Расмотрим пример на датасете из репозитория UCI

Описание данных - https://archive.ics.uci.edu/ml/datasets/banknote+authentication#

In [272]:
import pandas as pd
import numpy as np
data = pd.read_csv("data_banknote_authentication.txt", header=None)
data.head(3)

Unnamed: 0,0,1,2,3,4
0,3.6216,8.6661,-2.8073,-0.44699,0
1,4.5459,8.1674,-2.4586,-1.4621,0
2,3.866,-2.6383,1.9242,0.10645,0


У нас есть 4 признака и 1 целевая переменная (бинарная) - нужно определить поддельная купюра или нет

In [273]:
print(data.shape)

(1372, 5)


Всего 1372 купюры

Посмотрим на соотношение классов

In [274]:
data.iloc[:, -1].value_counts()

0    762
1    610
Name: 4, dtype: int64

Разбиваем выборку на тренировочную и тестовую части и обучаем модель (в примере - градиентный бустинг)

In [275]:
from sklearn.model_selection import train_test_split

x_data = data.iloc[:,:-1]
y_data = data.iloc[:,-1]

x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.2, random_state=7)

In [276]:
import xgboost as xgb

model = xgb.XGBClassifier()

model.fit(x_train, y_train)
y_predict = model.predict(x_test)

Проверяем качество

In [277]:
from sklearn.metrics import recall_score, precision_score, roc_auc_score, accuracy_score, f1_score

def evaluate_results(y_test, y_predict):
    print('Classification results:')
    f1 = f1_score(y_test, y_predict)
    print("f1: %.2f%%" % (f1 * 100.0)) 
    roc = roc_auc_score(y_test, y_predict)
    print("roc: %.2f%%" % (roc * 100.0)) 
    rec = recall_score(y_test, y_predict, average='binary')
    print("recall: %.2f%%" % (rec * 100.0)) 
    prc = precision_score(y_test, y_predict, average='binary')
    print("precision: %.2f%%" % (prc * 100.0)) 

    
evaluate_results(y_test, y_predict)

Classification results:
f1: 99.57%
roc: 99.57%
recall: 99.15%
precision: 100.00%


### Теперь очередь за PU learning

Представим, что нам неизвестны негативы и часть позитивов

In [278]:
mod_data = data.copy()
#get the indices of the positives samples
pos_ind = np.where(mod_data.iloc[:,-1].values == 1)[0]
#shuffle them
np.random.shuffle(pos_ind)
# leave just 25% of the positives marked
pos_sample_len = int(np.ceil(0.25 * len(pos_ind)))
print(f'Using {pos_sample_len}/{len(pos_ind)} as positives and unlabeling the rest')
pos_sample = pos_ind[:pos_sample_len]

Using 153/610 as positives and unlabeling the rest


Создаем столбец для новой целевой переменной, где у нас два класса - P (1) и U (-1)

In [279]:
mod_data['class_test'] = -1
mod_data.loc[pos_sample,'class_test'] = 1
print('target variable:\n', mod_data.iloc[:,-1].value_counts())

target variable:
 -1    1219
 1     153
Name: class_test, dtype: int64


* We now have just 153 positive samples labeled as 1 in the 'class_test' col while the rest is unlabeled as -1.

* Recall that col 4 still holds the actual label

In [280]:
mod_data.head(10)

Unnamed: 0,0,1,2,3,4,class_test
0,3.6216,8.6661,-2.8073,-0.44699,0,-1
1,4.5459,8.1674,-2.4586,-1.4621,0,-1
2,3.866,-2.6383,1.9242,0.10645,0,-1
3,3.4566,9.5228,-4.0112,-3.5944,0,-1
4,0.32924,-4.4552,4.5718,-0.9888,0,-1
5,4.3684,9.6718,-3.9606,-3.1625,0,-1
6,3.5912,3.0129,0.72888,0.56421,0,-1
7,2.0922,-6.81,8.4636,-0.60216,0,-1
8,3.2032,5.7588,-0.75345,-0.61251,0,-1
9,1.5356,9.1772,-2.2718,-0.73535,0,-1


Remember that this data frame (x_data) includes the former target variable that we keep here just to compare the results

[:-2] is the original class label for positive and negative data [:-1] is the new class for positive and unlabeled data

In [281]:
x_data = mod_data.iloc[:,:-2].values # just the X 
y_labeled = mod_data.iloc[:,-1].values # new class (just the P & U)
y_positive = mod_data.iloc[:,-2].values # original class

### 1. random negative sampling

In [282]:
mod_data = mod_data.sample(frac=1)
neg_sample = mod_data[mod_data['class_test']==-1][:len(mod_data[mod_data['class_test']==1])]
sample_test = mod_data[mod_data['class_test']==-1][len(mod_data[mod_data['class_test']==1]):]
pos_sample = mod_data[mod_data['class_test']==1]
print(neg_sample.shape, pos_sample.shape)
sample_train = pd.concat([neg_sample, pos_sample]).sample(frac=1)

(153, 6) (153, 6)


In [283]:
sample_train

Unnamed: 0,0,1,2,3,4,class_test
1311,-1.11880,3.33570,-1.345500,-1.95730,1,1
1245,-1.13060,1.84580,-1.357500,-1.38060,1,1
909,-1.73220,-9.28280,7.719000,-1.71680,1,1
1286,-6.42470,9.53110,0.022844,-6.85170,1,1
502,2.00510,-6.86380,8.132000,-0.24010,0,-1
...,...,...,...,...,...,...
842,-1.89690,-6.78930,5.276100,-0.32544,1,-1
35,2.43910,6.44170,-0.807430,-0.69139,0,-1
846,-2.14050,-0.16762,1.321000,-0.20906,1,1
971,0.00312,-4.00610,1.795600,0.91722,1,1


In [284]:
model = xgb.XGBClassifier()

model.fit(sample_train.iloc[:,:-2].values, 
          sample_train.iloc[:,-2].values)
y_predict = model.predict(sample_test.iloc[:,:-2].values)
evaluate_results(sample_test.iloc[:,-2].values, y_predict)

Classification results:
f1: 96.63%
roc: 97.73%
recall: 98.72%
precision: 94.62%


### Домашнее задание

1. взять любой набор данных для бинарной классификации (можно скачать один из модельных с https://archive.ics.uci.edu/ml/datasets.php)
3. сделать feature engineering
4. обучить любой классификатор (какой вам нравится)
5. далее разделить ваш набор данных на два множества: P (positives) и U (unlabeled). Причем брать нужно не все положительные (класс 1) примеры, а только лишь часть
6. применить random negative sampling для построения классификатора в новых условиях
7. сравнить качество с решением из пункта 4 (построить отчет - таблицу метрик)
8. поэкспериментировать с долей P на шаге 5 (как будет меняться качество модели при уменьшении/увеличении размера P)

<b>Бонусный вопрос:</b>

Как вы думаете, какой из методов на практике является более предпочтительным: random negative sampling или 2-step approach?

Ваш ответ здесь:

In [285]:
data = pd.read_csv("../parkinsons.data")

data = pd.DataFrame(data, columns=['name', 'MDVP:Fo(Hz)', 'MDVP:Fhi(Hz)', 'MDVP:Flo(Hz)', 'MDVP:Jitter(%)',
       'MDVP:Jitter(Abs)', 'MDVP:RAP', 'MDVP:PPQ', 'Jitter:DDP',
       'MDVP:Shimmer', 'MDVP:Shimmer(dB)', 'Shimmer:APQ3', 'Shimmer:APQ5',
       'MDVP:APQ', 'Shimmer:DDA', 'NHR', 'HNR', 'RPDE', 'DFA',
       'spread1', 'spread2', 'D2', 'PPE', 'status'])

data.drop('name', axis = 1, inplace=True)

pos_ind = np.where(data['status'].values == 1)[0]
np.random.shuffle(pos_ind)

In [286]:
df = data.copy()

In [287]:
x_data = df.iloc[:,:-1]
y_data = df.iloc[:,-1]

x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.2, random_state=7)

In [288]:
model_df = xgb.XGBClassifier()

model_df.fit(x_train, y_train)
y_predict_df = model_df.predict(x_test)

In [289]:
evaluate_results(y_test, y_predict_df)

Classification results:
f1: 96.88%
roc: 91.29%
recall: 96.88%
precision: 96.88%


In [290]:
data.columns

Index(['MDVP:Fo(Hz)', 'MDVP:Fhi(Hz)', 'MDVP:Flo(Hz)', 'MDVP:Jitter(%)',
       'MDVP:Jitter(Abs)', 'MDVP:RAP', 'MDVP:PPQ', 'Jitter:DDP',
       'MDVP:Shimmer', 'MDVP:Shimmer(dB)', 'Shimmer:APQ3', 'Shimmer:APQ5',
       'MDVP:APQ', 'Shimmer:DDA', 'NHR', 'HNR', 'RPDE', 'DFA', 'spread1',
       'spread2', 'D2', 'PPE', 'status'],
      dtype='object')

In [291]:
pos_ind

array([  8, 103, 125,  11,  95,  23,  70, 138, 114,  71, 143,  90, 131,
        22,  19, 102,  26,   1, 121, 137,  38,  10,   0, 105,  24,  81,
        28, 177, 126,  54, 141, 109,  56, 147,  58, 154,  68, 128,  69,
       132, 115,  72,  57,  97,  40, 110, 162, 178,  13,  67, 135,  39,
        16,  20, 100,   5, 164,  94,  82,   6,  17,  21,  18,   2,  78,
       182, 120, 163,  76, 127,  73,  66, 106, 161,  15,   9,  83, 153,
       139,  88, 157, 150, 117, 134, 104, 113, 151,  86, 116, 145, 112,
        55,  85, 160, 130,  59,  75,   4,  25, 101,  87, 159, 148, 149,
       158, 108,  84,  79,  74,  14,  89, 144,  77, 123, 119, 136, 180,
       107, 181, 111, 155, 118,  80, 140,  96,   3,  93,  37, 152,  91,
        92, 124,  36,  99,  29,  27,   7, 129, 146,  41, 142, 156, 122,
       133, 179,  98,  12], dtype=int64)

In [292]:
pos_sample_len = int(np.ceil(0.45 * len(pos_ind)))

In [293]:
print(f'Using {pos_sample_len}/{len(pos_ind)} as positives and unlabeling the rest')
pos_sample = pos_ind[:pos_sample_len]

Using 67/147 as positives and unlabeling the rest


In [294]:
data['class_test'] = -1
data.loc[pos_sample,'class_test'] = 1
print('target variable:\n', data.iloc[:,-1].value_counts())

target variable:
 -1    128
 1     67
Name: class_test, dtype: int64


In [295]:
data.head(2)

Unnamed: 0,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,MDVP:Shimmer(dB),...,NHR,HNR,RPDE,DFA,spread1,spread2,D2,PPE,status,class_test
0,119.992,157.302,74.997,0.00784,7e-05,0.0037,0.00554,0.01109,0.04374,0.426,...,0.02211,21.033,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654,1,1
1,122.4,148.65,113.819,0.00968,8e-05,0.00465,0.00696,0.01394,0.06134,0.626,...,0.01929,19.085,0.458359,0.819521,-4.075192,0.33559,2.486855,0.368674,1,1


In [296]:
x_data = data.iloc[:,:-2].values # just the X 
y_labeled = data.iloc[:,-1].values # new class (just the P & U)
y_positive = data.iloc[:,-2].values # original class

In [297]:
data = data.sample(frac=1)
neg_sample = data[data['class_test']==-1][:len(data[data['class_test']==1])]
sample_test = data[data['class_test']==-1][len(data[data['class_test']==1]):]
pos_sample = data[data['class_test']==1]
print(neg_sample.shape, pos_sample.shape)
sample_train = pd.concat([neg_sample, pos_sample]).sample(frac=1)

(67, 24) (67, 24)


результат по random negative sampling 

In [298]:
model = xgb.XGBClassifier()

model.fit(sample_train.iloc[:,:-2].values, 
          sample_train.iloc[:,-2].values)
y_predict = model.predict(sample_test.iloc[:,:-2].values)
evaluate_results(sample_test.iloc[:,-2].values, y_predict)

Classification results:
f1: 81.01%
roc: 72.64%
recall: 91.43%
precision: 72.73%


результат обычной базовой разметки из датасета 

Classification results:
f1: 96.88%
roc: 91.29%
recall: 96.88%
precision: 96.88%

на 25%  f1: 83.42%
roc: 63.95%
recall: 100.00%
precision: 71.56%

на 30% f1: 88.05%
roc: 72.83%
recall: 98.59%
precision: 79.55%

на 35% f1: 89.55%
roc: 77.70%
recall: 96.77%
precision: 83.33%

на 40% f1: 84.96%
roc: 70.22%
recall: 96.00%
precision: 76.19%

на 45% f1: 81.01%
roc: 72.64%
recall: 91.43%
precision: 72.73%

При взятии в выборке до 35 % на объема данных из класса positive идёт рост по точности и достигает пика, по кривой ROC и f1 практически по всем показателям
, но после 40 % идет на спад 