### Домашняя работа № 6, Кривоногов Н.В.

1. взять любой набор данных для бинарной классификации (можно скачать один с https://archive.ics.uci.edu/ml/datasets.php)
2. сделать feature engineering
3. обучить любой классификатор (какой вам нравится)
4. далее разделить ваш набор данных на два множества: P (positives) и U (unlabeled). Причем брать нужно не все положительные (класс 1) примеры, а только лишь часть
5. применить random negative sampling для построения классификатора в новых условиях
6. сравнить качество с решением из пункта 4 (построить отчет - таблицу метрик)
7. поэкспериментировать с долей P на шаге 5 (как будет меняться качество модели при уменьшении/увеличении размера P)

Один мой друг Дмитрий Шелгунов поставил себе такую задачу: приготовить плов 100 раз - и только сотый по счету плов он признает удавшимся, таким образом научится готовить его. 

Именно в честь Шела я выбрал набор данных для бинарной классификации РИСА - основного ингридиента этого замечательного блюда: https://archive.ics.uci.edu/ml/datasets/Rice+%28Cammeo+and+Osmancik%29

Описание: 

A total of 3810 rice grain's images were taken for the two species (Cammeo and Osmancik), processed and feature inferences were made. 7 morphological features were obtained for each grain of rice.

Перевод: 

В общей сложности было получено 3810 изображений рисовых зерен для двух видов (Камео и Османчик), обработаны и сделаны выводы о признаках. Для каждого зерна риса было получено 7 морфологических признаков.

In [1]:
import pandas as pd
import numpy as np
import warnings 
warnings.filterwarnings('ignore')

In [2]:
df  = pd.read_csv('Rice_Osmancik_Cammeo_Dataset.csv')

In [3]:
df

Unnamed: 0,AREA,PERIMETER,MAJORAXIS,MINORAXIS,ECCENTRICITY,CONVEX_AREA,EXTENT,CLASS
0,15231,525.578979,229.749878,85.093788,0.928882,15617,0.572896,Cammeo
1,14656,494.311005,206.020065,91.730972,0.895405,15072,0.615436,Cammeo
2,14634,501.122009,214.106781,87.768288,0.912118,14954,0.693259,Cammeo
3,13176,458.342987,193.337387,87.448395,0.891861,13368,0.640669,Cammeo
4,14688,507.166992,211.743378,89.312454,0.906691,15262,0.646024,Cammeo
...,...,...,...,...,...,...,...,...
3805,11441,415.858002,170.486771,85.756592,0.864280,11628,0.681012,Osmancik
3806,11625,421.390015,167.714798,89.462570,0.845850,11904,0.694279,Osmancik
3807,12437,442.498993,183.572922,86.801979,0.881144,12645,0.626739,Osmancik
3808,9882,392.296997,161.193985,78.210480,0.874406,10097,0.659064,Osmancik


Attribute Information:
1. Area: Returns the number of pixels within the boundaries of the rice grain.
2. Perimeter: Calculates the circumference by calculating the distance between pixels around the boundaries of the rice grain.
3. Major Axis Length: The longest line that can be drawn on the rice grain, i.e. the main axis distance, gives.
4. Minor Axis Length: The shortest line that can be drawn on the rice grain, i.e. the small axis distance, gives.
5. Eccentricity: It measures how round the ellipse, which has the same moments as the rice grain, is.
6. Convex Area: Returns the pixel count of the smallest convex shell of the region formed by the rice grain.
7. Extent: Returns the ratio of the region formed by the rice grain to the bounding box pixels
8. Class: Cammeo and Osmancik.

Именно восьмой признак будет целевым: Cammeo (1) или Osmancik (0). 

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3810 entries, 0 to 3809
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   AREA          3810 non-null   int64  
 1   PERIMETER     3810 non-null   float64
 2   MAJORAXIS     3810 non-null   float64
 3   MINORAXIS     3810 non-null   float64
 4   ECCENTRICITY  3810 non-null   float64
 5   CONVEX_AREA   3810 non-null   int64  
 6   EXTENT        3810 non-null   float64
 7   CLASS         3810 non-null   object 
dtypes: float64(5), int64(2), object(1)
memory usage: 238.2+ KB


In [5]:
df.describe()

Unnamed: 0,AREA,PERIMETER,MAJORAXIS,MINORAXIS,ECCENTRICITY,CONVEX_AREA,EXTENT
count,3810.0,3810.0,3810.0,3810.0,3810.0,3810.0,3810.0
mean,12667.727559,454.23918,188.776222,86.31375,0.886871,12952.49685,0.661934
std,1732.367706,35.597081,17.448679,5.729817,0.020818,1776.972042,0.077239
min,7551.0,359.100006,145.264465,59.532406,0.777233,7723.0,0.497413
25%,11370.5,426.144752,174.353855,82.731695,0.872402,11626.25,0.598862
50%,12421.5,448.852493,185.810059,86.434647,0.88905,12706.5,0.645361
75%,13950.0,483.683746,203.550438,90.143677,0.902588,14284.0,0.726562
max,18913.0,548.445984,239.010498,107.54245,0.948007,19099.0,0.86105


In [6]:
# смотрю на соотношение классов:

df['CLASS'].value_counts()

Osmancik    2180
Cammeo      1630
Name: CLASS, dtype: int64

In [7]:
# провожу бинарное кодирование целевой переменной: 

df['CLASS'] = df['CLASS'].map({'Cammeo': 1, 'Osmancik': 0})

In [8]:
# проверка:

df.head(3)

Unnamed: 0,AREA,PERIMETER,MAJORAXIS,MINORAXIS,ECCENTRICITY,CONVEX_AREA,EXTENT,CLASS
0,15231,525.578979,229.749878,85.093788,0.928882,15617,0.572896,1
1,14656,494.311005,206.020065,91.730972,0.895405,15072,0.615436,1
2,14634,501.122009,214.106781,87.768288,0.912118,14954,0.693259,1


In [9]:
# разбиваю выборку на тренировочную и тестовую части и обучаю модель (я выбрал CatBoost):

from sklearn.model_selection import train_test_split

X_data = df.drop('CLASS', axis=1)
y_data = df['CLASS']

X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.2, random_state=42)

In [10]:
from catboost import CatBoostClassifier

model = CatBoostClassifier(random_state=42, verbose=100)

model.fit(X_train, y_train)
y_predict = model.predict(X_test)

Learning rate set to 0.016581
0:	learn: 0.6655457	total: 148ms	remaining: 2m 28s
100:	learn: 0.1889111	total: 747ms	remaining: 6.65s
200:	learn: 0.1659009	total: 1.4s	remaining: 5.58s
300:	learn: 0.1548290	total: 1.98s	remaining: 4.59s
400:	learn: 0.1452210	total: 2.56s	remaining: 3.82s
500:	learn: 0.1364113	total: 3.13s	remaining: 3.12s
600:	learn: 0.1288964	total: 3.7s	remaining: 2.46s
700:	learn: 0.1212032	total: 4.27s	remaining: 1.82s
800:	learn: 0.1139823	total: 4.84s	remaining: 1.2s
900:	learn: 0.1074276	total: 5.42s	remaining: 596ms
999:	learn: 0.1010663	total: 5.99s	remaining: 0us


In [11]:
from sklearn.metrics import f1_score, roc_auc_score, precision_recall_curve

In [12]:
metrics_df = pd.DataFrame(columns=['model', 'thresh', 'F-Score', 'Precision', 'Recall', 'ROC AUC'])
metrics_df

Unnamed: 0,model,thresh,F-Score,Precision,Recall,ROC AUC


In [13]:
precision, recall, thresholds = precision_recall_curve(y_test, y_predict)

fscore = (2 * precision * recall) / (precision + recall)
# locate the index of the largest f score
ix = np.argmax(fscore)
print(f'Best Threshold={thresholds[ix]}, F-Score={fscore[ix]:.3f}, Precision={precision[ix]:.3f}, Recall={recall[ix]:.3f}')

Best Threshold=1, F-Score=0.921, Precision=0.928, Recall=0.914


In [14]:
roc_auc = roc_auc_score(y_test, y_predict)
roc_auc

0.9268030513176144

In [15]:
metrics_df = metrics_df.append({
    'model': 'supervised',
    'thresh': thresholds[ix],
    'F-Score': fscore[ix],
    'Precision': precision[ix],
    'Recall': recall[ix],
    'ROC AUC': roc_auc
}, ignore_index=True)

metrics_df

Unnamed: 0,model,thresh,F-Score,Precision,Recall,ROC AUC
0,supervised,1,0.920863,0.927536,0.914286,0.926803


#### Теперь очередь PU learning (25%)

In [16]:
# представлю, что неизвестны негативы и часть позитивов:

mod_data = X_train.copy()
mod_data['label'] = y_train
mod_data = mod_data.reset_index(drop=True)

# mod_data = data.copy()
# get the indices of the positives samples
pos_ind = np.where(mod_data.iloc[:, -1].values == 1)[0]

# shuffle them
np.random.shuffle(pos_ind)
# leave just 25% of the positives marked
perc = 0.25
pos_sample_len = int(np.ceil(perc * len(pos_ind)))

print(f'Using {pos_sample_len}/{len(pos_ind)} as positives and unlabeling the rest')
pos_sample = pos_ind[:pos_sample_len]

Using 320/1280 as positives and unlabeling the rest


In [17]:
# создаю столбец для новой целевой переменной, где будет два класса - P (1) и U (-1):

mod_data['class_test'] = -1
mod_data.loc[pos_sample, 'class_test'] = 1
print('target variable:\n', mod_data.iloc[:,-1].value_counts())

target variable:
 -1    2728
 1     320
Name: class_test, dtype: int64


* 320 позитивных примеров (1)
* 2728 без разметки (-1)

In [18]:
mod_data.head(10)

Unnamed: 0,AREA,PERIMETER,MAJORAXIS,MINORAXIS,ECCENTRICITY,CONVEX_AREA,EXTENT,label,class_test
0,12529,437.838989,174.86145,92.189262,0.849733,12840,0.766019,0,-1
1,11051,424.976013,180.871903,78.26738,0.901527,11240,0.568058,0,-1
2,12975,463.851013,196.423966,85.064117,0.901363,13358,0.609126,1,-1
3,10398,405.678986,162.227158,82.393456,0.861422,10658,0.644717,0,-1
4,14541,492.785004,204.257141,92.471016,0.891653,14893,0.758292,1,-1
5,10870,409.490997,169.20607,82.095322,0.874415,11030,0.670243,0,-1
6,13913,493.606995,212.985474,83.991348,0.918959,14218,0.572128,1,-1
7,14720,494.862,207.092712,91.498894,0.897101,15071,0.704003,1,-1
8,11136,427.109985,175.653076,81.918777,0.884591,11474,0.574376,0,-1
9,14115,483.779999,206.841873,88.068649,0.904828,14312,0.747656,1,-1


#### random negative sampling

In [19]:
# помню, что (X_data) содержит целевой признак, который буду использовать для оценки качества
# отделю [:-2] как истиный класс для проверки и [:-1] как данные для входной разметки PUL:

mod_data = mod_data.sample(frac=1)


data_N = mod_data[mod_data['class_test'] == -1]
data_P = mod_data[mod_data['class_test'] == 1]

neg_sample = data_N[:data_P.shape[0]]
sample_test = data_N[data_P.shape[0]:]
pos_sample = data_P.copy()

print(neg_sample.shape, pos_sample.shape)
sample_train = pd.concat([neg_sample, pos_sample]).sample(frac=1)

(320, 9) (320, 9)


In [20]:
sample_train

Unnamed: 0,AREA,PERIMETER,MAJORAXIS,MINORAXIS,ECCENTRICITY,CONVEX_AREA,EXTENT,label,class_test
2831,13548,479.748993,203.271881,85.602463,0.907004,13850,0.554428,1,1
366,13683,481.571014,206.926773,84.899063,0.911957,13922,0.762199,1,1
2025,12743,449.230011,180.944763,90.807716,0.864953,13044,0.619374,0,-1
1850,14734,498.707001,211.000961,90.111481,0.904220,15171,0.784600,1,1
1882,14421,489.817993,204.956833,90.341690,0.897613,14716,0.625260,1,1
...,...,...,...,...,...,...,...,...,...
2587,11559,435.184998,178.782074,83.545471,0.884097,11796,0.666378,0,-1
2925,15072,487.584992,202.494736,95.355270,0.882185,15267,0.790559,1,1
2105,13415,464.292999,191.957779,89.882965,0.883600,13660,0.580661,0,-1
364,11413,421.953003,171.325287,85.999313,0.864888,11688,0.699326,0,-1


In [21]:
model = CatBoostClassifier(random_state=42, verbose=100)
sample_train.loc[sample_train['class_test'] == -1, 'class_test'] = 0

model.fit(sample_train.drop(columns=['class_test', 'label']), 
          sample_train['class_test'])

y_predict = model.predict(X_test)

Learning rate set to 0.008515
0:	learn: 0.6883198	total: 8.08ms	remaining: 8.07s
100:	learn: 0.4886151	total: 492ms	remaining: 4.38s
200:	learn: 0.4414077	total: 1.03s	remaining: 4.08s
300:	learn: 0.4140807	total: 1.56s	remaining: 3.61s
400:	learn: 0.3930665	total: 2.21s	remaining: 3.3s
500:	learn: 0.3739091	total: 2.8s	remaining: 2.79s
600:	learn: 0.3555164	total: 3.24s	remaining: 2.15s
700:	learn: 0.3373478	total: 3.69s	remaining: 1.57s
800:	learn: 0.3205914	total: 4.11s	remaining: 1.02s
900:	learn: 0.3027707	total: 4.56s	remaining: 501ms
999:	learn: 0.2867361	total: 5.01s	remaining: 0us


In [22]:
precision, recall, thresholds = precision_recall_curve(y_test, y_predict)

fscore = (2 * precision * recall) / (precision + recall)
# locate the index of the largest f score
ix = np.argmax(fscore)
print(f'Best Threshold={thresholds[ix]}, F-Score={fscore[ix]:.3f}, Precision={precision[ix]:.3f}, Recall={recall[ix]:.3f}')

Best Threshold=1, F-Score=0.900, Precision=0.891, Recall=0.909


In [23]:
roc_auc = roc_auc_score(y_test, y_predict)
roc_auc

0.9069556171983356

In [24]:
metrics_df = metrics_df.append({
    'model': 'pu-learning (25%)',
    'thresh': thresholds[ix],
    'F-Score': fscore[ix],
    'Precision': precision[ix],
    'Recall': recall[ix],
    'ROC AUC': roc_auc
}, ignore_index=True)

metrics_df

Unnamed: 0,model,thresh,F-Score,Precision,Recall,ROC AUC
0,supervised,1,0.920863,0.927536,0.914286,0.926803
1,pu-learning (25%),1,0.899576,0.890756,0.908571,0.906956


#### 10%

In [25]:
# представлю, что неизвестны негативы и часть позитивов:

mod_data = X_train.copy()
mod_data['label'] = y_train
mod_data = mod_data.reset_index(drop=True)

# mod_data = data.copy()
# get the indices of the positives samples
pos_ind = np.where(mod_data.iloc[:, -1].values == 1)[0]

# shuffle them
np.random.shuffle(pos_ind)
# leave just 10% of the positives marked
perc = 0.1
pos_sample_len = int(np.ceil(perc * len(pos_ind)))

print(f'Using {pos_sample_len}/{len(pos_ind)} as positives and unlabeling the rest')
pos_sample = pos_ind[:pos_sample_len]

# создаю столбец для новой целевой переменной, где будет два класса - P (1) и U (-1):

mod_data['class_test'] = -1
mod_data.loc[pos_sample, 'class_test'] = 1
print('target variable:\n', mod_data.iloc[:,-1].value_counts())

# помню, что (X_data) содержит целевой признак, который буду использовать для оценки качества
# отделю [:-2] как истиный класс для проверки и [:-1] как данные для входной разметки PUL:

mod_data = mod_data.sample(frac=1)


data_N = mod_data[mod_data['class_test'] == -1]
data_P = mod_data[mod_data['class_test'] == 1]

neg_sample = data_N[:data_P.shape[0]]
sample_test = data_N[data_P.shape[0]:]
pos_sample = data_P.copy()

print(neg_sample.shape, pos_sample.shape)
sample_train = pd.concat([neg_sample, pos_sample]).sample(frac=1)

model = CatBoostClassifier(random_state=42, verbose=100)
sample_train.loc[sample_train['class_test'] == -1, 'class_test'] = 0

model.fit(sample_train.drop(columns=['class_test', 'label']), 
          sample_train['class_test'])

y_predict = model.predict(X_test)

precision, recall, thresholds = precision_recall_curve(y_test, y_predict)

fscore = (2 * precision * recall) / (precision + recall)
# locate the index of the largest f score
ix = np.argmax(fscore)
print(f'Best Threshold={thresholds[ix]}, F-Score={fscore[ix]:.3f}, Precision={precision[ix]:.3f}, Recall={recall[ix]:.3f}')

roc_auc = roc_auc_score(y_test, y_predict)
roc_auc

metrics_df = metrics_df.append({
    'model': 'pu-learning (10%)',
    'thresh': thresholds[ix],
    'F-Score': fscore[ix],
    'Precision': precision[ix],
    'Recall': recall[ix],
    'ROC AUC': roc_auc
}, ignore_index=True)

metrics_df

Using 128/1280 as positives and unlabeling the rest
target variable:
 -1    2920
 1     128
Name: class_test, dtype: int64
(128, 9) (128, 9)
Learning rate set to 0.005758
0:	learn: 0.6897787	total: 5.27ms	remaining: 5.26s
100:	learn: 0.4962274	total: 506ms	remaining: 4.5s
200:	learn: 0.4210104	total: 951ms	remaining: 3.78s
300:	learn: 0.3768758	total: 1.4s	remaining: 3.25s
400:	learn: 0.3428081	total: 1.91s	remaining: 2.85s
500:	learn: 0.3146054	total: 2.36s	remaining: 2.35s
600:	learn: 0.2909506	total: 2.83s	remaining: 1.88s
700:	learn: 0.2688903	total: 3.27s	remaining: 1.4s
800:	learn: 0.2498657	total: 3.83s	remaining: 951ms
900:	learn: 0.2310333	total: 4.36s	remaining: 480ms
999:	learn: 0.2134452	total: 4.8s	remaining: 0us
Best Threshold=1, F-Score=0.868, Precision=0.884, Recall=0.851


Unnamed: 0,model,thresh,F-Score,Precision,Recall,ROC AUC
0,supervised,1,0.920863,0.927536,0.914286,0.926803
1,pu-learning (25%),1,0.899576,0.890756,0.908571,0.906956
2,pu-learning (10%),1,0.86754,0.884273,0.851429,0.878384


#### 50%

In [26]:
# представлю, что неизвестны негативы и часть позитивов:

mod_data = X_train.copy()
mod_data['label'] = y_train
mod_data = mod_data.reset_index(drop=True)

# mod_data = data.copy()
# get the indices of the positives samples
pos_ind = np.where(mod_data.iloc[:, -1].values == 1)[0]

# shuffle them
np.random.shuffle(pos_ind)
# leave just 50% of the positives marked
perc = 0.5
pos_sample_len = int(np.ceil(perc * len(pos_ind)))

print(f'Using {pos_sample_len}/{len(pos_ind)} as positives and unlabeling the rest')
pos_sample = pos_ind[:pos_sample_len]

# создаю столбец для новой целевой переменной, где будет два класса - P (1) и U (-1):

mod_data['class_test'] = -1
mod_data.loc[pos_sample, 'class_test'] = 1
print('target variable:\n', mod_data.iloc[:,-1].value_counts())

# помню, что (X_data) содержит целевой признак, который буду использовать для оценки качества
# отделю [:-2] как истиный класс для проверки и [:-1] как данные для входной разметки PUL:

mod_data = mod_data.sample(frac=1)


data_N = mod_data[mod_data['class_test'] == -1]
data_P = mod_data[mod_data['class_test'] == 1]

neg_sample = data_N[:data_P.shape[0]]
sample_test = data_N[data_P.shape[0]:]
pos_sample = data_P.copy()

print(neg_sample.shape, pos_sample.shape)
sample_train = pd.concat([neg_sample, pos_sample]).sample(frac=1)

model = CatBoostClassifier(random_state=42, verbose=100)
sample_train.loc[sample_train['class_test'] == -1, 'class_test'] = 0

model.fit(sample_train.drop(columns=['class_test', 'label']), 
          sample_train['class_test'])

y_predict = model.predict(X_test)

precision, recall, thresholds = precision_recall_curve(y_test, y_predict)

fscore = (2 * precision * recall) / (precision + recall)
# locate the index of the largest f score
ix = np.argmax(fscore)
print(f'Best Threshold={thresholds[ix]}, F-Score={fscore[ix]:.3f}, Precision={precision[ix]:.3f}, Recall={recall[ix]:.3f}')

roc_auc = roc_auc_score(y_test, y_predict)
roc_auc

metrics_df = metrics_df.append({
    'model': 'pu-learning (50%)',
    'thresh': thresholds[ix],
    'F-Score': fscore[ix],
    'Precision': precision[ix],
    'Recall': recall[ix],
    'ROC AUC': roc_auc
}, ignore_index=True)

metrics_df

Using 640/1280 as positives and unlabeling the rest
target variable:
 -1    2408
 1     640
Name: class_test, dtype: int64
(640, 9) (640, 9)
Learning rate set to 0.011448
0:	learn: 0.6852183	total: 6.47ms	remaining: 6.46s
100:	learn: 0.4364951	total: 573ms	remaining: 5.1s
200:	learn: 0.3981203	total: 1.06s	remaining: 4.24s
300:	learn: 0.3767372	total: 1.66s	remaining: 3.86s
400:	learn: 0.3599859	total: 2.21s	remaining: 3.31s
500:	learn: 0.3435599	total: 2.75s	remaining: 2.74s
600:	learn: 0.3283650	total: 3.28s	remaining: 2.18s
700:	learn: 0.3117653	total: 3.79s	remaining: 1.61s
800:	learn: 0.2958129	total: 4.32s	remaining: 1.07s
900:	learn: 0.2801271	total: 4.96s	remaining: 545ms
999:	learn: 0.2657112	total: 5.44s	remaining: 0us
Best Threshold=1, F-Score=0.916, Precision=0.910, Recall=0.923


Unnamed: 0,model,thresh,F-Score,Precision,Recall,ROC AUC
0,supervised,1,0.920863,0.927536,0.914286,0.926803
1,pu-learning (25%),1,0.899576,0.890756,0.908571,0.906956
2,pu-learning (10%),1,0.86754,0.884273,0.851429,0.878384
3,pu-learning (50%),1,0.916312,0.909859,0.922857,0.922594


#### Краткие выводы: 

За базовый размер P было взято 25% и получены достаточно высокие метрики. 

При уменьшении размера P до 10% метрики соответственно падают. 

А при увеличении размера P до 50% метрики соответственно растут. 

Вообще Positive-Unlabeled (PU) learning можно перевести как «обучение на основе положительных и неразмеченных данных». 

По сути PU learning —  это аналог бинарной классификация для случаев, когда имеются размеченные данные только одного из классов, но доступна неразмеченная смесь данных обоих классов. 

В общем случае мы даже не знаем, сколько данных в смеси соответствует положительному классу, а сколько — отрицательному. На основе таких наборов данных мы хотим построить бинарный классификатор: такой же, как и при наличии размеченных данных обоих классов.