# Practice assignment: Handling imbalanced data

This assignment is graded by your `submission.json`.

The cell below creates a valid `submission.json` file, fill your answers in there. 

You can press "Submit Assignment" at any time to submit partial progress.

In [155]:
%%file submission.json
{
    "q1": 0.06124,
    "q2": 7,
    "q3": 2,
    "q4": 20,
    "q5": 1,
    "q6": -202.06375,
    "q7": 0.00813,
    "q8": 0.00035,
    "q9": 0.87818,
    "q10": 0.87762,
    "q11": 0.95267,
    "q12": 0.82158,
    "q13": 0.88737,
    "q14": 0.92015,
    "q15": 0.91959,
    "q16": 0.91153,
    "q17": 0.92072,
    "q18": 0.91097
}

Overwriting submission.json


In this programming assignment, you are going to work with a dataset based on the following data:

https://archive.ics.uci.edu/ml/datasets/thyroid+disease

_Citation:_

* _(Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.)_

The dataset contains various attributes of patients. Some of them have a thyroid disease (`'Class' = 1`), some of them don't have it (`'Class' = 0`).

The data is imbalanced. In this assignment, you are going to preprocess the data and apply various techniques for the imbalanced classification.

In [2]:
import numpy as np
import pandas as pd
from imblearn.combine import SMOTETomek
from imblearn.over_sampling import ADASYN, BorderlineSMOTE, RandomOverSampler, SMOTE
from imblearn.under_sampling import NearMiss, RandomUnderSampler, TomekLinks
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, balanced_accuracy_score, confusion_matrix, matthews_corrcoef, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

In [3]:
df = pd.read_csv('data.csv')

## 1

**q1:** What proportion of patients in this data has a thyroid disease? Provide the answer (a number from 0 to 1), rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

In [4]:
df.head()

Unnamed: 0,age,sex,on_thyroxine,query_on_thyroxine,on_antithyroid_medication,sick,pregnant,thyroid_surgery,I131_treatment,query_hypothyroid,...,TT4_measured,TT4,T4U_measured,T4U,FTI_measured,FTI,TBG_measured,TBG,referral_source,Class
0,41.0,F,f,f,f,f,f,f,f,f,...,t,125.0,t,1.14,t,109.0,f,,SVHC,0
1,23.0,F,f,f,f,f,f,f,f,f,...,t,102.0,f,,f,,f,,other,0
2,46.0,M,f,f,f,f,f,f,f,f,...,t,109.0,t,0.91,t,120.0,f,,other,0
3,70.0,F,t,f,f,f,f,f,f,f,...,t,175.0,f,,f,,f,,other,0
4,70.0,F,f,f,f,f,f,f,f,f,...,t,61.0,t,0.87,t,70.0,f,,SVI,0


In [12]:
c1 = df[df['Class'] == 1]
c1_count = c1.shape[0]
total = df.shape[0]

round(c1_count / total, 5)

0.06124

## 2

**q2:** How many columns contain missing values (NaN)?

In [22]:
df.isna().sum()

age                             1
sex                             0
on_thyroxine                    0
query_on_thyroxine              0
on_antithyroid_medication       0
sick                            0
pregnant                        0
thyroid_surgery                 0
I131_treatment                  0
query_hypothyroid               0
query_hyperthyroid              0
lithium                         0
goitre                          0
tumor                           0
hypopituitary                   0
psych                           0
TSH_measured                    0
TSH                           369
T3_measured                     0
T3                            769
TT4_measured                    0
TT4                           231
T4U_measured                    0
T4U                           387
FTI_measured                    0
FTI                           385
TBG_measured                    0
TBG                          3772
referral_source                 0
Class         

## 3

**q3:** How many columns contain only one unique value (count NaNs too)? If the number is bigger than 0, drop these columns.

In [24]:
df.columns[df.nunique() <= 1]

Index(['TBG_measured', 'TBG'], dtype='object')

In [26]:
df = df.drop(['TBG_measured', 'TBG'], axis=1)

In [27]:
df.head()

Unnamed: 0,age,sex,on_thyroxine,query_on_thyroxine,on_antithyroid_medication,sick,pregnant,thyroid_surgery,I131_treatment,query_hypothyroid,...,T3_measured,T3,TT4_measured,TT4,T4U_measured,T4U,FTI_measured,FTI,referral_source,Class
0,41.0,F,f,f,f,f,f,f,f,f,...,t,2.5,t,125.0,t,1.14,t,109.0,SVHC,0
1,23.0,F,f,f,f,f,f,f,f,f,...,t,2.0,t,102.0,f,,f,,other,0
2,46.0,M,f,f,f,f,f,f,f,f,...,f,,t,109.0,t,0.91,t,120.0,other,0
3,70.0,F,t,f,f,f,f,f,f,f,...,t,1.9,t,175.0,f,,f,,other,0
4,70.0,F,f,f,f,f,f,f,f,f,...,t,1.2,t,61.0,t,0.87,t,70.0,SVI,0


## 4

**q4:** Calculate the number of binary columns (only two unique values) with `'object'` data types. Transform them with `LabelEncoder` so that their values become numbers.

In [65]:
bin_columns = list(set(list(df.select_dtypes(['object']).columns)) & set(list(df.columns[df.nunique() == 2])))
len(bin_columns)

20

In [66]:
labelencoder = LabelEncoder()

for column in bin_columns:
    column_name = f'{column}_enc'
    df[column_name] = labelencoder.fit_transform(df[column])
    df = df.drop([column], axis=1)

In [67]:
df.head()

Unnamed: 0,age,TSH,T3,TT4,T4U,FTI,referral_source,Class,TT4_measured_enc,query_hyperthyroid_enc,...,pregnant_enc,I131_treatment_enc,tumor_enc,lithium_enc,TSH_measured_enc,T4U_measured_enc,thyroid_surgery_enc,sex_enc,hypopituitary_enc,goitre_enc
0,41.0,1.3,2.5,125.0,1.14,109.0,SVHC,0,1,0,...,0,0,0,0,1,1,0,0,0,0
1,23.0,4.1,2.0,102.0,,,other,0,1,0,...,0,0,0,0,1,0,0,0,0,0
2,46.0,0.98,,109.0,0.91,120.0,other,0,1,0,...,0,0,0,0,1,1,0,1,0,0
3,70.0,0.16,1.9,175.0,,,other,0,1,0,...,0,0,0,0,1,0,0,0,0,0
4,70.0,0.72,1.2,61.0,0.87,70.0,SVI,0,1,0,...,0,0,0,0,1,1,0,0,0,0


## 5

**q5:** How many categorical columns with `'object'` data types are remaining in the data? Encode them with One-Hot encoding (with `pandas.get_dummies()`) the same way as in the programming assignment in week 1.

In [68]:
df.select_dtypes(['object']).columns

Index(['referral_source'], dtype='object')

In [70]:
referral_source_df = pd.get_dummies(df['referral_source'])
referral_source_df.head()

Unnamed: 0,STMW,SVHC,SVHD,SVI,other
0,0,1,0,0,0
1,0,0,0,0,1
2,0,0,0,0,1
3,0,0,0,0,1
4,0,0,0,1,0


In [71]:
referral_source_df = referral_source_df.add_prefix('ref_')
referral_source_df.columns = referral_source_df.columns.str.strip().str.lower()
referral_source_df.head()

Unnamed: 0,ref_stmw,ref_svhc,ref_svhd,ref_svi,ref_other
0,0,1,0,0,0
1,0,0,0,0,1
2,0,0,0,0,1
3,0,0,0,0,1
4,0,0,0,1,0


In [72]:
df_concat = pd.concat([df, referral_source_df], axis=1)
df_concat.head()

Unnamed: 0,age,TSH,T3,TT4,T4U,FTI,referral_source,Class,TT4_measured_enc,query_hyperthyroid_enc,...,T4U_measured_enc,thyroid_surgery_enc,sex_enc,hypopituitary_enc,goitre_enc,ref_stmw,ref_svhc,ref_svhd,ref_svi,ref_other
0,41.0,1.3,2.5,125.0,1.14,109.0,SVHC,0,1,0,...,1,0,0,0,0,0,1,0,0,0
1,23.0,4.1,2.0,102.0,,,other,0,1,0,...,0,0,0,0,0,0,0,0,0,1
2,46.0,0.98,,109.0,0.91,120.0,other,0,1,0,...,1,0,1,0,0,0,0,0,0,1
3,70.0,0.16,1.9,175.0,,,other,0,1,0,...,0,0,0,0,0,0,0,0,0,1
4,70.0,0.72,1.2,61.0,0.87,70.0,SVI,0,1,0,...,1,0,0,0,0,0,0,0,1,0


In [73]:
df_concat = df_concat.drop(['referral_source'], axis=1)
df_concat.head()

Unnamed: 0,age,TSH,T3,TT4,T4U,FTI,Class,TT4_measured_enc,query_hyperthyroid_enc,psych_enc,...,T4U_measured_enc,thyroid_surgery_enc,sex_enc,hypopituitary_enc,goitre_enc,ref_stmw,ref_svhc,ref_svhd,ref_svi,ref_other
0,41.0,1.3,2.5,125.0,1.14,109.0,0,1,0,0,...,1,0,0,0,0,0,1,0,0,0
1,23.0,4.1,2.0,102.0,,,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
2,46.0,0.98,,109.0,0.91,120.0,0,1,0,0,...,1,0,1,0,0,0,0,0,0,1
3,70.0,0.16,1.9,175.0,,,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
4,70.0,0.72,1.2,61.0,0.87,70.0,0,1,0,0,...,1,0,0,0,0,0,0,0,1,0


In [74]:
df = df_concat

## 6

We have encoded categorical features, but we still have missing values. Fill them with a number -999. 

**q6:** What is a mean value of `'T3'` column now? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

Don't be afraid if you see that the mean changed significantly after filling missing values. We just introduced a special category, and it won't influence tree-based models.

In [78]:
df.isna().sum()

age                                1
TSH                              369
T3                               769
TT4                              231
T4U                              387
FTI                              385
Class                              0
TT4_measured_enc                   0
query_hyperthyroid_enc             0
psych_enc                          0
query_hypothyroid_enc              0
on_thyroxine_enc                   0
sick_enc                           0
FTI_measured_enc                   0
query_on_thyroxine_enc             0
on_antithyroid_medication_enc      0
T3_measured_enc                    0
pregnant_enc                       0
I131_treatment_enc                 0
tumor_enc                          0
lithium_enc                        0
TSH_measured_enc                   0
T4U_measured_enc                   0
thyroid_surgery_enc                0
sex_enc                            0
hypopituitary_enc                  0
goitre_enc                         0
r

In [79]:
map_nan = {np.nan: -999}

In [82]:
df['TSH'] = df['TSH'].replace(map_nan)
df['T3']  = df['T3'].replace(map_nan)
df['TT4'] = df['TT4'].replace(map_nan)
df['T4U'] = df['T4U'].replace(map_nan)
df['FTI'] = df['FTI'].replace(map_nan)
df['age'] = df['age'].replace(map_nan)

In [83]:
df

Unnamed: 0,age,TSH,T3,TT4,T4U,FTI,Class,TT4_measured_enc,query_hyperthyroid_enc,psych_enc,...,T4U_measured_enc,thyroid_surgery_enc,sex_enc,hypopituitary_enc,goitre_enc,ref_stmw,ref_svhc,ref_svhd,ref_svi,ref_other
0,41.0,1.30,2.5,125.0,1.14,109.0,0,1,0,0,...,1,0,0,0,0,0,1,0,0,0
1,23.0,4.10,2.0,102.0,-999.00,-999.0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
2,46.0,0.98,-999.0,109.0,0.91,120.0,0,1,0,0,...,1,0,1,0,0,0,0,0,0,1
3,70.0,0.16,1.9,175.0,-999.00,-999.0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
4,70.0,0.72,1.2,61.0,0.87,70.0,0,1,0,0,...,1,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3767,30.0,-999.00,-999.0,-999.0,-999.00,-999.0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3768,68.0,1.00,2.1,124.0,1.08,114.0,0,1,0,0,...,1,0,0,0,0,0,0,0,1,0
3769,74.0,5.10,1.8,112.0,1.07,105.0,0,1,1,0,...,1,0,0,0,0,0,0,0,0,1
3770,72.0,0.70,2.0,82.0,0.94,87.0,0,1,0,0,...,1,0,1,0,0,0,0,0,1,0


In [98]:
round(df['T3'].mean(), 5)

-202.06375

## 7

Finally, we have preprocessed the data. Next, we separate the target from the dataframe with features (`df` -> `X`, `y`).

Split the data (`X` and `y`) into train and test sets using `train_test_split` from `sklearn`. Test size should be 0.25 of the whole data. Use `random_state=13`, so that your results are reproducible and similar to the original ones.

**q7:** Measure the proportion of patients in train set who has a thyroid disease (as in the task 1). Measure the proportion of patients in test set who has a thyroid disease. As the answer, provide the absolute value of the difference between these proportions (to compare positive class proportions in train and test), rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

In [84]:
X = df.drop('Class', axis=1)
y = df['Class']

In [85]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=13)

In [92]:
c1_train = y_train[y_train == 1]
c1_train_count = c1_train.shape[0]
total_train = y_train.shape[0]

c1_train_part = c1_train_count / total_train
c1_train_part

0.06327324142806645

In [94]:
c1_test = y_test[y_test == 1]
c1_test_count = c1_test.shape[0]
total_test = y_test.shape[0]

c1_test_part = c1_test_count / total_test
c1_test_part

0.05514316012725345

In [100]:
round(abs(c1_train_part - c1_test_part), 5)

0.00813

## 8

Now split the data (`X` and `y`) into train and test sets using `train_test_split` from `sklearn` with the same parameters, as in task 7, but also add `stratify=y` parameter for the stratification. This may help to make positive class proportions in train and test more similar.

**q8:** Measure the proportion of patients in train set who has a thyroid disease (as in the task 1). Measure the proportion of patients in test set who has a thyroid disease. As the answer, provide the absolute value of the difference between these proportions (to compare positive class proportions in train and test), rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

Is it bigger or smaller than the similar number in the previous task?

In [104]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=13, stratify=y)

In [106]:
c1_train = y_train[y_train == 1]
c1_train_count = c1_train.shape[0]
total_train = y_train.shape[0]

c1_train_part = c1_train_count / total_train
c1_train_part

0.06115235065394132

In [107]:
c1_test = y_test[y_test == 1]
c1_test_count = c1_test.shape[0]
total_test = y_test.shape[0]

c1_test_part = c1_test_count / total_test
c1_test_part

0.061505832449628844

In [108]:
round(abs(c1_train_part - c1_test_part), 5)

0.00035

## 9

Let's move to modeling. First, we write two functions to estimate a quality of machine learning model predictions on test set via different metrics.

In this and all the following tasks, use the same train and test sets which you obtained in the task 8 (with the stratification).

Train a Random Forest classifier from `sklearn` with 50 estimators and `random_state=13`, and let all other parameters have the default values. Fit it on the train set and obtain predictions for the test set. Run the function which computes scores. 

**q9:** What balanced accuracy value do you obtain? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

In [110]:
def compute_metrics(y_test, y_pred):
    print('Accuracy: {:.5f}'.format(accuracy_score(y_test, y_pred)))
    print('F-score: {:.5f}'.format(f1_score(y_test, y_pred)))
    print('Precision: {:.5f}'.format(precision_score(y_test, y_pred)))
    print('Recall: {:.5f}'.format(recall_score(y_test, y_pred)))
    print('Accuracy (balanced): {:.5f}'.format(balanced_accuracy_score(y_test, y_pred)))
    print('MCC: {:.5f}'.format(matthews_corrcoef(y_test, y_pred)))

def compute_confusion_matrix(y_test, y_pred):
    compute_metrics(y_test, y_pred)
    return pd.DataFrame(
        confusion_matrix(y_test, y_pred, labels=[1, 0]),
        columns=['a(x) = 1', 'a(x) = 0'],
        index=['y = 1', 'y = 0'],
    ).T

In [114]:
rfc = RandomForestClassifier(n_estimators=50, random_state=13)
rfc.fit(X_train, y_train)

y_pred = rfc.predict(X_test)

In [115]:
compute_confusion_matrix(y_test, y_pred)

Accuracy: 0.98303
F-score: 0.84615
Precision: 0.95652
Recall: 0.75862
Accuracy (balanced): 0.87818
MCC: 0.84361


Unnamed: 0,y = 1,y = 0
a(x) = 1,44,2
a(x) = 0,14,883


## 10

In this task, perform the same procedure as in task 9, but with the parameter `class_weight='balanced'` in the Random Forest classifier. 

**q10:** What balanced accuracy value do you obtain? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

Look at the scores overall. What do you think - did setting class weights improve the quality of the model?

In [118]:
rfc = RandomForestClassifier(n_estimators=50, random_state=13, class_weight='balanced')
rfc.fit(X_train, y_train)

y_pred = rfc.predict(X_test)

In [119]:
compute_confusion_matrix(y_test, y_pred)

Accuracy: 0.98197
F-score: 0.83810
Precision: 0.93617
Recall: 0.75862
Accuracy (balanced): 0.87762
MCC: 0.83380


Unnamed: 0,y = 1,y = 0
a(x) = 1,44,3
a(x) = 0,14,882


## 11

Let's try to balance train set with different approaches. We will use a special library `imbalanced-learn` (documentation: https://imbalanced-learn.org/stable/).

In this and all the following tasks, use the same Random Forest classifier setting as in the task 10 (with `class_weight='balanced'`).

Let's start with a random understampling (`RandomUnderSampler`). Run it with the default parameter values and `random_state=13` on the initial train data (from the task 8) and modify it. Train a Random Forest classifier on the modified data and obtain the predictions on the test data.

**q11:** What balanced accuracy value do you obtain? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

Look at the scores overall. What do you think - how did random undersampling method perform?

In [126]:
undersample = RandomUnderSampler(random_state=13)
X_res, y_res = undersample.fit_sample(X_train, y_train)

In [123]:
rfc = RandomForestClassifier(n_estimators=50, random_state=13, class_weight='balanced')
rfc.fit(X_res, y_res)

y_pred = rfc.predict(X_test)

In [124]:
compute_confusion_matrix(y_test, y_pred)

Accuracy: 0.95652
F-score: 0.72848
Precision: 0.59140
Recall: 0.94828
Accuracy (balanced): 0.95267
MCC: 0.72953


Unnamed: 0,y = 1,y = 0
a(x) = 1,55,38
a(x) = 0,3,847


## 12

Take the second version of `NearMiss` (`version=2`). Run it with `sampling_strategy=0.2`, `n_neighbors=3` and other default parameter values on the initial train data and modify it. Train a Random Forest classifier on the modified data and obtain the predictions on the test data.

**q12:** What balanced accuracy value do you obtain? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

Look at the scores overall. What do you think - how did NearMiss-2 method perform?

In [127]:
undersample = NearMiss(version=2, n_neighbors=3, sampling_strategy=0.2)
X_res, y_res = undersample.fit_resample(X_train, y_train)

In [128]:
rfc = RandomForestClassifier(n_estimators=50, random_state=13, class_weight='balanced')
rfc.fit(X_res, y_res)

y_pred = rfc.predict(X_test)

In [129]:
compute_confusion_matrix(y_test, y_pred)

Accuracy: 0.77094
F-score: 0.32075
Precision: 0.19615
Recall: 0.87931
Accuracy (balanced): 0.82158
MCC: 0.34578


Unnamed: 0,y = 1,y = 0
a(x) = 1,51,209
a(x) = 0,7,676


## 13

Take the Tomek's links method (`TomekLinks`) with the default parameter values, run it on the initial train data and modify it. Train a Random Forest classifier on the modified data and obtain the predictions on the test data.

**q13:** What balanced accuracy value do you obtain? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

Look at the scores overall. What do you think - how did Tomek's links method perform? What was the best undersampling approach?

In [131]:
undersample = TomekLinks()
X_res, y_res = undersample.fit_resample(X_train, y_train)

In [132]:
rfc = RandomForestClassifier(n_estimators=50, random_state=13, class_weight='balanced')
rfc.fit(X_res, y_res)

y_pred = rfc.predict(X_test)

In [133]:
compute_confusion_matrix(y_test, y_pred)

Accuracy: 0.98515
F-score: 0.86538
Precision: 0.97826
Recall: 0.77586
Accuracy (balanced): 0.88737
MCC: 0.86410


Unnamed: 0,y = 1,y = 0
a(x) = 1,45,1
a(x) = 0,13,884


## 14

Now let's move to the oversampling. Take a random oversampling approach (`RandomOverSampler`) with `sampling_strategy=0.8`, `random_state=13` and other default parameter values. Run it on the initial train data and modify it. Train a Random Forest classifier on the modified data and obtain the predictions on the test data.

**q14:** What balanced accuracy value do you obtain? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

Look at the scores overall. What do you think - how did random oversampling method perform?

In [135]:
undersample = RandomOverSampler(sampling_strategy=0.8, random_state=13)
X_res, y_res = undersample.fit_resample(X_train, y_train)

In [136]:
rfc = RandomForestClassifier(n_estimators=50, random_state=13, class_weight='balanced')
rfc.fit(X_res, y_res)

y_pred = rfc.predict(X_test)

In [137]:
compute_confusion_matrix(y_test, y_pred)

Accuracy: 0.98621
F-score: 0.88288
Precision: 0.92453
Recall: 0.84483
Accuracy (balanced): 0.92015
MCC: 0.87658


Unnamed: 0,y = 1,y = 0
a(x) = 1,49,4
a(x) = 0,9,881


## 15

Take SMOTE (`SMOTE`) with `sampling_strategy=0.8`, `k_neighbors=5`, `random_state=13` and other default parameter values. Run it on the initial train data and modify it. Train a Random Forest classifier on the modified data and obtain the predictions on the test data.

**q15:** What balanced accuracy value do you obtain? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

Look at the scores overall. What do you think - how did SMOTE method perform? Was it better than random oversampling?

In [139]:
undersample = SMOTE(sampling_strategy=0.8, random_state=13, k_neighbors=5)
X_res, y_res = undersample.fit_resample(X_train, y_train)

In [140]:
rfc = RandomForestClassifier(n_estimators=50, random_state=13, class_weight='balanced')
rfc.fit(X_res, y_res)

y_pred = rfc.predict(X_test)

In [141]:
compute_confusion_matrix(y_test, y_pred)

Accuracy: 0.98515
F-score: 0.87500
Precision: 0.90741
Recall: 0.84483
Accuracy (balanced): 0.91959
MCC: 0.86774


Unnamed: 0,y = 1,y = 0
a(x) = 1,49,5
a(x) = 0,9,880


## 16

Take ADASYN (`ADASYN`) with `sampling_strategy=0.8`, `n_neighbors=5`, `random_state=13` and other default parameter values. Run it on the initial train data and modify it. Train a Random Forest classifier on the modified data and obtain the predictions on the test data.

**q16:** What balanced accuracy value do you obtain? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

Look at the scores overall. What do you think - how did ADASYN method perform? Was it better than SMOTE?

In [144]:
undersample = ADASYN(sampling_strategy=0.8, random_state=13, n_neighbors=5)
X_res, y_res = undersample.fit_resample(X_train, y_train)

In [145]:
rfc = RandomForestClassifier(n_estimators=50, random_state=13, class_weight='balanced')
rfc.fit(X_res, y_res)

y_pred = rfc.predict(X_test)

In [146]:
compute_confusion_matrix(y_test, y_pred)

Accuracy: 0.98515
F-score: 0.87273
Precision: 0.92308
Recall: 0.82759
Accuracy (balanced): 0.91153
MCC: 0.86632


Unnamed: 0,y = 1,y = 0
a(x) = 1,48,4
a(x) = 0,10,881


## 17

Take the first version of borderline SMOTE (`BorderlineSMOTE`, `kind='borderline-1'`) with `sampling_strategy=0.8`, `random_state=13` and other default parameter values. Run it on the initial train data and modify it. Train a Random Forest classifier on the modified data and obtain the predictions on the test data.

**q17:** What balanced accuracy value do you obtain? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

Look at the scores overall. What do you think - how did BorderlineSMOTE-1 method perform? Was it better than SMOTE and ADASYN? What was the best oversampling approach?

In [148]:
undersample = BorderlineSMOTE(sampling_strategy=0.8, random_state=13, kind='borderline-1')
X_res, y_res = undersample.fit_resample(X_train, y_train)

In [149]:
rfc = RandomForestClassifier(n_estimators=50, random_state=13, class_weight='balanced')
rfc.fit(X_res, y_res)

y_pred = rfc.predict(X_test)

In [150]:
compute_confusion_matrix(y_test, y_pred)

Accuracy: 0.98727
F-score: 0.89091
Precision: 0.94231
Recall: 0.84483
Accuracy (balanced): 0.92072
MCC: 0.88566


Unnamed: 0,y = 1,y = 0
a(x) = 1,49,3
a(x) = 0,9,882


## 18

Finally, check the performance of the combination of oversampling and undersampling. Take SMOTE + Tomek's links (`SMOTETomek`) with `sampling_strategy=0.8`, `random_state=13` and other default parameter values. Run it on the initial train data and modify it. Train a Random Forest classifier on the modified data and obtain the predictions on the test data.

**q18:** What balanced accuracy value do you obtain? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

What do you think, which approach was the best to deal with our data?

In [152]:
oversample = SMOTETomek(sampling_strategy=0.8, random_state=13)
X_res, y_res = oversample.fit_resample(X_train, y_train)

In [153]:
rfc = RandomForestClassifier(n_estimators=50, random_state=13, class_weight='balanced')
rfc.fit(X_res, y_res)

y_pred = rfc.predict(X_test)

In [154]:
compute_confusion_matrix(y_test, y_pred)

Accuracy: 0.98409
F-score: 0.86486
Precision: 0.90566
Recall: 0.82759
Accuracy (balanced): 0.91097
MCC: 0.85741


Unnamed: 0,y = 1,y = 0
a(x) = 1,48,5
a(x) = 0,10,880
