# Practice assignment: Handling imbalanced data

In this programming assignment, you are going to work with a dataset based on the following data:

https://archive.ics.uci.edu/ml/datasets/thyroid+disease

_Citation:_

* _(Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.)_

The dataset contains various attributes of patients. Some of them have a thyroid disease (`'Class' = 1`), some of them don't have it (`'Class' = 0`).

The data is imbalanced. In this assignment, you are going to preprocess the data and apply various techniques for the imbalanced classification.

In [1]:
import numpy as np
import pandas as pd
from imblearn.combine import SMOTETomek
from imblearn.over_sampling import ADASYN, BorderlineSMOTE, RandomOverSampler, SMOTE
from imblearn.under_sampling import NearMiss, RandomUnderSampler, TomekLinks
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, balanced_accuracy_score, confusion_matrix, matthews_corrcoef, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

In [2]:
df = pd.read_csv('data_4.csv')

## 1

**q1:** What proportion of patients in this data has a thyroid disease? Provide the answer (a number from 0 to 1), rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

In [3]:
round(df.Class.value_counts(True)[1], 5)

0.06124

## 2

**q2:** How many columns contain missing values (NaN)?

In [4]:
bg = df.isna().sum()
bg[bg>0].shape

(7,)

## 3

**q3:** How many columns contain only one unique value (count NaNs too)? If the number is bigger than 0, drop these columns.

In [5]:
df.nunique(dropna=False)

age                           94
sex                            2
on_thyroxine                   2
query_on_thyroxine             2
on_antithyroid_medication      2
sick                           2
pregnant                       2
thyroid_surgery                2
I131_treatment                 2
query_hypothyroid              2
query_hyperthyroid             2
lithium                        2
goitre                         2
tumor                          2
hypopituitary                  2
psych                          2
TSH_measured                   2
TSH                          288
T3_measured                    2
T3                            70
TT4_measured                   2
TT4                          242
T4U_measured                   2
T4U                          147
FTI_measured                   2
FTI                          235
TBG_measured                   1
TBG                            1
referral_source                5
Class                          2
dtype: int

In [6]:
df.drop(columns=['TBG', 'TBG_measured'], inplace=True)

## 4

**q4:** Calculate the number of binary columns (only two unique values) with `'object'` data types. Transform them with `LabelEncoder` so that their values become numbers.

In [7]:
nun = df.nunique(dropna=False)



In [8]:
cols_le = list(set(nun[nun == 2].index).intersection(set(df.select_dtypes('object').columns)))

In [9]:
len(cols_le)

20

In [10]:
for col in cols_le:
    df[col] = LabelEncoder().fit_transform(df[col])

## 5

**q5:** How many categorical columns with `'object'` data types are remaining in the data? Encode them with One-Hot encoding (with `pandas.get_dummies()`) the same way as in the programming assignment in week 1.

In [11]:
df.select_dtypes('object').columns

Index(['referral_source'], dtype='object')

In [12]:
df = pd.concat((df, pd.get_dummies(df['referral_source'],  prefix='referral_source', prefix_sep='-')), axis=1)

In [13]:
df.drop(columns='referral_source', inplace=True)

## 6

We have encoded categorical features, but we still have missing values. Fill them with a number -999. 

**q6:** What is a mean value of `'T3'` column now? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

Don't be afraid if you see that the mean changed significantly after filling missing values. We just introduced a special category, and it won't influence tree-based models.

In [14]:
df.fillna(-999, inplace=True)
round(df['T3'].mean(), 5)

-202.06375

## 7

Finally, we have preprocessed the data. Next, we separate the target from the dataframe with features (`df` -> `X`, `y`).

Split the data (`X` and `y`) into train and test sets using `train_test_split` from `sklearn`. Test size should be 0.25 of the whole data. Use `random_state=13`, so that your results are reproducible and similar to the original ones.

**q7:** Measure the proportion of patients in train set who has a thyroid disease (as in the task 1). Measure the proportion of patients in test set who has a thyroid disease. As the answer, provide the absolute value of the difference between these proportions (to compare positive class proportions in train and test), rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

In [15]:
X = df.drop('Class', axis=1)
y = df['Class']

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=13)

In [17]:
round(y_train.mean() - y_test.mean(), 5)

0.00813

## 8

Now split the data (`X` and `y`) into train and test sets using `train_test_split` from `sklearn` with the same parameters, as in task 7, but also add `stratify=y` parameter for the stratification. This may help to make positive class proportions in train and test more similar.

**q8:** Measure the proportion of patients in train set who has a thyroid disease (as in the task 1). Measure the proportion of patients in test set who has a thyroid disease. As the answer, provide the absolute value of the difference between these proportions (to compare positive class proportions in train and test), rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

Is it bigger or smaller than the similar number in the previous task?

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.25, random_state=13)

In [19]:
round(y_train.mean() - y_test.mean(), 5)

-0.00035

## 9

Let's move to modeling. First, we write two functions to estimate a quality of machine learning model predictions on test set via different metrics.

In this and all the following tasks, use the same train and test sets which you obtained in the task 8 (with the stratification).

Train a Random Forest classifier from `sklearn` with 50 estimators and `random_state=13`, and let all other parameters have the default values. Fit it on the train set and obtain predictions for the test set. Run the function which computes scores. 

**q9:** What balanced accuracy value do you obtain? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

In [20]:
def compute_metrics(y_test, y_pred):
    print('Accuracy: {:.5f}'.format(accuracy_score(y_test, y_pred)))
    print('F-score: {:.5f}'.format(f1_score(y_test, y_pred)))
    print('Precision: {:.5f}'.format(precision_score(y_test, y_pred)))
    print('Recall: {:.5f}'.format(recall_score(y_test, y_pred)))
    print('Accuracy (balanced): {:.5f}'.format(balanced_accuracy_score(y_test, y_pred)))
    print('MCC: {:.5f}'.format(matthews_corrcoef(y_test, y_pred)))

def compute_confusion_matrix(y_test, y_pred):
    compute_metrics(y_test, y_pred)
    return pd.DataFrame(
        confusion_matrix(y_test, y_pred, labels=[1, 0]),
        columns=['a(x) = 1', 'a(x) = 0'],
        index=['y = 1', 'y = 0'],
    ).T

In [21]:
# your code here
rf = RandomForestClassifier(n_estimators=50, random_state=13)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
compute_confusion_matrix(y_test, y_pred)

Accuracy: 0.98091
F-score: 0.82692
Precision: 0.93478
Recall: 0.74138
Accuracy (balanced): 0.86899
MCC: 0.82312


Unnamed: 0,y = 1,y = 0
a(x) = 1,43,3
a(x) = 0,15,882


## 10

In this task, perform the same procedure as in task 9, but with the parameter `class_weight='balanced'` in the Random Forest classifier. 

**q10:** What balanced accuracy value do you obtain? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

Look at the scores overall. What do you think - did setting class weights improve the quality of the model?

In [22]:
# your code here
rf = RandomForestClassifier(n_estimators=50, random_state=13, class_weight='balanced')
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
compute_confusion_matrix(y_test, y_pred)

Accuracy: 0.98515
F-score: 0.86275
Precision: 1.00000
Recall: 0.75862
Accuracy (balanced): 0.87931
MCC: 0.86418


Unnamed: 0,y = 1,y = 0
a(x) = 1,44,0
a(x) = 0,14,885


## 11

Let's try to balance train set with different approaches. We will use a special library `imbalanced-learn` (documentation: https://imbalanced-learn.org/stable/).

In this and all the following tasks, use the same Random Forest classifier setting as in the task 10 (with `class_weight='balanced'`).

Let's start with a random understampling (`RandomUnderSampler`). Run it with the default parameter values and `random_state=13` on the initial train data (from the task 8) and modify it. Train a Random Forest classifier on the modified data and obtain the predictions on the test data.

**q11:** What balanced accuracy value do you obtain? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

Look at the scores overall. What do you think - how did random undersampling method perform?

In [23]:
# your code here

In [41]:
rs = RandomUnderSampler(random_state=13)
X_res, y_res = rs.fit_resample(X_train, y_train)

In [42]:
rf = RandomForestClassifier(n_estimators=50, random_state=13, class_weight='balanced')

rf.fit(X_res, y_res )
y_pred = rf.predict(X_test)
compute_confusion_matrix(y_test, y_pred)

Accuracy: 0.96182
F-score: 0.75342
Precision: 0.62500
Recall: 0.94828
Accuracy (balanced): 0.95549
MCC: 0.75244


Unnamed: 0,y = 1,y = 0
a(x) = 1,55,33
a(x) = 0,3,852


## 12

Take the second version of `NearMiss` (`version=2`). Run it with `sampling_strategy=0.2`, `n_neighbors=3` and other default parameter values on the initial train data and modify it. Train a Random Forest classifier on the modified data and obtain the predictions on the test data.

**q12:** What balanced accuracy value do you obtain? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

Look at the scores overall. What do you think - how did NearMiss-2 method perform?

In [44]:
nm = NearMiss(version=2, sampling_strategy=0.2, n_neighbors=3)
X_res, y_res = nm.fit_resample(X_train, y_train)

In [45]:
rf = RandomForestClassifier(n_estimators=50, random_state=13, class_weight='balanced')

rf.fit(X_res, y_res)
y_pred = rf.predict(X_test)
compute_confusion_matrix(y_test, y_pred)

Accuracy: 0.81124
F-score: 0.35971
Precision: 0.22727
Recall: 0.86207
Accuracy (balanced): 0.83499
MCC: 0.38060


Unnamed: 0,y = 1,y = 0
a(x) = 1,50,170
a(x) = 0,8,715


## 13

Take the Tomek's links method (`TomekLinks`) with the default parameter values, run it on the initial train data and modify it. Train a Random Forest classifier on the modified data and obtain the predictions on the test data.

**q13:** What balanced accuracy value do you obtain? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

Look at the scores overall. What do you think - how did Tomek's links method perform? What was the best undersampling approach?

In [47]:
tm = TomekLinks()
X_res, y_res = tm.fit_resample(X_train, y_train)

In [48]:
rf = RandomForestClassifier(n_estimators=50, random_state=13, class_weight='balanced')

rf.fit(X_res, y_res )
y_pred = rf.predict(X_test)
compute_confusion_matrix(y_test, y_pred)

Accuracy: 0.98515
F-score: 0.86792
Precision: 0.95833
Recall: 0.79310
Accuracy (balanced): 0.89542
MCC: 0.86446


Unnamed: 0,y = 1,y = 0
a(x) = 1,46,2
a(x) = 0,12,883


## 14

Now let's move to the oversampling. Take a random oversampling approach (`RandomOverSampler`) with `sampling_strategy=0.8`, `random_state=13` and other default parameter values. Run it on the initial train data and modify it. Train a Random Forest classifier on the modified data and obtain the predictions on the test data.

**q14:** What balanced accuracy value do you obtain? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

Look at the scores overall. What do you think - how did random oversampling method perform?

In [50]:
ros = RandomOverSampler(sampling_strategy=0.8, random_state=13)
X_res, y_res = ros.fit_resample(X_train, y_train)

In [51]:
rf = RandomForestClassifier(n_estimators=50, random_state=13, class_weight='balanced')

rf.fit(X_res, y_res )
y_pred = rf.predict(X_test)
compute_confusion_matrix(y_test, y_pred)

Accuracy: 0.98515
F-score: 0.87500
Precision: 0.90741
Recall: 0.84483
Accuracy (balanced): 0.91959
MCC: 0.86774


Unnamed: 0,y = 1,y = 0
a(x) = 1,49,5
a(x) = 0,9,880


## 15

Take SMOTE (`SMOTE`) with `sampling_strategy=0.8`, `k_neighbors=5`, `random_state=13` and other default parameter values. Run it on the initial train data and modify it. Train a Random Forest classifier on the modified data and obtain the predictions on the test data.

**q15:** What balanced accuracy value do you obtain? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

Look at the scores overall. What do you think - how did SMOTE method perform? Was it better than random oversampling?

In [52]:
smote = SMOTE(sampling_strategy=0.8, k_neighbors=5, random_state=13)
X_res, y_res = smote.fit_resample(X_train, y_train)

In [53]:
rf = RandomForestClassifier(n_estimators=50, random_state=13, class_weight='balanced')

rf.fit(X_res, y_res )
y_pred = rf.predict(X_test)
compute_confusion_matrix(y_test, y_pred)

Accuracy: 0.98621
F-score: 0.88288
Precision: 0.92453
Recall: 0.84483
Accuracy (balanced): 0.92015
MCC: 0.87658


Unnamed: 0,y = 1,y = 0
a(x) = 1,49,4
a(x) = 0,9,881


## 16

Take ADASYN (`ADASYN`) with `sampling_strategy=0.8`, `n_neighbors=5`, `random_state=13` and other default parameter values. Run it on the initial train data and modify it. Train a Random Forest classifier on the modified data and obtain the predictions on the test data.

**q16:** What balanced accuracy value do you obtain? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

Look at the scores overall. What do you think - how did ADASYN method perform? Was it better than SMOTE?

In [56]:
ads = ADASYN(sampling_strategy=0.8, n_neighbors=5, random_state=13)
X_res, y_res = ads.fit_resample(X_train, y_train)

In [57]:
rf = RandomForestClassifier(n_estimators=50, random_state=13, class_weight='balanced')

rf.fit(X_res, y_res )
y_pred = rf.predict(X_test)
compute_confusion_matrix(y_test, y_pred)

Accuracy: 0.98621
F-score: 0.88288
Precision: 0.92453
Recall: 0.84483
Accuracy (balanced): 0.92015
MCC: 0.87658


Unnamed: 0,y = 1,y = 0
a(x) = 1,49,4
a(x) = 0,9,881


## 17

Take the first version of borderline SMOTE (`BorderlineSMOTE`, `kind='borderline-1'`) with `sampling_strategy=0.8`, `random_state=13` and other default parameter values. Run it on the initial train data and modify it. Train a Random Forest classifier on the modified data and obtain the predictions on the test data.

**q17:** What balanced accuracy value do you obtain? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

Look at the scores overall. What do you think - how did BorderlineSMOTE-1 method perform? Was it better than SMOTE and ADASYN? What was the best oversampling approach?

In [68]:
bds = BorderlineSMOTE(kind='borderline-1', sampling_strategy=0.8, random_state=13)
X_res, y_res = bds.fit_resample(X_train, y_train)

In [70]:
rf = RandomForestClassifier(n_estimators=50, random_state=13, 
                            class_weight='balanced'
                           )

rf.fit(X_res, y_res )
y_pred = rf.predict(X_test)
compute_confusion_matrix(y_test, y_pred)

Accuracy: 0.98515
F-score: 0.87500
Precision: 0.90741
Recall: 0.84483
Accuracy (balanced): 0.91959
MCC: 0.86774


Unnamed: 0,y = 1,y = 0
a(x) = 1,49,5
a(x) = 0,9,880


## 18

Finally, check the performance of the combination of oversampling and undersampling. Take SMOTE + Tomek's links (`SMOTETomek`) with `sampling_strategy=0.8`, `random_state=13` and other default parameter values. Run it on the initial train data and modify it. Train a Random Forest classifier on the modified data and obtain the predictions on the test data.

**q18:** What balanced accuracy value do you obtain? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

What do you think, which approach was the best to deal with our data?

In [60]:
# your code here
stm = SMOTETomek(sampling_strategy=0.8, random_state=13)
X_res, y_res = stm.fit_resample(X_train, y_train)
rf = RandomForestClassifier(n_estimators=50, random_state=13, class_weight='balanced')

rf.fit(X_res, y_res )
y_pred = rf.predict(X_test)
compute_confusion_matrix(y_test, y_pred)

Accuracy: 0.98515
F-score: 0.87273
Precision: 0.92308
Recall: 0.82759
Accuracy (balanced): 0.91153
MCC: 0.86632


Unnamed: 0,y = 1,y = 0
a(x) = 1,48,4
a(x) = 0,10,881
