<a href="https://colab.research.google.com/github/sathyadithyarithi/ITI103_myClasswork/blob/main/imbalanced_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dealing with Imbalanced Data Set

Welcome to the hands-on lab. This is part of the series of exercises to help you acquire skills in different techniques to fine-tune your model.

In this lab, you will learn:
- how to use over-sampling correctly for imbalanced data set
- how to perform resampling using K-folds



In this exercise, we will use an imbalanced data set from Lending Club that consists of data for both 'bad' and 'good' loans to illustrate how we can apply oversampling and undersampling techniques to improve our model performance. You will also learn to apply resampling correctly when using cross-validation.

## Import the libraries

In [40]:
import pandas as pd
import numpy as np
import urllib.request
import shutil
import zipfile

from sklearn.ensemble import (
    RandomForestClassifier,
    GradientBoostingClassifier
)

from sklearn.model_selection import (
    train_test_split,
    RepeatedStratifiedKFold,
    cross_validate
)

from sklearn.metrics import (
    classification_report,
    roc_curve,
    roc_auc_score,
    auc,
    precision_recall_curve,
    RocCurveDisplay
)

from imblearn.pipeline import Pipeline

from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE

from imblearn.ensemble import (
    RUSBoostClassifier,
    EasyEnsembleClassifier
)

%matplotlib inline

## Get the data

In [41]:
url = 'https://nyp-aicourse.s3.ap-southeast-1.amazonaws.com/datasets/lending_club-data.csv.zip'
zip_file = "lending_club-data.csv.zip"

# download the zip file and copy to a file 'lending-club-data.csv.zip'
with urllib.request.urlopen(url) as response, open(zip_file, 'wb') as out_file:
    shutil.copyfileobj(response, out_file)

# unzip the file to a folder 'data'
with zipfile.ZipFile(zip_file,"r") as zip_ref:
    zip_ref.extractall('data')

## Understand the data

Here we are trying to find out some information about the dataset

In [42]:
df = pd.read_csv('data/lending-club-data.csv')

  df = pd.read_csv('data/lending-club-data.csv')


Let us just find out about different features and their data types.

In [43]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122607 entries, 0 to 122606
Data columns (total 68 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   id                           122607 non-null  int64  
 1   member_id                    122607 non-null  int64  
 2   loan_amnt                    122607 non-null  int64  
 3   funded_amnt                  122607 non-null  int64  
 4   funded_amnt_inv              122607 non-null  int64  
 5   term                         122607 non-null  object 
 6   int_rate                     122607 non-null  float64
 7   installment                  122607 non-null  float64
 8   grade                        122607 non-null  object 
 9   sub_grade                    122607 non-null  object 
 10  emp_title                    115767 non-null  object 
 11  emp_length                   118516 non-null  object 
 12  home_ownership               122607 non-null  object 
 13 

In this exercise, we are trying to predict if a member will default on his loan or not. So we will be using the feature column 'bad_loans' as the label for our classification task. If the value of `bad_loan` is 1, it means it is a default (or bad loan), otherwise, it is 0.  

***Exercise:***

Find out how many samples in the data set is bad loans and how many are not.

Hint: `value_counts()` in [pandas](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html) give you the count of unique values

<p>
<details><summary>Click here for answer</summary>

```python

df.bad_loans.value_counts()

```

</details>

In [44]:
### Complete the code below ###



Is the data set imbalanced? Clearly we have a lot of more good loans than bad loans (around 4 times more).

## Data Preparation

There are quite a lot of features in this data set but we are just going to use a few, just for demonstration purpose (as we are not really interested in actual performance of our model).

In [45]:
features = ['grade', 'home_ownership','emp_length_num', 'sub_grade','short_emp',
            'dti', 'term', 'purpose', 'int_rate', 'last_delinq_none', 'last_major_derog_none',
            'revol_util', 'total_rec_late_fee', 'payment_inc_ratio', 'bad_loans']

In [46]:
df = df[features]
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122607 entries, 0 to 122606
Data columns (total 15 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   grade                  122607 non-null  object 
 1   home_ownership         122607 non-null  object 
 2   emp_length_num         122607 non-null  int64  
 3   sub_grade              122607 non-null  object 
 4   short_emp              122607 non-null  int64  
 5   dti                    122607 non-null  float64
 6   term                   122607 non-null  object 
 7   purpose                122607 non-null  object 
 8   int_rate               122607 non-null  float64
 9   last_delinq_none       122607 non-null  int64  
 10  last_major_derog_none  122607 non-null  int64  
 11  revol_util             122607 non-null  float64
 12  total_rec_late_fee     122607 non-null  float64
 13  payment_inc_ratio      122603 non-null  float64
 14  bad_loans              122607 non-nu

Notice that `payment_inc_ratio` has some null values, and since it is only a small number, just remove the rows that have null values for `payment_inc_ratio`.

In [47]:
loans_df = df.dropna()

We will go ahead and encode our categorical columns.

In [48]:
loans_encoded = pd.get_dummies(loans_df)
loans_encoded.info()

<class 'pandas.core.frame.DataFrame'>
Index: 122603 entries, 0 to 122606
Data columns (total 70 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   emp_length_num              122603 non-null  int64  
 1   short_emp                   122603 non-null  int64  
 2   dti                         122603 non-null  float64
 3   int_rate                    122603 non-null  float64
 4   last_delinq_none            122603 non-null  int64  
 5   last_major_derog_none       122603 non-null  int64  
 6   revol_util                  122603 non-null  float64
 7   total_rec_late_fee          122603 non-null  float64
 8   payment_inc_ratio           122603 non-null  float64
 9   bad_loans                   122603 non-null  int64  
 10  grade_A                     122603 non-null  bool   
 11  grade_B                     122603 non-null  bool   
 12  grade_C                     122603 non-null  bool   
 13  grade_D            

### Split the data set into train and test set

***Exercise:***

First, separate the features and the label.  

Hint: use `df.drop()` and specify `axis=1` to remove a particular column in dataframe.

Then, split the data into train set (called `X_train, y_train`) and test set (`X_test, y_test`). Think about the splitting strategy, e.g. do you need to ensure the distribution of good/bad is the same in both train and test set?

<p>
<details><summary>Click here for answer</summary>
    
```python

X_df = loans_encoded.drop(['bad_loans'], axis=1)
y_df = loans_encoded['bad_loans']

X_train, X_test, y_train, y_test = train_test_split(X_df, y_df,
                                                    test_size = .2,
                                                    stratify = y_df,
                                                    random_state = 42)

```
</details>

In [50]:
X_df = loans_encoded.drop(['bad_loans'], axis=1)
y_df = loans_encoded['bad_loans']

X_train, X_test, y_train, y_test = train_test_split(X_df, y_df,
                                                    test_size = .2,
                                                    stratify = y_df,
                                                    random_state = 42)

In [51]:
print(y_train.value_counts())

bad_loans
0    79562
1    18520
Name: count, dtype: int64


## Train a baseline model

Now for comparison sake, we will evaluate a baseline model without any resampling.
As we are dealing with imbalanced dataset, it is useful for us to look at the roc auc score.

In [52]:
clf = RandomForestClassifier(n_estimators=30, random_state=0)
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=1, random_state=0)
scores = cross_validate(clf, X_train, y_train, scoring=['roc_auc'], cv=cv, n_jobs=-1)

print('ROC_AUC of baseline model = {}'.format(scores['test_roc_auc'].mean()))

ROC_AUC of baseline model = 0.6745330348717996


## Oversampling

Now we will try the over-sampling techniques to see if we can improve our model performance on the 'bad loan'.

### The ***wrong*** way to oversample ###

With the training data created, we can oversample the minority class (the bad_loan = 1). In this exercise, we will use the SMOTE (from the [imblearn](https://imbalanced-learn.readthedocs.io/en/stable/index.html) library) to create synthetic samples of the minority class.

After upsampling to a class ratio of 1.0 (i.e. 1 to 1 ratio between positive and negative classes) you should have a balanced dataset. In most cases, there’s often no need to balance the classes totally.

In [58]:
# Set sampling_strategy='auto' to oversample only the minority class

sm = SMOTE(sampling_strategy='auto',random_state=0)

X_upsample, y_upsample = sm.fit_resample(X_train, y_train)

Now let's see the number of samples we have for each class. You will see that now our train set is totally balanced, with equal number of samples for each class.


In [59]:
y_upsample.value_counts()

bad_loans
0    79562
1    79562
Name: count, dtype: int64

In [60]:
clf = RandomForestClassifier(n_estimators=30, random_state=0)
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=1, random_state=0)
scores = cross_validate(clf, X_upsample, y_upsample, scoring=['roc_auc'], cv=cv, n_jobs=-1)

print('Cross-validation ROC_AUC score SMOTE-wrong way = {}'.format(scores['test_roc_auc'].mean()))

Cross-validation ROC_AUC score SMOTE-wrong way = 0.9352371343346733


Our roc_auc score has improved to 93%. Impressive!  But is this actually representative of how the model will perform? Let's put our model to test.

Now let's train the model using the full up-sampled training set and evaluate on test set.

In [61]:
clf = RandomForestClassifier(n_estimators=30, random_state=0)
clf.fit(X_upsample, y_upsample)

y_probas = clf.predict_proba(X_test)[:,1]

roc_auc = roc_auc_score(y_test, y_probas)

print('Test ROC_AUC with SMOTE-wrong way = {}'.format(roc_auc))

Test ROC_AUC with SMOTE-wrong way = 0.6795601525071902


You will get around 0.68. That’s disappointing! What has happened?

By oversampling before splitting into training and validation datasets, we “leaked” information from the validation set into the training of the model (refer to your lecture for more details)

### The ***right way*** to oversample

So, let do it the right way and see what happens. This time round, we will oversample the training set and not the train + validation set. Oversampling is done after we set aside the validation set.

In [62]:
sm = SMOTE(sampling_strategy='auto', random_state=0)
clf = RandomForestClassifier(n_estimators=30, random_state=0)

# declare a pipeline that consists of the oversampler and the classifier
steps = [('ovr', sm), ('clf', clf)]
pipeline = Pipeline(steps=steps)

# the oversampling is only applied to the train folds
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=1, random_state=0)
scores = cross_validate(pipeline, X_train, y_train, scoring=['roc_auc'], cv=cv, n_jobs=-1)

print('average roc_auc = {}'.format(scores['test_roc_auc'].mean()))

average roc_auc = 0.6678994199969314


## Undersampling

It does not seems that we have much success with oversampling (it is marginally better than the baseline model). Let us try undersampling to see if we can get a better model.

**Exercise:**

Complete the code cell below, using RandomUndersampler, resample only the majority class. Cross-validate with RandomForestClassifier like before and compare the result with the oversampling approach. What do you observe about the result?

<details><summary>Click here for answer</summary>
<br/>
    
```python

undersampler  = RandomUnderSampler(sampling_strategy='auto', random_state=0)
clf = RandomForestClassifier(n_estimators=30, random_state=0)

# declare a pipeline that consists of the oversampler and the classifier
steps = [('under', undersampler), ('clf', clf)]
pipeline = Pipeline(steps=steps)

# the oversampling is only applied to the train folds
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=1, random_state=1)
scores = cross_validate(pipeline, X_train, y_train, scoring=['roc_auc'], cv=cv, n_jobs=-1)

print('Cross-validation ROC_AUC score Random Undersampling = {}'.format(scores['test_roc_auc'].mean()))
    
```
</details>

In [65]:
undersampler  = RandomUnderSampler(sampling_strategy='auto', random_state=0)
clf = RandomForestClassifier(n_estimators=30, random_state=0)

# declare a pipeline that consists of the oversampler and the classifier
steps = [('under', undersampler), ('clf', clf)]
pipeline = Pipeline(steps=steps)

# the oversampling is only applied to the train folds
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=1, random_state=1)
scores = cross_validate(pipeline, X_train, y_train, scoring=['roc_auc'], cv=cv, n_jobs=-1)

print('Cross-validation ROC_AUC score Random Undersampling = {}'.format(scores['test_roc_auc'].mean()))



Cross-validation ROC_AUC score Random Undersampling = 0.6811834358533816


## Boosting

Let us try some boosting algorithm to see if we can achieve better result.

**Exercise:**

Complete the code cell below, using GradientBoostingClassifier, with default parameters and random_state=0

<details><summary>Click here for answer</summary>
<br/>
    
```python
clf = GradientBoostingClassifier(random_state=0)

cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=1, random_state=0)
scores = cross_validate(clf, X_train, y_train, scoring=['roc_auc'], cv=cv, n_jobs=-1)
print('Cross-validate ROC_AUC with GradientBoosting = {}'.format(scores['test_roc_auc'].mean()))
```
</details>

In [None]:
clf = GradientBoostingClassifier(random_state=0)

cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=1, random_state=0)
scores = cross_validate(clf, X_train, y_train, scoring=['roc_auc'], cv=cv, n_jobs=-1)
print('Cross-validate ROC_AUC with GradientBoosting = {}'.format(scores['test_roc_auc'].mean()))

Here we can see that even without any re-sampling, boosting algorithm is able to achieve better result.

In [None]:
### Complete the code below ###
clf = RUSBoostClassifier(n_estimators=30, sampling_strategy='auto', learning_rate=1.0)

