# Undersampling

Background info: Undersampling the opposite of oversampling. It decreases the size of the majority class, instead of increasing the number of the minority class (oversampling).

#### Two techniques of undersampling in this exercise: random and cluster centroid.

Exercise:

1) Implement the cluster centroids and random undersampling techniques with the credit card default data. Then estimate a logistic regression model and report the classification evaluation metrics from both sampling methods.

##### Drawbacks: with undersampling (downsampling), the downsides are that it involves loss of data and is not an option when the dataset is small. 

In [1]:
# Import dependencies.
import pandas as pd
from path import Path
from collections import Counter

In [2]:
# Load dataset.
data = Path('../Resources/cc_default.csv')
df = pd.read_csv(data)
df.head()

Unnamed: 0,ID,ln_balance_limit,sex,education,marriage,age,default_next_month
0,1,9.903488,1,2,0,24,1
1,2,11.695247,1,2,1,26,1
2,3,11.407565,1,2,1,34,0
3,4,10.819778,1,2,0,37,0
4,5,10.819778,0,2,0,57,0


2) Note: column with /ln_balance_limit/ is the log of the maximum balance they can have on the card; 1 is female, 0 male for sex; the education is denoted: 1 = graduate school; 2 = university; 3 = high school; 4 = others; 1 is married and 0 single for marriage; default_next_month is whether the person defaults in the following month (1 yes, 0 no).

In [3]:
# Define feature (X) and target (y). 
x_cols = [i for i in df.columns if i not in ('ID', 'default_next_month')]
X = df[x_cols]
y = df['default_next_month']

In [30]:
# Show shape of X and y
X.shape

(30000, 5)

Note:
x_cols = [i for i in df.columns if i not in ('ID', 'default_next_month')]
List comprehension statement for columns ID and default_next_month: drops the columns same as using .drop

In [4]:
# Normal train-test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

## Random Undersampling

In [5]:
# Undersample the data using `RandomUnderSampler`
from imblearn.under_sampling import RandomUnderSampler
ros = RandomUnderSampler(random_state=1)
X_resampled, y_resampled = ros.fit_resample(X_train, y_train)
Counter(y_resampled)

Counter({0: 4968, 1: 4968})

Note: size of majority class was decreased to equal minority class.

In [6]:
# Fit a Logistic regression model using random undersampled data
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver='lbfgs', random_state=1)
model.fit(X_resampled, y_resampled)

LogisticRegression(random_state=1)

In [7]:
# Display the confusion matrix
from sklearn.metrics import confusion_matrix
y_pred = model.predict(X_test)
confusion_matrix(y_test, y_pred)

array([[3732, 2100],
       [ 740,  928]])

In [8]:
# Calculate the Balanced Accuracy Score
from sklearn.metrics import balanced_accuracy_score
balanced_accuracy_score(y_test, y_pred)

0.5981363057701987

In [9]:
# Print the imbalanced classification report
from imblearn.metrics import classification_report_imbalanced
print(classification_report_imbalanced(y_test, y_pred))

                   pre       rec       spe        f1       geo       iba       sup

          0       0.83      0.64      0.56      0.72      0.60      0.36      5832
          1       0.31      0.56      0.64      0.40      0.60      0.35      1668

avg / total       0.72      0.62      0.57      0.65      0.60      0.36      7500



Summary: random undersampling only produces 0.598/59.8% accuracy; try cluster centroid undersampling to see if metrics improve. 

## Cluster Centroid Undersampling

Note: similar to SMOTE; algorithm identifies clusters of the majority class, then generates synthetic data points, called centroids, that are representative of the clusters. Majority class is then undersampled down to the size of the minority class. 

In [10]:
# Fit the data using `ClusterCentroids` and check the count of each class
from imblearn.under_sampling import ClusterCentroids
cc = ClusterCentroids(random_state=1)
X_resampled, y_resampled = cc.fit_resample(X_train, y_train)
Counter(y_resampled)

Counter({0: 4968, 1: 4968})

In [11]:
# Logistic regression using cluster centroid undersampled data
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver='lbfgs', random_state=78)
model.fit(X_resampled, y_resampled)

LogisticRegression(random_state=78)

In [12]:
# Display the confusion matrix
from sklearn.metrics import confusion_matrix
y_pred = model.predict(X_test)
confusion_matrix(y_test, y_pred)

array([[2867, 2965],
       [ 616, 1052]])

In [13]:
# Calculate the balanced accuracy score
from sklearn.metrics import balanced_accuracy_score
balanced_accuracy_score(y_test, y_pred)

0.5611467616030632

In [14]:
# Print the imbalanced classification report
from imblearn.metrics import classification_report_imbalanced
print(classification_report_imbalanced(y_test, y_pred))

                   pre       rec       spe        f1       geo       iba       sup

          0       0.82      0.49      0.63      0.62      0.56      0.31      5832
          1       0.26      0.63      0.49      0.37      0.56      0.31      1668

avg / total       0.70      0.52      0.60      0.56      0.56      0.31      7500



Summary: results are worse than random undersampling (56%)! Conclusion: resampling can attempt to address imbalance, but it does not guarantee better results.

# Skill Drill: Oversampling Comparison

### Pre-processing

In [15]:
# Import dependencies.
import pandas as pd
from path import Path
from collections import Counter

In [16]:
# Load dataset.
data = Path('../Resources/cc_default.csv')
df = pd.read_csv(data)
df.head()

Unnamed: 0,ID,ln_balance_limit,sex,education,marriage,age,default_next_month
0,1,9.903488,1,2,0,24,1
1,2,11.695247,1,2,1,26,1
2,3,11.407565,1,2,1,34,0
3,4,10.819778,1,2,0,37,0
4,5,10.819778,0,2,0,57,0


In [17]:
# Normal train-test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
Counter(y_train)

Counter({0: 17532, 1: 4968})

### 1. Fit and resample (random oversampling)

In [18]:
# implement random oversampling
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=1)
X_resampled, y_resampled = ros.fit_resample(X_train, y_train)

Counter(y_resampled)

Counter({0: 17532, 1: 17532})

### Logistic regression

In [19]:
# Logistic regression using random oversampled data
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver='lbfgs', random_state=1)
model.fit(X_resampled, y_resampled)

LogisticRegression(random_state=1)

### Evaluate Metrics

In [20]:
# Display the confusion matrix
from sklearn.metrics import confusion_matrix

y_pred = model.predict(X_test)
confusion_matrix(y_test, y_pred)

array([[3744, 2088],
       [ 745,  923]])

In [21]:
from sklearn.metrics import balanced_accuracy_score

balanced_accuracy_score(y_test, y_pred)

0.5976663113953282

In [22]:
# Print the imbalanced classification report
from imblearn.metrics import classification_report_imbalanced
print(classification_report_imbalanced(y_test, y_pred))

                   pre       rec       spe        f1       geo       iba       sup

          0       0.83      0.64      0.55      0.73      0.60      0.36      5832
          1       0.31      0.55      0.64      0.39      0.60      0.35      1668

avg / total       0.72      0.62      0.57      0.65      0.60      0.36      7500



Analysis: Accuracy increased to ~ 59.76% for random oversampling, which is an improvement over the undersampling techniques.

### 2. Fit and resample (SMOTE)

In [23]:
# implement SMOTE oversampling
from imblearn.over_sampling import SMOTE
X_resampled, y_resampled = SMOTE(random_state=1, sampling_strategy='auto').fit_resample(
    X_train, y_train
)
Counter(y_resampled)

Counter({0: 17532, 1: 17532})

### Implement logistic regression model

In [24]:
# Logistic regression using random oversampled data
model = LogisticRegression(solver='lbfgs', random_state=1)
model.fit(X_resampled, y_resampled)

LogisticRegression(random_state=1)

### Evaluate metrics

In [25]:
# Calculated the balanced accuracy score
y_pred = model.predict(X_test)
balanced_accuracy_score(y_test, y_pred)

0.5975337014339146

In [26]:
# Display the confusion matrix
confusion_matrix(y_test, y_pred)

array([[3697, 2135],
       [ 732,  936]])

In [27]:
# Print the imbalanced classification report
print(classification_report_imbalanced(y_test, y_pred))

                   pre       rec       spe        f1       geo       iba       sup

          0       0.83      0.63      0.56      0.72      0.60      0.36      5832
          1       0.30      0.56      0.63      0.40      0.60      0.35      1668

avg / total       0.72      0.62      0.58      0.65      0.60      0.36      7500



Summary: 59.7% accuracy for SMOTE oversampling - not much improvement over random oversampling. Final conclusion is that oversampling yields better results for this dataset, with SMOTE improving accuracy nearly similarly over random oversampling (59.76% vs 59.75%, respectively). This proves (as seen in previous example of SMOTE vs random oversampling) that SMOTE does not always outperform random oversampling. In this case, the accuracy is fairly the same. Random undersampling yields the best results with 59.81% accuracy, making this the best resampling technique for this dataset. 