 Implement the cluster centroids and random undersampling techniques with the credit card default data. Then estimate a logistic regression model and report the classification evaluation metrics from both sampling methods.
 INFO: ln_balance_limit is the log of the maximum balance they can have on the card; 1 is female, 0 male for sex; the education is denoted: 1 = graduate school; 2 = university; 3 = high school; 4 = others; 1 is married and 0 single for marriage; default_next_month is whether the person defaults in the following month (1 yes, 0 no).

In [1]:
import pandas as pd 
from collections import Counter

from sklearn.model_selection import train_test_split
from imblearn.under_sampling import RandomUnderSampler
from imblearn.under_sampling import ClusterCentroids

from sklearn.linear_model import LogisticRegression

from imblearn.metrics import classification_report_imbalanced
from sklearn.metrics import balanced_accuracy_score, confusion_matrix


In [2]:
df = pd.read_csv('Resouces/cc_default.csv')
df.head()


Unnamed: 0,ID,ln_balance_limit,sex,education,marriage,age,default_next_month
0,1,9.903488,1,2,0,24,1
1,2,11.695247,1,2,1,26,1
2,3,11.407565,1,2,1,34,0
3,4,10.819778,1,2,0,37,0
4,5,10.819778,0,2,0,57,0


In [3]:
# set up features and target 
x_cols = [i for i in df.columns if i not in ('ID', 'default_next_month')]
X = df[x_cols]
y = df['default_next_month']


In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)


 ## Random UnderSampling

In [5]:
rus = RandomUnderSampler(random_state = 1)

X_resampled_train, y_resampled_train = rus.fit_resample(X_train, y_train)

# valid the counter
Counter(y_resampled_train)

Counter({0: 4968, 1: 4968})

In [6]:
# Fit a Logistic regression model using random undersampled data

model = LogisticRegression(solver='lbfgs', random_state=1)
model.fit(X_resampled_train, y_resampled_train)
y_pred = model.predict(X_test)


In [7]:
# evaluate

confusion_matrix(y_test, y_pred)

array([[3732, 2100],
       [ 740,  928]], dtype=int64)

In [8]:
balanced_accuracy_score(y_test, y_pred)

0.5981363057701987

In [9]:

print(classification_report_imbalanced(y_test, y_pred))

                   pre       rec       spe        f1       geo       iba       sup

          0       0.83      0.64      0.56      0.72      0.60      0.36      5832
          1       0.31      0.56      0.64      0.40      0.60      0.35      1668

avg / total       0.72      0.62      0.57      0.65      0.60      0.36      7500



 ## ClusterCentroid Undersampling

In [11]:
cc = ClusterCentroids(random_state=1)
X_resampled, y_resampled = cc.fit_resample(X_train, y_train)
Counter(y_resampled)

Counter({0: 4968, 1: 4968})

In [12]:
# Logistic regression using cluster centroid undersampled data
model = LogisticRegression(solver='lbfgs', random_state=78)
model.fit(X_resampled, y_resampled)
y_pred = model.predict(X_test)

In [13]:
# evaluate
confusion_matrix(y_test, y_pred)

array([[2918, 2914],
       [ 631, 1037]], dtype=int64)

In [14]:
balanced_accuracy_score(y_test, y_pred)

0.5610227867089045

In [15]:
print(classification_report_imbalanced(y_test, y_pred))

                   pre       rec       spe        f1       geo       iba       sup

          0       0.82      0.50      0.62      0.62      0.56      0.31      5832
          1       0.26      0.62      0.50      0.37      0.56      0.31      1668

avg / total       0.70      0.53      0.59      0.57      0.56      0.31      7500

