# Combination Sampling

Implement the SMOTEENN technique with the credit card default data. Then estimate a logistic regression model and report the classification evaluation metrics.

ln_balance_limit is the log of the maximum balance they can have on the card; 1 is female, 0 male for sex; the education is denoted: 1 = graduate school; 2 = university; 3 = high school; 4 = others; 1 is married and 0 single for marriage; default_next_month is whether the person defaults in the following month (1 yes, 0 no).

In [10]:
import pandas as pd
from pathlib import Path
from collections import Counter
import matplotlib.pyplot as plt

In [11]:
data = Path('../Resources/cc_default.csv')
df = pd.read_csv(data)

In [20]:
df.head(-5)

Unnamed: 0,ID,ln_balance_limit,sex,education,marriage,age,default_next_month
0,1,9.903488,1,2,0,24,1
1,2,11.695247,1,2,1,26,1
2,3,11.407565,1,2,1,34,0
3,4,10.819778,1,2,0,37,0
4,5,10.819778,0,2,0,57,0
...,...,...,...,...,...,...,...
29990,29991,11.849398,0,2,0,41,0
29991,29992,12.254863,0,2,0,34,1
29992,29993,9.210340,0,3,0,43,0
29993,29994,11.512925,0,1,1,38,0


In [32]:
X = df.copy()
X.drop("default_next_month", axis=1, inplace=True)
y = df["default_next_month"]

In [14]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [15]:
Counter(y_train)

Counter({0: 17532, 1: 4968})

In [16]:
# YOUR CODE HERE!
from imblearn.combine import SMOTEENN

sm = SMOTEENN(random_state=1)
X_resampled, y_resampled = sm.fit_resample(X_train, y_train)
Counter(y_resampled)


Counter({0: 5524, 1: 8030})

In [26]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver='lbfgs', random_state=1)
model.fit(X_resampled, y_resampled)

LogisticRegression(random_state=1)

In [27]:
# Display the confusion matrix
from sklearn.metrics import confusion_matrix

y_pred = model.predict(X_test)
confusion_matrix(y_test, y_pred)

array([[1730, 4102],
       [ 528, 1140]], dtype=int64)

In [28]:
# Print the imbalanced classification report
from imblearn.metrics import classification_report_imbalanced

print(classification_report_imbalanced(y_test, y_pred))

                   pre       rec       spe        f1       geo       iba       sup

          0       0.77      0.30      0.68      0.43      0.45      0.19      5832
          1       0.22      0.68      0.30      0.33      0.45      0.21      1668

avg / total       0.64      0.38      0.60      0.41      0.45      0.20      7500

