## Synthetic Minority Over-sampling Technique
let me use the following example to illustrate how to use SMOTE in unbalance data and prove that the result after SMOTE is better than using original data.

In [2]:
# read neccessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.metrics import roc_auc_score, accuracy_score, recall_score, f1_score, cohen_kappa_score
from sklearn import preprocessing
from sklearn.model_selection import KFold
# read training data and testing data
train_data = pd.read_csv("training_data.csv")
test_data = pd.read_csv("testing_data.csv")

Firstly we use logistic regression on the training dataset and test the trained model in testing dataset.

In [3]:
# try only logistic regression
X = train_data.iloc[:, :-1]
y = train_data.iloc[:, -1]
# normalize data before regression
X = preprocessing.scale(X)
c = [0.0001, 0.01, 1, 100]
logreg = LogisticRegressionCV(penalty='l2', solver='sag', Cs=c, refit=True, cv=10, max_iter=100)
logreg.fit(X, y)
print("The accuracy rate in training dataset is ", logreg.score(X, y))



The accuracy rate in training dataset is  0.764872521246


In [4]:
# compare the accuracy rate in testing dataset
X_test = test_data.iloc[:, :-1]
y_test = test_data.iloc[:, -1]
y_test_predict = logreg.predict(X_test)
print("The result in testing dataset is:")
print("Accuracy is ", accuracy_score(y_test, y_test_predict), "and recall is",recall_score(y_test, y_test_predict))
print("AUC score is ", roc_auc_score(y_test, y_test_predict))

The result in testing dataset is:
Accuracy is  0.521292217327 and recall is 0.316103379722
AUC score is  0.47877345936


We can see the accuracy rate dramaticlly drops on testing dataset means that we overfitting the training data.  
SMOTE is Synthetic Minority Over-sampling Technique which can help us address with unbalance dataset. For example, in our history data the number of good record ('1-30 DPD', 'Current', 'Paid', 'Matured') are 3167 and the number of bad record ('Balance Owed', 'Assigned for Repossession', 'Recovered', '90+ DPD', 'Bankruptcy') are 1805. There are several teniques to solve unbalanced dataset in machine learning. SMOTE is one of them. For more information, you may read https://github.com/scikit-learn-contrib/imbalanced-learn. In the following example, I will show how much improvement after we applied SMOTE on logistic regression and random forest.

In [5]:
# use SMOTE to generate synthetic oversampling dataset 
from imblearn.over_sampling import SMOTE
X = pd.concat([train_data.iloc[:, :-1], test_data.iloc[:, :-1]])
X = preprocessing.scale(X)
y = pd.concat([train_data.iloc[:, -1], test_data.iloc[:, -1]])
# generate synthetic dataset
sm = SMOTE(kind='regular')
X_resampled, y_resampled = sm.fit_sample(X, y)
# the date size before and after SMOTE
print("The size before generating: ", X.shape)
print("The size after generating:", X_resampled.shape)

The size before generating:  (4539, 1050)
The size after generating: (5830, 1050)


After we use SMOTE to generate dataset, we now applied it on our training model and test on testing dataset.

In [6]:
# reuse logistic regression to train the model and test on testing dataset
from sklearn.model_selection import train_test_split
X_resampled = pd.DataFrame(X_resampled)
y_resampled = pd.DataFrame(y_resampled)
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size = 0.3, random_state = 42)
c = [0.01, 0.1, 1, 10]
logreg = LogisticRegressionCV(penalty='l2', solver='sag', Cs=c, refit=True, cv=10, max_iter=100)
logreg.fit(X_train, y_train)
y_test_predict = logreg.predict(X_test)
print("The result in testing dataset is:")
print("accuracy is ", accuracy_score(y_test, y_test_predict), "and recall is",recall_score(y_test, y_test_predict))
print("AUC score is ", roc_auc_score(y_test, y_test_predict))

  y = column_or_1d(y, warn=True)


The result in testing dataset is:
accuracy is  0.685534591195 and recall is 0.731934731935
AUC score is  0.686393853061


Now, we applied random forest on the data and test the trained model on testing dataset. 

In [8]:
# use random forest to train the model and test on testing dataset
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)
y_test_predict = rf.predict(X_test)
print("The result in testing dataset is:")
print("accuracy is ", accuracy_score(y_test, y_test_predict), "and recall is",recall_score(y_test, y_test_predict))
print("AUC score is ", roc_auc_score(y_test, y_test_predict))



The result in testing dataset is:
accuracy is  0.758719268153 and recall is 0.765734265734
AUC score is  0.758849175516


We can see after applying SMOTE, outr accuracy rate increases no matter we use basic logistic regression or random forest. To test it, we can applied more different classification algorithm in machine learning to prove it. The point is even we use simple logistic regression, SMOTE helps. 