## Imbalanced Data
* weighting: punishing the errors on the minority class
* upsampling: randomly replicating instances in the minority class
* downsampling: randomly removing instances in the majority class
* SMOTE: synthetic minority oversampling technique(minority class instance)

### SMOTE(Synthetic Minority Over-sampling Technique)
SMOTE는 비율이 낮은 분류의 데이터를 만들어내는 방법이다.
Minority class에 속하는 데이터 $x_i$에 대해 K-NN을 사용하여 K개의 샘플을 얻어낸다.
샘플 중에서 랜덤하게 선택하고 아래와 같은 계산을 통해 새로운 데이터를 생성한다.
$$
x_{new} = x_i + (x_{ih} - x_i) * delta
$$
여기서 $x_i$ 는 minority class에 속하는 기준 샘플이고 $x_{ih}$는 $x_i$에 대한 K-NN의 하나이다. $x_{ih}$또한 minority class에 속한다. delta는 0과 1사이의 랜덤 수이다.

In [1]:
import numpy as np
import pandas as pd
import matplotlib
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split

In [2]:
# read data
hr = pd.read_csv("data/processed_data.csv")
hr = hr.drop(["Unnamed: 0"], axis=1)
target = "Attrition_Yes"
hr.head()

Unnamed: 0,Age,DailyRate,DistanceFromHome,EmployeeNumber,HourlyRate,MonthlyIncome,MonthlyRate,NumCompaniesWorked,PercentSalaryHike,TotalWorkingYears,...,RelationshipSatisfaction_1,RelationshipSatisfaction_2,RelationshipSatisfaction_3,RelationshipSatisfaction_4,OverTime_No,OverTime_Yes,WorkLifeBalance_1,WorkLifeBalance_2,WorkLifeBalance_3,WorkLifeBalance_4
0,41,1102,1,1,94,5993,19479,8,11,8,...,1,0,0,0,0,1,1,0,0,0
1,49,279,8,2,61,5130,24907,1,23,10,...,0,0,0,1,1,0,0,0,1,0
2,37,1373,2,4,92,2090,2396,6,15,7,...,0,1,0,0,0,1,0,0,1,0
3,33,1392,3,5,56,2909,23159,1,11,8,...,0,0,1,0,0,1,0,0,1,0
4,27,591,2,7,40,3468,16632,9,12,6,...,0,0,0,1,1,0,0,0,1,0


In [3]:
# Balancing Data
from imblearn.over_sampling import SMOTE

temp = pd.concat([hr.iloc[:,:14], hr.iloc[:,16:]], 1)
sm = SMOTE(random_state=2)
X, y = sm.fit_sample(temp, hr.iloc[:,15])

In [4]:
X = pd.DataFrame(X)
y = pd.DataFrame(y)
X.shape, y.shape

((2466, 78), (2466, 1))

In [5]:
# split train test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=12)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(1726, 78)
(740, 78)
(1726, 1)
(740, 1)


In [6]:
# Logistic Regression
from sklearn.linear_model import LogisticRegression

logisticRegr = LogisticRegression(random_state=12)
logisticRegr.fit(X_train, np.array(y_train).ravel())
predictions = logisticRegr.predict(X_test)
score = logisticRegr.score(X_test, y_test)
print(score)

0.8337837837837838


In [7]:
from sklearn.ensemble import GradientBoostingClassifier  #GBM 
from sklearn.cross_validation import train_test_split
from sklearn.metrics import classification_report
from sklearn.grid_search import GridSearchCV

y_train = np.array(y_train)
y_train = y_train.ravel()
# Gradient Boosting
# base
base_model_args = {'max_depth': 3, 'n_estimators': 500, 'subsample': 1, 'random_state': 5,
            'min_samples_split': 2, 'min_samples_leaf':1, 'max_features':'sqrt'}
base_model = GradientBoostingClassifier(learning_rate=0.1, **base_model_args)
base_model.fit(X_train,y_train)

# learning rate, estimators
model1_args = {'learning_rate':0.1,'max_depth': 3, 'n_estimators': 1500, 'subsample': 1, 'random_state': 5,
            'min_samples_split': 2, 'min_samples_leaf':1, 'max_features':'sqrt'}
model1 = GradientBoostingClassifier(**model1_args)
model1.fit(X_train,y_train)

# sample split, leaf
model2_args = {'learning_rate':0.1,'max_depth': 3, 'n_estimators': 1500, 'subsample': 1, 'random_state': 5,
            'min_samples_split': 2, 'min_samples_leaf':1, 'max_features':'sqrt'}
model2=GradientBoostingClassifier(learning_rate=0.01, n_estimators=1500,max_depth=4, min_samples_split=40, min_samples_leaf=7,max_features=4 , subsample=0.95, random_state=10)
model2.fit(X_train,y_train)

pred=base_model.predict(X_test)
print(classification_report(y_test, pred))

pred=model1.predict(X_test)
print(classification_report(y_test, pred))

pred=model2.predict(X_test)
print(classification_report(y_test, pred))

             precision    recall  f1-score   support

          0       0.87      0.97      0.92       360
          1       0.97      0.87      0.92       380

avg / total       0.92      0.92      0.92       740

             precision    recall  f1-score   support

          0       0.88      0.97      0.92       360
          1       0.97      0.87      0.92       380

avg / total       0.93      0.92      0.92       740

             precision    recall  f1-score   support

          0       0.88      0.99      0.93       360
          1       0.99      0.87      0.92       380

avg / total       0.93      0.93      0.93       740

