# Kaggle Competition : Home Credit Default Risk

> Predict how capable each applicant is of repaying a loan.

References:<br>
[Data Sources](https://www.kaggle.com/c/home-credit-default-risk/data) <br>
[Credit Fraud || Dealing with Imbalanced Datasets](https://www.kaggle.com/janiobachmann/credit-fraud-dealing-with-imbalanced-datasets)<br>
[信用卡詐騙分析-不平衡資料分析與處理](https://medium.com/%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92%E7%9F%A5%E8%AD%98%E6%AD%B7%E7%A8%8B/%E4%BF%A1%E7%94%A8%E5%8D%A1%E8%A9%90%E9%A8%99%E5%88%86%E6%9E%90-%E4%B8%8D%E5%B9%B3%E8%A1%A1%E8%B3%87%E6%96%99%E5%88%86%E6%9E%90%E8%88%87%E8%99%95%E7%90%86kernel%E7%BF%BB%E8%AD%AFpart1-7f1b0a645f9a)<br><br>

task:

> 1. Check dataset

> 2. Data Processing

> 3. Under-sampling

> 4. Over-sampling (SMOTE Technique)


# 1. Check dataset

In [143]:
import pandas as pd
import numpy as np

In [144]:
df = pd.read_csv('data/home_default/application_train.csv')

In [145]:
df.shape

(307511, 122)

In [146]:
df.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


#### inbalanced data

91% of data is 0 (loans that were repaid on time), 8% of data is 1 (loans that were not repaid on time). 

In [147]:
df['TARGET'].value_counts(normalize=True)

0    0.919271
1    0.080729
Name: TARGET, dtype: float64

#### train test split

In [148]:
# define training and testing data
y = df['TARGET']
X = df.copy().drop(columns = ['TARGET'])

In [149]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. Data Processing

task:

> a. Outlinear

> b. Missing value

> c. Transform categorical data

> d. Scaling

#### a. deal with Outlinear

We find anomalies when df['DAYS_EMPLOYED'] == 365243, now check whether this has influence on target. 

In [None]:
anom = df[df['DAYS_EMPLOYED'] == 365243]
non_anom = df[df['DAYS_EMPLOYED'] != 365243]

Let's fill in the anomalous values with 'np.nan', and create a new boolean column indicating whether or not the value was anomalous.

In [None]:
# Create an anomalous flag column
df['DAYS_EMPLOYED_ANOM'] = df["DAYS_EMPLOYED"] == 365243

# Replace the anomalous values with nan
df['DAYS_EMPLOYED'].replace({365243: np.nan}, inplace = True)

#### b. missing value

In [None]:
col_missing = list(df.columns[df.isnull().any()])

Function from: [Impute categorical missing values in scikit-learn](https://stackoverflow.com/questions/25239958/impute-categorical-missing-values-in-scikit-learn)

In [None]:
from sklearn.base import TransformerMixin

In [None]:
class DataFrameImputer(TransformerMixin):

    def __init__(self):
        """Impute missing values.

        Columns of dtype object are imputed with the most frequent value 
        in column.

        Columns of other types are imputed with mean of column.

        """
    def fit(self, X, y=None):

        self.fill = pd.Series([X[c].value_counts().index[0]
            if X[c].dtype == np.dtype('O') else X[c].mean() for c in X],
            index=X.columns)

        return self

    def transform(self, X, y=None):
        return X.fillna(self.fill)

In [None]:
imp = DataFrameImputer()
imp.fit(df_train)

df_train = imp.transform(df_train)
df_test = imp.transform(df_test)

In [None]:
# No missing value after imputataion. 

df_train[col_missing].isnull().sum(axis = 0)/df_train.shape[0]

#### b. transform categorical data

* Label encoding  →  columns with 2 unique categories.  <br>
* One-hot encoding  →  columns variable with more than 2 unique categories.

In [None]:
#  columns with 2 unique categories
columns_ob_two = ['NAME_CONTRACT_TYPE', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'EMERGENCYSTATE_MODE', 'DAYS_EMPLOYED_ANOM']

Label encoding of 2 unique categorical variables

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

for col in columns_ob_two:
    le = LabelEncoder()
    df_train[col] = le.fit_transform(df_train[col])
    df_test[col] = le.transform(df_test[col])

One-hot encoding of more than two unique categorical variable

In [None]:
df_train = pd.get_dummies(df_train)
df_test = pd.get_dummies(df_test)

To remove the columns in the training data that are not in the testing data, we need to align the dataframes. 

In [None]:
train_labels = df_train['TARGET']

# Align the training and testing data, keep only columns present in both dataframes
df_train, df_test = df_train.align(df_test, join = 'inner', axis = 1)

# Add the target back in
df_train['TARGET'] = train_labels

print('Training Features shape: ', df_train.shape)
print('Testing Features shape: ', df_test.shape)

#### c. Scaling

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
# define training and testing data
y_train = df_train['TARGET']

X_train = df_train.copy().drop(columns = ['TARGET'])
X_test = df_test.copy()

# Scale each feature to 0-1
scaler = MinMaxScaler(feature_range = (0, 1))
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

print('Training data shape: ', X_train.shape)
print('Testing data shape: ', X_test.shape)

In [None]:
df_X_train = pd.DataFrame(X_train, columns = df_train.columns[0:-1])

In [None]:
df_X_test = pd.DataFrame(X_test, columns = df_test.columns)

In [None]:
df_X_train.head()

# 3. Under-sampling

#### b. Random Under-Sampling

We want a sub-sample of our dataframe with a 50/50 ratio with regards to our classes. <br>
Then the next step we will implement is to shuffle the data to see if our models can maintain a certain accuracy everytime we run this script.

Warning: The main issue with "Random Under-Sampling" is that we run the risk that our classification models will not perform as accurate as we would like to since there is a great deal of information loss.

In [None]:
df_train['TARGET'].value_counts()

* There are 24,825 cases of Default in our dataset so we can randomly get 24,825 cases of non-Default to create our new sub dataframe.
* We concat the 24,825 cases of Default and non-Default, creating a new sub-sample.

In [None]:
df_train = df_X_train # create training dataset with target after scaling
df_train['TARGET'] = y_train

In [None]:
# Lets shuffle the data before creating the subsamples

df_train = df_train.sample(frac=1)

# amount of Default classes 492 rows.
default_df = df_train.loc[df_train['TARGET'] == 1]
non_default_df = df_train.loc[df_train['TARGET'] == 0][:24825]

normal_distributed_df = pd.concat([default_df, non_default_df])

# Shuffle dataframe rows
new_df = normal_distributed_df.sample(frac=1, random_state=42)

In [None]:
print('Shape of Default case: {}'.format(default_df.shape))
print('Shape of Non-Default case: {}'.format(non_default_df.shape))
print('Shape of All cases: {}'.format(new_df.shape))

In [None]:
import seaborn as sns

In [None]:
print('Distribution of the Classes in the subsample dataset')

sns.countplot('TARGET', data=new_df)
plt.title('Equally Distributed Classes', fontsize=14)
plt.show()

#### b. Fit Logistic classifier to examine the result of underfitting

In [None]:
# define training and testing data
y_train = new_df['TARGET']

X_train = new_df.copy().drop(columns = ['TARGET'])
X_test = df_test.copy()

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

In [None]:
params_log  = { 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
                'penalty': ['l1', 'l2']}

log = LogisticRegression(max_iter=10000)

In [None]:
gs_log = GridSearchCV(log, params_log, cv=5, n_jobs=-1, verbose=1)

# fitting the model for grid search 
gs_log.fit(X_train , y_train)

# summarize
print('Mean Accuracy: %.3f' % gs_log.best_score_)
print('Config: %s' % gs_log.best_params_)

Don't use accuracy score as a metric with imbalanced datasets (will be usually high and misleading), instead use f1-score, precision/recall score or confusion matrix

In [None]:
from sklearn.metrics import classification_report

In [None]:
y_pred = gs_log.predict(X_train)

classification_report(X_train, y_pred, target_names=['0', '1'])

# 4. Over-sampling (SMOTE Technique)

[SMOTE for Imbalanced Classification with Python](https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/)

SMOTE stands for Synthetic Minority Over-sampling Technique. <br>
SMOTE creates new synthetic points in order to have an equal balance of the classes. <br>

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE

In [None]:
params_log  = { 'classification__C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
                'classification__penalty': ['l1', 'l2']}

model = Pipeline([
        ('sampling', SMOTE()),
        ('classification', LogisticRegression(max_iter=10000))
    ])

gs_log_smote = GridSearchCV(model, params_log, cv=5, n_jobs=-1, verbose=1)
gs_log_smote.fit(X_train, y_train)

In [None]:
# summarize
print('Mean Accuracy: %.3f' % gs_log_smote.best_score_)
print('Config: %s' % gs_log_smote.best_params_)

In [None]:
y_pred = gs_log_smote.predict(X_test)
classification_report(y_train, y_pred, target_names=['0', '1'])

In [None]:
# # SMOTE happens during Cross Validation not before..

# # define pipeline
# steps = [('over', SMOTE()), ('model', LogisticRegression())]
# pipeline = Pipeline(steps=steps)

# # evaluate pipeline
# cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# scores = cross_val_score(pipeline, X_train, y_train, scoring='roc_auc', cv=cv, n_jobs=-1)