# Introduction
This notebook was created as part of a Machine Learning academic course. Any feedback is more than welcome.

## Variables
There are 25 variables:

* ID: ID of each client
* LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit
* SEX: Gender (1=male, 2=female)
* EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
* MARRIAGE: Marital status (1=married, 2=single, 3=others)
* AGE: Age in years
* PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, ... 8=payment delay for eight months, 9=payment delay for nine months and above)
* PAY_2: Repayment status in August, 2005 (scale same as above)
* PAY_3: Repayment status in July, 2005 (scale same as above)
* PAY_4: Repayment status in June, 2005 (scale same as above)
* PAY_5: Repayment status in May, 2005 (scale same as above)
* PAY_6: Repayment status in April, 2005 (scale same as above)
* BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)
* BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)
* BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)
* BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)
* BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)
* BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)
* PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar)
* PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)
* PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)
* PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)
* PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)
* PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)
* default.payment.next.month: Default payment (1=yes, 0=no)


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import sklearn as sk

### Some personal settings

In [None]:
#plt.rcParams.keys()

In [None]:
%matplotlib inline

plt.style.use('seaborn')
sns.set(style="darkgrid")
plt.rcParams['figure.figsize'] = (10.0, 8.0)
plt.rcParams['xtick.labelsize'] = 14 
plt.rcParams['ytick.labelsize'] = 14 
plt.rcParams['axes.labelsize'] = 18

sns.set(font_scale=1.8)
sns.set(style="darkgrid")

pd.options.display.max_columns = 100
pd.options.display.max_rows = 100
pd.options.mode.chained_assignment = None
pd.set_option('display.float_format', lambda x: '%.4f' % x)
np.set_printoptions(formatter={'float_kind':'{:f}'.format})

class color:
   PURPLE = '\033[95m'
   CYAN = '\033[96m'
   DARKCYAN = '\033[36m'
   BLUE = '\033[94m'
   GREEN = '\033[92m'
   YELLOW = '\033[93m'
   RED = '\033[91m'
   BOLD = '\033[1m'
   UNDERLINE = '\033[4m'
   END = '\033[0m'

# Load Data

In [None]:
# Default of credit card clients: predict DEFAULT
df = pd.read_csv('../input/default/default of credit card clients.csv')
df.head()

Check how many examples and how many features are in the dataset

In [None]:
print(df.shape)
df.shape[0] - df.dropna().shape[0]


We have 30000 examples and 25 columns (24 features and one label). 68 rows with NA values. Let's look at the data

In [None]:
df.head()

# Data Cleaning & Preprocessing

Change column names to be more convenient 

In [None]:
df = df.rename(columns=str.lower)
df = df.rename(columns={'education': 'educ', 'marriage': 'status', 'pay_0': 'pay_1'})

Drop unneeded columns

In [None]:
df = df.drop(columns=['id'])

In [None]:
df.columns.to_list()

Let's look for missing data

In [None]:
df.isna().any()

In [None]:
df.isna().mean()

Since the number of null values is really small, we'll simply drop them.

In [None]:
df = df.dropna()
df.default.value_counts(dropna=False)

### Let's make sure all columns are as documented.

In [None]:
df.describe()

According to our documentation, the PAY_n variables indicate the number of months of delay and indicates "pay duly" with -1. Then what is -2 and 0? It seems to me that the label has to be adjusted to 0 for pay duly. Let's fix this.

In [None]:
pay_cols = ['pay_1', 'pay_2', 'pay_3', 'pay_4', 'pay_5', 'pay_6']
pay_cols = [col.lower() for col in pay_cols]

for col in pay_cols:
    fil = (df[col] == -2) | (df[col] == -1) | (df[col] == 0)
    df.loc[fil, col] = 0
    
df.pay_1.value_counts()

Categorical values to 1-hot

In [None]:
df_no_dummies = df
non_dummy_df = df 
df = pd.get_dummies(df)
cols = [col.replace(' ', '_') for col in df.columns]
df.columns = cols
df.head()

Check all values are indeed numeric

In [None]:
df.dtypes

All features are numeric, that's great.

# Data exploration

What are the statistics of the data?

In [None]:
df.describe()

Let's take a look at the some features distributions

In [None]:
ax = df.sex_female.value_counts(normalize=True).plot(kind='bar')
ax.set_xlabel("Gender", size=18)
ax.set_xticklabels(['Female', 'Male'], rotation = 0)
ax.tick_params(axis='both', which='major', labelsize=14)
df.sex_female.value_counts(normalize=True)

In [None]:
ax = df_no_dummies.educ.value_counts(normalize=True).plot(kind='bar')
ax.set_xlabel("Education")
ax.set_xticklabels(ax.get_xticklabels(), rotation = 0)#, ha="right")
df_no_dummies.educ.value_counts(normalize=True)

In [None]:
ax = df_no_dummies.status.value_counts(normalize=True).plot(kind='bar')
ax.set_xlabel("Status")
ax.set_xticklabels(ax.get_xticklabels(), rotation = 0)  #, ha="right")
df_no_dummies.status.value_counts(normalize=True)

In [None]:
sns.set(font_scale=1.6)

ax = df.default.value_counts(normalize=True).plot(kind='bar')
ax.set_xticklabels(['Not_defaulted', 'defaulted'], rotation = 0)
ax.tick_params(axis='both', which='major', labelsize=14)
df.default.value_counts(normalize=True)

About ~22% of the clients defaulted. A small unbalanced dataset, we'll remember that.

Distribution of the age in our data:

In [None]:
sns.set(font_scale=1.6)

plt.title('Age distribution')
df.age.hist(bins=17)

This histogram seems reasonable(based on the fact that the minimum age is 21).


Here we can see the distribution of the credit that was given

In [None]:
sns.set(font_scale=1.25)

plt.figure(figsize=(14,8))
ax = df.limit_bal.hist(bins=50)
ax.set_xticks(np.linspace(0, 1000000, 11))
ax.set_xlim(0,750000)
plt.title('Credit(Limit_Bal) distribution')


In [None]:
for i in range(1, 7):
    col = 'pay_' + str(i)
    print('Column', col, ':\n', df[col].value_counts().sort_index()\
          .plot(kind='bar', figsize=(7,4)), '\n\n')
    plt.title('value counts for column: {}'.format(col), fontsize=18)
    plt.show()

In [None]:
for i in range(1, 7):
    col = 'bill_amt' + str(i)
    print(df[col].plot(kind='hist', figsize=(7,4), bins=40), '\n\n')
    plt.title('Histogram for column: {}'.format(col), fontsize=18)
    plt.show()

Let's check the credit limit distribution VS sex.

In [None]:
sns.set(font_scale=1.8)

fig, ax1 = plt.subplots(figsize=(14,8))
s = sns.boxplot(ax = ax1, x="sex", y="limit_bal", hue="sex",data=non_dummy_df,\
                palette="PRGn",showfliers=True)
plt.title("Limit Balance Box plot by Gender", fontsize=24)


In [None]:
sns.set(font_scale=1.8)

fig, ax1 = plt.subplots(figsize=(14, 8))
sns.boxplot(x="educ", y="limit_bal", hue='sex', data=df_no_dummies)
plt.title("Limit Balance Box plot by Gender & Edcuation", fontsize=24)


In [None]:
sns.set(font_scale=1.8)

fig, ax1 = plt.subplots(figsize=(14,8))
s = sns.boxplot(ax = ax1, x="status", y="limit_bal", hue="sex",data=non_dummy_df,\
                palette="PRGn",showfliers=True)
plt.title("Limit Balance Box plot by Status", fontsize=24)


Now, let's see how default differs with respect to other features.

In [None]:
fig, ax = plt.subplots(figsize=(20,10))
sns.set(font_scale=2)
sns.barplot(x="sex", y="default", data=non_dummy_df, hue="educ", capsize=.05)
plt.legend(loc = 'best', bbox_to_anchor=(0, 1), fontsize=18)
plt.title("Default average & confidence intervals by Education level & Gender", fontsize=30)

In [None]:
fig, ax = plt.subplots(figsize=(20,10))
sns.set(font_scale=2)
sns.barplot(x="sex", y="default", data=non_dummy_df, hue="status", capsize=.05)
plt.legend(loc = 'upper left', bbox_to_anchor=(0, 1), fontsize=18)
plt.title("Default average & confidence intervals by Status & Gender", fontsize=30)

In [None]:
cols = ['pay_amt' + str(i) for i in range(1,7)]
df[cols].describe()

Now, let's see a correlation matrix heat map, and try to find interesting relations

In [None]:
sns.set(style="white")

corr = df.corr()
# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=np.bool))
f, ax = plt.subplots(figsize=(10, 8))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.9, vmin=-0.9, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})
plt.title("Correlation Matrix", fontsize=18)

And now plot correlations to default

In [None]:
plt.style.use('seaborn')
sns.set(font_scale=1.4)
plt.figure(figsize=(10,8))
plt.title('Default correlation with features')

df.corr()['default'].drop('default').plot(kind='barh')

It seems that the strongest relationships are regarding repayment status(pay_x) the link is positive, and it seems that the link decreases with the number of months before the current month.
Same goes for the payment amount(pay_amt_x), just with negative and less significant correlations. 
Another notable correlation is with the amount of given credit(limit_bal)

# Preprocessing

Let's scale the x values

Because there are different scales in our data, we'll use standard scalar which is less sensitive to outliers.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

df = df.astype(float)
df = df.dropna()
scale = MinMaxScaler()
df_to_scale = df
scaled = scale.fit_transform(df_to_scale)
scaled_df = pd.DataFrame(scaled, columns=df_to_scale.columns)
scaled_df.sample(5)

### Train and test split

We have about 30k obs', so let's use 80% for train and 20% for test.

In [None]:
from sklearn.model_selection import train_test_split

test_size = round(0.2 * len(scaled_df))
train, test = train_test_split(scaled_df, test_size=test_size, random_state=0, shuffle=True)

label = 'default'

x_train, y_train = train.drop(label, axis=1), train[label]
x_test, y_test = test.drop(label, axis=1), test[label]

And let's split for NOT scaled data as well.

In [None]:
train_not_scaled, test_not_scaled = train_test_split(df, test_size=test_size,\
                                                     random_state=0, shuffle=True)

label = 'default'

x_train_not_scaled, y_train_not_scaled = train_not_scaled.drop(label, axis=1), train_not_scaled[label]
x_test_not_scaled, y_test_not_scaled = test_not_scaled.drop(label, axis=1), test_not_scaled[label]

In [None]:
x_train.shape, y_train.shape, x_test.shape, y_test.shape


# Evaluation + Benchmark

As this is a classification problem that is a little unbalanced in its labels, we'll use F1 & accuracy as our evaluation metric.

The benchmark would be the most common label in the train set

In [None]:
df.default.value_counts(dropna=False)

In [None]:
y_train.value_counts(dropna=False)

In this case it's 0 (not survived), let's check its performance on both train and test

In [None]:
y_test.value_counts(dropna=False)

In [None]:
acc = len(df[df.default==0])/len(df)
pred = np.zeros(len(df))
f1 = sk.metrics.f1_score(df.default, pred)
print('Beanchmark Accuracy:', acc)
print('Beanchmark F1:', f1) # it'll be 0..

The performance on train and test is almost equal, our best algorithm should beat this performance!

# Running KNN - on NOT scaled data

In [None]:
from sklearn.neighbors import KNeighborsClassifier

train_acc = []
test_acc = []

train_f1 = []
test_f1 = []

k_vals = list(range(5, 41, 5))

vals = k_vals

for k in vals:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(x_train_not_scaled, y_train_not_scaled)
    
    train_acc.append(knn.score(x_train_not_scaled, y_train_not_scaled))
    test_acc.append(knn.score(x_test_not_scaled, y_test_not_scaled))
    
    y_pred_train = knn.predict(x_train_not_scaled)
    y_pred_test = knn.predict(x_test_not_scaled)

    train_f1.append(sk.metrics.f1_score(y_train, y_pred_train))
    test_f1.append(sk.metrics.f1_score(y_test, y_pred_test))


In [None]:
sns.set(font_scale=1.5)
# This will plot the accuracies as a function of k.

fig = plt.figure(figsize=(10, 8))
ax1 = fig.add_subplot()
ax1.plot(vals, train_acc, '-o', label='Training Accuracy')
ax1.plot(vals ,test_acc, '-o', label='Testing Accuracy')
ax1.set_ylabel("Accuracy")
ax1.set_xlabel("k")
plt.title('KNN accuracy by K - NOT scaled data')
plt.legend(fontsize=14)
plt.show()

# This will plot the f1 score as a function of k.

fig = plt.figure(figsize=(10, 8))
ax1 = fig.add_subplot()
ax1.plot(vals, train_f1, '-o', label='Training Accuracy')
ax1.plot(vals ,test_f1, '-o', label='Testing Accuracy')
ax1.set_ylabel("F1 score")
ax1.set_xlabel("k")
plt.title('KNN F1 score by K - NOT scaled data')
plt.legend(fontsize=14)
plt.show()

It doesn't seem better than the benchmark even when when it's performing best on the test set(k = 30). Also, we know that KNN can highly suffer from features that are in different scales. So let's run it on scaled data

In [None]:
k = 30
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(x_train_not_scaled, y_train_not_scaled)
y_pred_test = knn.predict(x_test_not_scaled)
pred_df = pd.DataFrame({'KNN_not_scaled': y_pred_test})
pred_df['Beanchmark'] = 0

# KNN - on SCALED data

In [None]:
from sklearn.neighbors import KNeighborsClassifier

train_acc = []
test_acc = []

train_f1 = []
test_f1 = []

k_vals = list(range(5, 41, 5))
vals = k_vals

for k in vals:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(x_train, y_train)
    
    test_acc.append(knn.score(x_test, y_test))
    train_acc.append(knn.score(x_train, y_train))
    
    y_pred_train = knn.predict(x_train)
    y_pred_test = knn.predict(x_test)

    train_f1.append(sk.metrics.f1_score(y_train, y_pred_train))
    test_f1.append(sk.metrics.f1_score(y_test, y_pred_test))


In [None]:

# This will plot the accuracies as a function of k.

fig = plt.figure(figsize=(10, 8))
ax1 = fig.add_subplot()
ax1.plot(vals, train_acc, '-o', label='Training Accuracy')
ax1.plot(vals ,test_acc, '-o', label='Testing Accuracy')
ax1.set_ylabel("Accuracy")
ax1.set_xlabel("k")
plt.title('KNN accuracy by K - scaled data')
plt.legend(fontsize=14)
plt.show()


# This will plot the f1 score as a function of k.

fig = plt.figure(figsize=(10, 8))
ax1 = fig.add_subplot()
ax1.plot(vals, train_f1, '-o', label='Training F1 Score')
ax1.plot(vals ,test_f1, '-o', label='Testing F1 Score')
ax1.set_ylabel("F1 Score")
ax1.set_xlabel("k")
plt.title('KNN F1 Score by K - scaled data')
plt.legend(fontsize=14)
plt.show()

The accuracy improved. and the F1 score improved by a lot. Based on this graphs we'll choose k = 25.

In [None]:
k = 25
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(x_train_not_scaled, y_train_not_scaled)
y_pred_test = knn.predict(x_test_not_scaled)
pred_df['KNN_scaled'] = y_pred_test

# Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

max_depth_vals = range(3, 19, 3)

min_samples_vals = range(1, 101, 10)

for depth in max_depth_vals:
    
    train_acc = []
    test_acc = []
    
    train_f1 = []
    test_f1 = []
    
    for min_sample in min_samples_vals:
        classifier = DecisionTreeClassifier(random_state=0, \
                                            max_depth=depth, min_samples_leaf=min_sample)
        classifier.fit(x_train_not_scaled, y_train_not_scaled)
        
        train_acc.append(classifier.score(x_train_not_scaled, y_train_not_scaled))
        test_acc.append(classifier.score(x_test_not_scaled, y_test_not_scaled))
        
        y_pred_train = classifier.predict(x_train_not_scaled)
        y_pred_test = classifier.predict(x_test_not_scaled)

        train_f1.append(sk.metrics.f1_score(y_train, y_pred_train))
        test_f1.append(sk.metrics.f1_score(y_test, y_pred_test))

    # This will plot the Accuracy Scores as a function of k.
  
    fig, ax = plt.subplots(figsize=(7, 5))
    plt.plot(min_samples_vals, train_acc, '-o', label = 'Training Accuracy')
    plt.plot(min_samples_vals, test_acc, '-o', label = 'Test Accuracy')
    ax.set_xlabel('Min Samples Leaf', fontsize=16)
    ax.set_ylabel('Accuracy', fontsize=16)
    plt.title('Accuracy for Decision Tree with Maximum depth = {} by min_samples leaf'.format(depth)\
              , fontsize=18)
    plt.legend(fontsize=14)
    plt.show()
    print('\n')

    # This will plot the F1 Scores as a function of k.

    fig = plt.figure(figsize=(7, 5))
    ax1 = fig.add_subplot()
    ax1.plot(min_samples_vals, train_f1, '-o', label = 'Training F1 Score')
    ax1.plot(min_samples_vals, test_f1, '-o', label = 'Test F1 Score')
    ax1.set_ylabel("F1 Score", fontsize=16)
    ax1.set_xlabel('Min Samples Leaf', fontsize=16)
    plt.title('F1 Score for Decision Tree with Maximum depth = {} by min_samples leaf'.format(depth)\
              , fontsize=18)
    plt.legend(fontsize=14)
    plt.show()
    print('\n\n\n\n')


Based on this graphs we decided to go with min_samples_leaf=60 and max_tree_depth=15

In [None]:
max_depth = 15
min_samples_leaf = 60
classifier = DecisionTreeClassifier(max_depth=max_depth, min_samples_leaf=min_samples_leaf)
classifier.fit(x_train_not_scaled, y_train_not_scaled)
y_pred_test = classifier.predict(x_test_not_scaled)
pred_df['Decision_tree'] = y_pred_test

# Let's plot the Decision Tree

We plot will plot this with max_depth = 6 just so we'll be able to keep track on the tree.

In [None]:
from sklearn.tree import export_graphviz
from IPython.display import SVG
from graphviz import Source


def plot_tree(tree, features, labels):
    graph = Source(export_graphviz(tree, feature_names=features, class_names=labels, filled = True))
    display(SVG(graph.pipe(format='svg')))

tree = DecisionTreeClassifier(max_depth=6, min_samples_leaf=60, random_state=0)
tree.fit(x_train, y_train)
plot_tree(tree, features=x_train.columns, labels=['Not Default', 'Default'])

# Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

max_depth_vals = range(3, 19, 3)

n_estimators_values = range(10, 200, 20)

for depth in max_depth_vals:
    
    train_acc = []
    test_acc = []
    
    train_f1 = []
    test_f1 = []
    
    for n in n_estimators_values:
        classifier = RandomForestClassifier(random_state=0, n_estimators=n, \
                                        max_depth=depth, min_samples_leaf=60) 
                                    # The paramters we decided on earlier
        classifier.fit(x_train_not_scaled, y_train_not_scaled)
        
        train_acc.append(classifier.score(x_train_not_scaled, y_train_not_scaled))
        test_acc.append(classifier.score(x_test_not_scaled, y_test_not_scaled))
        
        y_pred_train = classifier.predict(x_train_not_scaled)
        y_pred_test = classifier.predict(x_test_not_scaled)

        train_f1.append(sk.metrics.f1_score(y_train, y_pred_train))
        test_f1.append(sk.metrics.f1_score(y_test, y_pred_test))

    # This will plot the Accuracy Scores as a function of k.

    fig, ax = plt.subplots(figsize=(7, 5))
    plt.plot(n_estimators_values, train_acc, '-o', label = 'Training Accuracy')
    plt.plot(n_estimators_values, test_acc, '-o', label = 'Test Accuracy')
    ax.set_xlabel('n_estimators', fontsize=16)
    ax.set_ylabel('Accuracy', fontsize=16)
    plt.title('Accuracy for Random Forest with Maximum depth = {} by n_estimators'.format(depth)\
              , fontsize=18)
    plt.legend(fontsize=14)
    plt.show()
    print('\n')

    # This will plot the F1 Scores as a function of k.

    fig = plt.figure(figsize=(7, 5))
    ax1 = fig.add_subplot()
    ax1.plot(n_estimators_values, train_f1, '-o', label = 'Training F1 Score')
    ax1.plot(n_estimators_values, test_f1, '-o', label = 'Test F1 Score')
    ax1.set_ylabel("F1 Score", fontsize=16)
    ax1.set_xlabel('n_estimators', fontsize=16)
    plt.title('F1 Score for Random Forest with Maximum depth = {} by n_estimators'.format(depth)\
              , fontsize=18)
    plt.legend(fontsize=14)
    plt.show()
    print('\n\n\n\n')


We can notice that the y axes has really small differences both on F1 and Accuracy(diff<1%).
It seems that 100 is the best number of trees. Let's see what is the feature importance:

In [None]:
n_of_trees = 100
classifier = RandomForestClassifier(random_state=0, n_estimators=n_of_trees, \
                                        max_depth=15, min_samples_leaf=60) 
                                    # The paramters we decided on earlier
classifier.fit(x_train_not_scaled, y_train_not_scaled)
pred_df['random_forest_pred'] = classifier.predict(x_test_not_scaled)

In [None]:
sns.set(font_scale=1.6)

tmp = pd.DataFrame({'Feature': x_train.columns, 'Feature importance': classifier.feature_importances_})
tmp = tmp.sort_values(by='Feature importance', ascending=False)

plt.figure(figsize = (10, 8))
plt.title('Features importance of Random Forest')
s = sns.barplot(x='Feature', y='Feature importance', data=tmp)
s.set_xticklabels(s.get_xticklabels(), rotation=90)
plt.show()   

# AdaBoost

In [None]:
from sklearn.ensemble import AdaBoostClassifier

max_depth_vals = range(5, 18, 4)

n_estimators_values = range(40, 161, 40)

for depth in max_depth_vals:
    
    train_acc = []
    test_acc = []
    
    train_f1 = []
    test_f1 = []
    
    for n in n_estimators_values:
        base_estimator = DecisionTreeClassifier(max_depth=depth, min_samples_leaf=120)
                                        # AdaBoost tends to oferfit, so we'll use more suited paramters.
        classifier = AdaBoostClassifier(random_state=0, n_estimators=n, \
                                        base_estimator=base_estimator)
        classifier.fit(x_train_not_scaled, y_train_not_scaled)
        
        train_acc.append(classifier.score(x_train_not_scaled, y_train_not_scaled))
        test_acc.append(classifier.score(x_test_not_scaled, y_test_not_scaled))
        
        y_pred_train = classifier.predict(x_train_not_scaled)
        y_pred_test = classifier.predict(x_test_not_scaled)

        train_f1.append(sk.metrics.f1_score(y_train, y_pred_train))
        test_f1.append(sk.metrics.f1_score(y_test, y_pred_test))

    # This will plot the Accuracy Scores as a function of k.

    fig, ax = plt.subplots(figsize=(7, 5))
    plt.plot(n_estimators_values, train_acc, '-o', label = 'Training Accuracy')
    plt.plot(n_estimators_values, test_acc, '-o', label = 'Test Accuracy')
    ax.set_xlabel('n_estimators', fontsize=16)
    ax.set_ylabel('Accuracy', fontsize=16)
    plt.title('Accuracy for AdaBoost with Maximum depth = {} by n_estimators'.format(depth)\
              , fontsize=18)
    plt.legend(fontsize=14)
    plt.show()
    print('\n')

    # This will plot the F1 Scores as a function of k.

    fig = plt.figure(figsize=(7, 5))
    ax1 = fig.add_subplot()
    ax1.plot(n_estimators_values, train_f1, '-o', label = 'Training F1 Score')
    ax1.plot(n_estimators_values, test_f1, '-o', label = 'Test F1 Score')
    ax1.set_ylabel("F1 Score", fontsize=16)
    ax1.set_xlabel('n_estimators', fontsize=16)
    plt.title('F1 Score for AdaBoost with Maximum depth = {} by n_estimators'.format(depth)\
              , fontsize=18)
    plt.legend(fontsize=14)
    plt.show()
    print('\n\n\n\n')


We can notice that there is a major overfitting. We can also notice that the y axes has really small differences both on F1 and Accuracy(usually diff<1%).
It seems that 40 is the best number of trees and max_depth = 5 is the best max_depth. Let's see what is the feature importance:

In [None]:
n_of_trees = 40
base_estimator = DecisionTreeClassifier(max_depth=15, min_samples_leaf=60)
                                # The paramters we decided on earlier
classifier = AdaBoostClassifier(random_state=0, n_estimators=n_of_trees, base_estimator=base_estimator)
classifier.fit(x_train_not_scaled, y_train_not_scaled)
pred_df['AdaBoost'] = classifier.predict(x_test_not_scaled)

In [None]:
sns.set(font_scale=1.6)

tmp = pd.DataFrame({'Feature': x_train.columns, 'Feature importance': classifier.feature_importances_})
tmp = tmp.sort_values(by='Feature importance', ascending=False)

plt.figure(figsize = (10, 8))
plt.title('Features importance of AdaBoost')
s = sns.barplot(x='Feature', y='Feature importance', data=tmp)
s.set_xticklabels(s.get_xticklabels(), rotation=90)
plt.show()   

# Logistic regression

In [None]:
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression(penalty='l1', solver='liblinear')
classifier.fit(x_train, y_train)

y_pred_test = classifier.predict(x_test)
y_pred_train = classifier.predict(x_train)

acc_train = classifier.score(x_train, y_train)
acc_test = classifier.score(x_test, y_test)

f1_train = sk.metrics.f1_score(y_train, y_pred_train)
f1_test = sk.metrics.f1_score(y_test, y_pred_test)


pred_df['logistic_regression'] = y_pred_test

print('Accuracy for train:', acc_train)
print('F1 Score for train:', f1_train, '\n')

print('Accuracy for test:', acc_test)
print('F1 Score for test:', f1_test)

# Neural Networks with pyTorch

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

In [None]:
x_train.shape

In [None]:
## train data
class trainData(Dataset):
    
    def __init__(self, X_data, y_data):
        self.X_data = X_data
        self.y_data = y_data
        
    def __getitem__(self, index):
        return self.X_data[index], self.y_data[index]
        
    def __len__ (self):
        return len(self.X_data)


train_data = trainData(torch.FloatTensor(x_train.values), 
                       torch.FloatTensor(y_train))
## test data    
class testData(Dataset):
    
    def __init__(self, X_data):
        self.X_data = X_data
        
    def __getitem__(self, index):
        return self.X_data[index]
        
    def __len__ (self):
        return len(self.X_data)
    

test_data = testData(torch.FloatTensor(x_test.values))

In [None]:
BATCH_SIZE = 64
train_loader = DataLoader(dataset=train_data, batch_size=BATCH_SIZE, shuffle=True)
test_loader = DataLoader(dataset=test_data, batch_size=1)

In [None]:
class binaryClassification(nn.Module):
    def __init__(self):
        super(binaryClassification, self).__init__()
        # Number of input features is 12.
        self.layer_1 = nn.Linear(30, 64) 
        self.layer_2 = nn.Linear(64, 64)
        self.layer_out = nn.Linear(64, 1) 
        
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(p=0.1)
        self.batchnorm1 = nn.BatchNorm1d(64)
        self.batchnorm2 = nn.BatchNorm1d(64)
        
    def forward(self, inputs):
        x = self.relu(self.layer_1(inputs))
        x = self.batchnorm1(x)
        x = self.relu(self.layer_2(x))
        x = self.batchnorm2(x)
        x = self.dropout(x)
        x = self.layer_out(x)
        
        return x

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

In [None]:
LEARNING_RATE = 0.001
model = binaryClassification()
model.to(device)
print(model)
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

In [None]:
def binary_acc(y_pred, y_test):
    y_pred_tag = torch.round(torch.sigmoid(y_pred))

    correct_results_sum = (y_pred_tag == y_test).sum().float()
    acc = correct_results_sum/y_test.shape[0]
    acc = torch.round(acc * 100)
    
    return acc

In [None]:
EPOCHS = 50
model.train()
for e in range(1, EPOCHS+1):
    epoch_loss = 0
    epoch_acc = 0
    for X_batch, y_batch in train_loader:
        X_batch, y_batch = X_batch.to(device), y_batch.to(device)
        optimizer.zero_grad()
        
        y_pred = model(X_batch)
        
        loss = criterion(y_pred, y_batch.unsqueeze(1))
        acc = binary_acc(y_pred, y_batch.unsqueeze(1))
        
        loss.backward()
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        

    print(f'Epoch {e+0:03}: | Loss: {epoch_loss/len(train_loader):.5f} |\
    Acc: {epoch_acc/len(train_loader):.3f}')


In [None]:
y_pred_list = []
model.eval()
with torch.no_grad():
    for X_batch in test_loader:
        X_batch = X_batch.to(device)
        y_test_pred = model(X_batch)
        y_test_pred = torch.sigmoid(y_test_pred)
        y_pred_tag = torch.round(y_test_pred)
        y_pred_list.append(y_pred_tag.cpu().numpy())

y_pred_list = [a.squeeze().tolist() for a in y_pred_list]
pred_df['NN'] = y_pred_list

In [None]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred_list)

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred_list))

In [None]:
from string import ascii_uppercase
from sklearn.metrics import confusion_matrix

plt.figure(figsize=(12, 10))
confm = confusion_matrix(y_test, y_pred_list)
df_cm = pd.DataFrame(confm, index=['Predicted No Default', 'Predicted Default'], columns=['Not Default', 'Default'])

ax = sns.heatmap(df_cm, cmap='Oranges', annot=True, fmt='g')

# Averaging models

In [None]:
pred_df['default'] = test.default.reset_index(drop=True)
p = pred_df
pred_df.head()

In [None]:
pred_df = p
pred_df['model_avg'] = pred_df[['random_forest_pred', 'AdaBoost', 'NN']].mean(axis=1).round(decimals=0)


In [None]:
acc = []
f1 = []
models = pred_df.drop('default', axis=1).columns

for model in models:
    acc.append(sk.metrics.accuracy_score(pred_df.default, pred_df[model]))
    f1.append(sk.metrics.f1_score(pred_df.default, pred_df[model]))

comparison = pd.DataFrame({'Model': models, 'Accuracy': acc, 'F1 Score': f1})
comparison

In [None]:
acc = []
f1 = []
models = pred_df.drop('default', axis=1).columns

for model in models:
    acc.append(sk.metrics.accuracy_score(pred_df.default, pred_df[model]))
    f1.append(sk.metrics.f1_score(pred_df.default, pred_df[model]))

comparison = pd.DataFrame({'Model': models, 'Accuracy': acc, 'F1 Score': f1})
comparison

In [None]:
fig, ax = plt.subplots(figsize=(18,10))
sns.set(font_scale=1.3)

values = comparison.Accuracy.values
clrs = ['grey' if (x < max(values)) else 'red' for x in values ]

graph = sns.barplot(x="Model", y="Accuracy", data=comparison, palette= clrs)

for p in graph.patches:
        graph.annotate('{:.0f}%'.format(p.get_height()*100), (p.get_x()+0.42, p.get_height()),
                    ha='center', va='bottom')
plt.title("Accuracy Comparison by Model", fontsize=20)

In [None]:
fig, ax = plt.subplots(figsize=(18,10))
sns.set(font_scale=1.2)
sns.barplot(x="Model", y="F1 Score", data=comparison, capsize=.05)
plt.title("F1 Score Comparison by Model", fontsize=20)

# Performance vs. amount of data

In [None]:
train_acc = []
test_acc = []

train_f1 = []
test_f1 = []

percents = [0.1, 0.3, 0.5, 0.7, 1]

x_train_not_scaled = x_train_not_scaled.reset_index(drop=True)
y_train_not_scaled = y_train_not_scaled.reset_index(drop=True)

for p in percents:
    x_train_to_use = x_train_not_scaled.iloc[: int(p * len(x_train_not_scaled))]
    y_train_to_use = y_train_not_scaled.iloc[: int(p * len(y_train_not_scaled))]

    n_of_trees = 100
    classifier = RandomForestClassifier(random_state=0, n_estimators=n_of_trees, \
                                            max_depth=15, min_samples_leaf=60) 
                                        # The paramters we decided on earlier
    classifier.fit(x_train_to_use, y_train_to_use)
    
    test_acc.append(classifier.score(x_test_not_scaled, y_test_not_scaled))
    train_acc.append(classifier.score(x_train_to_use, y_train_to_use))
    
    y_pred_train = classifier.predict(x_train_to_use)
    y_pred_test = classifier.predict(x_test_not_scaled)

    train_f1.append(sk.metrics.f1_score(y_train_to_use, y_pred_train))
    test_f1.append(sk.metrics.f1_score(y_test_not_scaled, y_pred_test))


In [None]:
sns.set(font_scale=1.5)
# This will plot the accuracies as a function of % of data.

fig = plt.figure(figsize=(10, 8))
ax1 = fig.add_subplot()
ax1.plot(percents, train_acc, '-o', label='Training Accuracy')
ax1.plot(percents ,test_acc, '-o', label='Testing Accuracy')
ax1.set_ylabel("Accuracy")
ax1.set_xlabel("% Of Data")
plt.title('Random Forest Accuracy by % of data')
plt.legend(fontsize=14)
plt.show()


# This will plot the f1 score as a function of % of data.

fig = plt.figure(figsize=(10, 8))
ax1 = fig.add_subplot()
ax1.plot(percents, train_f1, '-o', label='Training F1 Score')
ax1.plot(percents ,test_f1, '-o', label='Testing F1 Score')
ax1.set_ylabel("F1 Score")
ax1.set_xlabel("% Of Data")
plt.title('Random Forest F1 Score by % of data')
plt.legend(fontsize=14)
plt.show()

It seems that the score doesn't improve when we go from 70% to 100%. Hence, we wouldn't use more data.

# Stacking models - normal average

In [None]:
pred_df = pred_df.drop('model_avg', axis=1)
pred_df.head()

In [None]:
from statsmodels.regression.linear_model import OLS

x_cols = ['ln_min_wage', 'arab', 'gender', 'constant']

ols = OLS(pred_df.default, pred_df.drop(['default', 'Beanchmark'], axis=1)).fit()

ols.summary()


In [None]:
dic = dict(ols.params)
sum_params = sum(dic.values())

In [None]:
compare = pd.DataFrame(columns=['Model', 'weight'])
compare['Model'] = dic.keys()
compare['weight'] = dic.values()
norm_weight = [param/sum_params for param in ols.params]
compare['norm_weight'] = norm_weight
c = compare
compare

In [None]:
compare = c
compare_t = compare.T
compare_t = compare_t.rename(columns=compare_t.iloc[0])
compare_t = compare_t.iloc[1:]
compare = compare_t.T
compare_t

In [None]:
pred_df['final_model'] = compare_t.iloc[0]['KNN_not_scaled']*pred_df.KNN_not_scaled\
    + compare_t.iloc[0]['KNN_scaled']*pred_df.KNN_scaled\
    + compare_t.iloc[0]['Decision_tree']*pred_df.Decision_tree\
    + compare_t.iloc[0]['random_forest_pred']*pred_df.Decision_tree\
    + compare_t.iloc[0]['AdaBoost']*pred_df.AdaBoost\
    + compare_t.iloc[0]['logistic_regression']*pred_df.logistic_regression\
    + compare_t.iloc[0]['NN']*pred_df.NN


pred_df.final_model = pred_df.final_model.round()
pred_df.head()


In [None]:
pred_df['final_norm_model'] = compare_t.iloc[1]['KNN_not_scaled']*pred_df.KNN_not_scaled\
    + compare_t.iloc[1]['KNN_scaled']*pred_df.KNN_scaled\
    + compare_t.iloc[1]['Decision_tree']*pred_df.Decision_tree\
    + compare_t.iloc[1]['random_forest_pred']*pred_df.Decision_tree\
    + compare_t.iloc[1]['AdaBoost']*pred_df.AdaBoost\
    + compare_t.iloc[1]['logistic_regression']*pred_df.logistic_regression\
    + compare_t.iloc[1]['NN']*pred_df.NN


pred_df.final_norm_model = pred_df.final_norm_model.round()
pred_df.head()

In [None]:
acc = []
f1 = []
models = pred_df.drop('default', axis=1).columns

for model in models:
    acc.append(sk.metrics.accuracy_score(pred_df.default, pred_df[model]))
    f1.append(sk.metrics.f1_score(pred_df.default, pred_df[model]))

comparison = pd.DataFrame({'Model': models, 'Accuracy': acc, 'F1 Score': f1})
comparison

In [None]:
fig, ax = plt.subplots(figsize=(18,10))
sns.set(font_scale=1.3)

values = comparison.Accuracy.values
clrs = ['grey' if (x < max(values)) else 'red' for x in values ]

graph = sns.barplot(x="Model", y="Accuracy", data=comparison, palette=clrs)

for p in graph.patches:
        graph.annotate('{:.0f}%'.format(p.get_height()*100), (p.get_x()+0.42, p.get_height()),
                    ha='center', va='bottom')
plt.title("Accuracy Comparison by Model", fontsize=20)

# Stacking models - with Sklearn function

In [None]:
# compare standalone models for binary classification
from numpy import mean
from numpy import std
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from matplotlib import pyplot


# get a list of models to evaluate
def get_models():
    models = dict()
    models['lr'] = LogisticRegression()
    models['knn'] = KNeighborsClassifier()
    models['cart'] = DecisionTreeClassifier()
    models['svm'] = SVC()
    models['bayes'] = GaussianNB()
    return models


# evaluate a given model using cross-validation
def evaluate_model(model):
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
    scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
    return scores

In [None]:
X, y = scaled_df.drop('default', axis=1), scaled_df.default
# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
    scores = evaluate_model(model)
    results.append(scores)
    names.append(name)
    print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))
# plot model performance for comparison
plt.boxplot(results, labels=names, showmeans=True)
plt.show()

In [None]:
from sklearn.ensemble import StackingClassifier


def get_stacking():
    # define the base models
    level0 = list()
    level0.append(('lr', LogisticRegression()))
    level0.append(('knn', KNeighborsClassifier()))
    level0.append(('cart', DecisionTreeClassifier()))
    level0.append(('svm', SVC()))
    level0.append(('bayes', GaussianNB()))
    # define meta learner model
    level1 = LogisticRegression()
    # define the stacking ensemble
    model = StackingClassifier(estimators=level0, final_estimator=level1, cv=5)
    return model

In [None]:
def get_models():
    models = dict()
    models['lr'] = LogisticRegression()
    models['knn'] = KNeighborsClassifier()
    models['cart'] = DecisionTreeClassifier()
    models['svm'] = SVC()
    models['bayes'] = GaussianNB()
    models['stacking'] = get_stacking()
    return models

In [None]:
X, y = scaled_df.drop('default', axis=1), scaled_df.default

# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
    scores = evaluate_model(model)
    results.append(scores)
    names.append(name)
    print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))
# plot model performance for comparison
plt.boxplot(results, labels=names, showmeans=True)
plt.show()

# Thank you for your time!

Again, any comments will be welcomed