# Introduction

We are presented with the dataset about bank's customers which have borrowed money from the bank (it wasn't specified which bank exactly though). There are couple of things we may (and we will) explore, but the most important question we will try to answer in this notebook is: **Based on the dataset, can we predict whether a loan will be repaid**?

# Import relevant libraries

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import plotly.express as px
from scipy.stats import uniform
from matplotlib import pyplot as plt
from sklearn import tree
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest,chi2
from sklearn.preprocessing import MinMaxScaler, StandardScaler,LabelEncoder
from sklearn.metrics import classification_report, roc_auc_score, recall_score, make_scorer, plot_confusion_matrix, confusion_matrix, accuracy_score,f1_score


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv('/kaggle/input/credit-risk/original.csv')

# Get the basic idea about the dataset

Check what type of objects our dataset contains

In [None]:
df.info()

It can be noted that the feature space is very small.

Check nulls

In [None]:
ser = df.isnull().sum()
ser[ser>0]

As we see, only `age` has missing values. Since 3 missing values is fairly small number (given that we have 2k entries in the dataset), we will use imputation method. Before imputing though, let's check the basic stats relating to `age`

In [None]:
df['age'].describe()

We see that the `age` column has **negative values**. There are two possible explanations:

1. Data has been transformed

2. Someone has made a mistake when entering the values


Let's check which explanation is more likely

In [None]:
dataframe = df
feature = 'age'
sns.set_style('ticks')
plt.figure(figsize=(10,7))
dataframe[feature].hist(bins=50)
plt.title(f"Distribution of the feature `{feature}`",fontsize=25)
plt.show()

In [None]:
df[df['age'] < 0]

Fistly we note that only 3 entries have negative age (out of 1997 non-null values). Furthermore, the non-negative values falls into the range which agrees with our common sense about the age (i.e (1) only adults (older than 18 or 20) people can borrow money from bank, and (2) it is unlikely that there will be a lot of people older than 90 (or even 100) that will borrow money from a bank)

Consequently, it will be reasonable to assume that the negative values were entered by mistake. Hence we will remove the entries with negative values.

In [None]:
neg_age = df[df['age'] < 0].index
df.drop(neg_age,axis=0,inplace=True)

In [None]:
df['age'].describe()

We impute `age` with mean

In [None]:
df['age'].fillna(df['age'].mean(),inplace=True)

It also should be noted that the values in the `age` are mostly floats. However, by "age" we normally mean how many years a person has already lived for (in other words, one's age is normally represented by the whole number, not a decimal). One way to explain why we observe decimals in the feature  `age` is: it might be the case that in our dataset the `age` is defined as:

$$ \frac{\text{number of days a person has lived for}}{365}$$


For example, if the person was born in 1995/05/01 and today is 2020/11/07, then the person has lived for $9322$ days, which implies that the person's age is 

$$\frac{9322}{365}=25.53972602739726$$


This definition of age at least explains the occurence of decimal numbers in the column `age` (although it may not be the actual definition used in the dataset; only the creator of the dataset can tell)

After cleaning the feature, we proceed by answering the following question: Is `age` a good predictor for defaults?

# Distribution of a target feature.

In [None]:
dataframe = df
feature1 = 'default'
sns.countplot(dataframe[feature1])
plt.title(f"Distribution of the feature `{feature1}`")

We see that the distribution is heavily unbalanced. 

# Is age a good predictor of a default scenario?

In [None]:
dataframe = df
feature_1 = 'default'
feature_2 = 'age'
plt.figure(figsize=(7,7))
sns.boxplot(x=feature_1, y=feature_2, data=dataframe)
plt.title("How does one's age affect risk of default?")
plt.show()

And the answer is: Yes, `age` is a good predictor. As the box plots shows, **younger people are more likely to default**; the conditional distributions deviate significantly (one can use ANOVA to quantify the difference)

In [None]:
df['income'].describe()

In [None]:
dataframe = df
feature = 'income'
sns.set_style('ticks')
plt.figure(figsize=(10,7))
dataframe[feature].hist()
plt.title(f"Distribution of the feature `{feature}`",fontsize=25)
plt.show()

We see that in our dataset, the income has (roughly) uniform distribution

# Is income a good predictor of a default scenario?

In [None]:
dataframe = df
feature_1 = 'default'
feature_2 = 'income'
plt.figure(figsize=(7,7))
sns.boxplot(x=feature_1, y=feature_2, data=dataframe)
plt.title("How does one's income affect risk of default?")
plt.show()

In [None]:
dataframe = df
cat_feat = 'default'
cont_feat = 'income'
plt.figure(figsize=(7,7))
for value in df[cat_feat].unique():
    sns.distplot(df[df[cat_feat] == value][cont_feat], label=value)
plt.legend()
plt.title(f"Distribution of `{cont_feat}` conditional on `{cat_feat}`")
plt.show()


Contrary to one's intuition, in our dataset the lower income doesn't not imply higher chance of default. As we see from the charts above, the conditional distributions are rouhgly the same, indicating that the `income` does not do a good job at differentiating bad loans

In [None]:
df['loan'].describe()

In [None]:
dataframe = df
feature = 'loan'
sns.set_style('ticks')
plt.figure(figsize=(10,7))
dataframe[feature].hist()
plt.title(f"Distribution of {feature}",fontsize=25)
plt.show()

Unlike the distributions of the `age` and `income`, the distribution of `loan` is skewed to the right.

Are there loans where amount is less than 10?

In [None]:
df[df['loan'] < 10]

Now let's see the conditional distribution

# Is loan amount a good predictor of a default scenario?

In [None]:
dataframe = df
feature_1 = 'default'
feature_2 = 'loan'
plt.figure(figsize=(7,7))
sns.boxplot(x=feature_1, y=feature_2, data=dataframe)
plt.title("How does loan amount affect risk of default?")
plt.show()

In [None]:
dataframe = df
cat_feat = 'default'
cont_feat = 'loan'
plt.figure(figsize=(7,7))
for value in df[cat_feat].unique():
    sns.distplot(df[df[cat_feat] == value][cont_feat], label=value)
plt.legend()
plt.title(f"Distribution of `{cont_feat}` conditional on `{cat_feat}`")
plt.show()



As we see, there is a clear separation: larger loan amount implies higher likelihood of default.

# Is there any relation between features  `income` and `loan`?

In [None]:
dataframe = df
feature1 = 'loan'
feature2 = 'income'


g=sns.jointplot(x=dataframe[feature1], y=dataframe[feature2], kind="kde")
g.fig.set_figwidth(11)
g.fig.set_figheight(13)
plt.show()

Couple of observations can be made: 
1. For the high income customers, the amount borrowed has a very large spread (i.e it is quite possible that the person with the high income will borrow a very small amount of money)

2. For the low income customers, the spread is very small (in other words, it is highly unlikely (even impossible) that the people with the small income will borrow large amounts of money)

To see that the conclusion above hold (quantitatively), we can do following:

1. Bin income (i.e discretize the feature `income`)

2. For each bin (i.e income category), calculate a spread of loan amount (for example, using variance/standard deviation)


Let's try

# Does higher income imply larger loan amount spread?

In [None]:
df1 = df[['income','loan']].copy()
df1['Binned income'] = pd.cut(df1['income'],7)
df1 = df1.groupby('Binned income').std()
df1.reset_index(inplace=True)
df1['Binned income'] = df1['Binned income'].astype(str)
df1.sort_values(by=['loan'],ascending=True,inplace=False)


fig = px.bar(df1,
             x='Binned income',
             y='loan',
             title='Spread (stdev) of the loan amount for each income category')
fig.show()

As we see, income is positively correlated with the loan spread (i.e Some rich people borrow large sums, but some of them also borrow tiny amounts; but the people with low income (generally) borrow low amounts) 

# Is there a correlation between customer's income and an amount borrowed?

In [None]:
corr = df[['income','loan']].corr().values[0][1]
print(f'Correlation between `income` and `loan`: {round(corr,2)}')

As we see, there is a correlation between income and loan, but it's pretty weak one (for the reason we've just elaborated on: loan amounts are spread out for high-income customers)

# EDA: Conclusions. 
Based on our dataset, we can make following conclusions (should be noted though, these conclusions may not hold in general: our dataset is quite small, and the sample taken might not necessarily be random):
1. Younger people have way higher likelihood of default.
2. Income doesn't affect the likelihood of default.
3. Larger loan amounts imply higher likelihood of a default.
4. Amounts borrowed by rich people are way more spread out than those borrowed by low-income customers.

# Feature importance estimation: Random Forest

We've hypothesized that `income` is the least useful feature for predicting default. One can use several ways to verify this, but I will use the most straightforward one: Random Forest feature importance

In [None]:
X = df.drop(['clientid','default'],axis=1)
y = df['default']

In [None]:
sns.set_style('darkgrid')

forest_clf = RandomForestClassifier(n_estimators=100)
forest_clf.fit(X, y)

importances = forest_clf.feature_importances_
indices = np.argsort(importances)[::-1]

plt.figure(figsize=(7,7))
plt.bar(range(len(indices)),importances[indices])
plt.xticks(range(len(indices)), indices)
plt.title("Feature importance (Random Forest)")
plt.xlabel('Index of a feature')
plt.ylabel('Feature importance')
plt.show()



In [None]:
lowest_importance = X[X.columns[indices[-1]]].name
print(f'RF esimations show that the feature with the lowest importance is: {lowest_importance}')

As we see, Random Forest estimations agree with the conclusion we reached after visualizing conditional distributions.

# Feature selection

We see that the feature space is very small, so we will use all features available to us (except `clientid`, of course; this feature has no value)

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=15)


mmsc = MinMaxScaler()
X_train = mmsc.fit_transform(X_train)
X_test = mmsc.transform(X_test)

# Naive Bayes

In [None]:
nb_clf = GaussianNB().fit(X_train,y_train)
print(classification_report(y_true=y_test, y_pred=nb_clf.predict(X_test)))
plot_confusion_matrix(nb_clf, X_test, y_test)

# Logistic Regression

In [None]:
log_random_state = None
log_clf = LogisticRegression(random_state=log_random_state).fit(X_train, y_train)
print(classification_report(y_true=y_test, y_pred=log_clf.predict(X_test)))
plot_confusion_matrix(log_clf, X_test, y_test)

# Important note:
We see that the recall (given that the label `1` is positive) is pretty low for both logistic and NB. This is to be expected, mainly because the target feature is disbalanced (there are way less entries with label `1`).

Let's try different models (but now we will select hyperparameters that will maximize recall for the label `1`)

# KNN

In [None]:
recall_1 = make_scorer(recall_score,pos_label=1)

MIN = 1 #Min number of neighbors
MAX = 30 #Max number of neighbors
knn_estimator = KNeighborsClassifier()
knn_clf = GridSearchCV(knn_estimator,
                       {'n_neighbors': range(MIN,MAX+1)}
                       ,scoring=recall_1).fit(X_train, y_train)
print(f"Best estimator: {knn_clf.best_estimator_}")
print(classification_report(y_true=y_test, y_pred=knn_clf.predict(X_test)))
plot_confusion_matrix(knn_clf, X_test, y_test)

# Random Forest

In [None]:
estimator = RandomForestClassifier(random_state=13)
rf_clf = GridSearchCV(estimator,
                      param_grid={'n_estimators':[10,20,50,100], 'criterion': ['entropy','gini']},
                      scoring=recall_1).fit(X_train, y_train)

print(classification_report(y_true=y_test, y_pred=rf_clf.predict(X_test)))
plot_confusion_matrix(rf_clf, X_test, y_test)

In [None]:
rf_clf.best_estimator_

# Training: Conclusions

We see that the Random Forest does the best job (by 'best" I mean that the accuracy and macro f1 score is the highest out of all models we used)

