# Loan Prediction
# XGBoost / Hyperparameter tunning

` This notebook is work in progress, I will add comments later`

## Plan
1. Data Analysis 
2. Data Pre processing
3. Models testing
4. Hyperparameter tunning

Let us first start by importing the necessary packages:

In [None]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

I would be using ggplot theme for my plots

In [None]:
plt.style.use('ggplot')

Import the data

In [None]:
path = '/kaggle/input/loan-prediction-based-on-customer-behavior/Training Data.csv'
pd.set_option('display.max_columns', None)
data = pd.read_csv(path)
data.head()

# Data analysis

Before diving into model testing, we have to get familiar with the data and understand what it represents.

In [None]:
data.describe(include='all')

In [None]:
data.info()

We have multiple categorical data, since we will test various models and not only tree based models, we should econde these variables. <br>
However we can see we do not have any missing values in our data.

## Age

In [None]:
data.groupby('Risk_Flag').describe()['Age']

We can see that the minimum age, maximum age, and even average age does not affect the risk flag variable. <br>
Also we can see that both (0 and 1) have the same difference between the age groups and the mean: standard deviation of 17.

In [None]:
data[data['Risk_Flag'] == 1]['Age'].value_counts(sort=True)[:20]

There is not a big difference between the top 5 of the Risk Flag list, however starting from the sixth position the difference grow, but the distribution of age groups is normal, since we don't have a certain age group repeating. As we can see the first place is for people aged 22, second place for people aged 66.

## Marital status

In [None]:
data.groupby('Risk_Flag')['Married/Single'].value_counts()

Here we can see that single people are more risky, 91% of risky flag people are single. But actually this can't tell much since most of people that apply are single.

In [None]:
data['Married/Single'].hist()

## Income

In [None]:
sns.distplot(data['Income'], bins=20)
plt.show()

In [None]:
data['Status'] = np.where(data['Income']>=data['Income'].mean(), 'Above Average', 'Under Average')
data['Status'].hist()

The data is distributed equally between people with under average and above average income.

In [None]:
data.groupby('Risk_Flag')['Status'].hist()

The income does not affect the risk flag variable of a person.

In [None]:
data.groupby("Risk_Flag")['Income'].describe()

The average salary of Risk Flag group 1 is 0.3 e+06 smaller than the non risky group. Meanwhile other metrics for both groups are similar.

## State

In [None]:
data.groupby("Risk_Flag")['STATE'].value_counts(sort=True)

Uttar Pradesh ranks first in both groups which is normal since Uttar has the most number of applicants, 

In [None]:
sns.countplot(y='STATE', data=data)
plt.show()

West Bengal ranks second with the most risky flags, even though it is the fourth city in the number of applicants.

In [None]:
data[data['STATE'] == 'West_Bengal'].describe()

In [None]:
data.describe()

Income and age of applicants from West Bangali does not differ much from other states.

## Current Job

In [None]:
data.groupby("Risk_Flag")['CURRENT_JOB_YRS'].value_counts(sort=True)

In [None]:
sns.countplot(x='CURRENT_JOB_YRS', hue='Risk_Flag', data=data)
plt.show()

This plot doesn't tell much.

## Profession

In [None]:
data['Profession'].value_counts()

Most applicants have well paid jobs.

In [None]:
data.groupby('Profession')['Income'].mean().sort_values(ascending=False)

# Data Encoding:

The data does not need much cleaning, we only need to encode the categorical data so it could be used by our algorithms. For this step I will use LabelEncode of the sklearn pre processing library.

In [None]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
data["Married/Single"] = label_encoder.fit_transform(data["Married/Single"])
data["House_Ownership"] = label_encoder.fit_transform(data["House_Ownership"])
data["Car_Ownership"] = label_encoder.fit_transform(data["Car_Ownership"])
data["Profession"] = label_encoder.fit_transform(data["Profession"])
data["CITY"] = label_encoder.fit_transform(data["CITY"])
data["STATE"] = label_encoder.fit_transform(data["STATE"])
data["Status"] = label_encoder.fit_transform(data["Status"])

In [None]:
data['STATE'].value_counts()

Tada!

# Testing Classifiers

Moving to the next step, we will start by importing the classifiers that we will test:
* Logistic Regression
* XGBoost
* Random Forest 
* Gradient Boosting

We would test other classifiers but it takes so much time to run a cross validation test on many classifiers.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from sklearn.ensemble import GradientBoostingClassifier

And of course evaluation metrics are important, alongside the splitting function.

In [None]:
from sklearn import metrics
from sklearn.model_selection import KFold, cross_val_score, train_test_split

First, we have to drop the target column from the data.

In [None]:
pred = data['Risk_Flag']
data.drop(columns=['Risk_Flag'], inplace=True)

To avoid wasting too much time running one cell of code, first we will test logistic regression and gradient boosting, and then move to XGBoost and Random Forest.

In [None]:
KFold_Score = pd.DataFrame()
classifiers = ['LogisticRegression', 'GradientBoostingClassifier']
models = [
          LogisticRegression(max_iter = 1000),
          GradientBoostingClassifier(random_state=0)
         ]

In [None]:
j = 0
for i in models:
    model = i
    cv = KFold(n_splits=5, random_state=0, shuffle=True)
    KFold_Score[classifiers[j]] = (cross_val_score(model, data, np.ravel(pred), scoring = 'accuracy', cv=cv))
    j = j+1

Well, we can clearly see we have got nearly the same mean cross validation score. Moving on to the next test.

In [None]:
mean = pd.DataFrame(KFold_Score.mean(), index= classifiers)
KFold_Score = pd.concat([KFold_Score,mean.T])
KFold_Score.index=['Fold 1','Fold 2','Fold 3','Fold 4','Fold 5','Mean']
KFold_Score.T.sort_values(by=['Mean'], ascending = False)

In [None]:
KFold_Score2 = pd.DataFrame()
classifiers = ['RandomForestClassifier', 'XGBoostClassifier']
models = [
          RandomForestClassifier(n_estimators=200, random_state=0),
          xgb.XGBClassifier(n_estimators=100),
         ]
j = 0
for i in models:
    model = i
    cv = KFold(n_splits=5, random_state=0, shuffle=True)
    KFold_Score2[classifiers[j]] = (cross_val_score(model, data, np.ravel(pred), scoring = 'accuracy', cv=cv))
    j = j+1

In [None]:
mean = pd.DataFrame(KFold_Score2.mean(), index= classifiers)
KFold_Score2 = pd.concat([KFold_Score2,mean.T])
KFold_Score2.index=['Fold 1','Fold 2','Fold 3','Fold 4','Fold 5','Mean']
KFold_Score2.T.sort_values(by=['Mean'], ascending = False)

After testig four classifiers, we can see that XGBoostClassifier returns the best mean cross validation score.

# Hyperparameter tunning:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data, pred, test_size=0.2, random_state=42)

Instead of numpy arrays or pandas dataFrame, XGBoost uses DMatrices.

In [None]:
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

We will start by defining the initial parameters of the model:

In [None]:
params = {
    'max_depth':6,
    'min_child_weight': 1,
    'eta':.3,
    'subsample': 1,
    'colsample_bytree': 1,
    'objective':'reg:linear',
}

Since this is a classification problem, we will set the evaluation metric to log loss

In [None]:
params['eval_metric'] = 'logloss'

Train the model with the parameters selected, and specify early stopping at ten rounds

In [None]:
model = xgb.train(
    params,
    dtrain,
    num_boost_round=999,
    evals=[(dtest, "Test")],
    early_stopping_rounds=10
)

Let us check the initial log loss

In [None]:
print("Best Log Loss: {:.2f} with {} rounds".format(
                 model.best_score,
                 model.best_iteration+1))

In [None]:
cv_results = xgb.cv(
    params,
    dtrain,
    num_boost_round=999,
    seed=42,
    nfold=5,
    metrics={'logloss'},
    early_stopping_rounds=10
)
cv_results

In [None]:
cv_results['test-logloss-mean'].min()

We will start by searching two parameters: max_depth and min_child_weight

In [None]:
gridsearch_params = [
    (max_depth, min_child_weight)
    for max_depth in range(9,12)
    for min_child_weight in range(5,8)
]

In [None]:
min_log = float("Inf")
best_params = None
for max_depth, min_child_weight in gridsearch_params:
    print("CV with max_depth={}, min_child_weight={}".format(
                             max_depth,
                             min_child_weight))
    params['max_depth'] = max_depth
    params['min_child_weight'] = min_child_weight
    cv_results = xgb.cv(
        params,
        dtrain,
        num_boost_round=999,
        seed=42,
        nfold=5,
        metrics={'logloss'},
        early_stopping_rounds=10
    
    mean_log = cv_results['test-logloss-mean'].min()
    boost_rounds = cv_results['test-logloss-mean'].argmin()
    print("\logloss {} for {} rounds".format(mean_log, boost_rounds))
    if mean_log < min_log:
        min_log = mean_log
        best_params = (max_depth,min_child_weight)
print("Best params: {}, {}, logloss: {}".format(best_params[0], best_params[1], min_log))

Save these two parameters to the params dictionary

In [None]:
params['max_depth'] = 9
params['min_child_weight'] = 7

Moving to the subsample and colsample paramters

In [None]:
gridsearch_params = [
    (subsample, colsample)
    for subsample in [i/10. for i in range(7,11)]
    for colsample in [i/10. for i in range(7,11)]
]

In [None]:
min_log = float("Inf")
best_params = None
for subsample, colsample in reversed(gridsearch_params):
    print("CV with subsample={}, colsample={}".format(
                             subsample,
                             colsample))
    params['subsample'] = subsample
    params['colsample_bytree'] = colsample
    cv_results = xgb.cv(
        params,
        dtrain,
        num_boost_round=500,
        seed=42,
        nfold=5,
        metrics={'logloss'},
        early_stopping_rounds=10
    )
    mean_log = cv_results['test-logloss-mean'].min()
    boost_rounds = cv_results['test-logloss-mean'].argmin()
    print("\log {} for {} rounds".format(mean_log, boost_rounds))
    if mean_log < min_log:
        min_log = mean_log
        best_params = (subsample,colsample)
print("Best params: {}, {}, log: {}".format(best_params[0], best_params[1], min_log))

The optimal results are 1 and 0.7

In [None]:
params['subsample'] = 1.0
params['colsample_bytree'] = .7

Tuning the learning rate might take some time, if you clone the code prepare to wait up to 30mins for this cell to be executed

In [None]:
min_log = float("Inf")
best_params = None
for eta in [.3, .2, .1, .05, .01, .005]:
    print("CV with eta={}".format(eta))
    params['eta'] = eta
    cv_results = xgb.cv(
    params,
    dtrain,
    num_boost_round=500,
    seed=42,
    nfold=5,
    metrics=['logloss'],
    early_stopping_rounds=10)
    mean_log = cv_results['test-logloss-mean'].min()
    boost_rounds = cv_results['test-logloss-mean'].argmin()
    print("\Log Loss {} for {} rounds\n".format(mean_log, boost_rounds))
    if mean_log < min_log:
        min_log = mean_log
        best_params = eta
print("Best params: {}, Log Loss: {}".format(best_params, min_log))

The best learning rate is 0.1

In [None]:
params['eta'] = .1

Let us have a look at the parameters we have got so far

In [None]:
params

Train the model

In [None]:
model = xgb.train(
    params,
    dtrain,
    num_boost_round=900,
    evals=[(dtest, "Test")],
    early_stopping_rounds=10
)

print("Best Log: {:.2f} in {} rounds".format(model.best_score, model.best_iteration+1))

In [None]:
num_boost_round = model.best_iteration + 1
best_model = xgb.train(
    params,
    dtrain,
    num_boost_round=num_boost_round,
    evals=[(dtest, "Test")]
)

In [None]:
predictions = best_model.predict(dtest)

# ROC_AUC score on the test set

As required, we will calculate the roc auc score of the model on the test set

In [None]:
metrics.roc_auc_score(y_test, predictions)

We have got quite interesting result, 0.94. Of course it could be improved but we will it here for the moment.
Save the model:

In [None]:
best_model.save_model("my_model.model")


# REFERENCES

For more details of the hyperparameter tuning techniques I have used in this notebook, reger to the following blog, it explains hyperparameter tuning in xgboost in detail:

XGBoost Hyperparameter tunning: https://blog.cambridgespark.com/hyperparameter-tuning-in-xgboost-4ff9100a3b2f