# GIVE ME SOME CREDIT

This notebook is created 9 years after this competition ended. The main aim of this project is to predict the probabily whether a customer will default in the future given his record present in the dataset. We will be using **predict_proba** to determine the delinquency probabilities of the customer.

The Highlights of the notebook are:

- **Exploratory Data Analysis**
    - **Outlier Analysis
    - **Null Handling
    - **Distribution Analysis
    - **Skewness Reduction (using Box Cox Transformation)
- **Feature Engineering**
- **LightGBM using RandomizedSearchCV (Classification)**
    - **Evaluation Metrics**
        - Mean Squared Error
        - Root Mean Squared Error
        - Mean Absolute Error
        - Mean Squared Logarithmic Error
        - Root Mean Square Logarithmic Error
        - Accuracy on Training Set
        - Accuracy on Test Set
        - F-Beta Score (Beta = 2)
        - F1 Score
        - Precision
        - Recall
        - Confusion Matrix
        - AUC Curve
    - **Probability Prediction on Validation Sets**
    - **Delinquency Prediction on Validation Sets**
    - **Feature Importances**
        - Summary Plot
        - SHAP Analysis
- **XGBoost using RandomizedSearchCV (Classification)**
    - **Evaluation Metrics**
        - Mean Squared Error
        - Root Mean Squared Error
        - Mean Absolute Error
        - Mean Squared Logarithmic Error
        - Root Mean Square Logarithmic Error
        - Accuracy on Training Set
        - Accuracy on Test Set
        - F-Beta Score (Beta = 2)
        - F1 Score
        - Precision
        - Recall
        - Confusion Matrix
        - AUC Curve
    - **Probability Prediction on Validation Sets**
    - **Delinquency Prediction on Validation Sets**
    - **Feature Importances**
        - Summary Plot
        - SHAP Analysis
        
        
Let's begin with importing the libraries we will be requiring for this notebook

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

from sklearn import preprocessing
from sklearn import metrics
from sklearn import model_selection
from sklearn import ensemble
from sklearn import tree
from sklearn import linear_model
import os, datetime, sys, random, time
import seaborn as sns
import xgboost as xgs
import lightgbm as lgb
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from mlxtend import classifier
plt.style.use('fivethirtyeight')
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from scipy import stats, special
import shap
import catboost as ctb

In [None]:
trainingData = pd.read_csv('/kaggle/input/GiveMeSomeCredit/cs-training.csv')
testData = pd.read_csv('/kaggle/input/GiveMeSomeCredit/cs-test.csv')

In [None]:
trainingData.head()

# Exploratory Data Analysis

Let's first try to identify the column by column datatypes and null values.

In [None]:
trainingData.info()

Some of the observations are:

- There are 150,000 rows for 11 features in our data.
- We see in the training data, that all the datatypes belong to a numeric class i.e. **int** and **float**.
- Columns **MonthlyIncome** and **NumberOfDependents** have some null values

In [None]:
trainingData.describe()

From here we can conclude that the column **Unnamed: 0** will have no significance in the predictive modelling because it represents ID of the customer,

In [None]:
print(trainingData.shape)
print(testData.shape)

Performing similar analysis on the Test Data.

In [None]:
testData.head()

In [None]:
testData.info()

Some of the observations on the testing data:

- The total rows for our 11 features are 101,503. 
- Like the Training Data (as it should be), we observe numeric class's datatypes i.e. **int** and **float**.
- Nulls were observed for features **MonthlyIncome** and **NumberOfDependents** just like the training data.

In [None]:
testData.describe()

Let's create a copy of our two datasets, so the changes we are gonna make forward does not affect the original data.

In [None]:
finalTrain = trainingData.copy()
finalTest = testData.copy()

Since, we need to predict the probability of Delinquency in the test data, we need to remove the additional column from it first.

In [None]:
finalTest.drop('SeriousDlqin2yrs', axis=1, inplace = True)

Also as mentioned above, let's take the ID column i.e. **Unnamed: 0** and store it in seperate variables.

In [None]:
trainID = finalTrain['Unnamed: 0']
testID = finalTest['Unnamed: 0']

finalTrain.drop('Unnamed: 0', axis=1, inplace=True)
finalTest.drop('Unnamed: 0', axis=1, inplace=True)

### Imbalance Ratio

Since we have a total data of 150,000. There are high chances that it can be an imbalanced dataset. Therefore, checking the positive and negative delinquency ratio.

In [None]:
fig, axes = plt.subplots(1,2,figsize=(12,6))
finalTrain['SeriousDlqin2yrs'].value_counts().plot.pie(explode=[0,0.1],autopct='%1.1f%%',ax=axes[0])
axes[0].set_title('SeriousDlqin2yrs')
#ax[0].set_ylabel('')
sns.countplot('SeriousDlqin2yrs',data=finalTrain,ax=axes[1])
axes[1].set_title('SeriousDlqin2yrs')
plt.show()

The ratio of negative to positive delinquency outliers are found to be 93.3% to 6.7%, which is approximately a ratio of 14:1. Therefore, our dataset is highly imbalanced. We cannot rely on the accuracy scores to predict the model's success. Many other evaluation metrics would be considered here. But more on that later.

Now let's move on the Outlier Analysis section of our EDA. Here we will remove potential outliers which might affect our predictive modelling.

### Outlier Analysis

In [None]:
fig = plt.figure(figsize=[30,30])
for col,i in zip(finalTrain.columns,range(1,13)):
    axes = fig.add_subplot(7,2,i)
    sns.regplot(finalTrain[col],finalTrain.SeriousDlqin2yrs,ax=axes)
plt.show()

From the above graphs we can observe:

- In the columns **NumberOfTime30-59DaysPastDueNotWorse** , **NumberOfTime60-89DaysPastDueNotWorse** and **NumberOfTimes90DaysLate**, we see delinquency range beyond 90 which is common across all 3 features.
- There are some unusually high values for **DebtRatio** and **RevolvingUtilizationOfUnsecuredLines**.

Step 1: Fixing the columns **NumberOfTime30-59DaysPastDueNotWorse** , **NumberOfTime60-89DaysPastDueNotWorse** and **NumberOfTimes90DaysLate**

In [None]:
print("Unique values in '30-59 Days' values that are more than or equal to 90:",np.unique(finalTrain[finalTrain['NumberOfTime30-59DaysPastDueNotWorse']>=90]
                                                                                          ['NumberOfTime30-59DaysPastDueNotWorse']))


print("Unique values in '60-89 Days' when '30-59 Days' values are more than or equal to 90:",np.unique(finalTrain[finalTrain['NumberOfTime30-59DaysPastDueNotWorse']>=90]
                                                                                                       ['NumberOfTime60-89DaysPastDueNotWorse']))


print("Unique values in '90 Days' when '30-59 Days' values are more than or equal to 90:",np.unique(finalTrain[finalTrain['NumberOfTime30-59DaysPastDueNotWorse']>=90]
                                                                                                    ['NumberOfTimes90DaysLate']))


print("Unique values in '60-89 Days' when '30-59 Days' values are less than 90:",np.unique(finalTrain[finalTrain['NumberOfTime30-59DaysPastDueNotWorse']<90]
                                                                                           ['NumberOfTime60-89DaysPastDueNotWorse']))


print("Unique values in '90 Days' when '30-59 Days' values are less than 90:",np.unique(finalTrain[finalTrain['NumberOfTime30-59DaysPastDueNotWorse']<90]
                                                                                        ['NumberOfTimes90DaysLate']))


print("Proportion of positive class with special 96/98 values:",
      round(finalTrain[finalTrain['NumberOfTime30-59DaysPastDueNotWorse']>=90]['SeriousDlqin2yrs'].sum()*100/
      len(finalTrain[finalTrain['NumberOfTime30-59DaysPastDueNotWorse']>=90]['SeriousDlqin2yrs']),2),'%')

We can see from the following that when records in column 'NumberOfTime30-59DaysPastDueNotWorse' are more than 90, the other columns that records number of times payments are past due X days also have the same values. We will classify these as special labels since the proportion of positive class is abnormally high at 54.65%.

These 96 and 98 values can be viewed as accounting errors. Hence, we would replace them with the maximum value before 96 i.e. 13, 11 and 17

In [None]:
finalTrain.loc[finalTrain['NumberOfTime30-59DaysPastDueNotWorse'] >= 90, 'NumberOfTime30-59DaysPastDueNotWorse'] = 13
finalTrain.loc[finalTrain['NumberOfTime60-89DaysPastDueNotWorse'] >= 90, 'NumberOfTime60-89DaysPastDueNotWorse'] = 11
finalTrain.loc[finalTrain['NumberOfTimes90DaysLate'] >= 90, 'NumberOfTimes90DaysLate'] = 17

In [None]:
print("Unique values in 30-59Days", np.unique(finalTrain['NumberOfTime30-59DaysPastDueNotWorse']))
print("Unique values in 60-89Days", np.unique(finalTrain['NumberOfTime60-89DaysPastDueNotWorse']))
print("Unique values in 90Days", np.unique(finalTrain['NumberOfTimes90DaysLate']))

Performing a similar analysis on the Test Set.

In [None]:
print("Unique values in '30-59 Days' values that are more than or equal to 90:",np.unique(finalTest[finalTest['NumberOfTime30-59DaysPastDueNotWorse']>=90]
                                                                                          ['NumberOfTime30-59DaysPastDueNotWorse']))


print("Unique values in '60-89 Days' when '30-59 Days' values are more than or equal to 90:",np.unique(finalTest[finalTest['NumberOfTime30-59DaysPastDueNotWorse']>=90]
                                                                                                       ['NumberOfTime60-89DaysPastDueNotWorse']))


print("Unique values in '90 Days' when '30-59 Days' values are more than or equal to 90:",np.unique(finalTest[finalTest['NumberOfTime30-59DaysPastDueNotWorse']>=90]
                                                                                                    ['NumberOfTimes90DaysLate']))


print("Unique values in '60-89 Days' when '30-59 Days' values are less than 90:",np.unique(finalTest[finalTest['NumberOfTime30-59DaysPastDueNotWorse']<90]
                                                                                           ['NumberOfTime60-89DaysPastDueNotWorse']))


print("Unique values in '90 Days' when '30-59 Days' values are less than 90:",np.unique(finalTest[finalTest['NumberOfTime30-59DaysPastDueNotWorse']<90]
                                                                                        ['NumberOfTimes90DaysLate']))

Since, these values exist in Test Set as well. Therefore, replacing them with maximum values before 96 and 98 i.e. 19, 9 and 18.

In [None]:
finalTest.loc[finalTest['NumberOfTime30-59DaysPastDueNotWorse'] >= 90, 'NumberOfTime30-59DaysPastDueNotWorse'] = 19
finalTest.loc[finalTest['NumberOfTime60-89DaysPastDueNotWorse'] >= 90, 'NumberOfTime60-89DaysPastDueNotWorse'] = 9
finalTest.loc[finalTest['NumberOfTimes90DaysLate'] >= 90, 'NumberOfTimes90DaysLate'] = 18

print("Unique values in 30-59Days", np.unique(finalTest['NumberOfTime30-59DaysPastDueNotWorse']))
print("Unique values in 60-89Days", np.unique(finalTest['NumberOfTime60-89DaysPastDueNotWorse']))
print("Unique values in 90Days", np.unique(finalTest['NumberOfTimes90DaysLate']))

Step 2: Checking for **DebtRatio** and **RevolvingUtilizationOfUnsecuredLines**.

In [None]:
print('Debt Ratio: \n',finalTrain['DebtRatio'].describe())
print('\nRevolving Utilization of Unsecured Lines: \n',finalTrain['RevolvingUtilizationOfUnsecuredLines'].describe())

Here you can see a massive difference between the 75th Quantile and the Max Value. Let's explore this in a greater depth.

**Debt Ratio**

In [None]:
quantiles = [0.75,0.8,0.81,0.85,0.9,0.95,0.975,0.99]

for i in quantiles:
    print(i*100,'% quantile of debt ratio is: ',finalTrain.DebtRatio.quantile(i))

As you can see there is a huge rise in quantile post 81%. So, our main aim would be to check the potential outliers beyond 81% quantiles. However, since our data is 150,000, let's consider 95% and 97.5% quantiles for our further analysis.

In [None]:
finalTrain[finalTrain['DebtRatio'] >= finalTrain['DebtRatio'].quantile(0.95)][['SeriousDlqin2yrs','MonthlyIncome']].describe()

Here we can observe:

- Out of 7501 customers who have debt ratio greater than 95% i.e. the number of times their debt is higher than their income, only 379 have Monthly Income values.
- The Max for Monthly Income is 1 and Min is 0 which makes us wonder that are data entry errors. Let's check whether the Serious Delinquency in 2 years and Monthly Income values are equal.

In [None]:
finalTrain[(finalTrain["DebtRatio"] > finalTrain["DebtRatio"].quantile(0.95)) & (finalTrain['SeriousDlqin2yrs'] == finalTrain['MonthlyIncome'])]

Hence, our suspects are true and there are 331 out of 379 rows where Monthly Income is equal to the Serious Delinquencies in 2 years. Hence we will remove these 331 outliers from our analysis as their current values aren't useful for our predictive modelling and will add to the bias and variance.

The reason behind this, is we have 331 rows where the debt ratio is massive compared to the customer's income and they arent't scrutinized for defaulting which is nothing but a data entry error.

In [None]:
finalTrain = finalTrain[-((finalTrain["DebtRatio"] > finalTrain["DebtRatio"].quantile(0.95)) & (finalTrain['SeriousDlqin2yrs'] == finalTrain['MonthlyIncome']))]
finalTrain

**Revolving Utilization of Unsecured Lines**

This field basically represents the ratio of the amount owed by the credit limit of a customer. A ratio higher than 1 is considered to be a serious defaulter. A Ratio of 10 functionally also seems possible, let's see how many of these customers have the Revolving Utilization of Unsecured Lines greater than 10.

In [None]:
finalTrain[finalTrain['RevolvingUtilizationOfUnsecuredLines']>10].describe()

Here if you see the difference between the 50th and 75 quantile for Revolving Utilization of Unsecured Lines, you'll observe that there is a massive increase from 13 to 1891.25. Since 13 seems like a reasonable ratio too (but way too high), let's check how many of these counts lie above 13.

In [None]:
finalTrain[finalTrain['RevolvingUtilizationOfUnsecuredLines']>13].describe()

Despite owing thousands, these 238 people do not show any default which means this might be another error. Even if it is not an error, these numbers will add huge bias and variance to our final predictions. Therefore, the best decision is to remove these values.

In [None]:
finalTrain = finalTrain[finalTrain['RevolvingUtilizationOfUnsecuredLines']<=13]
finalTrain

The outliers are now handled. Next, we will move on to handling the missing data, as we observed at the start of this notebook that MonthlyIncome and NumberOfDependents had null values.

### Null Handling

- Since MonthlyIncome is an integer value, we will replace the nulls with the median values.
- Number of Dependents can be characterized as a categorical variable, hence if customers have NA for number of dependents, it means that they do not have any dependents. Therefore, we fill them with zeros.

In [None]:
def MissingHandler(df):
    DataMissing = df.isnull().sum()*100/len(df)
    DataMissingByColumn = pd.DataFrame({'Percentage Nulls':DataMissing})
    DataMissingByColumn.sort_values(by='Percentage Nulls',ascending=False,inplace=True)
    return DataMissingByColumn[DataMissingByColumn['Percentage Nulls']>0]

MissingHandler(finalTrain)

Therefore, we have 19.76% and 2.59% Nulls for MonthlyIncome and NumberOfDependents respectively. 

In [None]:
finalTrain['MonthlyIncome'].fillna(finalTrain['MonthlyIncome'].median(), inplace=True)
finalTrain['NumberOfDependents'].fillna(0, inplace = True)

Rechecking Nulls

In [None]:
MissingHandler(finalTrain)

Applying Similar Analysis for the Testing Data

In [None]:
MissingHandler(finalTest)

Similar to the training data, we have 19.71% and 2.56% nulls for MonthlyIncome and NumberOfDependents respectively.

In [None]:
finalTest['MonthlyIncome'].fillna(finalTrain['MonthlyIncome'].median(), inplace=True)
finalTest['NumberOfDependents'].fillna(0, inplace = True)

Rechecking Nulls

In [None]:
MissingHandler(finalTest)

In [None]:
print(finalTrain.shape)
print(finalTest.shape)

### Additional EDA

Let's study a few more things about the dataset to get more familiar with it.

**CORRELATION MATRIX**

In [None]:
fig = plt.figure(figsize = [15,10])
mask = np.zeros_like(finalTrain.corr(), dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(finalTrain.corr(), cmap=sns.diverging_palette(150, 275, s=80, l=55, n=9), mask = mask, annot=True, center = 0)
plt.title("Correlation Matrix (HeatMap)", fontsize = 15)

From the correlation heatmap above, we can see the most correlated values to **SeriousDlqin2yrs** are **NumberOfTime30-59DaysPastDueNotWorse** , **NumberOfTime60-89DaysPastDueNotWorse** and **NumberOfTimes90DaysLate**.

Now let's move to the Feature Engineering section of our Notebook

# Feature Engineering

Let's first combine the train and test sets to add features on both the data and conduct further analyses. We will split them later before Model Testing.

In [None]:
SeriousDlqIn2Yrs = finalTrain['SeriousDlqin2yrs']

finalTrain.drop('SeriousDlqin2yrs', axis = 1 , inplace = True)


In [None]:
finalData = pd.concat([finalTrain, finalTest])

finalData.shape

Adding some new features:

- **MonthlyIncomePerPerson**: Monthly Income divided by the number of dependents

- **MonthlyDebt**: Monthly Income multiplied by the Debt Ratio

- **isRetired**: Person whose monthly income is 0 and age is greater than 65 (Assumed Retirement Age)

- **RevolvingLines**: Difference between Number of Open Credit Lines and Loans and Number of Real Estate Lines and Loans

- **hasRevolvingLines**: If RevolvingLines exists then 1 else 0

- **hasMultipleRealEstates**: If the Number of Real Estates is greater than 2

- **incomeDivByThousand**: Monthly Income divided by 1000. Fraud might be more likely for these or it might signal the person is in a new job and hasn’t had a percent raise in pay yet. Both groups signal higher risk.

In [None]:
#New Features

finalData['MonthlyIncomePerPerson'] = finalData['MonthlyIncome']/(finalData['NumberOfDependents']+1)
finalData['MonthlyIncomePerPerson'].fillna(0, inplace=True)

finalData['MonthlyDebt'] = finalData['MonthlyIncome']*finalData['DebtRatio']
finalData['MonthlyDebt'].fillna(finalData['DebtRatio'],inplace=True)
finalData['MonthlyDebt'] = np.where(finalData['MonthlyDebt']==0, finalData['DebtRatio'],finalData['MonthlyDebt'])

finalData['isRetired'] = np.where((finalData['age'] > 65), 1, 0)

finalData['RevolvingLines'] = finalData['NumberOfOpenCreditLinesAndLoans']-finalData['NumberRealEstateLoansOrLines']

finalData['hasRevolvingLines']=np.where((finalData['RevolvingLines']>0),1,0)

finalData['hasMultipleRealEstates'] = np.where((finalData['NumberRealEstateLoansOrLines']>=2),1,0)

finalData['incomeDivByThousand'] = finalData['MonthlyIncome']/1000

In [None]:
finalData.shape

In [None]:
MissingHandler(finalData)

We have now added new features to our dataset. Next, we will perform a skewness check on our data by analysing the distributions of individual columns and perform Box Cox Transformation to reduce the skewness.

# Skewness Check and Box Cox Transformation

Let's check the distribution of each values first

In [None]:
columnList = list(finalData.columns)
columnList

fig = plt.figure(figsize=[20,20])
for col,i in zip(columnList,range(1,19)):
    axes = fig.add_subplot(6,3,i)
    sns.distplot(finalData[col],ax=axes, kde_kws={'bw':1.5}, color='purple')
plt.show()

From the above distribution plots, we can see that majority of our data is skewed in either of the directions. We can only see Age forming close to normal distribution. Let's check the skewness values of each column

In [None]:
def SkewMeasure(df):
    nonObjectColList = df.dtypes[df.dtypes != 'object'].index
    skewM = df[nonObjectColList].apply(lambda x: stats.skew(x.dropna())).sort_values(ascending = False)
    skewM=pd.DataFrame({'skew':skewM})
    return skewM[abs(skewM)>0.5].dropna()

skewM = SkewMeasure(finalData)
skewM

The Skewness is massively high for all the columns. We would apply Box Cox Transformation with **λ = 0.15** in order to reduce this skewness.

In [None]:
for i in skewM.index:
    finalData[i] = special.boxcox1p(finalData[i],0.15) #lambda = 0.15
    
SkewMeasure(finalData)

The Skewness have reduced on a much higher scale now that the Box Cox Transformation is applied. Let's check the distribution plots for individual columns again:

In [None]:
fig = plt.figure(figsize=[20,20])
for col,i in zip(columnList,range(1,19)):
    axes = fig.add_subplot(6,3,i)
    sns.distplot(finalData[col],ax=axes, kde_kws={'bw':1.5}, color='purple')
plt.show()

As you can see, our graphs look much better now.

# Model Training

## Train-Validation Split

We will currently split the train and validation sets into a 70-30 proportion.

In [None]:
trainDF = finalData[:len(finalTrain)]
testDF = finalData[len(finalTrain):]
print(trainDF.shape)
print(testDF.shape)

In [None]:
xTrain, xTest, yTrain, yTest = model_selection.train_test_split(trainDF.to_numpy(),SeriousDlqIn2Yrs.to_numpy(),test_size=0.3,random_state=2020)

## LightGBM

**Hyperparameter Tuning**

In [None]:
lgbAttributes = lgb.LGBMClassifier(objective='binary', n_jobs=-1, random_state=2020, importance_type='gain')

lgbParameters = {
    'max_depth' : [2,3,4,5],
    'learning_rate': [0.05, 0.1,0.125,0.15],
    'colsample_bytree' : [0.2,0.4,0.6,0.8,1],
    'n_estimators' : [400,500,600,700,800,900],
    'min_split_gain' : [0.15,0.20,0.25,0.3,0.35], #equivalent to gamma in XGBoost
    'subsample': [0.6,0.7,0.8,0.9,1],
    'min_child_weight': [6,7,8,9,10],
    'scale_pos_weight': [10,15,20],
    'min_data_in_leaf' : [100,200,300,400,500,600,700,800,900],
    'num_leaves' : [20,30,40,50,60,70,80,90,100]
}

lgbModel = model_selection.RandomizedSearchCV(lgbAttributes, param_distributions = lgbParameters, cv = 5, random_state=2020)

lgbModel.fit(xTrain,yTrain.flatten(),feature_name=trainDF.columns.to_list())

In [None]:
bestEstimatorLGB = lgbModel.best_estimator_
bestEstimatorLGB

Saving the best estimator from RandomSearchCV

In [None]:
bestEstimatorLGB = lgb.LGBMClassifier(colsample_bytree=0.4, importance_type='gain', max_depth=5,
               min_child_weight=6, min_data_in_leaf=600, min_split_gain=0.25,
               n_estimators=900, num_leaves=50, objective='binary',
               random_state=2020, scale_pos_weight=10, subsample=0.9).fit(xTrain,yTrain.flatten(),feature_name=trainDF.columns.to_list())

In [None]:
yPredLGB = bestEstimatorLGB.predict_proba(xTest)
yPredLGB = yPredLGB[:,1]

In [None]:
yTestPredLGB = bestEstimatorLGB.predict(xTest)
print(metrics.classification_report(yTest,yTestPredLGB))

In [None]:
metrics.confusion_matrix(yTest,yTestPredLGB)

In [None]:
LGBMMetrics = pd.DataFrame({'Model': 'LightGBM', 
                            'MSE': round(metrics.mean_squared_error(yTest, yTestPredLGB)*100,2),
                            'RMSE' : round(np.sqrt(metrics.mean_squared_error(yTest, yTestPredLGB)*100),2),
                            'MAE' : round(metrics.mean_absolute_error(yTest, yTestPredLGB)*100,2),
                            'MSLE' : round(metrics.mean_squared_log_error(yTest, yTestPredLGB)*100,2), 
                            'RMSLE' : round(np.sqrt(metrics.mean_squared_log_error(yTest, yTestPredLGB)*100),2),
                            'Accuracy Train' : round(bestEstimatorLGB.score(xTrain, yTrain) * 100,2),
                            'Accuracy Test' : round(bestEstimatorLGB.score(xTest, yTest) * 100,2),
                            'F-Beta Score (β=2)' : round(metrics.fbeta_score(yTest, yTestPredLGB, beta=2)*100,2)},index=[1])

LGBMMetrics

**ROC AUC**

In [None]:
fpr,tpr,_ = metrics.roc_curve(yTest,yPredLGB)
rocAuc = metrics.auc(fpr, tpr)
plt.figure(figsize=(12,6))
plt.title('ROC Curve')
sns.lineplot(fpr, tpr, label = 'AUC for LightGBM Model = %0.2f' % rocAuc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

**FEATURE IMPORTANCE**

In [None]:
lgb.plot_importance(bestEstimatorLGB, importance_type='gain')

**FEATURE IMPORTANCE USING SHAP**

In [None]:
X = pd.DataFrame(xTrain, columns=trainDF.columns.to_list())

explainer = shap.TreeExplainer(bestEstimatorLGB)
shap_values = explainer.shap_values(X)

shap.summary_plot(shap_values[1], X)

## XGBoost

**Hyperparameter Tuning**

In [None]:
xgbAttribute = xgs.XGBClassifier(tree_method='gpu_hist',n_jobs=-1, gpu_id=0)

xgbParameters = {
    'max_depth' : [2,3,4,5,6,7,8],
    'learning_rate':[0.05,0.1,0.125,0.15],
    'colsample_bytree' : [0.2,0.4,0.6,0.8,1],
    'n_estimators' : [400,500,600,700,800,900],
    'gamma':[0.15,0.20,0.25,0.3,0.35],
    'subsample': [0.6,0.7,0.8,0.9,1],
    'min_child_weight': [6,7,8,9,10],
    'scale_pos_weight': [10,15,20]
    
}

xgbModel = model_selection.RandomizedSearchCV(xgbAttribute, param_distributions = xgbParameters, cv = 5, random_state=2020)

xgbModel.fit(xTrain,yTrain.flatten())

In [None]:
bestEstimatorXGB = xgbModel.best_estimator_
bestEstimatorXGB

Setting the best estimator from RandomizedSearchCV

In [None]:
bestEstimatorXGB = xgs.XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.4, gamma=0.25, gpu_id=0,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.125, max_delta_step=0, max_depth=5,
              min_child_weight=9,
              monotone_constraints='(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)',
              n_estimators=800, n_jobs=-1, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=10, subsample=1,
              tree_method='gpu_hist', validate_parameters=1, verbosity=None).fit(xTrain,yTrain.flatten())

In [None]:
yPredXGB = bestEstimatorXGB.predict_proba(xTest)
yPredXGB = yPredXGB[:,1]

yTestPredXGB = bestEstimatorXGB.predict(xTest)
print(metrics.classification_report(yTest,yTestPredXGB))

In [None]:
metrics.confusion_matrix(yTest,yTestPredXGB)

In [None]:
XGBMetrics = pd.DataFrame({'Model': 'XGBoost', 
                            'MSE': round(metrics.mean_squared_error(yTest, yTestPredXGB)*100,2),
                            'RMSE' : round(np.sqrt(metrics.mean_squared_error(yTest, yTestPredXGB)*100),2),
                            'MAE' : round(metrics.mean_absolute_error(yTest, yTestPredXGB)*100,2),
                            'MSLE' : round(metrics.mean_squared_log_error(yTest, yTestPredXGB)*100,2), 
                            'RMSLE' : round(np.sqrt(metrics.mean_squared_log_error(yTest, yTestPredXGB)*100),2),
                            'Accuracy Train' : round(bestEstimatorLGB.score(xTrain, yTrain) * 100,2),
                            'Accuracy Test' : round(bestEstimatorLGB.score(xTest, yTest) * 100,2),
                            'F-Beta Score (β=2)' : round(metrics.fbeta_score(yTest, yTestPredXGB, beta=2)*100,2)},index=[2])

XGBMetrics

**ROC AUC**

In [None]:
fpr,tpr,_ = metrics.roc_curve(yTest,yPredXGB)
rocAuc = metrics.auc(fpr, tpr)
plt.figure(figsize=(12,6))
plt.title('ROC Curve')
sns.lineplot(fpr, tpr, label = 'AUC for XGBoost Model = %0.2f' % rocAuc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

**FEATURE IMPORTANCE**

In [None]:
bestEstimatorXGB.get_booster().feature_names = trainDF.columns.to_list()
xgs.plot_importance(bestEstimatorXGB, importance_type='gain')

**FEATURE IMPORTANCE USING SHAP**

In [None]:
# resolve a conflict/bug with latest version of XGBoost and SHAP
mybooster = bestEstimatorXGB.get_booster()
model_bytearray = mybooster.save_raw()[4:]
def myfun(self=None):
    return model_bytearray

mybooster.save_raw = myfun


X = pd.DataFrame(xTrain, columns=trainDF.columns.to_list())

explainer = shap.TreeExplainer(mybooster)
shap_values = explainer.shap_values(X)

shap.summary_plot(shap_values, X)

In [None]:
frames = [LGBMMetrics, XGBMetrics]
TrainingResult = pd.concat(frames)
TrainingResult.T

### LGBM Submission

Since, we can see our LGBM performs better, we will submit this. (Late Submission)

In [None]:
lgbProbs = bestEstimatorLGB.predict_proba(testDF)
lgbDF = pd.DataFrame({'ID': testID, 'Probability': lgbProbs[:,1]})
lgbDF.to_csv('submission.csv', index=False)

Hence the delinquency probabilities.

In [None]:
lgbDF