# L&T Loan Prediction

Financial institutions incur significant losses due to the default of vehicle loans. This has led to the tightening up of vehicle loan underwriting and increased vehicle loan rejection rates. The need for a better credit risk scoring model is also raised by these institutions. This warrants a study to estimate the determinants of vehicle loan default. A financial institution has hired you to accurately predict the probability of loanee/borrower defaulting on a vehicle loan in the first EMI (Equated Monthly Instalments) on the due date. Following Information regarding the loan and loanee are provided in the datasets:
Loanee Information (Demographic data like age, Identity proof etc.)
Loan Information (Disbursal details, loan to value ratio etc.)
Bureau data & history (Bureau score, number of active accounts, the status of other loans, credit history etc.)
Doing so will ensure that clients capable of repayment are not rejected and important determinants can be identified which can be further used for minimising the default rates.

## Downloading the Dataset

> Instructions for downloading the dataset
>
> - Find dataset on this page: https://www.kaggle.com/mamtadhaker/lt-vehicle-loan-default-prediction
> - The data is in CSV format, and have 41 columns and 233154 rows
> - Download the dataset using the [`opendatasets` Python library](https://github.com/JovianML/opendatasets#opendatasets)

In [None]:
!pip install opendatasets --upgrade --quiet

Let's begin by downloading the data, and listing the files within the dataset.

In [None]:
# Change this
dataset_url = 'https://www.kaggle.com/mamtadhaker/lt-vehicle-loan-default-prediction' 

The dataset has been downloaded and extracted.

In [None]:
train_file = '/work/lt-vehicle-loan-default-prediction/train.csv'
test_file = '/work/lt-vehicle-loan-default-prediction/test.csv'

## Data Preparation and Cleaning

> - Learn and understand the data
> - Check for any missing values
> - Load the dataset into a data frame using Pandas
> - Explore the number of rows & columns, ranges of values etc.



In [None]:
import numpy as np
import pandas as pd

In [None]:
train_df = pd.read_csv(train_file)
test_df = pd.read_csv(test_file)

In [None]:
train_df.head()

In [None]:
train_df.shape

## Exploratory Analysis and Visualization



> 
> - Compute the mean, sum, range and other interesting statistics for numeric columns
> - Explore distributions of numeric columns using histograms etc.
> - Explore relationship between columns using scatter plots, bar charts etc.
> - Make a note of interesting insights from the exploratory analysis
> - Handle missing, incorrect and invalid data
> - Perform any additional steps (parsing dates, creating additional columns, merging multiple dataset etc.)
> - All The Numerical Variables
> - Distribution of the Numerical Variables
> - Categorical Variables
> - Cardinality of Categorical Variables
> - Outliers
> - Relationship between independent and dependent feature(loanDefault)

In [None]:
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

In [None]:
#check the columns available in dataset

train_df.columns

In [None]:
# check high level view of values avaiable and their datatypes
train_df.info()

In [None]:
# check the statistical features of numerical values
train_df.describe()

Check if the data has any missing cells

In [None]:
train_df.isnull().sum()

In [None]:
# missing data heatmaps - our goal is to turn this heatmap to completely dark (i.e without any white spots)
sns.heatmap(train_df.isnull())
plt.show()

In [None]:
# only Employment type column has some missing values. 
# we can either remove the entire column from our dataframe or fill the missing cells with something else

In [None]:
# lets find out the relationship between missing column (Employment Type) and target (load default)
train_df.groupby('Employment.Type')['loan_default'].sum().plot.bar()
plt.show()

### Dealing with Numerical features

In [None]:
numerical_features = train_df.select_dtypes(include=['int64', 'float64']).columns
print(numerical_features)
print(f"Total length: {len(numerical_features)}")

In [None]:
train_df[numerical_features].head()

#### Numerical variables are of 2 types
1. Discrete Variables
2. Continuous Variables

In [None]:
# Discrete Variables

discrete_variables = [feature for feature in numerical_features if len(train_df[feature].unique())<25]
print(discrete_variables)
print("Discrete Variables Count: {}".format(len(discrete_variables)))

In [None]:
train_df[discrete_variables].head()

In [None]:
for feature in discrete_variables:
    data = train_df.copy()
    sns.countplot(data=data, x=feature, hue='loan_default')
    plt.xlabel(feature)
    plt.ylabel('Loan count')
    plt.title(feature)
    plt.show()

In [None]:
train_df['PRI.OVERDUE.ACCTS'].value_counts()

In [None]:
# drop unncessary columsn from the dataframe
to_drop = ['UniqueID','manufacturer_id', 'State_ID', 'Passport_flag','MobileNo_Avl_Flag', 'Driving_flag']
train_df = train_df.drop(to_drop, axis=1)

In [None]:
train_df.head()

### Continuous Variables

In [None]:
## continuous variables
continuous_variables = [feature for feature in numerical_features if feature not in discrete_variables+['UniqueID']]
print(continuous_variables)
print("Continuous variables Count: {}".format(len(continuous_variables)))

In [None]:
for feature in continuous_variables[:10]:
    data = train_df.copy()
    #data[feature].hist(bins=25)
    sns.histplot(data=data, x=feature, hue='loan_default')
    plt.xlabel(feature)
    plt.ylabel("Count")
    plt.title(feature)
    plt.show()

In [None]:
train_df['PERFORM_CNS.SCORE'].plot(kind='hist')
plt.show()

In [None]:
train_df['PRI.ACTIVE.ACCTS'].plot(kind='hist')
plt.show()

In [None]:
#train_df['PRI.ACTIVE.ACCTS'].value_counts()
sns.countplot(data=train_df, x=train_df['PRI.ACTIVE.ACCTS'], hue='loan_default')
plt.show()

In [None]:
# drop unncessary features
to_drop = ['branch_id', 'supplier_id', 'Current_pincode_ID', 'Employee_code_ID']
train_df = train_df.drop(to_drop, axis=1)

In [None]:
# chech all the available columns now
train_df.columns

Merging Primary accounts and Secondary accounts into One

In [None]:
train_df['no_of_accts'] = train_df['PRI.NO.OF.ACCTS'] + train_df['SEC.NO.OF.ACCTS']
train_df['active_accts'] = train_df['PRI.ACTIVE.ACCTS'] + train_df['SEC.ACTIVE.ACCTS']
train_df['overdue_accts'] = train_df['PRI.OVERDUE.ACCTS'] + train_df['SEC.OVERDUE.ACCTS']
train_df['outstanding_amount'] = train_df['PRI.CURRENT.BALANCE'] + train_df['SEC.CURRENT.BALANCE']
train_df['sanctioned_amount'] = train_df['PRI.SANCTIONED.AMOUNT'] + train_df['SEC.SANCTIONED.AMOUNT']
train_df['disbursed_amount'] = train_df['PRI.DISBURSED.AMOUNT'] + train_df['SEC.DISBURSED.AMOUNT']
train_df['install_amt'] = train_df['PRIMARY.INSTAL.AMT'] + train_df['SEC.INSTAL.AMT']

In [None]:
# dropping merged columns
train_df.drop(['PRI.NO.OF.ACCTS',
       'PRI.ACTIVE.ACCTS', 'PRI.OVERDUE.ACCTS', 'PRI.CURRENT.BALANCE',
       'PRI.SANCTIONED.AMOUNT', 'PRI.DISBURSED.AMOUNT', 'SEC.NO.OF.ACCTS',
       'SEC.ACTIVE.ACCTS', 'SEC.OVERDUE.ACCTS', 'SEC.CURRENT.BALANCE',
       'SEC.SANCTIONED.AMOUNT', 'SEC.DISBURSED.AMOUNT', 'PRIMARY.INSTAL.AMT',
       'SEC.INSTAL.AMT'], axis=1, inplace=True)

In [None]:
train_df.head()

### Categorical Features

In [None]:
categorical_features = train_df.select_dtypes(include=['object']).columns
print(categorical_features)
print("Categorical features Count: {}".format(len(categorical_features)))

In [None]:
train_df[categorical_features].head()

Feature Engineering can be done in these categorical features in many ways
- Convert DateofBirth, Disbursal date, avergae account age and credit history to datetime object
- Remove add age column from date of birth and disbursal date and remove them 
- Fill missing values of Employment.Type column by either salaried or self-employed

- in the end, label encode or one hot encode necessary featues

In [None]:
# convert dateofBirth and Disbursal date to datetime object
train_df['DisbursalDate'] = pd.to_datetime(train_df['DisbursalDate'], format='%d-%m-%y')
train_df['DisbursalDate'].head()

In [None]:
# convert Disbursal date to datetime object
train_df['Date.of.Birth'] = pd.to_datetime(train_df['Date.of.Birth'], format='%d-%m-%y')
train_df['Date.of.Birth'].head()

In [None]:
# converting average account length and credit history length to months

train_df['AVERAGE.ACCT.AGE'].value_counts()

In [None]:
import re
def to_yrs(s):
    nos = re.findall(r'(\d+)', string=s)
    yr = int(nos[0])
    mo = int(nos[1])
    return (yr * 12) + mo

In [None]:
train_df['AVERAGE.ACCT.AGE'] = train_df['AVERAGE.ACCT.AGE'].apply(lambda x: to_yrs(x))
train_df['CREDIT.HISTORY.LENGTH'] = train_df['CREDIT.HISTORY.LENGTH'].apply(lambda x: to_yrs(x))

In [None]:
# why could be the data missing in Employment.type?
# can we remove the entire column?
# if not, how can we impute the values in missing cells? (find the relevant columns)

# imputing the employement.type as Self employed
# assuming that, if person was salaried, it should have mentioned
train_df['Employment.Type'] = train_df['Employment.Type'].replace(np.nan, 'Self employed')

In [None]:
# check the categorical features now
train_df[categorical_features].head()

In [None]:
# creating new column "AGE" at the time of taking loan, from disbursal date and date of birth
train_df['Age'] = (train_df['DisbursalDate'] - train_df['Date.of.Birth']) / np.timedelta64(1, 'Y')

In [None]:
# drop dateofbirth and disbursaldate
train_df = train_df.drop(['Date.of.Birth', 'DisbursalDate'], axis=1)

In [None]:
# left categorical features
train_df.select_dtypes(include=['object']).columns

In [None]:
# Label encode
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder

In [None]:
train_df['is_salaried'] = pd.get_dummies(data=train_df['Employment.Type'])['Salaried']
train_df = train_df.drop(['Employment.Type'], axis=1)

In [None]:
# dealing with PERFORM_CNS.SCORE.DESCRIPTION
train_df['PERFORM_CNS.SCORE.DESCRIPTION'].value_counts()

In [None]:
risk = []
for i in train_df['PERFORM_CNS.SCORE.DESCRIPTION']:
    if('Very Low' in i):
        risk.append('Very Low Risk')
    elif('Low' in i):
        risk.append('Low Risk')
    elif('Medium' in i):
        risk.append('Medium Risk')
    elif('Very High' in i):
        risk.append('Very High Risk')
    elif('High' in i):
        risk.append('High Risk')
    else:
        risk.append('Not Scored')

In [None]:
train_df["risk"] = risk
train_df.head()

In [None]:
risk_map = {'Not Scored':-1, 
            'Very Low Risk':4,
            'Low Risk':3,
            'Medium Risk':2, 
            'High Risk':1,
            'Very High Risk':0}

train_df['risk'] = train_df['risk'].map(risk_map)

In [None]:
train_df.drop('PERFORM_CNS.SCORE.DESCRIPTION',axis=1,inplace=True)

In [None]:
numerical_ft = train_df.select_dtypes(include=['int64', 'float64']).columns
numerical_ft = list(numerical_ft)
numerical_ft.remove('loan_default')
print(numerical_ft)

### Feature Importance

In [None]:
from sklearn.ensemble import ExtraTreesClassifier

model = ExtraTreesClassifier()

model.fit(train_df[numerical_ft], train_df['loan_default'])

In [None]:
feat_imp = pd.DataFrame(model.feature_importances_, index=numerical_ft, columns=['Feature_Importances']).sort_values(by='Feature_Importances',ascending=False)

In [None]:
feat_imp

In [None]:
plt.figure(figsize=(8,8))
ranked_features = pd.Series(model.feature_importances_,index=numerical_ft)
ranked_features.nlargest(18).plot(kind='barh')
plt.show()

## Modeling

In [None]:
y = train_df['loan_default']
X = train_df.drop('loan_default',axis=1)

In [None]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
Xscaled = sc.fit_transform(X)
Xscaled = pd.DataFrame(Xscaled,columns=X.columns)

In [None]:
!pip install statsmodels==0.12.2

In [None]:
import statsmodels.api as sm
Xc = sm.add_constant(Xscaled)
model = sm.Logit(y, Xc).fit()
model.summary()

In [None]:
from sklearn.metrics import confusion_matrix,roc_auc_score,log_loss,roc_curve,accuracy_score

In [None]:
y_pred = model.predict(Xc)
prob = pd.DataFrame(y_pred, columns=['probability'])
prob['loan_default'] = y
prob['y_est'] = prob['probability'].apply(lambda x: 0 if x<0.5 else 1)
prob.head()

In [None]:
confusion_matrix(prob['loan_default'], prob['y_est'])

In [None]:
accuracy_score(prob['loan_default'], prob['y_est'])

In [None]:
roc_auc_score(prob['loan_default'],prob['probability'])

In [None]:
sc = StandardScaler()
X_scaled = sc.fit_transform(X)

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X_scaled,y,test_size=0.3)

### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(solver='liblinear')

In [None]:
lr.fit(X_train,y_train)

In [None]:
y_train_pred = lr.predict(X_train)
y_test_pred = lr.predict(X_test)
y_train_prob = lr.predict_proba(X_train)
y_test_prob = lr.predict_proba(X_test)

In [None]:
print('train AUC score:',roc_auc_score(y_train,y_train_prob[:,1]))
print('test AUC score:',roc_auc_score(y_test,y_test_prob[:,1]))

In [None]:
fpr, tpr, thresholds = roc_curve(y_test, y_test_prob[:,1])
plt.plot(fpr,fpr)
plt.plot(fpr,tpr)
plt.grid()
plt.title('Test ROC curve')
plt.show()

In [None]:
confusion_matrix(y_test,y_test_pred)

In [None]:
sns.heatmap(confusion_matrix(y_test,y_test_pred),annot=True)
plt.show()

In [None]:
accuracy_score(y_test,y_test_pred)

In [None]:
# FNs are too high and TPs are too low. 
#Maybe Applying SMOTE and balancing the data should help
from sklearn.metrics import classification_report
print('Test Classification Report\n')
print(classification_report(y_test,y_test_pred))

Using SMOTE to handle data imbalance

In [None]:
train_df['loan_default'].value_counts().plot(kind='bar')
plt.show()

In [None]:
!pip install imblearn --quiet

In [None]:
from imblearn.over_sampling import SMOTE

smote = SMOTE()
X_train_sm, y_train_sm = smote.fit_resample(X_train,y_train)
print(X_train_sm.shape, y_train_sm.shape)

In [None]:
lr_smote = LogisticRegression(solver='liblinear')
lr_smote.fit(X_train_sm,y_train_sm)

In [None]:
y_train_pred = lr_smote.predict(X_train_sm)
y_test_pred = lr_smote.predict(X_test)
y_train_prob = lr_smote.predict_proba(X_train_sm)
y_test_prob = lr_smote.predict_proba(X_test)

In [None]:
print('Train AUC score:',roc_auc_score(y_train_sm,y_train_prob[:,1]))
print('Test AUC score:',roc_auc_score(y_test,y_test_prob[:,1]))

In [None]:
fpr, tpr, thresholds = roc_curve(y_test, y_test_prob[:,1])
plt.plot(fpr,fpr)
plt.plot(fpr,tpr)
plt.grid()
plt.title('Test ROC curve')
plt.show()

In [None]:
confusion_matrix(y_test,y_test_pred)

In [None]:
sns.heatmap(confusion_matrix(y_test,y_test_pred),annot=True)
plt.show()

In [None]:
from sklearn.metrics import classification_report
print('Test Classification Report\n')
print(classification_report(y_test,y_test_pred))

### Random Forest Classifier

without SMOTE

In [None]:
from sklearn.ensemble import RandomForestClassifier
rsearch1_best_params = {'max_depth': 13,
 'min_samples_leaf': 10,
 'min_samples_split': 11,
 'n_estimators': 374}

In [None]:
rfc1 = RandomForestClassifier(**rsearch1_best_params, random_state=300)
rfc1.fit(X_train, y_train)

In [None]:
y_train_pred = rfc1.predict(X_train)
y_test_pred = rfc1.predict(X_test)
y_train_prob = rfc1.predict_proba(X_train)
y_test_prob = rfc1.predict_proba(X_test)

In [None]:
print('Train AUC score:',roc_auc_score(y_train,y_train_prob[:,1]))
print('Test AUC score:',roc_auc_score(y_test,y_test_prob[:,1]))

In [None]:
fpr, tpr, thresholds = roc_curve(y_test, y_test_prob[:,1])
plt.plot(fpr,fpr)
plt.plot(fpr,tpr)
plt.grid()
plt.title('Test ROC curve')
plt.show()

In [None]:
confusion_matrix(y_test,y_test_pred)

In [None]:
sns.heatmap(confusion_matrix(y_test,y_test_pred),annot=True)
plt.show()

In [None]:
from sklearn.metrics import classification_report
print('Test Classification Report\n')
print(classification_report(y_test,y_test_pred))

Random Forest with SMOTE

In [None]:
rsearch_best_params = {'max_depth': 17,
 'min_samples_leaf': 2,
 'min_samples_split': 4,
 'n_estimators': 317}

In [None]:
rfc = RandomForestClassifier(**rsearch_best_params, random_state=300)
rfc.fit(X_train_sm, y_train_sm)

In [None]:
y_train_pred = rfc.predict(X_train_sm)
y_test_pred = rfc.predict(X_test)
y_train_prob = rfc.predict_proba(X_train_sm)
y_test_prob = rfc.predict_proba(X_test)

In [None]:
print('Train AUC score:',roc_auc_score(y_train_sm,y_train_prob[:,1]))
print('Test AUC score:',roc_auc_score(y_test,y_test_prob[:,1]))

In [None]:
fpr, tpr, thresholds = roc_curve(y_test, y_test_prob[:,1])
plt.plot(fpr,fpr)
plt.plot(fpr,tpr)
plt.grid()
plt.title('Test ROC curve')
plt.show()

In [None]:
confusion_matrix(y_test,y_test_pred)

In [None]:
sns.heatmap(confusion_matrix(y_test,y_test_pred),annot=True)
plt.show()

In [None]:
from sklearn.metrics import classification_report
print('Test Classification Report\n')
print(classification_report(y_test,y_test_pred))

### LightGBM with SMOTE

In [None]:
rsearch_best_params = {'learning_rate': 0.3,
 'max_depth': 12,
 'n_estimators': 540,
 'num_leaves': 31}

In [None]:
!pip install lightgbm

In [None]:
import lightgbm as lgb

In [None]:
lgbmc = lgb.LGBMClassifier(**rsearch_best_params, importance_type='gain',random_state=300)
lgbmc.fit(X_train_sm, y_train_sm)

In [None]:
y_train_pred = lgbmc.predict(X_train_sm)
y_test_pred = lgbmc.predict(X_test)
y_train_prob = lgbmc.predict_proba(X_train_sm)
y_test_prob = lgbmc.predict_proba(X_test)

In [None]:
print('Train AUC score:',roc_auc_score(y_train_sm,y_train_prob[:,1]))
print('Test AUC score:',roc_auc_score(y_test,y_test_prob[:,1]))

In [None]:
fpr, tpr, thresholds = roc_curve(y_test, y_test_prob[:,1])
plt.plot(fpr,fpr)
plt.plot(fpr,tpr)
plt.grid()
plt.title('Test ROC curve')
plt.show()

In [None]:
confusion_matrix(y_test,y_test_pred)

In [None]:
sns.heatmap(confusion_matrix(y_test,y_test_pred),annot=True)
plt.show()

In [None]:
from sklearn.metrics import classification_report
print('Test Classification Report\n')
print(classification_report(y_test,y_test_pred))

### XGBoost with SMOTE

In [None]:
!pip install xgboost --quiet

In [None]:
import xgboost
from xgboost import XGBClassifier

rsearch_best_params = {'eval_metric': 'auc',
 'gamma': 0.2,
 'learning_rate': 0.2,
 'max_depth': 9,
 'n_estimators': 192,
 'reg_alpha': 0.1}

In [None]:
xgbc = XGBClassifier(**rsearch_best_params, random_state=300)
xgbc.fit(X_train_sm, y_train_sm)

In [None]:
y_train_pred = xgbc.predict(X_train_sm)
y_test_pred = xgbc.predict(X_test)
y_train_prob = xgbc.predict_proba(X_train_sm)
y_test_prob = xgbc.predict_proba(X_test)

In [None]:
print('Train AUC score:',roc_auc_score(y_train_sm,y_train_prob[:,1]))
print('Test AUC score:',roc_auc_score(y_test,y_test_prob[:,1]))

In [None]:
fpr, tpr, thresholds = roc_curve(y_test, y_test_prob[:,1])
plt.plot(fpr,fpr)
plt.plot(fpr,tpr)
plt.grid()
plt.title('Test ROC curve')
plt.show()

In [None]:
confusion_matrix(y_test,y_test_pred)

In [None]:
sns.heatmap(confusion_matrix(y_test,y_test_pred),annot=True)
plt.show()

In [None]:
from sklearn.metrics import classification_report
print('Test Classification Report\n')
print(classification_report(y_test,y_test_pred))

### Model Stacking with SMOTE

In [None]:
from sklearn.ensemble import StackingClassifier

In [None]:
estimators = [
    ('rfc',RandomForestClassifier(max_depth = 17,
    min_samples_leaf = 2,
    min_samples_split = 4,
    n_estimators = 317)),
    
    ('lgbmc',lgb.LGBMClassifier(learning_rate = 0.3,
    max_depth = 12,
    n_estimators = 540,
    num_leaves = 31)),
    
    ('xgbc', XGBClassifier(eval_metric = 'auc',
    gamma = 0.2,
    learning_rate = 0.2,
    max_depth = 9,
    n_estimators = 192,
    reg_alpha = 0.1))
]

In [None]:
clf = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression(solver='liblinear'), cv = 5, n_jobs=-1)
clf.fit(X_train_sm,y_train_sm)

In [None]:
y_train_pred = clf.predict(X_train_sm)
y_test_pred = clf.predict(X_test)
y_train_prob = clf.predict_proba(X_train_sm)
y_test_prob = clf.predict_proba(X_test)

In [None]:
print('Train AUC score:',roc_auc_score(y_train_sm,y_train_prob[:,1]))
print('Test AUC score:',roc_auc_score(y_test,y_test_prob[:,1]))

In [None]:
fpr, tpr, thresholds = roc_curve(y_test, y_test_prob[:,1])
plt.plot(fpr,fpr)
plt.plot(fpr,tpr)
plt.grid()
plt.title('Test ROC curve')
plt.show()

In [None]:
confusion_matrix(y_test,y_test_pred)

In [None]:
sns.heatmap(confusion_matrix(y_test,y_test_pred),annot=True)
plt.show()

In [None]:
from sklearn.metrics import classification_report
print('Test Classification Report\n')
print(classification_report(y_test,y_test_pred))

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=9cd82b0b-83d1-4ad3-b19b-0bde0aeab1d8' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>