# Predicting the Probability of Default of potential Borrower


## Credit Risk Modeling

---




## Table of Contents

#### The Problem

>*   [Why loan defaulters is a problem?](https://https://colab.research.google.com/drive/1jKuIoX6YsxW6Bez9ahVNyP8wMa-xDaBo#scrollTo=ZX2IqfWVlFhp&line=2&uniqifier=1)



#### Examine the data



>* [Simple Inspection of Data](https://)






**The Problem:**

People apply for loan in a bank. Bank employees processes each loan application manually and evaluate applicants application based on different factors like profession of applicant, age, debt on him/her, salary, Marital status etc.

After analyzing all factors, bank decides whether to approve or reject his/her loan application. This is a tedious and time consuming process. There is certain possibility of human error too. 

For ex. Applicant A whose appliaction needs to be rejected based on his financial condition, default history and other factors but his application got approved. On the other hand, application of Applicants B needs to be approved as he/she has no debts and earning good salary etc., but his application got rejected. 

The reason could be that by mistake application's of both applicants got swapped or the person who is incharge of approving loan applications, is biased.

In both cases, Bank has to bear some loss.

Lets assume that both applicant need 1000 dollars loan from Bank. 

Applicant A :- Bank gave 1000 dollars loan to him and he got defaulted, then the total loss to bank is:- 1000 dollars.
Applicant B :- Bank did not approve his/her loan application. If his/her had got approved, then Bank had earned 100 dollars  from this.

The bank will be in more debt, if they approves a defaulters loan rather then rejecting a non-defaulters loan.
In other words giving a loan to a bad customer marked as a good customer results in a greater cost to the bank than denying a loan to a good customer marked as a bad customer.

In this project, we are developing a automated process, which will approve/reject all Loan applications, based on different factors. It will save a lot of time of the bank which was spent on manual process and help them to reduce the human errors and save Banks money by reducing the loan to defaulters.







## **Import Packages**

In [1]:
# import your libraries
import seaborn as sns
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler, MinMaxScaler
from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import make_scorer, fbeta_score
from sklearn.metrics import confusion_matrix
from imblearn.pipeline import Pipeline, make_pipeline
from imblearn.under_sampling import RepeatedEditedNearestNeighbours
from imblearn.under_sampling import EditedNearestNeighbours
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier, GradientBoostingClassifier

import warnings
warnings.filterwarnings("ignore")



  import pandas.util.testing as tm


In [None]:
__author__ = "Samit Singh"
__email__ = "samitsingh.85@gmail.com"

### ---- 2 Load the data ----


---

In [2]:
#load the data into a Pandas dataframe
def load_csv(csv_file):
    return pd.read_csv(csv_file)

In [3]:

loan_df = load_csv('/content/feature_data.csv')
loan_df.head()

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
0,22,59000,RENT,123.0,PERSONAL,D,35000,16.02,1,0.59,Y,3
1,21,9600,OWN,5.0,EDUCATION,B,1000,11.14,0,0.1,N,2
2,25,9600,MORTGAGE,1.0,MEDICAL,C,5500,12.87,1,0.57,N,3
3,23,65500,RENT,4.0,MEDICAL,C,35000,15.23,1,0.53,N,2
4,24,54400,RENT,8.0,MEDICAL,C,35000,14.27,1,0.55,Y,4


### Examine the data

In [4]:
loan_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32581 entries, 0 to 32580
Data columns (total 12 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   person_age                  32581 non-null  int64  
 1   person_income               32581 non-null  int64  
 2   person_home_ownership       32581 non-null  object 
 3   person_emp_length           31686 non-null  float64
 4   loan_intent                 32581 non-null  object 
 5   loan_grade                  32581 non-null  object 
 6   loan_amnt                   32581 non-null  int64  
 7   loan_int_rate               29465 non-null  float64
 8   loan_status                 32581 non-null  int64  
 9   loan_percent_income         32581 non-null  float64
 10  cb_person_default_on_file   32581 non-null  object 
 11  cb_person_cred_hist_length  32581 non-null  int64  
dtypes: float64(3), int64(5), object(4)
memory usage: 3.0+ MB


In [5]:
# Get list of numerical and categorical columns
num_cols = loan_df.select_dtypes(include=np.number).columns.tolist()
print('Numerical columns in data:- {}'.format(num_cols))
cat_cols = loan_df.select_dtypes(include='O').columns.tolist()
print('\nCategorical columns in data:- {}'.format(cat_cols))

Numerical columns in data:- ['person_age', 'person_income', 'person_emp_length', 'loan_amnt', 'loan_int_rate', 'loan_status', 'loan_percent_income', 'cb_person_cred_hist_length']

Categorical columns in data:- ['person_home_ownership', 'loan_intent', 'loan_grade', 'cb_person_default_on_file']


In [6]:
#we need to verify the duplicate data
print('Number of duplicate rows in dataframe:- {}'.format(loan_df.duplicated().\
                                                          sum()))

Number of duplicate rows in dataframe:- 165


In [7]:
#verify if there is null values in dataframe
loan_df.isnull().sum()

person_age                       0
person_income                    0
person_home_ownership            0
person_emp_length              895
loan_intent                      0
loan_grade                       0
loan_amnt                        0
loan_int_rate                 3116
loan_status                      0
loan_percent_income              0
cb_person_default_on_file        0
cb_person_cred_hist_length       0
dtype: int64

There are 895 nulls in 'pesron_emp_length' and 3116 null values in loan_int_rate. We can impute Employment length using median but cannot impute interest rate , as it might be an important factor in determining the defaulter, so we need to delete rows having null in this field.

In [8]:
# Numerical columns statistics
loan_df.describe()

Unnamed: 0,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_cred_hist_length
count,32581.0,32581.0,31686.0,32581.0,29465.0,32581.0,32581.0,32581.0
mean,27.7346,66074.85,4.789686,9589.371106,11.011695,0.218164,0.170203,5.804211
std,6.348078,61983.12,4.14263,6322.086646,3.240459,0.413006,0.106782,4.055001
min,20.0,4000.0,0.0,500.0,5.42,0.0,0.0,2.0
25%,23.0,38500.0,2.0,5000.0,7.9,0.0,0.09,3.0
50%,26.0,55000.0,4.0,8000.0,10.99,0.0,0.15,4.0
75%,30.0,79200.0,7.0,12200.0,13.47,0.0,0.23,8.0
max,144.0,6000000.0,123.0,35000.0,23.22,1.0,0.83,30.0


As shown above , the maximum **age** of a person is **144 years**, which is an outlier.
In 'person_emp_length' column, the maximum value is **123 years**, which is also an outlier.


In [9]:
# Categorical columns statistics
loan_df.describe(include='O')

Unnamed: 0,person_home_ownership,loan_intent,loan_grade,cb_person_default_on_file
count,32581,32581,32581,32581
unique,4,6,7,2
top,RENT,EDUCATION,A,N
freq,16446,6453,10777,26836


### ---- 3 Clean the data ----

In [10]:
# drop dulicate rows
print('No of rows in the dataframe before change:- {}'.format(loan_df.shape[0]))
loan_df.drop_duplicates(inplace=True)
print('No of rows in the dataframe after change:-  {}'.format(loan_df.shape[0]))

No of rows in the dataframe before change:- 32581
No of rows in the dataframe after change:-  32416


In [11]:
# delete row having age gearter than 100, employment length greater than 60 & 
# interest rate is null
indices = loan_df[(loan_df.person_age > 100) | 
                  (loan_df.person_emp_length > 60) | 
                  (loan_df.loan_int_rate.isnull())].index
# print(indices)
loan_df.drop(indices, inplace=True)

In [12]:
# Verify the no of Nulls remaining 
loan_df.isnull().sum()

person_age                      0
person_income                   0
person_home_ownership           0
person_emp_length             820
loan_intent                     0
loan_grade                      0
loan_amnt                       0
loan_int_rate                   0
loan_status                     0
loan_percent_income             0
cb_person_default_on_file       0
cb_person_cred_hist_length      0
dtype: int64

### ---- 4 Explore the data (EDA) ----