# Problem Statement


You work as a Business Analytics Consultant at the Bank of Corporate.The bank was witnessing slower than usual growth in its book of business for the most recent quarter in 2018. The bank provides financial services/products such as savings accounts, current accounts, debit cards, etc. to its customer. The data suggested that it was the home loan business of the bank that was hit by a major loss. Now, loans are the core business of banks. The main profit comes directly from the loan’s interest. The head of the Home Loan business asked the heads of the Sales, Operations, Risk and Analytics teams to investigate and identify the root causes for the slowing growth and solve the problem. A business like selling Home Loans can grow or shrink based on several factors like demand or supply side. Some of the reasons are:

- <b> Demand Side: </b> Are interest rates high?
- <b> Demand Side: </b> Are there any macro economic reasons, such as recession or low salary growth or inflation?
- <b> Supply Side: </b> Are new and attractive housing projects not available in the markets being served?
- <b> Supply Side: </b> Have real estate prices shot up making homes unaffordable, relatively speaking?
- <b> Competitor Side: </b> Are we losing customers to our competition? Is our competition also facing lower growth?



The team found out that the credit risk was in abnormal standards and the default loan rates were high. So what do you mean by default loans and credit risk?

<b> Deafult loans : </b> <br>
Default is the failure to repay a loan according to the terms agreed to in the promissory note. For most federal student loans, you will default if you have not made a payment in more than 270 days.

<b> Credit risk : </b> <br>
It is understood simply as the risk a bank takes while lending out money to borrowers. They might default and fail to repay the dues in time and these results in losses to the bank.

## <b> So, what do banks do then? </b> <br>
They need to manage their credit risks. The goal of credit risk management in banks is to maintain credit risk exposure within proper and acceptable parameters. It is the practice of mitigating losses by understanding the adequacy of a bank’s capital and loan loss reserves at any given time. For this, banks not only need to manage the entire portfolio but also individual credits.

## <b> Measures taken </b> <br>
So in 2019, the bank came up with a project to build a "Credit risk estimate model" for its home loan branch.The loan should be granted after an intensive process of verification and validation. The dataset (provided below) contains the information about all the customers who were contacted during this year and were provided loans based on various parameters. The "Credit risk estimate model" need to be cost-efficient so that the bank not only decreases their credit risk but also increase the total profit.


## <b> Business objective </b> <br>

Your aim is to build the "Credit risk estimate model" to classify new loans availed as "Low Risk", "High Risk" and "Medium Risk". This will help the bank to sanction loans to "Low Risk" customers, following up with the latest information/data for the "Medium Risk" customers and reject the loan approval for "High Risk" customers.

### Step-1: Importing the modules

In [2]:
#Import the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Suppressing Warnings
import warnings
warnings.filterwarnings('ignore')

In [3]:
#Import the statsmodel and its corresponding modules
import statsmodels
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import StandardScaler

#Import the sklearn and its corresponding modules
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

### Step-2: Reading the dataset

In [4]:
loan_data = pd.read_csv('loan_data.csv')
loan_data.head()

Unnamed: 0,id,loan_amnt,funded_amnt,int_rate,installment,emp_length,annual_inc,loan_status
0,1077501,5000,5000,10.65,162.87,10,24000.0,Low Risk
1,1077430,2500,2500,15.27,59.83,1,30000.0,High Risk
2,1077175,2400,2400,15.96,84.33,10,12252.0,Low Risk
3,1076863,10000,10000,13.49,339.31,10,49200.0,Low Risk
4,1075358,3000,3000,12.69,67.79,1,80000.0,Medium Risk


The dataset has the following columns: </b>

<b> id : </b>Transaction ID use to identify each transaction uniquely <br>
<b> loan_amnt : </b> Loan amount that was requested by the customer <br>
<b> funded_amnt : </b>Amount that was sanctioned by the bank <br>
<b> int_rate : </b> Interest rate offered on the loan amount <br>
<b> installment : </b>Amount of money paid during each installment <br>
<b> emp_length :  </b> Work experience (employment length of the customer)  <br>
<b> annual_inc : </b>What is the annual income of the customer<br>
<b> loan_status : </b> Classified as whether it is "High Risk", "Low Risk" and "Medium Risk" <br>

In [5]:
#Check the details of the dataset
loan_data.describe()

Unnamed: 0,id,loan_amnt,funded_amnt,int_rate,installment,emp_length,annual_inc
count,38642.0,38642.0,38642.0,38642.0,38642.0,38642.0,38642.0
mean,681040.4,11291.615988,11017.101211,12.052427,326.760477,5.09205,69608.28
std,211304.5,7462.136215,7193.038828,3.716705,209.143908,3.408338,64253.2
min,54734.0,500.0,500.0,5.42,15.69,1.0,4000.0
25%,513435.0,5500.0,5500.0,9.32,168.4425,2.0,41400.0
50%,662770.5,10000.0,9950.0,11.86,282.83,4.0,60000.0
75%,836491.2,15000.0,15000.0,14.59,434.3975,9.0,83199.99
max,1077501.0,35000.0,35000.0,24.59,1305.19,10.0,6000000.0


In [6]:
#gets the information of the dataset
loan_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38642 entries, 0 to 38641
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           38642 non-null  int64  
 1   loan_amnt    38642 non-null  int64  
 2   funded_amnt  38642 non-null  int64  
 3   int_rate     38642 non-null  float64
 4   installment  38642 non-null  float64
 5   emp_length   38642 non-null  int64  
 6   annual_inc   38642 non-null  float64
 7   loan_status  38642 non-null  object 
dtypes: float64(3), int64(4), object(1)
memory usage: 2.4+ MB


In [7]:
# Distribution of the target variable
loan_data['loan_status'].value_counts()

loan_status
Low Risk       32145
High Risk       5399
Medium Risk     1098
Name: count, dtype: int64

In [8]:
#Check for null values
loan_data.isnull().sum()

id             0
loan_amnt      0
funded_amnt    0
int_rate       0
installment    0
emp_length     0
annual_inc     0
loan_status    0
dtype: int64

In [9]:
#Check for duplicate values
loan_data.duplicated().sum()

0

### Step-3: Data Preparation

<b> funded_amnt: </b> Percentage of amount sanctioned compared to the total loan amount. Higher the value, it states that the bank is positive in lending the loan to the customer. <br>
<b> incToloan_perc: </b> Percentage of annual income when compared to the loan amount. Higher the value it states that the customer is more likely to pay back without defaulting.

In [10]:
# Adding new variables

# fund_perc variable represents the ratio of funded amount wrt loan amount
loan_data['fund_perc'] = loan_data['funded_amnt']/loan_data['loan_amnt']

#incToloan_perc variable represent the ratio of annula income wrt loan amount
loan_data['incToloan_perc'] = loan_data['annual_inc']/loan_data['loan_amnt']

In [11]:
# Understanding distribution of all the numerical variables in dataset
loan_data.describe()

Unnamed: 0,id,loan_amnt,funded_amnt,int_rate,installment,emp_length,annual_inc,fund_perc,incToloan_perc
count,38642.0,38642.0,38642.0,38642.0,38642.0,38642.0,38642.0,38642.0,38642.0
mean,681040.4,11291.615988,11017.101211,12.052427,326.760477,5.09205,69608.28,0.985571,8.91595
std,211304.5,7462.136215,7193.038828,3.716705,209.143908,3.408338,64253.2,0.070317,13.845454
min,54734.0,500.0,500.0,5.42,15.69,1.0,4000.0,0.10125,1.204819
25%,513435.0,5500.0,5500.0,9.32,168.4425,2.0,41400.0,1.0,4.0
50%,662770.5,10000.0,9950.0,11.86,282.83,4.0,60000.0,1.0,6.066667
75%,836491.2,15000.0,15000.0,14.59,434.3975,9.0,83199.99,1.0,10.016699
max,1077501.0,35000.0,35000.0,24.59,1305.19,10.0,6000000.0,1.0,1266.666667


In [12]:
#column names
loan_data.columns

Index(['id', 'loan_amnt', 'funded_amnt', 'int_rate', 'installment',
       'emp_length', 'annual_inc', 'loan_status', 'fund_perc',
       'incToloan_perc'],
      dtype='object')

In [13]:
loan_data.head()

Unnamed: 0,id,loan_amnt,funded_amnt,int_rate,installment,emp_length,annual_inc,loan_status,fund_perc,incToloan_perc
0,1077501,5000,5000,10.65,162.87,10,24000.0,Low Risk,1.0,4.8
1,1077430,2500,2500,15.27,59.83,1,30000.0,High Risk,1.0,12.0
2,1077175,2400,2400,15.96,84.33,10,12252.0,Low Risk,1.0,5.105
3,1076863,10000,10000,13.49,339.31,10,49200.0,Low Risk,1.0,4.92
4,1075358,3000,3000,12.69,67.79,1,80000.0,Medium Risk,1.0,26.666667


In [14]:
#not required for model building
loan_data = loan_data.drop(['id','funded_amnt'],axis=1)

In [15]:
#check the datatypes of the columns
loan_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38642 entries, 0 to 38641
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   loan_amnt       38642 non-null  int64  
 1   int_rate        38642 non-null  float64
 2   installment     38642 non-null  float64
 3   emp_length      38642 non-null  int64  
 4   annual_inc      38642 non-null  float64
 5   loan_status     38642 non-null  object 
 6   fund_perc       38642 non-null  float64
 7   incToloan_perc  38642 non-null  float64
dtypes: float64(5), int64(2), object(1)
memory usage: 2.4+ MB


In [16]:
#spliting the data into train and test sets
df_train,df_test = train_test_split(loan_data,train_size=0.7,random_state=100)

In [17]:
df_train.head()

Unnamed: 0,loan_amnt,int_rate,installment,emp_length,annual_inc,loan_status,fund_perc,incToloan_perc
2162,8000,16.29,282.41,10,108000.0,Low Risk,1.0,13.5
2558,13000,15.96,315.86,4,85000.0,Medium Risk,1.0,6.538462
17406,12000,16.4,294.38,6,65000.0,Low Risk,1.0,5.416667
5337,3000,11.71,99.23,9,40000.0,Low Risk,1.0,13.333333
9667,1900,5.99,57.8,5,40200.0,Low Risk,1.0,21.157895


In [18]:
df_test.head()

Unnamed: 0,loan_amnt,int_rate,installment,emp_length,annual_inc,loan_status,fund_perc,incToloan_perc
13544,24000,11.99,533.75,10,84996.0,Low Risk,1.0,3.5415
20268,12500,13.43,423.77,1,58000.0,Low Risk,1.0,4.64
35271,1700,13.16,57.41,1,14000.0,Low Risk,1.0,8.235294
29133,10000,12.73,335.67,10,60000.0,High Risk,1.0,6.0
2974,1300,7.9,40.68,1,41000.0,Low Risk,1.0,31.538462


In [19]:
#rescaling the data using standard scalar
scalar = StandardScaler()

In [20]:
num_col = ['loan_amnt','int_rate','installment','emp_length','annual_inc','fund_perc','incToloan_perc']

In [21]:
df_train[num_col] = scalar.fit_transform(df_train[num_col])

In [22]:
df_train.head()

Unnamed: 0,loan_amnt,int_rate,installment,emp_length,annual_inc,loan_status,fund_perc,incToloan_perc
2162,-0.44637,1.141503,-0.216163,1.438351,0.553912,Low Risk,0.208475,0.306507
2558,0.223622,1.052587,-0.055835,-0.32194,0.220512,Medium Risk,0.208475,-0.157154
17406,0.089624,1.171142,-0.15879,0.264824,-0.069401,Low Risk,0.208475,-0.231869
5337,-1.116363,-0.092547,-1.094161,1.144969,-0.431792,Low Risk,0.208475,0.295406
9667,-1.263762,-1.633763,-1.292739,-0.028558,-0.428893,Low Risk,0.208475,0.816548


### Step-4: Model Building using OneVsRest Classifier

In [23]:
#Create the predictor and target datasets
X_train = df_train.drop(['loan_status'],axis=1)
y_train = df_train['loan_status']

In [24]:
X_train.head()

Unnamed: 0,loan_amnt,int_rate,installment,emp_length,annual_inc,fund_perc,incToloan_perc
2162,-0.44637,1.141503,-0.216163,1.438351,0.553912,0.208475,0.306507
2558,0.223622,1.052587,-0.055835,-0.32194,0.220512,0.208475,-0.157154
17406,0.089624,1.171142,-0.15879,0.264824,-0.069401,0.208475,-0.231869
5337,-1.116363,-0.092547,-1.094161,1.144969,-0.431792,0.208475,0.295406
9667,-1.263762,-1.633763,-1.292739,-0.028558,-0.428893,0.208475,0.816548


In [25]:
y_train.head()

2162        Low Risk
2558     Medium Risk
17406       Low Risk
5337        Low Risk
9667        Low Risk
Name: loan_status, dtype: object

In [26]:
# Building a classification model using one vs rest method
LR = LogisticRegression()
ovr = OneVsRestClassifier(LR)

In [27]:
# Fitting the model with training data
model = ovr.fit(X_train,y_train)
print(model)

OneVsRestClassifier(estimator=LogisticRegression())


In [28]:
# Making a prediction on the train set
y_train_pred = model.predict(X_train)
# Evaluating the model
print(f"Test Set Accuracy: {accuracy_score(y_train, y_train_pred) * 100} %\n\n")
print(f"Classification Report: \n\n{classification_report(y_train, y_train_pred)}")

Test Set Accuracy: 83.2304336574365 %


Classification Report: 

              precision    recall  f1-score   support

   High Risk       0.33      0.01      0.01      3734
    Low Risk       0.83      1.00      0.91     22533
 Medium Risk       0.12      0.00      0.00       782

    accuracy                           0.83     27049
   macro avg       0.43      0.34      0.31     27049
weighted avg       0.74      0.83      0.76     27049



**Trained set Accuracy: 83%**

### Step-5: Model Prediction

In [29]:
#Create the predictor and target variables for test dataset
X_test = df_test.drop(['loan_status'],axis=1)
y_test = df_test['loan_status']

In [30]:
X_test.head()

Unnamed: 0,loan_amnt,int_rate,installment,emp_length,annual_inc,fund_perc,incToloan_perc
13544,24000,11.99,533.75,10,84996.0,1.0,3.5415
20268,12500,13.43,423.77,1,58000.0,1.0,4.64
35271,1700,13.16,57.41,1,14000.0,1.0,8.235294
29133,10000,12.73,335.67,10,60000.0,1.0,6.0
2974,1300,7.9,40.68,1,41000.0,1.0,31.538462


In [31]:
y_test.head()

13544     Low Risk
20268     Low Risk
35271     Low Risk
29133    High Risk
2974      Low Risk
Name: loan_status, dtype: object

In [32]:
#rescaling of test dataset
X_test[num_col] = scalar.transform(X_test[num_col])

In [33]:
X_test.head()

Unnamed: 0,loan_amnt,int_rate,installment,emp_length,annual_inc,fund_perc,incToloan_perc
13544,1.697607,-0.017103,0.988531,1.438351,0.220454,0.208475,-0.356762
20268,0.156623,0.370895,0.461387,-1.202086,-0.170871,0.208475,-0.283598
35271,-1.290562,0.298145,-1.294608,-1.202086,-0.808679,0.208475,-0.04414
29133,-0.178373,0.182285,0.039116,1.438351,-0.14188,0.208475,-0.193018
2974,-1.344161,-1.119126,-1.374796,-1.202086,-0.417297,0.208475,1.507927


In [34]:
# Making a prediction on the test set
y_test_pred = model.predict(X_test)

# Evaluating the model
print(f"Test Set Accuracy: {accuracy_score(y_test, y_test_pred) * 100} %\n\n")
print(f"Classification Report: \n\n{classification_report(y_test, y_test_pred)}")

Test Set Accuracy: 82.94660571034245 %


Classification Report: 

              precision    recall  f1-score   support

   High Risk       0.47      0.01      0.02      1665
    Low Risk       0.83      1.00      0.91      9612
 Medium Risk       0.00      0.00      0.00       316

    accuracy                           0.83     11593
   macro avg       0.43      0.34      0.31     11593
weighted avg       0.76      0.83      0.75     11593



**Test set Accuracy: 82.9%**

**We can see that both the trained set accuracy and test set accuracy are similar**

### Step-6: Analysing the probabilties and classification values

In [35]:
X_test_org = X_test.copy()

In [36]:
#Scaled feature array
X_test['Scalar Columns'] = X_test.values.tolist()

In [37]:
#Actual target variable
X_test['Actual'] = y_test

In [38]:
#OnevsRest target prediction
X_test['Predicted'] = y_test_pred

In [39]:
#OnevsRest probability prediction
X_test['Prob_Predicted'] = model.predict_proba(X_test_org).tolist()

In [40]:
#OnevsRest individual class prediction probabilities
X_test['Prob_Predicted_highRisk'] = model.predict_proba(X_test_org)[:,0].tolist()
X_test['Prob_Predicted_lowRisk'] = model.predict_proba(X_test_org)[:,1].tolist()
X_test['Prob_Predicted_mediumRisk'] = model.predict_proba(X_test_org)[:,2].tolist()

In [41]:
#The updated test set with probabilities
X_test.head()

Unnamed: 0,loan_amnt,int_rate,installment,emp_length,annual_inc,fund_perc,incToloan_perc,Scalar Columns,Actual,Predicted,Prob_Predicted,Prob_Predicted_highRisk,Prob_Predicted_lowRisk,Prob_Predicted_mediumRisk
13544,1.697607,-0.017103,0.988531,1.438351,0.220454,0.208475,-0.356762,"[1.6976067970323159, -0.017103257487488508, 0....",Low Risk,Low Risk,"[0.14540704828756673, 0.6705568181703987, 0.18...",0.145407,0.670557,0.184036
20268,0.156623,0.370895,0.461387,-1.202086,-0.170871,0.208475,-0.283598,"[0.15662313180383905, 0.37089505172108733, 0.4...",Low Risk,Low Risk,"[0.13443332466561425, 0.8637730013278433, 0.00...",0.134433,0.863773,0.001794
35271,-1.290562,0.298145,-1.294608,-1.202086,-0.808679,0.208475,-0.04414,"[-1.2905615277150784, 0.29814536874447944, -1....",Low Risk,Low Risk,"[0.17521222445481202, 0.8066509312053007, 0.01...",0.175212,0.806651,0.018137
29133,-0.178373,0.182285,0.039116,1.438351,-0.14188,0.208475,-0.193018,"[-0.1783733171588733, 0.1822847625224742, 0.03...",High Risk,Low Risk,"[0.13835020398997386, 0.8570853687482313, 0.00...",0.13835,0.857085,0.004564
2974,-1.344161,-1.119126,-1.374796,-1.202086,-0.417297,0.208475,1.507927,"[-1.3441609595491122, -1.1191262329479577, -1....",Low Risk,Low Risk,"[0.07183469705597605, 0.9259854258182494, 0.00...",0.071835,0.925985,0.00218


### Step-7: Display the coefficients and intercept values for each Logistic Regression model

In [42]:
# Classes for which individual models are created
print(model.classes_)

['High Risk' 'Low Risk' 'Medium Risk']


In [43]:
#Coefficient matrix for all the models created
for estimator in model.estimators_:
    print(estimator.coef_.shape)

(1, 7)
(1, 7)
(1, 7)


In [44]:
#Intercept values for all the models created
for estimator in model.estimators_:
    print(estimator.intercept_)

[-1.98702862]
[1.81436432]
[-5.11248147]


In [45]:
#Coefficient values for all the models created
for estimator in model.estimators_:
    print(estimator.coef_)

[[ 0.39851727  0.59347116 -0.38860129  0.04331238 -0.39891493  0.02934282
  -0.01457014]]
[[-1.16799074 -0.63873289  1.13309613 -0.07050531  0.30833649 -0.21800131
   0.09024585]]
[[ 4.9602886   0.96920326 -5.59110894  0.17542626  0.1334617   1.39128615
  -0.64155322]]
