# <b> Problem Statement </b>


You work as a Business Analytics Consultant at the Bank of Corporate.The bank was witnessing slower than usual growth in its book of business for the most recent quarter in 2018. The bank provides financial services/products such as savings accounts, current accounts, debit cards, etc. to its customer. The data suggested that it was the home loan business of the bank that was hit by a major loss. Now, loans are the core business of banks. The main profit comes directly from the loan’s interest. The head of the Home Loan business asked the heads of the Sales, Operations, Risk and Analytics teams to investigate and identify the root causes for the slowing growth and solve the problem. A business like selling Home Loans can grow or shrink based on several factors like demand or supply side. Some of the reasons are:

- <b> Demand Side: </b> Are interest rates high?
- <b> Demand Side: </b> Are there any macro economic reasons, such as recession or low salary growth or inflation?
- <b> Supply Side: </b> Are new and attractive housing projects not available in the markets being served?
- <b> Supply Side: </b> Have real estate prices shot up making homes unaffordable, relatively speaking?
- <b> Competitor Side: </b> Are we losing customers to our competition? Is our competition also facing lower growth?



The team found out that the credit risk was in abnormal standards and the default loan rates were high. So what do you mean by default loans and credit risk?

<b> Deafult loans : </b> <br>
Default is the failure to repay a loan according to the terms agreed to in the promissory note. For most federal student loans, you will default if you have not made a payment in more than 270 days.

<b> Credit risk : </b> <br>
It is understood simply as the risk a bank takes while lending out money to borrowers. They might default and fail to repay the dues in time and these results in losses to the bank. 

## <b> So, what do banks do then? </b> <br>
They need to manage their credit risks. The goal of credit risk management in banks is to maintain credit risk exposure within proper and acceptable parameters. It is the practice of mitigating losses by understanding the adequacy of a bank’s capital and loan loss reserves at any given time. For this, banks not only need to manage the entire portfolio but also individual credits.

## <b> Measures taken </b> <br>
So in 2019, the bank came up with a project to build a "Credit risk estimate model" for its home loan branch.The loan should be granted after an intensive process of verification and validation. The dataset (provided below) contains the information about all the customers who were contacted during this year and were provided loans based on various parameters. The "Credit risk estimate model" need to be cost-efficient so that the bank not only decreases their credit risk but also increase the total profit.


## <b> Business objective </b> <br>

Your aim is to build the "Credit risk estimate model" to classify new loans availed as "Low Risk", "High Risk" and "Medium Risk". This will help the bank to sanction loans to "Low Risk" customers, following up with the latest information/data for the "Medium Risk" customers and reject the loan approval for "High Risk" customers.

## <b> Read the dataset

In [54]:
#Import the libraries
import pandas as pd
import numpy as np

#Load the loan dataset
df = pd.read_csv("../data/loan2.csv")

#Details of the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38642 entries, 0 to 38641
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           38642 non-null  int64  
 1   loan_amnt    38642 non-null  int64  
 2   funded_amnt  38642 non-null  int64  
 3   int_rate     38642 non-null  float64
 4   installment  38642 non-null  float64
 5   emp_length   38642 non-null  int64  
 6   annual_inc   38642 non-null  float64
 7   loan_status  38642 non-null  object 
dtypes: float64(3), int64(4), object(1)
memory usage: 2.4+ MB


The dataset has the following columns: </b>

<b> id : </b>Transaction ID use to identify each transaction uniquely <br>
<b> loan_amnt : </b> Loan amount that was requested by the customer <br>
<b> funded_amnt : </b>Amount that was sanctioned by the bank <br>
<b> int_rate : </b> Interest rate offered on the loan amount <br>
<b> installment : </b>Amount of money paid during each installment <br>
<b> emp_length :  </b> Work experience (employment length of the customer)  <br>
<b> annual_inc : </b>What is the annual income of the customer<br>
<b> loan_status : </b> Classified as whether it is "High Risk", "Low Risk" and "Medium Risk" <br>

In [55]:
#Check the details of the dataset
df.head()

Unnamed: 0,id,loan_amnt,funded_amnt,int_rate,installment,emp_length,annual_inc,loan_status
0,1077501,5000,5000,10.65,162.87,10,24000.0,Low Risk
1,1077430,2500,2500,15.27,59.83,1,30000.0,High Risk
2,1077175,2400,2400,15.96,84.33,10,12252.0,Low Risk
3,1076863,10000,10000,13.49,339.31,10,49200.0,Low Risk
4,1075358,3000,3000,12.69,67.79,1,80000.0,Medium Risk


In [56]:
# Distribution of the target variable
df['loan_status'].value_counts()

Low Risk       32145
High Risk       5399
Medium Risk     1098
Name: loan_status, dtype: int64

## <b> Data Cleaning </b>

In [57]:
#Check for null values
df.isnull().sum()

id             0
loan_amnt      0
funded_amnt    0
int_rate       0
installment    0
emp_length     0
annual_inc     0
loan_status    0
dtype: int64

In [58]:
#Check for duplicate values
df.duplicated().sum()

0

## <b> Feature Creation </b>
<b> funded_amnt: </b> Percentage of amount sanctioned compared to the total loan amount. Higher the value, it states that the bank is positive in lending the loan to the customer. <br>
<b> incToloan_perc: </b> Percentage of annual income when compared to the loan amount. Higher the value it states that the customer is more likely to pay back without defaulting.

In [59]:
# Adding new variables  
# fund_perc variable represents the ratio of funded amount wrt loan amount
df['fund_perc'] = df['funded_amnt']/df['loan_amnt']

#incToloan_perc variable represent the ratio of annula income wrt loan amount
df['incToloan_perc'] = df['annual_inc']/df['loan_amnt']

In [60]:
# Understanding distribution of all the numerical variables in dataset
df.describe().round(2).transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,38642.0,681040.36,211304.55,54734.0,513435.0,662770.5,836491.25,1077501.0
loan_amnt,38642.0,11291.62,7462.14,500.0,5500.0,10000.0,15000.0,35000.0
funded_amnt,38642.0,11017.1,7193.04,500.0,5500.0,9950.0,15000.0,35000.0
int_rate,38642.0,12.05,3.72,5.42,9.32,11.86,14.59,24.59
installment,38642.0,326.76,209.14,15.69,168.44,282.83,434.4,1305.19
emp_length,38642.0,5.09,3.41,1.0,2.0,4.0,9.0,10.0
annual_inc,38642.0,69608.28,64253.2,4000.0,41400.0,60000.0,83199.99,6000000.0
fund_perc,38642.0,0.99,0.07,0.1,1.0,1.0,1.0,1.0
incToloan_perc,38642.0,8.92,13.85,1.2,4.0,6.07,10.02,1266.67


In [61]:
#column names
df.columns

Index(['id', 'loan_amnt', 'funded_amnt', 'int_rate', 'installment',
       'emp_length', 'annual_inc', 'loan_status', 'fund_perc',
       'incToloan_perc'],
      dtype='object')

## <b> Train-Test Split

In [62]:
# choosing all the numerical variables as independent variables (classifier can only take numerical input)
# dropping two variable funded_amnt as we have created new variable transformation based on it 
X = df.select_dtypes(np.number).drop(['id','funded_amnt'], axis=1)

#Dependent variable representing status of the loan
y = df['loan_status']

#splitting the dataset in train and test datasets using a split ratio of 70:30
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=30, random_state=10)

# standardizing all the variables using standard scaler
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


## <b> Model Building

In [63]:
from sklearn.linear_model import LogisticRegression 
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split
import warnings

In [64]:

# Building a classification model using one vs rest method
LR = LogisticRegression()
oneVsRest = OneVsRestClassifier(LR)

# Fitting the model with training data
oneVsRest.fit(X_train_scaled, y_train)

OneVsRestClassifier(estimator=LogisticRegression())

## <b>Step 3 : </b>
<b> Model Prediction </b>

In [65]:
# Making a prediction on the test set
prediction_oneVsRest = oneVsRest.predict(X_test_scaled)
   
# Evaluating the model
print(f"Test set accuracy : {accuracy_score(y_test, prediction_oneVsRest)*100}%\n\n")
print(f"Classification report: \n\n {classification_report(y_test, prediction_oneVsRest)}")

Test set accuracy : 83.33333333333334%


Classification report: 

               precision    recall  f1-score   support

   High Risk       0.00      0.00      0.00         4
    Low Risk       0.83      1.00      0.91        25
 Medium Risk       0.00      0.00      0.00         1

    accuracy                           0.83        30
   macro avg       0.28      0.33      0.30        30
weighted avg       0.69      0.83      0.76        30



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


<b> Accuracy : </b> Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations.The formula is given as: <br>
<b> *Accuracy = True Positives + True Negatives/True Positives+False Positives+False Negatives+True Positives* </b> <br> <br>
<b> Precision : </b> The quality of being exact and refers to how close two or more measurements are to each other, regardless of whether those measurements are accurate or not. The formula is : <br>
<b> *Precision = True Positives / (True Positives + False Positives)* </b> <br> <br>
<b> Recall : </b> It is calculated as the number of true positives divided by the total number of true positives and false negatives. The result is a value between 0.0 for no recall and 1.0 for full or perfect recall. The formula is : <br>
<b> *Recall = True Positives / (True Positives + False Negatives)* </b> <br> <br>
<b> F1 score : </b> F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account.The highest possible value of an F-score is 1.0, indicating perfect precision and recall, and the lowest possible value is 0, if either the precision or the recall is zero. The formula is : <br>
<b> *F1 score = 2\*((precision\*recall)/(precision+recall))* </b> <br> <br>

## <b>Analysing the probabilties and classification values </b>

In [66]:
# Adding followig variables to the test dataset

#Scaled feature array
X_test['Scaled_features'] = X_test_scaled.tolist()

#Actual target variable
X_test['Actual'] = y_test

#OnevsRest target prediction
X_test['prediction_oneVsRest'] = prediction_oneVsRest

#OnevsRest probability prediction
X_test['prob_pred_oneVsRest'] = oneVsRest.predict_proba(X_test_scaled).tolist()

#OnevsRest individual class prediction probabilities
X_test['prob_pred_oneVsRest_highRisk'] = oneVsRest.predict_proba(X_test_scaled)[:,0].tolist()
X_test['prob_pred_oneVsRest_lowRisk'] = oneVsRest.predict_proba(X_test_scaled)[:,1].tolist()
X_test['prob_pred_oneVsRest_mediumRisk'] = oneVsRest.predict_proba(X_test_scaled)[:,2].tolist()

X_test.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexe

Unnamed: 0,loan_amnt,int_rate,installment,emp_length,annual_inc,fund_perc,incToloan_perc,Scaled_features,Actual,prediction_oneVsRest,prob_pred_oneVsRest,prob_pred_oneVsRest_highRisk,prob_pred_oneVsRest_lowRisk,prob_pred_oneVsRest_mediumRisk
17614,25000,16.77,888.46,1,104000.0,1.0,4.16,"[1.8370160009053087, 1.2694953930271917, 2.685...",Low Risk,Low Risk,"[0.13572498246427991, 0.8642090689206677, 6.59...",0.135725,0.864209,6.6e-05
3728,3600,7.51,112.0,2,31800.0,1.0,8.833333,"[-1.0306474635995284, -1.2222166661825382, -1....",Low Risk,Low Risk,"[0.07644510149725968, 0.9200895460912942, 0.00...",0.076445,0.92009,0.003465
37880,9000,11.97,298.8,2,57500.0,1.0,6.388889,"[-0.3070314491917658, -0.022104810450853763, -...",Low Risk,Low Risk,"[0.11841953587921925, 0.878438228076339, 0.003...",0.11842,0.878438,0.003142
33951,17000,12.18,566.1,10,63504.0,1.0,3.735529,"[0.7649922758567714, 0.03440269845131049, 1.14...",High Risk,Low Risk,"[0.1216327375677756, 0.8775370992853335, 0.000...",0.121633,0.877537,0.00083
982,5500,14.27,188.7,5,24000.0,1.0,4.363636,"[-0.7760418289005008, 0.5967869537157096, -0.6...",Low Risk,Low Risk,"[0.19962021873041713, 0.7876522082236527, 0.01...",0.19962,0.787652,0.012728


## <b>Display the coefficient and intercept values for each Logistic Regression model </b>

In [67]:
# Classes for which individual models are created
print(oneVsRest.classes_)

#Coefficient matrix for all the models created
print(oneVsRest.coef_.shape)

#Intercept values for all the models created
print("\n Intercept Values")
print(oneVsRest.intercept_)

#Coefficient values for all the models created
print("\n Coefficient values")
Coeff_array = oneVsRest.coef_
print(Coeff_array)

['High Risk' 'Low Risk' 'Medium Risk']
(3, 7)

 Intercept Values
[[-1.96532324]
 [ 1.79858623]
 [-5.16266085]]

 Coefficient values
[[ 0.40074855  0.57450034 -0.40606121  0.04898165 -0.356952    0.03009393
  -0.04645165]
 [-1.15154635 -0.62512408  1.13150054 -0.07120236  0.27887282 -0.2076321
   0.10740671]
 [ 5.0510304   0.99757143 -5.7033346   0.15929608  0.11869475  1.37161869
  -0.53510853]]




## <b>Analyse probability values for one test sample</b>

In [68]:
print(X_test.iloc[0]['prob_pred_oneVsRest'])

[0.13572498246427991, 0.8642090689206677, 6.594861505229697e-05]


## <b>Understand the mathematics and calculations inside the Model </b>

In [69]:
#Below example demonstrate the calculation of prediction probability for a observation in the dataset
#The demonstartion uses coefficient values of each model for the calculation

# Choose the first observation
arr = X_test.iloc[0]['Scaled_features']

# Class calculates the log of odds value for a given class
def y(class_cat, input_arr):
    val = 0
    for i in range(7):
        val += Coeff_array[class_cat - 1][i]*input_arr[i]

    return val

# Calculates the probability values given the log of odds
def z(a):
    return (1/(1+np.exp(-a)))

#Non-normalized probability of all the classes
z_non_norm = z(sum(oneVsRest.intercept_[0], y(1,arr))),  z(sum(oneVsRest.intercept_[1], y(2,arr))),  z(sum(oneVsRest.intercept_[2], y(3,arr)))

#Normalized probability of all the classes
z_norm = z_non_norm/sum(z_non_norm)

print(z_non_norm)
print(z_norm)


(0.13966767227435964, 0.889313572365409, 6.786436355956968e-05)
[1.35724982e-01 8.64209069e-01 6.59486151e-05]


## <b> Building Logistic Regression Model and using it in One vs One Classifier </b>

In [70]:
#Classification using OnevsOne method
LR1 = LogisticRegression()
OneVsOne = OneVsOneClassifier(LR1)

# Fitting the model with training data
OneVsOne.fit(X_train_scaled, y_train)
   
# Making a prediction on the test set
prediction_oneVsOne = OneVsOne.predict(X_test_scaled)

## <b> Model Prediction </b>

In [71]:
# Evaluating the model
print(f"Test set accuracy : {accuracy_score(y_test, prediction_oneVsOne)*100}%\n\n")
print(f"Classification report: \n\n {classification_report(y_test, prediction_oneVsOne)}")

Test set accuracy : 83.33333333333334%


Classification report: 

               precision    recall  f1-score   support

   High Risk       0.00      0.00      0.00         4
    Low Risk       0.83      1.00      0.91        25
 Medium Risk       0.00      0.00      0.00         1

    accuracy                           0.83        30
   macro avg       0.28      0.33      0.30        30
weighted avg       0.69      0.83      0.76        30



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## <b>Analysing the probabilties and classification values </b>

In [72]:
# Adding followig variables to the test dataset

#OnevsOne target prediction
X_test['prediction_oneVsOne'] = prediction_oneVsOne
X_test.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


Unnamed: 0,loan_amnt,int_rate,installment,emp_length,annual_inc,fund_perc,incToloan_perc,Scaled_features,Actual,prediction_oneVsRest,prob_pred_oneVsRest,prob_pred_oneVsRest_highRisk,prob_pred_oneVsRest_lowRisk,prob_pred_oneVsRest_mediumRisk,prediction_oneVsOne
17614,25000,16.77,888.46,1,104000.0,1.0,4.16,"[1.8370160009053087, 1.2694953930271917, 2.685...",Low Risk,Low Risk,"[0.13572498246427991, 0.8642090689206677, 6.59...",0.135725,0.864209,6.6e-05,Low Risk
3728,3600,7.51,112.0,2,31800.0,1.0,8.833333,"[-1.0306474635995284, -1.2222166661825382, -1....",Low Risk,Low Risk,"[0.07644510149725968, 0.9200895460912942, 0.00...",0.076445,0.92009,0.003465,Low Risk
37880,9000,11.97,298.8,2,57500.0,1.0,6.388889,"[-0.3070314491917658, -0.022104810450853763, -...",Low Risk,Low Risk,"[0.11841953587921925, 0.878438228076339, 0.003...",0.11842,0.878438,0.003142,Low Risk
33951,17000,12.18,566.1,10,63504.0,1.0,3.735529,"[0.7649922758567714, 0.03440269845131049, 1.14...",High Risk,Low Risk,"[0.1216327375677756, 0.8775370992853335, 0.000...",0.121633,0.877537,0.00083,Low Risk
982,5500,14.27,188.7,5,24000.0,1.0,4.363636,"[-0.7760418289005008, 0.5967869537157096, -0.6...",Low Risk,Low Risk,"[0.19962021873041713, 0.7876522082236527, 0.01...",0.19962,0.787652,0.012728,Low Risk


## <b>Display the parameters and coefficients for each Logistic Regression model </b>

In [73]:
#OneVsOne.classes_
OneVsOne.classes_

array(['High Risk', 'Low Risk', 'Medium Risk'], dtype=object)

In [74]:
OneVsOne.estimators_

(LogisticRegression(), LogisticRegression(), LogisticRegression())