<a href="https://colab.research.google.com/github/satty25/Dexter_Lab/blob/main/Banking_Credit_Risk_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **BANKING CREDIT RISK MODEL**

# **Problem Statement**
A credit score is a numerical indicator of an individual's creditworthiness, typically ranging from 300 to 850. It assists lenders in evaluating the risk associated with providing loans. A higher credit score signifies strong creditworthiness, indicating a greater likelihood of timely repayments, whereas a lower score suggests increased lending risk.


Credit Risk Model Development
Credit risk model development involves creating statistical or machine learning models to predict the likelihood of a borrower defaulting on their financial obligations. These models are widely used in banking, lending, and financial services to quantify risk before approving loans or credit.

A key model in this field is the Probability of Default (PD) model, which estimates the likelihood of a borrower defaulting on a loan within a specified period, typically one year. This model is crucial for financial institutions to manage credit risk effectively and make informed lending decisions.

# **DATA EXPLORATION,VALIDATION AND CLEANING**

In [None]:
import pandas as pd
import matplotlib.pyplot as plt


In [15]:
Credit_data=pd.read_csv('/content/cs-training.csv',encoding='Latin-1')

In [None]:
Credit_data.columns

In [None]:
Credit_data.info()
#Observation: Null values observed in MonthlyIncome and No_of_dependents

In [None]:
Credit_data.head(10)

In [20]:
#Creating a duplicate dataset for working and keeping the masterdata untouched
Credit_data_clean=Credit_data.copy(deep=True)

In [None]:
Credit_data_clean.head(10)

In [None]:
Credit_data_clean.shape

(150000, 12)

In [None]:
Credit_data_clean.info()

In [None]:
Credit_data_clean.columns.values

#**Categorical Variable Exploration**

**SeriousDlqin2yrs**

In [None]:
Credit_data_clean['SeriousDlqin2yrs'].value_counts()
#Obs: Class Imbalance exists

In [None]:
Credit_data_clean['SeriousDlqin2yrs'].value_counts(normalize=True)
#Value in %

**Age**

In [None]:
Credit_data_clean['age'].quantile((.25,.50,.75,.90,.95,.99,1))
#Obs: #1% outliers

**30 DPD**

In [None]:
Credit_data_clean['NumberOfTime30-59DaysPastDueNotWorse'].value_counts()
#Obs:#Outliers present 98 and 96

**No of loans**

In [None]:
Credit_data_clean['NumberOfOpenCreditLinesAndLoans'].value_counts()
#Obs:#Outliers present

**90 DPD**

In [None]:
print(Credit_data_clean['NumberOfTimes90DaysLate'].value_counts())
#Obs:#Outliers present 98 and 96

**No of Home Loans**

In [None]:
Credit_data_clean['NumberRealEstateLoansOrLines'].value_counts()
#Obs:#Outliers present

**60 DPD**

In [None]:
Credit_data_clean['NumberOfTime60-89DaysPastDueNotWorse'].value_counts()
#Obs:#Outliers present 98 and 96

**No of dependents**

In [None]:
Credit_data_clean['NumberOfDependents'  ].value_counts()
#Obs: Outliers present and data type is float

In [36]:
Credit_data_clean.columns

Index(['Sr_No', 'SeriousDlqin2yrs', 'monthly_utilization', 'age',
       'NumberOfTime30-59DaysPastDueNotWorse', 'DebtRatio', 'MonthlyIncome',
       'NumberOfOpenCreditLinesAndLoans', 'NumberOfTimes90DaysLate',
       'NumberRealEstateLoansOrLines', 'NumberOfTime60-89DaysPastDueNotWorse',
       'NumberOfDependents'],
      dtype='object')

#CONTINOUS VARIABLE EXPLORATION

**Monthly_utilization**

In [None]:
Credit_data_clean['monthly_utilization'].quantile([.25,.50,.75,.85,.90,.95,.96,.97,.98,.99,1])
#Obs:3% outliers

**Monthly Income**

In [None]:
Credit_data_clean['MonthlyIncome'].quantile([.25,.50,.75,.90,.95,.98,.99,1])

In [46]:
(Credit_data_clean['MonthlyIncome'].isnull().sum())
#Obs:Null values

29731

In [47]:
Credit_data_clean['MonthlyIncome'].isnull().sum()/len(Credit_data_clean)
#Obs: % of null values = 19%

0.19820666666666667

**Debt Ratio**

In [None]:
(Credit_data_clean['DebtRatio'].quantile([.25,.50,.75,.765,.81,.90,.95,.98,.99,1]))

#76% data clean 24% Outliers

# **DATA CLEANING**

In [50]:
Credit_data_clean=Credit_data_clean.drop(columns='Sr_No',axis=1)
#Dropping Serial column

**Monthly_utilization**

In [55]:
Credit_data_clean['monthly_utilization_new']=Credit_data_clean['monthly_utilization']

In [None]:
Credit_data_clean['monthly_utilization_new'][Credit_data_clean['monthly_utilization']>1]=Credit_data_clean['monthly_utilization'].median()

Replaced the outliers that were above 100% utilization with median utilization

**Age**

In [None]:
Credit_data_clean['age'].quantile([0,.10,.25,.50,.75,.90,.95,.99,1])

In [59]:
Credit_data_clean['age_clean']=Credit_data_clean['age']

In [None]:
Credit_data_clean['age'].quantile([0,.01,.10,.25,.50,.75,.90,.95,.99,1])

In [None]:
#Flooring
Credit_data_clean['age_clean'][Credit_data_clean['age']<24]=24
#Capping
Credit_data_clean['age_clean'][Credit_data_clean['age']>80]=80

In [None]:
Credit_data_clean=Credit_data_clean.drop('age',axis=1)
#Dropped the old var after cleaning and creating a new var

**30 DPD**

In [63]:
Credit_data_clean['NumberOfTime30-59DaysPastDueNotWorse_clean']=Credit_data_clean['NumberOfTime30-59DaysPastDueNotWorse']


In [64]:
Cross_tab_1=pd.crosstab(Credit_data_clean['NumberOfTime30-59DaysPastDueNotWorse_clean'],Credit_data_clean['SeriousDlqin2yrs'],normalize='index')

In [None]:
Cross_tab_1
#The best value to replace 98 and  96 is 6 as per the class distribution.

In [None]:
Credit_data_clean['NumberOfTime30-59DaysPastDueNotWorse_clean'][Credit_data_clean['NumberOfTime30-59DaysPastDueNotWorse']>24]=6
#Replace the value

In [None]:
Credit_data_clean['NumberOfTime30-59DaysPastDueNotWorse_clean'].value_counts()
Credit_data_clean=Credit_data_clean.drop('NumberOfTime30-59DaysPastDueNotWorse',axis=1)
#Drop the  old column

**DebtRatio**

In [None]:
Credit_data_clean['DebtRatio'].quantile([.25,.50,.765,.95,.99,1])
#As the outliers are significant in number we do Imputation as well as Flagging.

In [None]:
Credit_data_clean['DebtRatio_clean']=Credit_data_clean['DebtRatio']
Credit_data_clean['DebtRatio_clean'][Credit_data_clean['DebtRatio']>1]=Credit_data_clean['DebtRatio'].median()
#Imputed

In [None]:
Credit_data_clean['DebtRatio_Flag']=1
Credit_data_clean['DebtRatio_Flag'][Credit_data_clean['DebtRatio']>1]=0
Credit_data_clean['DebtRatio_Flag'].value_counts()
#Flagged

In [None]:
Credit_data_clean['DebtRatio_clean'].quantile([.25,.50,.765,.95,.99,1])

**Monthly Income**

In [None]:
Credit_data_clean['MonthlyIncome'].isnull().sum()/len(Credit_data_clean)
#As the Null values are significant in number we do Imputation as well as Flagging.

In [None]:
Credit_data_clean['MonthlyIncome_clean']=Credit_data_clean['MonthlyIncome']
Credit_data_clean['MonthlyIncome_clean'][Credit_data_clean['MonthlyIncome'].isnull()]=Credit_data_clean['MonthlyIncome'].median()
#Imputed

In [None]:
Credit_data_clean['MonthlyIncome_Flag']=Credit_data_clean['MonthlyIncome']
Credit_data_clean['MonthlyIncome_Flag']=1
Credit_data_clean['MonthlyIncome_Flag'][Credit_data_clean['MonthlyIncome'].isnull()]=0
#Flagged

In [None]:
Credit_data_clean['MonthlyIncome_Flag'].value_counts()

In [None]:
Credit_data_clean=Credit_data_clean.drop('MonthlyIncome',axis=1)
#Drop the  old column

**No of Loans**

In [None]:
Credit_data_clean['NumberOfOpenCreditLinesAndLoans'].quantile([.25,.50,.75,.90,.95,.99,1])

In [None]:
Credit_data_clean['NumberOfOpenCreditLinesAndLoans_clean']=Credit_data_clean['NumberOfOpenCreditLinesAndLoans']
Credit_data_clean['NumberOfOpenCreditLinesAndLoans_clean'][Credit_data_clean['NumberOfOpenCreditLinesAndLoans']>18]=Credit_data_clean['NumberOfOpenCreditLinesAndLoans'].median()
#Imputed

In [None]:
Credit_data_clean['NumberOfOpenCreditLinesAndLoans_clean'].quantile([.25,.50,.75,.90,.95,.99,1])

In [None]:
Credit_data_clean=Credit_data_clean.drop('NumberOfOpenCreditLinesAndLoans',axis=1)
#Drop the  old column

**90 DPD**

In [None]:
Credit_data_clean['NumberOfTimes90DaysLate'].value_counts()

In [82]:
cross_tab_2=pd.crosstab(Credit_data_clean['NumberOfTimes90DaysLate'],Credit_data_clean['SeriousDlqin2yrs'],normalize='index')

In [None]:
cross_tab_2
#The best value to replace 98 and  96 is 3 as per the class distribution.

In [None]:
Credit_data_clean['NumberOfTimes90DaysLate_clean']=Credit_data_clean['NumberOfTimes90DaysLate']
Credit_data_clean['NumberOfTimes90DaysLate_clean'][Credit_data_clean['NumberOfTimes90DaysLate']>24]=3

In [None]:
Credit_data_clean=Credit_data_clean.drop('NumberOfTimes90DaysLate',axis=1)
#Dropped old Var

**No of Home Loans**

In [None]:
Credit_data_clean['NumberRealEstateLoansOrLines'].quantile([.25,.50,.75,.90,.95,.99,1])

In [None]:
Credit_data_clean['NumberRealEstateLoansOrLines_clean']=Credit_data_clean['NumberRealEstateLoansOrLines']
Credit_data_clean['NumberRealEstateLoansOrLines_clean'][Credit_data_clean['NumberRealEstateLoansOrLines']>10]=Credit_data_clean['NumberRealEstateLoansOrLines'].median()
#Imputing the 1% outlier

In [None]:
Credit_data_clean['NumberRealEstateLoansOrLines_clean'].quantile([.25,.50,.75,.90,.95,.99,1])

In [87]:
Credit_data_clean=Credit_data_clean.drop('NumberRealEstateLoansOrLines',axis=1)
#Dropped old Var

**60 DPD**

In [None]:
Credit_data_clean['NumberOfTime60-89DaysPastDueNotWorse'].value_counts()

In [90]:
cross_tab_3=pd.crosstab(Credit_data_clean['NumberOfTime60-89DaysPastDueNotWorse'],Credit_data_clean['SeriousDlqin2yrs'],normalize='index')

In [None]:
cross_tab_3
#The best value to replace 98 and  96 is 3 as per the class distribution.

In [None]:
Credit_data_clean['NumberOfTime60-89DaysPastDueNotWorse_clean']=Credit_data_clean['NumberOfTime60-89DaysPastDueNotWorse']
Credit_data_clean['NumberOfTime60-89DaysPastDueNotWorse_clean'][Credit_data_clean['NumberOfTime60-89DaysPastDueNotWorse']>24]=3

In [None]:
Credit_data_clean=Credit_data_clean.drop('NumberOfTime60-89DaysPastDueNotWorse',axis=1)
#Dropped old Var

**Number Of Dependents**

In [None]:
Credit_data_clean['NumberOfDependents'].quantile([.25,.50,.75,.90,.95,.99,.9999,1])
#<1% of Data is above 10

In [None]:
Credit_data_clean['NumberOfDependents'].isnull().sum()/len(Credit_data)
#26% has null values

In [None]:
Credit_data_clean["NumberOfDependents_new"]=Credit_data_clean["NumberOfDependents"]
Credit_data_clean["NumberOfDependents_new"][(Credit_data_clean["NumberOfDependents_new"]>10)|
                (Credit_data_clean["NumberOfDependents"].isnull())]=Credit_data_clean["NumberOfDependents"].median()

#Clean the nulls and missing values with median

In [None]:
Credit_data_clean['NumberOfDependents_new']=Credit_data_clean['NumberOfDependents_new'].astype(int)
#Changing the data type to int

In [113]:
Credit_data_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 13 columns):
 #   Column                                      Non-Null Count   Dtype  
---  ------                                      --------------   -----  
 0   SeriousDlqin2yrs                            150000 non-null  int64  
 1   monthly_utilization_new                     150000 non-null  float64
 2   age_clean                                   150000 non-null  int64  
 3   NumberOfTime30-59DaysPastDueNotWorse_clean  150000 non-null  int64  
 4   DebtRatio_clean                             150000 non-null  float64
 5   DebtRatio_Flag                              150000 non-null  int64  
 6   MonthlyIncome_clean                         150000 non-null  float64
 7   MonthlyIncome_Flag                          150000 non-null  int64  
 8   NumberOfOpenCreditLinesAndLoans_clean       150000 non-null  int64  
 9   NumberRealEstateLoansOrLines_clean          150000 non-null  int64  
 

# **MODEL BUILDING USING LOGISTIC REGRESSION**

In [155]:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix,accuracy_score
from sklearn.model_selection import train_test_split

In [114]:
x=Credit_data_clean.drop('SeriousDlqin2yrs',axis=1)
y=Credit_data_clean['SeriousDlqin2yrs']

xtrain,xtest,ytrain,ytest=train_test_split(x,y,test_size=0.3,random_state=123)

#Split the data into train and test data

In [116]:
model=sm.Logit(ytrain,xtrain).fit()
print(model.summary())

#Build the model

Optimization terminated successfully.
         Current function value: 0.209203
         Iterations 8
                           Logit Regression Results                           
Dep. Variable:       SeriousDlqin2yrs   No. Observations:               105000
Model:                          Logit   Df Residuals:                   104988
Method:                           MLE   Df Model:                           11
Date:                Thu, 13 Feb 2025   Pseudo R-squ.:                  0.1476
Time:                        19:12:44   Log-Likelihood:                -21966.
converged:                       True   LL-Null:                       -25770.
Covariance Type:            nonrobust   LLR p-value:                     0.000
                                                 coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------------------------
monthly_utilization_new                     

In [None]:
#Predictions on the train and test data
predict_train=model.predict(xtrain)
predict_test=model.predict(xtest)

#Accuracy on train data
cm_train=confusion_matrix(ytrain,np.round(predict_train))
Accuracy_train_data=accuracy_score(ytrain,np.round(predict_train))
print(cm_train)
print(Accuracy_train_data)

#Accuracy on test data
cm_test=confusion_matrix(ytest,np.round(predict_test))
Accuracy_test=accuracy_score(ytest,np.round(predict_test))
print(cm_test)
print(Accuracy_test)


##Over all accuracy is 93.5%


**Check for Multicollinearity**

In [135]:
#Vif Function
import statsmodels.formula.api as sm

def vif_cal(Data,col):
    x_vars=Data.drop([col], axis=1)
    xvar_names=x_vars.columns
    for i in range(0,xvar_names.shape[0]):
        y=x_vars[xvar_names[i]]
        x=x_vars[xvar_names.drop(xvar_names[i])]
        rsq=sm.ols(formula="y~x", data=x_vars).fit().rsquared
        vif=round(1/(1-rsq),2)
        print (xvar_names[i], " VIF = " , vif)

In [136]:
vif_cal(Credit_data_clean,'SeriousDlqin2yrs')

#As all the VIF is below 5 there is no multicollinearity present

monthly_utilization_new  VIF =  1.18
age_clean  VIF =  1.16
NumberOfTime30-59DaysPastDueNotWorse_clean  VIF =  1.23
DebtRatio_clean  VIF =  1.4
DebtRatio_Flag  VIF =  3.29
MonthlyIncome_clean  VIF =  1.04
MonthlyIncome_Flag  VIF =  3.38
NumberOfOpenCreditLinesAndLoans_clean  VIF =  1.34
NumberRealEstateLoansOrLines_clean  VIF =  1.46
NumberOfDependents_new  VIF =  1.12
NumberOfTimes90DaysLate_clean  VIF =  1.21
NumberOfTime60-89DaysPastDueNotWorse_clean  VIF =  1.3


#Model Validation & Class Imbalance

---



In [146]:
#Class Based Accuracy
print('Train_data_matrix',cm_train)
print('Test_data_matrix',cm_test)

Train_data_matrix [[97277   703]
 [ 6048   972]]
Test_data_matrix [[41670   324]
 [ 2568   438]]


In [148]:
#Class 0 and Class 1 Accuracy

Class_0_Acc=cm_train[0,0]/(cm_train[0,0]+cm_train[0,1])
Class_1_Acc=cm_train[1,1]/(cm_train[1,0]+cm_train[1,1])

print("Class-0 Accuracy",Class_0_Acc)
print("Class-1 Accuracy",Class_1_Acc)

#Class 1 accuracy is very low

Class-0 Accuracy 0.9928250663400694
Class-1 Accuracy 0.13846153846153847


**Use SMOTE to handle class imbalance**

In [150]:
from imblearn.over_sampling import SMOTE

#Using SMOTE to oversample the inferior class
smote=SMOTE(sampling_strategy=0.6,random_state=23)
xtrain_smote,ytrain_smote=smote.fit_resample(xtrain,ytrain)

In [156]:
model_1=sm.Logit(ytrain_smote,xtrain_smote).fit()
print(model_1.summary())

Optimization terminated successfully.
         Current function value: 0.518354
         Iterations 8
                           Logit Regression Results                           
Dep. Variable:       SeriousDlqin2yrs   No. Observations:               156768
Model:                          Logit   Df Residuals:                   156756
Method:                           MLE   Df Model:                           11
Date:                Thu, 13 Feb 2025   Pseudo R-squ.:                  0.2165
Time:                        20:09:29   Log-Likelihood:                -81261.
converged:                       True   LL-Null:                   -1.0371e+05
Covariance Type:            nonrobust   LLR p-value:                     0.000
                                                 coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------------------------
monthly_utilization_new                     

In [160]:
#Predictions on the train and test data
predict_train_smote=model_1.predict(xtrain_smote)
predict_test_smote=model_1.predict(xtest)

#Confusion Matrix
cm_train=confusion_matrix(ytrain_smote,np.round(predict_train_smote))
cm_test=confusion_matrix(ytest,np.round(predict_test_smote))

print(cm_train)
print(cm_test)

[[83278 14702]
 [25693 33095]]
[[35733  6261]
 [ 1259  1747]]


In [161]:
#Accuracy on train and test data
smote_test_accuracy=accuracy_score(ytest,np.round(predict_test_smote))
smote_train_accuracy=accuracy_score(ytrain_smote,np.round(predict_train_smote))
print(smote_test_accuracy)
print(smote_train_accuracy)

0.8328888888888889
0.7423262400489896


In [163]:
#Class 0 and Class 1 Accuracy

Class_0_Accuracy=cm_train[0,0]/(cm_train[0,0]+cm_train[0,1])
Class_1_Accuracy=cm_train[1,1]/(cm_train[1,0]+cm_train[1,1])
print('Class-0 Accuracy',Class_0_Accuracy)
print('Class-1 Accuracy',Class_1_Accuracy)



Class-0 Accuracy 0.8499489691773832
Class-1 Accuracy 0.5629550248350004


In [None]:
'Before SMOTE'
#Class-0 Accuracy 0.9928250663400694
#Class-1 Accuracy 0.13846153846153847
'After SMOTE'
#Class-0 Accuracy 0.8499489691773832
#Class-1 Accuracy 0.5629550248350004