# Credit Risk Segmentation Project

### Problem Statement:

The main objective of this project was to segment credit card users based on their risk levels. It aimed to develop a model that accurately predicts the risk level associated with a customer, which can help the bank make informed decisions regarding credit limit, interest rates, and other credit-related policies.

### Project Process:

The project was executed using Python, with the following steps:

1. **Data Collection:** Gathered data from two sources; the bank's internal systems and national CIBIL data.
2. **Data Preprocessing:** The data was cleaned by handling missing values and removing irrelevant columns, and merged information from internal bank records and external credit reports using the 'prospect id' to ensure a complete dataset. 
3. **Feature Engineering:** Created new features to provide the model with more insightful indicators, improving its ability to predict customer risk levels.
4. **Feature Selection:** Used the statistical techniques, **Chi-squared test** and **ANOVA test** to reduce the number of redundant features and minimize multicollinearity, making the model simpler and more efficient.
5. **Model Training:** Trained a multi-class classification model using **XGBoost Classifier**.
6. **Model Evaluation:** The model achieved an overall accuracy of **78.01%** accuracy and an F1 score of 0.76.
7. **Hyperparameter Tuning:** Used **Grid Search CV** to fine-tune the model parameters, achieving a testing accuracy of 78.01% from 77.83%.

### Conclusion

 This project demonstrated the effectiveness of using machine learning techniques in credit risk segmentation to mitigate credit risk by categorizing customers based on their risk levels and optimize decision-making for financial strategies such as credit and lending policies in the banking sector. 

Now, let's dive into the project!

### Importing Necessary Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import chi2_contingency
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBRegressor
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, classification_report, precision_recall_fscore_support
import warnings
import os
warnings.filterwarnings('ignore')

In [3]:
df1 = pd.read_excel("/content/cibil_data.xlsx")
df2 = pd.read_excel("/content/bank_data.xlsx")

In [4]:
print(df1.shape)
print(df2.shape)

(51336, 26)
(51336, 62)


In [5]:
df1.head()

Unnamed: 0,PROSPECTID,Total_TL,Tot_Closed_TL,Tot_Active_TL,Total_TL_opened_L6M,Tot_TL_closed_L6M,pct_tl_open_L6M,pct_tl_closed_L6M,pct_active_tl,pct_closed_tl,...,CC_TL,Consumer_TL,Gold_TL,Home_TL,PL_TL,Secured_TL,Unsecured_TL,Other_TL,Age_Oldest_TL,Age_Newest_TL
0,1,5,4,1,0,0,0.0,0.0,0.2,0.8,...,0,0,1,0,4,1,4,0,72,18
1,2,1,0,1,0,0,0.0,0.0,1.0,0.0,...,0,1,0,0,0,0,1,0,7,7
2,3,8,0,8,1,0,0.125,0.0,1.0,0.0,...,0,6,1,0,0,2,6,0,47,2
3,4,1,0,1,1,0,1.0,0.0,1.0,0.0,...,0,0,0,0,0,0,1,1,5,5
4,5,3,2,1,0,0,0.0,0.0,0.333,0.667,...,0,0,0,0,0,3,0,2,131,32


In [6]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51336 entries, 0 to 51335
Data columns (total 26 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   PROSPECTID            51336 non-null  int64  
 1   Total_TL              51336 non-null  int64  
 2   Tot_Closed_TL         51336 non-null  int64  
 3   Tot_Active_TL         51336 non-null  int64  
 4   Total_TL_opened_L6M   51336 non-null  int64  
 5   Tot_TL_closed_L6M     51336 non-null  int64  
 6   pct_tl_open_L6M       51336 non-null  float64
 7   pct_tl_closed_L6M     51336 non-null  float64
 8   pct_active_tl         51336 non-null  float64
 9   pct_closed_tl         51336 non-null  float64
 10  Total_TL_opened_L12M  51336 non-null  int64  
 11  Tot_TL_closed_L12M    51336 non-null  int64  
 12  pct_tl_open_L12M      51336 non-null  float64
 13  pct_tl_closed_L12M    51336 non-null  float64
 14  Tot_Missed_Pmnt       51336 non-null  int64  
 15  Auto_TL            

In [7]:
df2.head()

Unnamed: 0,PROSPECTID,time_since_recent_payment,time_since_first_deliquency,time_since_recent_deliquency,num_times_delinquent,max_delinquency_level,max_recent_level_of_deliq,num_deliq_6mts,num_deliq_12mts,num_deliq_6_12mts,...,pct_CC_enq_L6m_of_L12m,pct_PL_enq_L6m_of_ever,pct_CC_enq_L6m_of_ever,max_unsec_exposure_inPct,HL_Flag,GL_Flag,last_prod_enq2,first_prod_enq2,Credit_Score,Approved_Flag
0,1,549,35,15,11,29,29,0,0,0,...,0.0,0.0,0.0,13.333,1,0,PL,PL,696,P2
1,2,47,-99999,-99999,0,-99999,0,0,0,0,...,0.0,0.0,0.0,0.86,0,0,ConsumerLoan,ConsumerLoan,685,P2
2,3,302,11,3,9,25,25,1,9,8,...,0.0,0.0,0.0,5741.667,1,0,ConsumerLoan,others,693,P2
3,4,-99999,-99999,-99999,0,-99999,0,0,0,0,...,0.0,0.0,0.0,9.9,0,0,others,others,673,P2
4,5,583,-99999,-99999,0,-99999,0,0,0,0,...,0.0,0.0,0.0,-99999.0,0,0,AL,AL,753,P1


In [8]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51336 entries, 0 to 51335
Data columns (total 62 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   PROSPECTID                    51336 non-null  int64  
 1   time_since_recent_payment     51336 non-null  int64  
 2   time_since_first_deliquency   51336 non-null  int64  
 3   time_since_recent_deliquency  51336 non-null  int64  
 4   num_times_delinquent          51336 non-null  int64  
 5   max_delinquency_level         51336 non-null  int64  
 6   max_recent_level_of_deliq     51336 non-null  int64  
 7   num_deliq_6mts                51336 non-null  int64  
 8   num_deliq_12mts               51336 non-null  int64  
 9   num_deliq_6_12mts             51336 non-null  int64  
 10  max_deliq_6mts                51336 non-null  int64  
 11  max_deliq_12mts               51336 non-null  int64  
 12  num_times_30p_dpd             51336 non-null  int64  
 13  n

### Removing Nulls Rows and Columns

In [9]:
df1 = df1.loc[df1['Age_Oldest_TL'] != -99999]

In [10]:
df1.shape

(51296, 26)

In [11]:
# Initialize an empty list to store column names for removal
columns_to_be_removed = []

# Check each column in the DataFrame
for i in df2.columns:
    # If the count of occurrences of -99999 in the column is greater than 10000
    if df2.loc[df2[i] == -99999].shape[0] > 10000:
        # Add the column name to the list of columns to be removed
        columns_to_be_removed.append(i)

# Drop the identified columns from the DataFrame
df2 = df2.drop(columns_to_be_removed, axis=1)

# Filter out rows where any column value is equal to -99999
for i in df2.columns:
    df2 = df2.loc[df2[i] != -99999]

In [12]:
df2.shape

(42066, 54)

### Merging both the table

In [13]:
# Checking common column names
for i in list(df1.columns):
    if i in list(df2.columns):
        print (i)

PROSPECTID


In [14]:
# Merge the two dataframes, inner join so that no nulls are present
df = pd.merge(df1, df2, how ='inner', left_on = ['PROSPECTID'], right_on = ['PROSPECTID'])

In [15]:
df

Unnamed: 0,PROSPECTID,Total_TL,Tot_Closed_TL,Tot_Active_TL,Total_TL_opened_L6M,Tot_TL_closed_L6M,pct_tl_open_L6M,pct_tl_closed_L6M,pct_active_tl,pct_closed_tl,...,pct_PL_enq_L6m_of_L12m,pct_CC_enq_L6m_of_L12m,pct_PL_enq_L6m_of_ever,pct_CC_enq_L6m_of_ever,HL_Flag,GL_Flag,last_prod_enq2,first_prod_enq2,Credit_Score,Approved_Flag
0,1,5,4,1,0,0,0.000,0.00,0.200,0.800,...,0.0,0.0,0.000,0.0,1,0,PL,PL,696,P2
1,2,1,0,1,0,0,0.000,0.00,1.000,0.000,...,0.0,0.0,0.000,0.0,0,0,ConsumerLoan,ConsumerLoan,685,P2
2,3,8,0,8,1,0,0.125,0.00,1.000,0.000,...,0.0,0.0,0.000,0.0,1,0,ConsumerLoan,others,693,P2
3,5,3,2,1,0,0,0.000,0.00,0.333,0.667,...,0.0,0.0,0.000,0.0,0,0,AL,AL,753,P1
4,6,6,5,1,0,0,0.000,0.00,0.167,0.833,...,1.0,0.0,0.429,0.0,1,0,ConsumerLoan,PL,668,P3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42059,51332,3,0,3,1,0,0.333,0.00,1.000,0.000,...,0.0,0.0,0.000,0.0,0,0,ConsumerLoan,ConsumerLoan,650,P4
42060,51333,4,2,2,0,1,0.000,0.25,0.500,0.500,...,0.0,0.0,0.000,0.0,0,0,others,others,702,P1
42061,51334,2,1,1,1,1,0.500,0.50,0.500,0.500,...,1.0,0.0,1.000,0.0,0,0,ConsumerLoan,others,661,P3
42062,51335,2,1,1,0,0,0.000,0.00,0.500,0.500,...,0.0,0.0,0.000,0.0,0,0,ConsumerLoan,others,686,P2


In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42064 entries, 0 to 42063
Data columns (total 79 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   PROSPECTID                  42064 non-null  int64  
 1   Total_TL                    42064 non-null  int64  
 2   Tot_Closed_TL               42064 non-null  int64  
 3   Tot_Active_TL               42064 non-null  int64  
 4   Total_TL_opened_L6M         42064 non-null  int64  
 5   Tot_TL_closed_L6M           42064 non-null  int64  
 6   pct_tl_open_L6M             42064 non-null  float64
 7   pct_tl_closed_L6M           42064 non-null  float64
 8   pct_active_tl               42064 non-null  float64
 9   pct_closed_tl               42064 non-null  float64
 10  Total_TL_opened_L12M        42064 non-null  int64  
 11  Tot_TL_closed_L12M          42064 non-null  int64  
 12  pct_tl_open_L12M            42064 non-null  float64
 13  pct_tl_closed_L12M          420

We can afford the loss of 9000 rows.

In [None]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
PROSPECTID,42064.0,25649.827477,14844.173396,1.0,12776.75,25706.5,38518.25,51336.0
Total_TL,42064.0,5.262980,7.463383,1.0,1.00,3.0,6.00,235.0
Tot_Closed_TL,42064.0,2.967383,6.141098,0.0,0.00,1.0,3.00,216.0
Tot_Active_TL,42064.0,2.295597,2.404086,0.0,1.00,2.0,3.00,47.0
Total_TL_opened_L6M,42064.0,0.812643,1.383559,0.0,0.00,0.0,1.00,27.0
...,...,...,...,...,...,...,...,...
pct_PL_enq_L6m_of_ever,42064.0,0.195497,0.367414,0.0,0.00,0.0,0.00,1.0
pct_CC_enq_L6m_of_ever,42064.0,0.064186,0.225989,0.0,0.00,0.0,0.00,1.0
HL_Flag,42064.0,0.252235,0.434300,0.0,0.00,0.0,1.00,1.0
GL_Flag,42064.0,0.056580,0.231042,0.0,0.00,0.0,0.00,1.0


## Feature Selection

### 1. For Categorical Features



In [17]:
# check how many columns are categorical
for i in df.columns:
    if df[i].dtype == 'object':
        print(i)

MARITALSTATUS
EDUCATION
GENDER
last_prod_enq2
first_prod_enq2
Approved_Flag


### -- Chi-square test

In [18]:
for i in ['MARITALSTATUS', 'EDUCATION', 'GENDER', 'last_prod_enq2', 'first_prod_enq2']:
    chi2, pval, _, _ = chi2_contingency(pd.crosstab(df[i], df['Approved_Flag']))
    print(i, '---', pval)

MARITALSTATUS --- 3.578180861038862e-233
EDUCATION --- 2.6942265249737532e-30
GENDER --- 1.907936100186563e-05
last_prod_enq2 --- 0.0
first_prod_enq2 --- 7.84997610555419e-287


Since, for all the columns the p_value <= 0.05.

so, we will reject Null Hypothesis (H0), i.e., the columns are not associated.

We have to take all the columns in modeling.


### 2. For Numerical Features

Here, first we will check for the Multicollinearity between the input features, using VIF (Variation Inflation Factor)

### Multicolinearity

Happens when two or more IV are correlated

Problems with multicollinearity-

1. Interpretation of IV goes wrong
2. Coefficient of IV become misleading

Needs to be removed by some method (eg- VIF)

-- **Variance Inflation Factor**

Used to identify multicollinearity among IVs

Takes R-squared value for each IV and eliminate if crosses a threshold

* VIF ranges from 1 from infinity
* VIF = 1 : No multicollinearity
* VIF between 1 and 5 : Low multicollinearity
* VIF between 5 and 10 : Moderate multicollinearity
* VIF above 10 : High multicollinearity



In [19]:
# VIF for numerical columns
numeric_columns = []
for i in df.columns:
    if df[i].dtype != 'object' and i not in ['PROSPECTID','Approved_Flag']:
        numeric_columns.append(i)

In [20]:
len(numeric_columns)

72

In [21]:
vif_data = df[numeric_columns]
total_columns = vif_data.shape[1]
columns_to_be_kept = []
column_index = 0


# Loop for VIF Sequential Check
for i in range (0,total_columns):

    vif_value = variance_inflation_factor(vif_data, column_index)
    print (column_index,'---',vif_value)


    if vif_value <= 6:
        columns_to_be_kept.append( numeric_columns[i] )
        column_index = column_index+1

    else:
        vif_data = vif_data.drop([ numeric_columns[i] ] , axis=1)

0 --- inf
0 --- inf
0 --- 11.320180023967996
0 --- 8.363698035000336
0 --- 6.520647877790928
0 --- 5.149501618212625
1 --- 2.611111040579735
2 --- inf
2 --- 1788.7926256209232
2 --- 8.601028256477228
2 --- 3.832800792153077
3 --- 6.099653381646723
3 --- 5.581352009642766
4 --- 1.985584353098778
5 --- inf
5 --- 4.80953830281934
6 --- 23.270628983464636
6 --- 30.595522588100053
6 --- 4.384346405965583
7 --- 3.0646584155234238
8 --- 2.898639771299251
9 --- 4.377876915347324
10 --- 2.207853583695844
11 --- 4.916914200506864
12 --- 5.214702030064725
13 --- 3.3861625024231476
14 --- 7.840583309478997
14 --- 5.255034641721434
15 --- inf
15 --- 7.380634506427238
15 --- 1.4210050015175733
16 --- 8.083255010190316
16 --- 1.6241227524040114
17 --- 7.257811920140003
17 --- 15.59624383268298
17 --- 1.825857047132431
18 --- 1.5080839450032664
19 --- 2.172088834824578
20 --- 2.6233975535272274
21 --- 2.2959970812106176
22 --- 7.360578319196446
22 --- 2.1602387773102567
23 --- 2.8686288267891467
24 --

In [22]:
len(columns_to_be_kept)

39

See we have reduced our columns based on Multicolinearity.

### -- ANOVA for columns_to_be_kept


In [23]:
from scipy.stats import f_oneway

In [24]:
columns_to_be_kept_numerical = []

for i in columns_to_be_kept:
    a = list(df[i])
    b = list(df['Approved_Flag'])

    group_P1 = [value for value, group in zip(a, b) if group == 'P1']
    group_P2 = [value for value, group in zip(a, b) if group == 'P2']
    group_P3 = [value for value, group in zip(a, b) if group == 'P3']
    group_P4 = [value for value, group in zip(a, b) if group == 'P4']


    f_statistic, p_value = f_oneway(group_P1, group_P2, group_P3, group_P4)

    if p_value <= 0.05:
        columns_to_be_kept_numerical.append(i)

In [25]:
# Listing all the Final Features
features = columns_to_be_kept_numerical + ['MARITALSTATUS', 'EDUCATION', 'GENDER', 'last_prod_enq2', 'first_prod_enq2']
data = df[features + ['Approved_Flag']]

In [26]:
data.shape

(42064, 43)

In [27]:
data.sample(25)

Unnamed: 0,pct_tl_open_L6M,pct_tl_closed_L6M,Tot_TL_closed_L12M,pct_tl_closed_L12M,Tot_Missed_Pmnt,CC_TL,Home_TL,PL_TL,Secured_TL,Unsecured_TL,...,pct_PL_enq_L6m_of_ever,pct_CC_enq_L6m_of_ever,HL_Flag,GL_Flag,MARITALSTATUS,EDUCATION,GENDER,last_prod_enq2,first_prod_enq2,Approved_Flag
40200,0.095,0.048,1,0.048,1,2,0,2,12,9,...,0.714,0.5,1,0,Married,GRADUATE,M,PL,CC,P4
33184,0.4,0.2,5,0.333,4,0,0,0,15,0,...,0.0,0.0,1,0,Single,12TH,M,others,ConsumerLoan,P2
12172,0.5,0.0,0,0.0,1,0,0,1,3,1,...,1.0,0.0,1,0,Married,UNDER GRADUATE,F,PL,others,P2
35745,0.0,0.5,1,0.5,0,0,0,1,1,1,...,0.0,0.0,0,0,Married,SSC,F,ConsumerLoan,AL,P4
11552,0.104,0.083,5,0.104,2,2,0,2,18,30,...,0.0,0.167,1,0,Married,GRADUATE,M,ConsumerLoan,others,P2
33927,0.0,0.222,3,0.333,0,0,0,0,9,0,...,0.333,0.0,1,0,Married,SSC,M,PL,PL,P3
11557,0.5,0.5,1,0.5,1,0,0,2,0,2,...,0.6,0.0,0,0,Married,12TH,M,PL,PL,P3
3436,0.0,0.0,0,0.0,2,0,0,0,10,0,...,0.0,0.0,1,0,Married,GRADUATE,M,ConsumerLoan,others,P2
18085,0.0,0.333,1,0.333,0,0,0,0,1,2,...,0.0,0.0,0,0,Married,SSC,M,ConsumerLoan,ConsumerLoan,P2
15800,1.0,0.0,0,0.0,0,0,0,1,0,1,...,1.0,0.0,0,0,Married,12TH,M,ConsumerLoan,PL,P4


In [28]:
# Label encoding for the categorical features
print("MARITALSTATUS: ", df['MARITALSTATUS'].unique())
print("EDUCATION: ", df['EDUCATION'].unique())
print("GENDER: ", df['GENDER'].unique())
print("last_prod_enq2: ", df['last_prod_enq2'].unique())
print("first_prod_enq2: ", df['first_prod_enq2'].unique())

MARITALSTATUS:  ['Married' 'Single']
EDUCATION:  ['12TH' 'GRADUATE' 'SSC' 'POST-GRADUATE' 'UNDER GRADUATE' 'OTHERS'
 'PROFESSIONAL']
GENDER:  ['M' 'F']
last_prod_enq2:  ['PL' 'ConsumerLoan' 'AL' 'CC' 'others' 'HL']
first_prod_enq2:  ['PL' 'ConsumerLoan' 'others' 'AL' 'HL' 'CC']


### Ordinal feature -- EDUCATION
- SSC            : 1
- 12TH           : 2
- GRADUATE       : 3
- UNDER GRADUATE : 3
- POST-GRADUATE  : 4
- OTHERS         : 1
- PROFESSIONAL   : 3

##### Others has to be verified by the business end user

In [29]:
data.loc[data['EDUCATION'] == 'SSC',['EDUCATION']]              = 1
data.loc[data['EDUCATION'] == '12TH',['EDUCATION']]             = 2
data.loc[data['EDUCATION'] == 'GRADUATE',['EDUCATION']]         = 3
data.loc[data['EDUCATION'] == 'UNDER GRADUATE',['EDUCATION']]   = 3
data.loc[data['EDUCATION'] == 'POST-GRADUATE',['EDUCATION']]    = 4
data.loc[data['EDUCATION'] == 'OTHERS',['EDUCATION']]           = 1
data.loc[data['EDUCATION'] == 'PROFESSIONAL',['EDUCATION']]     = 3

data['EDUCATION'].value_counts()
data['EDUCATION'] = data['EDUCATION'].astype(int)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42064 entries, 0 to 42063
Data columns (total 43 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   pct_tl_open_L6M            42064 non-null  float64
 1   pct_tl_closed_L6M          42064 non-null  float64
 2   Tot_TL_closed_L12M         42064 non-null  int64  
 3   pct_tl_closed_L12M         42064 non-null  float64
 4   Tot_Missed_Pmnt            42064 non-null  int64  
 5   CC_TL                      42064 non-null  int64  
 6   Home_TL                    42064 non-null  int64  
 7   PL_TL                      42064 non-null  int64  
 8   Secured_TL                 42064 non-null  int64  
 9   Unsecured_TL               42064 non-null  int64  
 10  Other_TL                   42064 non-null  int64  
 11  Age_Oldest_TL              42064 non-null  int64  
 12  Age_Newest_TL              42064 non-null  int64  
 13  time_since_recent_payment  42064 non-null  int

### One-Hot Encoding

In [30]:
df_encoded = pd.get_dummies(data, columns=['MARITALSTATUS','GENDER', 'last_prod_enq2' ,'first_prod_enq2'])

df_encoded.info()
k = df_encoded.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42064 entries, 0 to 42063
Data columns (total 55 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   pct_tl_open_L6M               42064 non-null  float64
 1   pct_tl_closed_L6M             42064 non-null  float64
 2   Tot_TL_closed_L12M            42064 non-null  int64  
 3   pct_tl_closed_L12M            42064 non-null  float64
 4   Tot_Missed_Pmnt               42064 non-null  int64  
 5   CC_TL                         42064 non-null  int64  
 6   Home_TL                       42064 non-null  int64  
 7   PL_TL                         42064 non-null  int64  
 8   Secured_TL                    42064 non-null  int64  
 9   Unsecured_TL                  42064 non-null  int64  
 10  Other_TL                      42064 non-null  int64  
 11  Age_Oldest_TL                 42064 non-null  int64  
 12  Age_Newest_TL                 42064 non-null  int64  
 13  t

### Machine Learing Model Building

In [35]:
# Data processing
# 1. Random Forest

y = df_encoded['Approved_Flag']
x = df_encoded. drop ( ['Approved_Flag'], axis = 1 )




x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)


rf_classifier = RandomForestClassifier(n_estimators = 200, random_state=42)

rf_classifier.fit(x_train, y_train)

y_pred = rf_classifier.predict(x_test)


accuracy = accuracy_score(y_test, y_pred)
print ()
print(f'Accuracy: {accuracy}')
print ()
overall_f1_score = precision_recall_fscore_support(y_test, y_pred, average='weighted')[2]
print(f"Overall F1 Score: {overall_f1_score:.2f}")

print ()
precision, recall, f1_score, _ = precision_recall_fscore_support(y_test, y_pred)


for i, v in enumerate(['p1', 'p2', 'p3', 'p4']):
    print(f"Class {v}:")
    print(f"Precision: {precision[i]}")
    print(f"Recall: {recall[i]}")
    print(f"F1 Score: {f1_score[i]}")
    print()


Accuracy: 0.7636990372043266

Overall F1 Score: 0.74

Class p1:
Precision: 0.8370457209847597
Recall: 0.7041420118343196
F1 Score: 0.7648634172469203

Class p2:
Precision: 0.7957519116397621
Recall: 0.9282457879088206
F1 Score: 0.8569075937785909

Class p3:
Precision: 0.4423380726698262
Recall: 0.21132075471698114
F1 Score: 0.28600612870275793

Class p4:
Precision: 0.7178502879078695
Recall: 0.7269193391642371
F1 Score: 0.7223563495895703



In [39]:
# 2. xgboost

import xgboost as xgb
from sklearn.preprocessing import LabelEncoder

xgb_classifier = xgb.XGBClassifier(objective='multi:softmax',  num_class=4)


y = df_encoded['Approved_Flag']
x = df_encoded. drop ( ['Approved_Flag'], axis = 1 )


label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)


x_train, x_test, y_train, y_test = train_test_split(x, y_encoded, test_size=0.2, random_state=42)


xgb_classifier.fit(x_train, y_train)
y_pred = xgb_classifier.predict(x_test)

accuracy = accuracy_score(y_test, y_pred)
print ()
print(f'Accuracy: {accuracy:.2f}')

overall_f1_score = precision_recall_fscore_support(y_test, y_pred, average='weighted')[2]
print(f"Overall F1 Score: {overall_f1_score:.2f}")

print ()

precision, recall, f1_score, _ = precision_recall_fscore_support(y_test, y_pred)

for i, v in enumerate(['p1', 'p2', 'p3', 'p4']):
    print(f"Class {v}:")
    print(f"Precision: {precision[i]}")
    print(f"Recall: {recall[i]}")
    print(f"F1 Score: {f1_score[i]}")
    print()


Accuracy: 0.78
Overall F1 Score: 0.76

Class p1:
Precision: 0.823906083244397
Recall: 0.7613412228796844
F1 Score: 0.7913890312660173

Class p2:
Precision: 0.8255418233924413
Recall: 0.913577799801784
F1 Score: 0.8673315769665036

Class p3:
Precision: 0.4756380510440835
Recall: 0.30943396226415093
F1 Score: 0.3749428440786465

Class p4:
Precision: 0.7342386032977691
Recall: 0.7356656948493683
F1 Score: 0.7349514563106796



In [36]:
# 3. Decision Tree
from sklearn.tree import DecisionTreeClassifier


y = df_encoded['Approved_Flag']
x = df_encoded. drop ( ['Approved_Flag'], axis = 1 )

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)


dt_model = DecisionTreeClassifier(max_depth=20, min_samples_split=10)
dt_model.fit(x_train, y_train)
y_pred = dt_model.predict(x_test)

accuracy = accuracy_score(y_test, y_pred)
print ()
print(f"Accuracy: {accuracy:.2f}")
print ()
overall_f1_score = precision_recall_fscore_support(y_test, y_pred, average='weighted')[2]
print(f"Overall F1 Score: {overall_f1_score:.2f}")

print ()

precision, recall, f1_score, _ = precision_recall_fscore_support(y_test, y_pred)

for i, v in enumerate(['p1', 'p2', 'p3', 'p4']):
    print(f"Class {v}:")
    print(f"Precision: {precision[i]}")
    print(f"Recall: {recall[i]}")
    print(f"F1 Score: {f1_score[i]}")
    print()


Accuracy: 0.71

Overall F1 Score: 0.71

Class p1:
Precision: 0.7204724409448819
Recall: 0.7218934911242604
F1 Score: 0.7211822660098522

Class p2:
Precision: 0.8084320963668156
Recall: 0.824777006937562
F1 Score: 0.8165227629513344

Class p3:
Precision: 0.3415213946117274
Recall: 0.32528301886792454
F1 Score: 0.3332044839582528

Class p4:
Precision: 0.6487854251012146
Recall: 0.6229348882410107
F1 Score: 0.6355974219137334



Since, among all of them XGBoost showing higher F1-Score as 76. We will use XGBoost algorithm for development and final pipeline.

In [None]:
data.to_csv('credit_data.csv', index = False)

### Hyperparameter-tuning on XGBoost Algorithm

In [37]:
from sklearn.model_selection import cross_val_score

In [40]:
# Now use the encoded labels for cross-validation
cross_val_score(xgb_classifier, x_train, y_train, cv=5, scoring='accuracy').mean()

0.7747467738698969

In [41]:
test_pred = xgb_classifier.predict(x_test)

accuracy_score(y_test,test_pred)

0.7783192677998336

In [42]:
overall_f1_score = precision_recall_fscore_support(y_test, y_pred, average='weighted')[2]
print(f"Overall F1 Score: {overall_f1_score:.2f}")

Overall F1 Score: 0.76


### Find optimal tuning parameters for the pipeline

In [43]:
from sklearn.model_selection import GridSearchCV

In [44]:
param_grid = {
    'eta': [0.01, 0.1, 0.15],
    'gamma': [0.1, 0.2, 0.3],
    'max_depth': [5, 6, 7, 8, 9, 10],
}

grid_search = GridSearchCV(xgb_classifier, param_grid, cv=5, scoring='accuracy')
grid_search.fit(x_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_}")

best_model = grid_search.best_estimator_

test_pred = best_model.predict(x_test)
accuracy_score(y_test,test_pred)
overall_f1_score = precision_recall_fscore_support(y_test, y_pred, average='weighted')[2]
print(f"Overall F1 Score: {overall_f1_score:.2f}")

Best parameters: {'eta': 0.15, 'gamma': 0.2, 'max_depth': 5}
Best score: 0.7801552374710344
Overall F1 Score: 0.76


# Conclusion

Using `Grid Search CV`, I have fine-tuned the model parameters, achieving a testing accuracy of 78.01% from 77.83%.

The hyperparameter tuning further enhanced the model's performance, showcasing the effectiveness of employing advanced methodologies in credit risk assessment.

This project underscores the importance of leveraging data-driven approaches to mitigate credit risk and optimize financial strategies in the banking sector.