#                                Home Credit Default Risk - Modelling

## Completed By  

## ABHIRAM MANNAM, JOSH HAWLEY, NEIL SAMUEL PULUKURI, KUSHAL RAM TAYI


## Table Of Contents
##### 1. Introduction
##### 2. Data Exploration

    2.1 Loading Libraries

    2.2 Loading the Dataset

    2.3 Explore the Dataset

##### 3. Dataset Cleaning And Imputing
##### 4. Clean the Test Dataset
##### 5. Majority Class Of Target Variable
##### 6. Pre Modeling
##### 7. Logistic Regression Model
##### 8. VIF (Variance Inflation Factor)
##### 9. LASSO Model
##### 10. Random Forest Model
##### 11. XG Boost Model
##### 12. Light Gradient Boost Model
##### 13. Results
##### 14. Team Members Contribution
##### 15. Conclusion




### 1. Introduction

Home Credit Default Risk - Exploration + Baseline Model. Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders. Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience.

In this notebook, we aim to explore the issue of credit default risk at Home Credit, a company that seeks to provide loans to individuals with insufficient or nonexistent credit histories. This underserved population often faces exploitation by unreliable lenders, making it crucial for Home Credit to ensure a positive and safe borrowing experience.

To address this challenge, Home Credit leverages various sources of alternative data, such as telecom and transactional information, to predict their clients' repayment abilities.

Our analysis begins with data cleaning, where we preprocess and prepare the data for modeling. Subsequently, we develop predictive models using the cleaned data to assess credit default risk. We focus on the target variable and employ different machine learning models, calculating relevant metrics like accuracy and ROC values.

The objective is to identify the best-performing model based on Kaggle scores. We start with Logistic regression and then explore penalized regression, specifically LASSO, due to collinearity issues. Additionally, we examine Random Forest, XGBoost, and Light XGBoost to assess their effectiveness in predicting credit default risk.

### 2. Data Exploration

#### 2.1 Loading Libraries

In [1]:
# Loading the numpy for pandas support as np.
# Pandas for dataframe support as pd and matplotlib and seaborn for visualizations.
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#### 2.2 Loading Dataset

In [2]:
# Loading the application_train.csv to data.
data=pd.read_csv("application_train.csv")

#### 2.3 Explore Dataset

In [3]:
# head provides the first 5 rows of the dataset.
data.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


In [4]:
# Tail provides the last 5 rows of the dataset.
data.tail()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
307506,456251,0,Cash loans,M,N,N,0,157500.0,254700.0,27558.0,...,0,0,0,0,,,,,,
307507,456252,0,Cash loans,F,N,Y,0,72000.0,269550.0,12001.5,...,0,0,0,0,,,,,,
307508,456253,0,Cash loans,F,N,Y,0,153000.0,677664.0,29979.0,...,0,0,0,0,1.0,0.0,0.0,1.0,0.0,1.0
307509,456254,1,Cash loans,F,N,Y,0,171000.0,370107.0,20205.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
307510,456255,0,Cash loans,F,N,N,0,157500.0,675000.0,49117.5,...,0,0,0,0,0.0,0.0,0.0,2.0,0.0,1.0


From the head() and tail() functions from the dataset we can see that there are 122 columns with 307511 rows. And some of the important columns are SK_ID_CURR which is loan application id and TARGET which is the target (default) variable.



In [5]:
data.duplicated().sum()

0

No duplicated row are available. All rows are unique

In [6]:
# Provides rows and columns of the dataset and also their dtypes.
data.shape
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(41), object(16)
memory usage: 286.2+ MB


We can see that the dataset has 307511 rows and 122 columns. In the dataset 65 columns are of float type and 41 columns are of int64 type and 16 columns are of object type

In [7]:
data.columns

Index(['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER',
       'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
       'AMT_CREDIT', 'AMT_ANNUITY',
       ...
       'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20',
       'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR',
       'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK',
       'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT',
       'AMT_REQ_CREDIT_BUREAU_YEAR'],
      dtype='object', length=122)

Here we can see the names of the 122 columns. Some of the column names are CODE_GENDER which provides gender of applicant and CNT_CHILDREN which provides number of children to the applicant and AMT_INCOME_TOTAL which provides the income of the applicant.

FLAG_DOCUMENT_2, FLAG_DOCUMENT_3, .... FLAG_DOCUMENT_20 ,FLAG_DOCUMENT_21 :These features are binary flags indicating the presence or absence of specific documents in the applicant's file. Each flag represents a different type of document (e.g., identification documents, income documents, etc.)

AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON, AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR represent the number of inquiries or requests made to the Credit Bureau by the applicant within specific time intervals. Each feature corresponds to a different time unit (hour, day, week, month, quarter, year).

In [8]:
# Provide data type of every column in the dataset.
data.dtypes

SK_ID_CURR                      int64
TARGET                          int64
NAME_CONTRACT_TYPE             object
CODE_GENDER                    object
FLAG_OWN_CAR                   object
                               ...   
AMT_REQ_CREDIT_BUREAU_DAY     float64
AMT_REQ_CREDIT_BUREAU_WEEK    float64
AMT_REQ_CREDIT_BUREAU_MON     float64
AMT_REQ_CREDIT_BUREAU_QRT     float64
AMT_REQ_CREDIT_BUREAU_YEAR    float64
Length: 122, dtype: object

As mentioned above the dataset has int64, float64 along with object type columns

### 3. Dataset Cleaning And Imputing

Let take a look at the numeric and object columns and summarize them.



In [9]:
# Summary of the all numeric columns
numeric_cols = data.select_dtypes(exclude=['object']).columns
numeric_summary = data[numeric_cols].describe()
numeric_summary

Unnamed: 0,SK_ID_CURR,TARGET,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
count,307511.0,307511.0,307511.0,307511.0,307511.0,307499.0,307233.0,307511.0,307511.0,307511.0,...,307511.0,307511.0,307511.0,307511.0,265992.0,265992.0,265992.0,265992.0,265992.0,265992.0
mean,278180.518577,0.080729,0.417052,168797.9,599026.0,27108.573909,538396.2,0.020868,-16036.995067,63815.045904,...,0.00813,0.000595,0.000507,0.000335,0.006402,0.007,0.034362,0.267395,0.265474,1.899974
std,102790.175348,0.272419,0.722121,237123.1,402490.8,14493.737315,369446.5,0.013831,4363.988632,141275.766519,...,0.089798,0.024387,0.022518,0.018299,0.083849,0.110757,0.204685,0.916002,0.794056,1.869295
min,100002.0,0.0,0.0,25650.0,45000.0,1615.5,40500.0,0.00029,-25229.0,-17912.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,189145.5,0.0,0.0,112500.0,270000.0,16524.0,238500.0,0.010006,-19682.0,-2760.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,278202.0,0.0,0.0,147150.0,513531.0,24903.0,450000.0,0.01885,-15750.0,-1213.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,367142.5,0.0,1.0,202500.0,808650.0,34596.0,679500.0,0.028663,-12413.0,-289.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0
max,456255.0,1.0,19.0,117000000.0,4050000.0,258025.5,4050000.0,0.072508,-7489.0,365243.0,...,1.0,1.0,1.0,1.0,4.0,9.0,8.0,27.0,261.0,25.0


We can see that there are some outliers for the numeric type columns like maximum number of children are 19 and maximum AMT_INCOME_TOTAL is 117000000 which are considerable for those columns but the DAYS_EMPLOYED maximum value is 365243 days which is equal to 243 years which is not possible. We will look more on outliers at later stage.




Let's Summarize object type columns

In [10]:
# Unique values in categorical values.
categorical_cols = data.select_dtypes(include=['object']).columns
for col in categorical_cols:
    unique_values = data[col].unique()
    print(f"\nUnique values in {col}:")
    print(unique_values)


Unique values in NAME_CONTRACT_TYPE:
['Cash loans' 'Revolving loans']

Unique values in CODE_GENDER:
['M' 'F' 'XNA']

Unique values in FLAG_OWN_CAR:
['N' 'Y']

Unique values in FLAG_OWN_REALTY:
['Y' 'N']

Unique values in NAME_TYPE_SUITE:
['Unaccompanied' 'Family' 'Spouse, partner' 'Children' 'Other_A' nan
 'Other_B' 'Group of people']

Unique values in NAME_INCOME_TYPE:
['Working' 'State servant' 'Commercial associate' 'Pensioner' 'Unemployed'
 'Student' 'Businessman' 'Maternity leave']

Unique values in NAME_EDUCATION_TYPE:
['Secondary / secondary special' 'Higher education' 'Incomplete higher'
 'Lower secondary' 'Academic degree']

Unique values in NAME_FAMILY_STATUS:
['Single / not married' 'Married' 'Civil marriage' 'Widow' 'Separated'
 'Unknown']

Unique values in NAME_HOUSING_TYPE:
['House / apartment' 'Rented apartment' 'With parents'
 'Municipal apartment' 'Office apartment' 'Co-op apartment']

Unique values in OCCUPATION_TYPE:
['Laborers' 'Core staff' 'Accountants' 'Managers

We can see there are some nan (null) values in some of the columns. And unknown, not specified, XNA and others are not null values. They are values that are unique to that column and or unknown at that time or not applicable to those applicants.

Let's clean nan (null) values

In [11]:
# Calculating percentage of missing values in the columns.
missingdata=(data.isna().sum()/len(data))
missingdata.sort_values(ascending=False)

COMMONAREA_MEDI             0.698723
COMMONAREA_AVG              0.698723
COMMONAREA_MODE             0.698723
NONLIVINGAPARTMENTS_MODE    0.694330
NONLIVINGAPARTMENTS_AVG     0.694330
                              ...   
NAME_HOUSING_TYPE           0.000000
NAME_FAMILY_STATUS          0.000000
NAME_EDUCATION_TYPE         0.000000
NAME_INCOME_TYPE            0.000000
SK_ID_CURR                  0.000000
Length: 122, dtype: float64

We can see that some columns have around 69% null and those columns wouldn't provide any information to the model. Let's deceide 45% as the cutoff when more than 45% values are null then can't provide variation for those columns.

In [12]:
# Numeric Columns which have more than 45% of missing values.
application_train_numeric= data.select_dtypes(include=['number'])
missings=application_train_numeric.loc[:,application_train_numeric.isna().mean() >= 0.45]
highly_missing_features_numeric = missings.columns
missings.columns

Index(['OWN_CAR_AGE', 'EXT_SOURCE_1', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG',
       'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG',
       'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG',
       'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG',
       'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE',
       'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE',
       'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE',
       'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE',
       'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE',
       'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI',
       'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI',
       'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI',
       'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI',
       'NONLIVINGAREA_MEDI', 'TO

Let's remove the columns whose null values make upto 45% of the data.

In [13]:
# Dropping the columns
data=data.drop(highly_missing_features_numeric,axis=1)
data.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


After dropping we can see that we have 77 columns remaining.

Let's drop object type columns.

In [14]:
# Object columns which have more than 45% of missing values.
application_train_object= data.select_dtypes(include=['object'])
missings=application_train_object.loc[:,application_train_object.isna().mean() >= 0.45]
highly_missing_features_object = missings.columns
missings.columns

Index(['FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE',
       'EMERGENCYSTATE_MODE'],
      dtype='object')

In [15]:
# Dropping the object columns.
data=data.drop(highly_missing_features_object,axis="columns")
data.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


We can see that after dropping the numeric and object type columns we finally have 73 columns.

Next, we have nan (nulls) to impute or remove. Before that let's see whether we have outliers or not

In [16]:
# Using IQR to find outliers.
outliers={}
for i in data.columns:
  if data[i].dtype != 'object':
    Q1 = data[i].quantile(0.25)
    Q3 = data[i].quantile(0.75)
    IQR = Q3 - Q1
    # identify outliers
    threshold = 1.5
    outliers[i] = data[(data[i] < Q1 - threshold * IQR) | (data[i] > Q3 + threshold * IQR)]
outliers;


So, we can see that there are outliers availabale for many columns. So, we need to impute the numeric type columns with median since mean is prone to the outliers.



But for object type nan means null or missing values since XNA, others, unknown and not specified are present to mention the other categories for the applicant preferences. So, we need to impute object type column nan (null) with mode.



In [17]:
# Imputating mode for object columns and median for numeric columns.
for column in data.columns:
    if data[column].dtype == 'object':
        data[column] = data[column].fillna(data[column].mode().iloc[0])
    else:
        data[column] = data[column].fillna(data[column].median())

In [18]:
missingdata=data.isna().sum()
missingdata.sort_values(ascending=False)

SK_ID_CURR                    0
REG_CITY_NOT_WORK_CITY        0
FLAG_DOCUMENT_8               0
FLAG_DOCUMENT_7               0
FLAG_DOCUMENT_6               0
                             ..
FLAG_CONT_MOBILE              0
FLAG_WORK_PHONE               0
FLAG_EMP_PHONE                0
FLAG_MOBIL                    0
AMT_REQ_CREDIT_BUREAU_YEAR    0
Length: 73, dtype: int64

Now, We can see the number of nan(nulls) are 0.

We know that DAYS_EMPLOYED has some outliers as mentioned above. So, let's remove some of the outliers from that column using the IQR (Inter Quartile Range)

In [19]:
def remove_outlier(col):
  Q1,Q3=col.quantile([.25,.75])
  IQR=Q3-Q1
  lower_range=Q1-(1.5*IQR)
  upper_range=Q3+(1.5*IQR)
  return lower_range,upper_range

In [20]:
for i in data.columns:
 if data[i].dtype != 'object':
    lowlevel,uplevel=remove_outlier(data['DAYS_EMPLOYED'])
data['DAYS_EMPLOYED']=np.where(data['DAYS_EMPLOYED']>uplevel,uplevel,data['DAYS_EMPLOYED'])
data['DAYS_EMPLOYED']=np.where(data['DAYS_EMPLOYED']<lowlevel,lowlevel,data['DAYS_EMPLOYED'])


In [21]:
data['DAYS_EMPLOYED'].describe()

count    307511.000000
mean      -1203.542428
std        2732.404969
min       -6466.500000
25%       -2760.000000
50%       -1213.000000
75%        -289.000000
max        3417.500000
Name: DAYS_EMPLOYED, dtype: float64

After removing the outliers we can see that the DAY_EMPLOYED has maximum value of 3417 days.

In [22]:
# Flag document just flags the data whether customer submitted document or not so better to remove

flag_documents = ['WEEKDAY_APPR_PROCESS_START','FLAG_DOCUMENT_2','FLAG_DOCUMENT_3','FLAG_DOCUMENT_4','FLAG_DOCUMENT_5','FLAG_DOCUMENT_6','FLAG_DOCUMENT_7',
'FLAG_DOCUMENT_8','FLAG_DOCUMENT_9','FLAG_DOCUMENT_10','FLAG_DOCUMENT_11','FLAG_DOCUMENT_12','FLAG_DOCUMENT_13','FLAG_DOCUMENT_14',
'FLAG_DOCUMENT_15','FLAG_DOCUMENT_16','FLAG_DOCUMENT_17','FLAG_DOCUMENT_18','FLAG_DOCUMENT_19','FLAG_DOCUMENT_20','FLAG_DOCUMENT_21']

In [23]:
data.drop(columns=flag_documents, inplace=True)

In [24]:
data.shape

(307511, 52)

In [25]:
data.to_csv(r'train.csv', index=False)

We are eliminating the columns associated with document flags from the dataset, resulting in a modified dataset containing 52 columns. This refined dataset is then stored in a CSV file named 'train.csv'. This process aims to streamline the dataset by discarding unnecessary or redundant columns, potentially enhancing its usability for future analysis or modeling purposes.

### 4. Cleaning The Test Dataset

Load the test dataset

In [26]:
testdata=pd.read_csv("application_test.csv") # Loading the dataset.

In [27]:
testdata.head() # Head gives first five rows of the dataset.

Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100001,Cash loans,F,N,Y,0,135000.0,568800.0,20560.5,450000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
1,100005,Cash loans,M,N,Y,0,99000.0,222768.0,17370.0,180000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
2,100013,Cash loans,M,Y,Y,0,202500.0,663264.0,69777.0,630000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,1.0,4.0
3,100028,Cash loans,F,N,Y,2,315000.0,1575000.0,49018.5,1575000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
4,100038,Cash loans,M,Y,N,1,180000.0,625500.0,32067.0,625500.0,...,0,0,0,0,,,,,,


In [28]:
testdata.tail() # Tail gives last five rows of the dataset.

Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
48739,456221,Cash loans,F,N,Y,0,121500.0,412560.0,17473.5,270000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
48740,456222,Cash loans,F,N,N,2,157500.0,622413.0,31909.5,495000.0,...,0,0,0,0,,,,,,
48741,456223,Cash loans,F,Y,Y,1,202500.0,315000.0,33205.5,315000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,3.0,1.0
48742,456224,Cash loans,M,N,N,0,225000.0,450000.0,25128.0,450000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,2.0
48743,456250,Cash loans,F,Y,N,0,135000.0,312768.0,24709.5,270000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,1.0,4.0


We can see that there 121 columns in the dataset.

In [29]:
testdata.duplicated().sum() # To find duplicates in the test dataset.

0

There are no duplicates in the test dataset

In [30]:
print(testdata.shape) # To print shape and information about the dataset.
testdata.info()

(48744, 121)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48744 entries, 0 to 48743
Columns: 121 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(40), object(16)
memory usage: 45.0+ MB


The dataset has 121 columns and 48743 rows with 65 float type and 40 int type and 16 object type columns same as train dataset.

Let's remove all the columns which we have removed for the train dataset

In [31]:
testdata_clean=testdata.drop(highly_missing_features_numeric,axis=1) # Dropping the numeric columns.
testdata_clean.head()

Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100001,Cash loans,F,N,Y,0,135000.0,568800.0,20560.5,450000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
1,100005,Cash loans,M,N,Y,0,99000.0,222768.0,17370.0,180000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
2,100013,Cash loans,M,Y,Y,0,202500.0,663264.0,69777.0,630000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,1.0,4.0
3,100028,Cash loans,F,N,Y,2,315000.0,1575000.0,49018.5,1575000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
4,100038,Cash loans,M,Y,N,1,180000.0,625500.0,32067.0,625500.0,...,0,0,0,0,,,,,,


In [32]:
testdata_clean=testdata_clean.drop(highly_missing_features_object,axis=1) # Dropping the object columns.
testdata_clean.head()

Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100001,Cash loans,F,N,Y,0,135000.0,568800.0,20560.5,450000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
1,100005,Cash loans,M,N,Y,0,99000.0,222768.0,17370.0,180000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
2,100013,Cash loans,M,Y,Y,0,202500.0,663264.0,69777.0,630000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,1.0,4.0
3,100028,Cash loans,F,N,Y,2,315000.0,1575000.0,49018.5,1575000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
4,100038,Cash loans,M,Y,N,1,180000.0,625500.0,32067.0,625500.0,...,0,0,0,0,,,,,,


After dropping we can see there are 72 columns equal to the train dataset columns.

Let's see remaining dataset. We can see there are still missing values

In [33]:
missingdata=(testdata_clean.isna().sum()/len(testdata_clean))
missingdata.sort_values(ascending=False) # Find the percentage of missong values in the other columns in descending order.

OCCUPATION_TYPE                0.320142
EXT_SOURCE_3                   0.177827
AMT_REQ_CREDIT_BUREAU_YEAR     0.124097
AMT_REQ_CREDIT_BUREAU_QRT      0.124097
AMT_REQ_CREDIT_BUREAU_MON      0.124097
                                 ...   
REG_REGION_NOT_LIVE_REGION     0.000000
REG_REGION_NOT_WORK_REGION     0.000000
LIVE_REGION_NOT_WORK_REGION    0.000000
REG_CITY_NOT_LIVE_CITY         0.000000
REG_CITY_NOT_WORK_CITY         0.000000
Length: 72, dtype: float64

Columns have nearly 32% while other columns have nearly 17% to 12% missing values. Let's impute median values to numeric and mode to values to object type columns as we did for train dataset.



In [34]:
for column in testdata_clean.columns:
    if testdata_clean[column].dtype == 'object':
        testdata_clean[column] = testdata_clean[column].fillna(testdata_clean[column].mode().iloc[0]) # Impute with mode for object columns.
    else:
        testdata_clean[column] = testdata_clean[column].fillna(testdata_clean[column].median())# Impute with median for numeric columns.

In [35]:
missingdata=(testdata_clean.isna().sum()/len(testdata_clean))  # Let's see any missing values are present.
missingdata.sort_values(ascending=False)

SK_ID_CURR                    0.0
NAME_CONTRACT_TYPE            0.0
FLAG_DOCUMENT_8               0.0
FLAG_DOCUMENT_7               0.0
FLAG_DOCUMENT_6               0.0
                             ... 
FLAG_CONT_MOBILE              0.0
FLAG_WORK_PHONE               0.0
FLAG_EMP_PHONE                0.0
FLAG_MOBIL                    0.0
AMT_REQ_CREDIT_BUREAU_YEAR    0.0
Length: 72, dtype: float64

We can see after imputing the missing there are no missing values.

In [36]:
testdata_clean.drop(columns=flag_documents, inplace=True)
#testdata_clean.drop(columns="ORGANIZATION_TYPE", inplace=True)

In [37]:
testdata_clean.shape

(48744, 51)

In [38]:
testdata_clean.to_csv(r'test.csv', index=False)

we remove the columns associated with document flags from the testdata_clean dataset, resulting in a refined dataset with 51 columns and 48,744 rows. The modified dataset is saved as 'test.csv', which could be beneficial for subsequent analysis.

### 5. Majority Class Of Target Variable

In [39]:
# Counts of TARGET column with default or no default.
target = data.TARGET.value_counts()
target

0    282686
1     24825
Name: TARGET, dtype: int64

As we see that 0 (no-default) has 282686 where as 1 (default) has 24825. This is an perfect example of the class imbalance as majority of the target variable has only one class. There are far more loans that were repaid on time than loans that were not repaid.This may affect the model and results in biased predictions toward the non default

In [40]:
# Calculating the percentage of target class.
percenttarget=(target.values/len(data)*100)
percenttarget

array([91.92711805,  8.07288195])

We can see that 0 (no-default) has 91.927% where as 1 (default) has 8.072% and the majority class classifier has 91.927% accuracy for the 0 (no-default) classifier.

### 6. Pre-Modeling

Loading the libraries required for the modeling.

In [41]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
import xgboost as xgb
from sklearn.metrics import roc_auc_score
from xgboost import XGBClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import accuracy_score, mean_squared_error, confusion_matrix

Loading the cleaned train and cleaned test datasets that we cleaned before.

In [42]:
train_data=pd.read_csv("train.csv")
test_data=pd.read_csv("test.csv")
test_id = test_data['SK_ID_CURR']

In [43]:
# Creating object columns
target=train_data.iloc[:,1]
train_data=train_data.drop(columns=["TARGET"])
train_df=train_data.select_dtypes(include=['object'])
test_df=test_data.select_dtypes(include=['object'])

Target variable from the train data is stored in target.
Removing the target variable from the train data and storing into train_data.
Taking only object columns from the train_data and storing it in train_df.
Taking only object columns from the test_data and storing it in test_df.

In [44]:
train_df.shape

(307511, 11)

The train_df has the shape of 11 columns which are only object type.

In [45]:
test_df.shape

(48744, 11)

The test_df has the shape of 11 columns which are only object type.

In [46]:
# Combine train and test data
combined_df = pd.concat([train_df, test_df])

# Initialize the OneHotEncoder with handle_unknown='ignore'
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')

# Fit and transform the encoder on the combined data
encoded_data_combined = encoder.fit_transform(combined_df)

# Split the encoded data back into train and test datasets
encoded_df_train = pd.DataFrame(encoded_data_combined[:len(train_df)], columns=encoder.get_feature_names_out(train_df.columns))
encoded_df_test = pd.DataFrame(encoded_data_combined[len(train_df):], columns=encoder.get_feature_names_out(test_df.columns))

Here, we encode the object columns from the train and test dataset to one hot encoding and get dummies for every object columns.

In [47]:
print(encoded_df_train.shape)
encoded_df_test.shape

(307511, 117)


(48744, 117)

The encoded_df_train and encoded_df_test has 117 columns which are one hot encodings for the object columns.

In [48]:
# Dropping the object columns
X_train=train_data.drop(columns=train_df.columns)

Dropping the object columns that are present in the train_data and storing this new dataframe to X_train so that we can add new hot encoding columns.

In [49]:
# Concatenating X_train with encoded dataframe
X_train = pd.concat([X_train, encoded_df_train], axis=1)

Here, we are concating the numeric columns from X_train with one hot encodings to form X_train training dataset.

In [50]:
# Dropping the object columns
X_test=test_data.drop(columns=test_df.columns)

Dropping the object columns that are present in the test_data and storing this new dataframe to X_test so that we can add new hot encoding columns.

In [51]:
# Concatenating X_test with encoded dataframe
X_test = pd.concat([X_test, encoded_df_test], axis=1)

Here, we are concating the numeric columns from X_test with one hot encodings to form X_test testing dataset.

In [52]:
print(X_train.shape)
print(X_test.shape)

(307511, 157)
(48744, 157)


Here, the shape of the X_train and X_test has 157 columns total for the final train and test datasets.

In [53]:
# Imblearn for the undersampling of the class variable.
from imblearn.under_sampling import RandomUnderSampler
under_sampler = RandomUnderSampler(random_state=0)
X_train, target_resampled = under_sampler.fit_resample(X_train,target)
X_train["TARGET"] = target_resampled

Since, this is an class imbalance problem we need to do undersampling from the imblearn library under sampling method to get the same class values.

In [54]:
X_train["TARGET"] .value_counts()

0    24825
1    24825
Name: TARGET, dtype: int64

Here we can see that the class imbalance is balanced with 0's 24825 rows and 1's 24825.

In [55]:
X_train.drop('SK_ID_CURR', axis=1, inplace=True)
X_test.drop('SK_ID_CURR', axis=1, inplace=True)

Dropping the SK_ID_CURR column from both X_train and X_test.

In [56]:
Y_train=X_train[['TARGET']]
X_train.drop(columns=["TARGET"],inplace=True)

Dropping TARGET from X_train and saving TARGET variable to Y_train.

In [57]:
print(X_train.shape)
Y_train.shape

(49650, 156)


(49650, 1)

X_train (Traing dataset) has 156 columns and Y_train (Training dataset target variable) has 1 column which is TARGET variable. Both X_train and Y_train has 49650 rows.

In [58]:
print(X_test.shape)

(48744, 156)


The X_test (test dataset) has 156 columns with 48744 rows

In [59]:
# Split the data into training and testing sets
X_train, X_valid, y_train, y_valid = train_test_split(X_train, Y_train, test_size=0.2, random_state=42)

Cross_validation of the X_train (training dataset) into X_Train and X_valid with 80%-20% respectively for the validation of the model and to know about overfitting. Also same way for the Y_train (TARGET) variable into 80-20% split for cross validation.

### 7. Logistic Regression Model

Logistic regression is a statistical model used for binary classification. It calculates the probability of an observation belonging to a particular class based on input features and applies a logistic function to map the output in range between 0 and 1. The class with the highest probability is predicted as the outcome.

Importing the preprocessing from sklearn to preprocess the X_train, X_test and X_valid to center and scale for the logistic model to X_scaled, X_test_scaled and Y_valid_scaled.

In [76]:
# Importing the preprocessing methods from sklearn and transforming the datasets.
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(X_train)
X_scaled = scaler.transform(X_train)
scaler = preprocessing.StandardScaler().fit(X_test)
X_test_scaled = scaler.transform(X_test)
scaler = preprocessing.StandardScaler().fit(X_valid)
X_valid_scaled = scaler.transform(X_valid)

Fitting the logistic regression model from sklearn and fit the model on X_scaled and y_train.

In [77]:
# Fitting the logistic regression model on train dataset.
model = LogisticRegression()
model.fit(X_scaled, y_train)

Predicting the Fitted model on the validation dataset (X_valid) to measure the performance the model.

In [78]:
# Predict the fitted model on the validation dataset.
y_pred = model.predict(X_valid_scaled)

Calculating accuracy and RMSE and AUC (Area Under the Curve)  along with confusion matrix for the logistic regression model with validation dataset.

In [99]:
# Calculate accuracy
accuracy = accuracy_score(y_valid, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(y_valid, y_pred))
print(f"RMSE: {rmse:.2f}")

# Calculate the ROC-AUC score on the test set
roc_auc = roc_auc_score(y_valid, y_pred)
print(f'ROC-AUC Score: {roc_auc}')

# Generate the confusion matrix
conf_matrix = confusion_matrix(y_valid, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

TP = conf_matrix[1, 1]  # True Positives
TN = conf_matrix[0, 0]  # True Negatives

print(f'True Positives: {TP}')
print(f'True Negatives: {TN}')

FN = conf_matrix[1, 0]  # False Negatives
FP = conf_matrix[0, 1]  # False Positives

print(f'False Negatives: {FN}')
print(f'False Positives: {FP}')
from sklearn.metrics import precision_score, recall_score

# Calculate sentivity
sensitivity = recall_score(y_valid, y_pred)

# Calculate precision
precision = precision_score(y_valid, y_pred)

# Calculate specificity
specificity = TN / (TN + FP)

print(f'Sensitivity (Recall): {sensitivity}')
print(f'Precision: {precision}')
print(f'Specificity: {specificity}')

Accuracy: 0.66
RMSE: 0.59
ROC-AUC Score: 0.6575693982750695
Confusion Matrix:
[[3394 1559]
 [1842 3135]]
True Positives: 3135
True Negatives: 3394
False Negatives: 1842
False Positives: 1559
Sensitivity (Recall): 0.6298975286317059
Precision: 0.6678738815509161
Specificity: 0.6852412679184333


Here the accuracy for validation dataset for logistic model is 0.66.

AUC (Area Under the Curve) for the validation set for logistic model is 0.657.

RMSE for the validation set for logistic model is 0.59

Classification matrix for the validation set for logistic model is

True Positives: 3135
True Negatives: 3394
False Negatives: 1842
False Positives: 1559

Here the sensitivity is = 0.6825.

Precision for the validation dataset is = 0.68737.

Specificity for the validation dataset is = 0.68524.

The model is classifying good for both positive and negative classification around 0.68.


Predicting the model on test data and uploading the output to Kaggle to get kaggle score.

In [64]:
# Predicting the model on test data.
y_pred = model.predict(X_test_scaled)

In [65]:
final_df = pd.DataFrame()
final_df["SK_ID_CURR"] = test_id
final_df["TARGET"] = y_pred
final_df

Unnamed: 0,SK_ID_CURR,TARGET
0,100001,1
1,100005,1
2,100013,0
3,100028,0
4,100038,1
...,...,...
48739,456221,0
48740,456222,0
48741,456223,0
48742,456224,0


In [66]:
final_df.to_csv("logistic.csv", index = False)

The Logistic Regression model achieved a Kaggle score of 0.67383, which is relatively low. Given this suboptimal performance, it is necessary to explore alternative models.

### 8. VIF (Variance Inflation Factor)

In [67]:
# Import the libraries for the VIF from the statsmodels.
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Create a Pandas DataFrame to store the VIFs
vif_data = pd.DataFrame({'variable': X_train.columns})

# Calculating VIF for every column in train dataset.
vif_data["VIF"] = [variance_inflation_factor(X_train.values, i)
                          for i in range(len(X_train.columns))]
# Print the VIFs
print(vif_data)

                                variable        VIF
0                           CNT_CHILDREN        inf
1                       AMT_INCOME_TOTAL   1.008698
2                             AMT_CREDIT  36.862853
3                            AMT_ANNUITY   2.681190
4                        AMT_GOODS_PRICE  37.626728
..                                   ...        ...
151  ORGANIZATION_TYPE_Transport: type 2        inf
152  ORGANIZATION_TYPE_Transport: type 3        inf
153  ORGANIZATION_TYPE_Transport: type 4        inf
154         ORGANIZATION_TYPE_University        inf
155                ORGANIZATION_TYPE_XNA        inf

[156 rows x 2 columns]


VIF (Variance Inflation Factor) calculates the multicolinearity between the variables in the train dataset. High colinearity has VIF value of greater than 5.
Multicolinearity has negative effect on the logistic regression.
Here the VIF for the train dataset columns like AMT_CREDIT and AMT_GOODS_PRICE which has high values of around 35 which is high, so we need to eliminate the variables that has high colinearity by doing penalized regression.
The best penalized regression is LASSO.

### 9. Lasso

Lasso is a technique that makes some of the features in a model equal to zero, effectively removing them. It's used to simplify and select the most important features in a machine learning model, helping prevent overcomplicated models.

In [68]:
# Import the linear_model for LASSO regression.
from sklearn import linear_model

# Hyperparameter tuning for LASSO penalization.
param_grid = {'alpha': [0.1, 1, 10]}

reg = linear_model.Lasso()

# Grid search the right parameters
grid_search = GridSearchCV(reg, param_grid, cv=5)
grid_search.fit(X_scaled, y_train)


Here, the LASSO model ran on X_scaled and on different hyperparameters tuning for penalization lambda values of 0.1,1 and 10 and selects best parameter for the good model fit. Here we use 5 fold cross validation.

In [100]:
best_alpha = grid_search.best_params_['alpha']
print(f"Best Alpha: {best_alpha}")

# Fit the model with the best alpha to the training data
best_lasso_model = linear_model.Lasso(alpha=best_alpha)
best_lasso_model.fit(X_scaled, y_train)

# Make predictions on the test data
y_pred = best_lasso_model.predict(X_valid_scaled)

y_pred = pd.DataFrame({'pred': y_pred})

y_pred['pred'] = pd.to_numeric(y_pred['pred'])

# Threshold for the class.
threshold = 0.5

# Giving class label for every output.
y_pred = [1 if pred > threshold else 0 for pred in y_pred['pred']]

# Generate the confusion matrix
conf_matrix = confusion_matrix(y_valid, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

# Calculate and print accuracy
accuracy = accuracy_score(y_valid, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(y_valid, y_pred))
print(f"RMSE: {rmse:.2f}")

# Calculate the ROC-AUC score on the valid set
roc_auc = roc_auc_score(y_valid, y_pred)
print(f'ROC-AUC Score: {roc_auc}')

TP = conf_matrix[1, 1]  # True Positives
TN = conf_matrix[0, 0]  # True Negatives

print(f'True Positives: {TP}')
print(f'True Negatives: {TN}')

FN = conf_matrix[1, 0]  # False Negatives
FP = conf_matrix[0, 1]  # False Positives

print(f'False Negatives: {FN}')
print(f'False Positives: {FP}')
from sklearn.metrics import precision_score, recall_score

# Calculate sentivity
sensitivity = recall_score(y_valid, y_pred)

# Calculate precision
precision = precision_score(y_valid, y_pred)

# Calculate specificity
specificity = TN / (TN + FP)

print(f'Sensitivity (Recall): {sensitivity}')
print(f'Precision: {precision}')
print(f'Specificity: {specificity}')

Best Alpha: 0.1
Confusion Matrix:
[[3394 1559]
 [1842 3135]]
Accuracy: 0.66
RMSE: 0.59
ROC-AUC Score: 0.6575693982750695
True Positives: 3135
True Negatives: 3394
False Negatives: 1842
False Positives: 1559
Sensitivity (Recall): 0.6298975286317059
Precision: 0.6678738815509161
Specificity: 0.6852412679184333


The Best alpha for the LASSO model is 0.1 that is selected by grid search and cross validation.

Here the accuracy for validation dataset for LASSO model is 0.66.

AUC (Area Under the Curve) for the validation set for LASSO model is 0.6575.

RMSE for the validation set for LASSO model is 0.59.

Classification matrix for the validation set for LASSO model is

True Positives: 3135
True Negatives: 3394
False Negatives: 1842
False Positives: 1559

Here the sensitivity is = 0.62989.

Precision for the validation dataset is = 0.6678.

The specificity for the validation model is 0.6852.

Here the specificty and precision is around 0.68 which is considered good for the model.


In [83]:
# Predicting the model on the test dataset.
y_pred = best_lasso_model.predict(X_test_scaled)

Here, we are predicting the best model on the X_test_scaled (scaled of test data). And uploading the predicted values to Kaggle to get Kaggle score.

In [84]:
final_df = pd.DataFrame()
final_df["SK_ID_CURR"] = test_id
final_df["TARGET"] = y_pred
final_df

Unnamed: 0,SK_ID_CURR,TARGET
0,100001,0.513049
1,100005,0.550766
2,100013,0.449834
3,100028,0.482110
4,100038,0.512801
...,...,...
48739,456221,0.453031
48740,456222,0.468438
48741,456223,0.518306
48742,456224,0.496076


In [85]:
final_df.to_csv("lasso.csv", index = False)

The LASSO model obtained a Kaggle score of 0.69292, which represents an improvement compared to the Logistic model. This improvement is attributed to the penalization of multicollinearity using the alpha value. However, there is still room for further enhancement in model performance.

### 10. Random Forest Model

Random Forest is a machine learning algorithm that creates a collection of decision trees and combines their predictions to improve accuracy and reduce overfitting, making it a powerful tool for classification and regression tasks.

In [101]:
# Importing the RandomForestClassifier for the Random forest classifier.
from sklearn.ensemble import RandomForestClassifier

In [102]:
rf_classifier = RandomForestClassifier()

In [149]:
# Hyperparameter tuning with parameter grid.
# param_grid = {
#     'n_estimators': [200, 300, 1000],
#     'max_depth': [ 10, 30, 50],
#     'min_samples_split': [10, 20],
#     'min_samples_leaf': [1, 2],
#     'max_features': ['auto', 'sqrt']
# }

# Parameters after tuning.
param_grid = {
    'n_estimators': [1000],
    'max_depth': [50],
    'min_samples_split': [20],
    'min_samples_leaf': [1],
    'max_features': ['auto']
}

In [150]:
# Grid search for the hyperparameter tuning.
grid_search = GridSearchCV(estimator=rf_classifier, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_scaled, y_train)

Grid search for the hyperparameter tuning with n_estimators, max_depth, min_samples_split and min_samples_leaf and max_featues. Here we use 5 fold cross validation with accuracy as metric to find best parameters.

In [151]:
best_params = grid_search.best_params_
best_rf_classifier = grid_search.best_estimator_
print(best_params)

{'max_depth': 50, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 20, 'n_estimators': 1000}


Selecting the best parameter and best estimator for the random forest and print the best parameters and the final best parameters are n_estimators=1000, max_depth=50, min_samples_split=20, min_samples_leaf=1, max_features='auto'

In [152]:
# Make predictions on the validation data
y_pred = best_rf_classifier.predict(X_valid_scaled)

y_pred = pd.DataFrame({'pred': y_pred})

y_pred['pred'] = pd.to_numeric(y_pred['pred'])

# Threshold for the class.
threshold = 0.5

# Giving class label for every output.
y_pred = [1 if pred > threshold else 0 for pred in y_pred['pred']]

# Calculate accuracy
accuracy = accuracy_score(y_valid, y_pred)
print(f"Accuracy on the test set: {accuracy:.2f}")

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(y_valid, y_pred))
print(f"RMSE: {rmse:.2f}")

# Calculate the ROC-AUC score on the valid set
roc_auc = roc_auc_score(y_valid, y_pred)
print(f'ROC-AUC Score: {roc_auc}')

# Generate the confusion matrix
conf_matrix = confusion_matrix(y_valid, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

TP = conf_matrix[1, 1]  # True Positives
TN = conf_matrix[0, 0]  # True Negatives

print(f'True Positives: {TP}')
print(f'True Negatives: {TN}')

FN = conf_matrix[1, 0]  # False Negatives
FP = conf_matrix[0, 1]  # False Positives

print(f'False Negatives: {FN}')
print(f'False Positives: {FP}')
from sklearn.metrics import precision_score, recall_score

# Calculate sentivity
sensitivity = recall_score(y_valid, y_pred)

# Calculate precision
precision = precision_score(y_valid, y_pred)

# Calculate specificity
specificity = TN / (TN + FP)

print(f'Sensitivity (Recall): {sensitivity}')
print(f'Precision: {precision}')
print(f'Specificity: {specificity}')

Accuracy on the test set: 0.68
RMSE: 0.57
ROC-AUC Score: 0.6791546382894933
Confusion Matrix:
[[3365 1588]
 [1598 3379]]
True Positives: 3379
True Negatives: 3365
False Negatives: 1598
False Positives: 1588
Sensitivity (Recall): 0.6789230460116537
Precision: 0.680289913428629
Specificity: 0.6793862305673329


Here the accuracy for validation dataset for random forest model is 0.68.

AUC for the validation set for random forest model is 0.679.

RMSE for the validation set for random forest model is 0.57.

Classification matrix for the validation set for random forest model is -

True Positives: 3379.
True Negatives: 3365.
False Negatives: 1598.
False Positives: 1588.

Here the sensitivity is = 0.678.

Precision for the validation dataset is = 0.680.

The specificity for the validation model is 0.679.

Here the specificty and precision is around 0.68 which is considered good for the model.

In [153]:
# Make predictions on test data.
y_pred = best_rf_classifier.predict(X_test_scaled)

Here, we are predicting the best model on the X_test_scaled (scaled of test data). And uploading the predicted values to Kaggle to get Kaggle score.

In [154]:
final_df = pd.DataFrame()
final_df["SK_ID_CURR"] = test_id
final_df["TARGET"] = y_pred
final_df

Unnamed: 0,SK_ID_CURR,TARGET
0,100001,0
1,100005,1
2,100013,0
3,100028,0
4,100038,1
...,...,...
48739,456221,0
48740,456222,1
48741,456223,1
48742,456224,0


In [155]:
final_df.to_csv("randomforest.csv", index = False)

The kaggle score of the Randomforest model is 0.66936 which is low when compared to regression model so we need to use more advanced models for better fit.

### 11. Xgboost Model

XGBoost (Extreme Gradient Boosting) is a machine learning algorithm that combines the predictions of multiple decision trees in an iterative manner. It optimizes model performance by minimizing a loss function using gradient boosting, employing techniques like regularization, and handling missing data efficiently. XGBoost is known for its speed, accuracy, and effectiveness in various machine learning tasks, especially in structured data problems.

In [133]:
param_grid = {
    'n_estimators': [50, 100, 150, 170],
    'learning_rate': [ 0.01, 0.07],
    'max_depth': [1,2],
}
# Create an XGBoost classifier
xgb = XGBClassifier(objective='binary:logistic', eval_metric='auc', nthread=4, random_state=42)

# Create a GridSearchCV object
grid_search = GridSearchCV(estimator=xgb, param_grid=param_grid, scoring='roc_auc', cv=5)

# Perform hyperparameter tuning
grid_search.fit(X_scaled, y_train)

# Get the best hyperparameters and the best estimator
best_params = grid_search.best_params_
best_xgb = grid_search.best_estimator_

# Fit the best model to the training data
best_xgb.fit(X_scaled, y_train)

# Make predictions on the test set
y_pred = best_xgb.predict_proba(X_valid)[:, 1]

# Calculate the ROC-AUC score on the test set
roc_auc = roc_auc_score(y_valid, y_pred)
print(f'ROC-AUC Score: {roc_auc}')

ROC-AUC Score: 0.6734083182802411


Here we are providing the parameter grid for hyper parameter tuning and provide n_estimators and learning_rate and max_depth. We do 5 fold cross validation for the selecting the best parameters for the good model fit.

In [134]:
best_params

{'learning_rate': 0.07, 'max_depth': 2, 'n_estimators': 170}

The best parameters are learning_rate = 0.07, max_depth = 2, n_estimators = 170.

In [138]:
# Applying best_parameters
xgb = XGBClassifier(objective='binary:logistic', eval_metric='auc',learning_rate = 0.07, max_depth = 2, n_estimators = 170,random_state=25,n_jobs=-1)
xgb.fit(X_scaled, y_train)

# Make predictions on the test set
y_pred = xgb.predict_proba(X_valid_scaled)[:, 1]

y_pred = pd.DataFrame({'pred': y_pred})

y_pred['pred'] = pd.to_numeric(y_pred['pred'])

# Threshold for the class.
threshold = 0.5

# Giving class label for every output.
y_pred = [1 if pred > threshold else 0 for pred in y_pred['pred']]

# Calculate accuracy
accuracy = accuracy_score(y_valid, y_pred)
print(f"Accuracy on the test set: {accuracy:.2f}")

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(y_valid, y_pred))
print(f"RMSE: {rmse:.2f}")

# Calculate the ROC-AUC score on the valid set
roc_auc = roc_auc_score(y_valid, y_pred)
print(f'ROC-AUC Score: {roc_auc}')

# Generate the confusion matrix
conf_matrix = confusion_matrix(y_valid, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

TP = conf_matrix[1, 1]  # True Positives
TN = conf_matrix[0, 0]  # True Negatives

print(f'True Positives: {TP}')
print(f'True Negatives: {TN}')

FN = conf_matrix[1, 0]  # False Negatives
FP = conf_matrix[0, 1]  # False Positives

print(f'False Negatives: {FN}')
print(f'False Positives: {FP}')
from sklearn.metrics import precision_score, recall_score

# Calculate sentivity
sensitivity = recall_score(y_valid, y_pred)

# Calculate precision
precision = precision_score(y_valid, y_pred)

# Calculate specificity
specificity = TN / (TN + FP)

print(f'Sensitivity (Recall): {sensitivity}')
print(f'Precision: {precision}')
print(f'Specificity: {specificity}')

Accuracy on the test set: 0.65
RMSE: 0.59
ROC-AUC Score: 0.6524998031526489
Confusion Matrix:
[[4124  829]
 [2626 2351]]
True Positives: 2351
True Negatives: 4124
False Negatives: 2626
False Positives: 829
Sensitivity (Recall): 0.4723729154108901
Precision: 0.7393081761006289
Specificity: 0.8326266908944074


Here the accuracy for validation dataset for XGBoost model is 0.65.

AUC for the validation set for XGBoost model is 0.652.

RMSE for the validation set for XGBoost model is 0.59.

Classification matrix for the validation set for XGBoost model is -

True Positives: 2351.
True Negatives: 4124.
False Negatives: 2626.
False Positives: 829.

Here the sensitivity is = 0.472.

Precision for the validation dataset is = 0.739.

The specificity for the validation model is 0.832.

Here the specificty and precision is around 0.739 and 0.832 respectively which is considered good for the model.

In [139]:
# Predicting the xgb model on the test data.
final_pred = xgb.predict_proba(X_test_scaled)[:, 1]

Here, we are predicting the best model on the X_test_scaled (scaled of test data). And uploading the predicted values to Kaggle to get Kaggle score.

In [140]:
final_df = pd.DataFrame()
final_df["SK_ID_CURR"] = test_id
final_df["TARGET"] = final_pred
final_df

Unnamed: 0,SK_ID_CURR,TARGET
0,100001,0.586183
1,100005,0.733518
2,100013,0.225736
3,100028,0.400433
4,100038,0.494137
...,...,...
48739,456221,0.338443
48740,456222,0.433971
48741,456223,0.583767
48742,456224,0.433949


In [141]:
final_df.to_csv("xgboost.csv", index = False)

The Kaggle score is 0.70854, which is quite good when considering the random forest and regression models.

### 12. Light Gradient Boost Model

LightGBM is a gradient boosting framework that uses a histogram-based algorithm for efficient training and predictions. It's known for its speed and scalability, making it a popular choice for machine learning tasks. LightGBM can handle large datasets, offers good accuracy, and supports various machine learning tasks, including classification, regression, and ranking.

In [125]:
import lightgbm as lgb

params = {
    'num_leaves': 40,
    'n_estimators': 160,
    'min_child_weight': 15,
    'max_depth': 3,
    'learning_rate': 0.07,
    'colsample_bytree': 0.8
}

# Create the LightGBM classifier
lgbm = lgb.LGBMClassifier(**params)

lgbm.fit(X_scaled, y_train)

# Fiting the model on the validation dataset.
y_pred = lgbm.predict_proba(X_valid_scaled)[:, 1]

# Calculate the ROC-AUC score on the test set
roc_auc = roc_auc_score(y_valid, y_pred)
print(f'ROC-AUC Score: {roc_auc}')


[LightGBM] [Info] Number of positive: 19848, number of negative: 19872
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.054637 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3364
[LightGBM] [Info] Number of data points in the train set: 39720, number of used features: 141
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.499698 -> initscore=-0.001208
[LightGBM] [Info] Start training from score -0.001208
ROC-AUC Score: 0.7464861277280295


Here we are providing the parameter grid for hyper parameter tuning and provide different parameters for the good model fit.

In [130]:
y_pred = pd.DataFrame({'pred': y_pred})

y_pred['pred'] = pd.to_numeric(y_pred['pred'])

# Threshold for the class.
threshold = 0.5

# Giving class label for every output.
y_pred = [1 if pred > threshold else 0 for pred in y_pred['pred']]
# Calculate accuracy
accuracy = accuracy_score(y_valid, y_pred)
print(f"Accuracy on the test set: {accuracy:.2f}")

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(y_valid, y_pred))
print(f"RMSE: {rmse:.2f}")

# Calculate the ROC-AUC score on the valid set
roc_auc = roc_auc_score(y_valid, y_pred)
print(f'ROC-AUC Score: {roc_auc}')

# Generate the confusion matrix
conf_matrix = confusion_matrix(y_valid, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

TP = conf_matrix[1, 1]  # True Positives
TN = conf_matrix[0, 0]  # True Negatives

print(f'True Positives: {TP}')
print(f'True Negatives: {TN}')

FN = conf_matrix[1, 0]  # False Negatives
FP = conf_matrix[0, 1]  # False Positives

print(f'False Negatives: {FN}')
print(f'False Positives: {FP}')
from sklearn.metrics import precision_score, recall_score

# Calculate sentivity
sensitivity = recall_score(y_valid, y_pred)

# Calculate precision
precision = precision_score(y_valid, y_pred)

# Calculate specificity
specificity = TN / (TN + FP)

print(f'Sensitivity (Recall): {sensitivity}')
print(f'Precision: {precision}')
print(f'Specificity: {specificity}')

Accuracy on the test set: 0.69
RMSE: 0.56
ROC-AUC Score: 0.6860284139263507
Confusion Matrix:
[[3452 1501]
 [1617 3360]]
True Positives: 3360
True Negatives: 3452
False Negatives: 1617
False Positives: 1501
Sensitivity (Recall): 0.6751054852320675
Precision: 0.6912157992182678
Specificity: 0.696951342620634


Here the accuracy for validation dataset for light Gradient Boost model is 0.69.

AUC for the validation set for light Gradient Boost model is 0.686.

RMSE for the validation set for light Gradient Boost model is 0.56.

Classification matrix for the validation set for light Gradient Boost model is -

True Positives: 3360.
True Negatives: 3452.
False Negatives: 1617.
False Positives: 1501.

Here the sensitivity is = 0.675.

Precision for the validation dataset is = 0.691.

The specificity for the validation model is 0.696.

Here the specificty and precision is around 0.69 which is considered good for the model.

In [126]:
# Testing the model on test dataset.
lgbm_pred = lgbm.predict_proba(X_test_scaled)[:, 1]

Here, we are predicting the best model on the X_test_scaled (scaled of test data). And uploading the predicted values to Kaggle to get Kaggle score.

In [127]:
final_df = pd.DataFrame()
final_df["SK_ID_CURR"] = test_id
final_df["TARGET"] = lgbm_pred
final_df

Unnamed: 0,SK_ID_CURR,TARGET
0,100001,0.547884
1,100005,0.776247
2,100013,0.167433
3,100028,0.385325
4,100038,0.555009
...,...,...
48739,456221,0.381632
48740,456222,0.433633
48741,456223,0.640532
48742,456224,0.409669


In [128]:
final_df.to_csv("lightgradient.csv", index = False)

The kaggle score is 0.72387 which is high when compared to all the models used in this notebook upto now.

### 13. Results

Here, we are tablulating the results of accuracy, ROC, RMSE, Kaggle score for different model used in this notebook modeling.

In [156]:
# Dataframe for results table
result ={
    'Model Name': ['Logistic Regression', 'LASSO', 'RandomForest', 'XGBoost', 'Light Gradient Boost'],
    'Accuracy': [0.66, 0.66, 0.68 , 0.65, 0.69],
    'RMSE': [0.59, 0.59, 0.57 , 0.59, 0.56],
    'AUC Score': [0.657, 0.6575, 0.679, 0.652, 0.686 ],
    'Kaggle Score': [0.67383, 0.69292 ,0.66936 , 0.70854, 0.72387]
}

result = pd.DataFrame(result)
result


Unnamed: 0,Model Name,Accuracy,RMSE,AUC Score,Kaggle Score
0,Logistic Regression,0.66,0.59,0.657,0.67383
1,LASSO,0.66,0.59,0.6575,0.69292
2,RandomForest,0.68,0.57,0.679,0.66936
3,XGBoost,0.65,0.59,0.652,0.70854
4,Light Gradient Boost,0.69,0.56,0.686,0.72387


Here, we can see from the results table the light Gradient boost has high kaggle score of 0.72387 and also high AUC score of 0.6860.

### 14. Team Members Contribution

Abhiram Mannam - Abhiram was responsible for implementing Light Gradient Boosting, preparing a notebook write-up, and evaluating the models.

Neil Samuel Pulukuri - Neil's tasks included data cleaning and modeling the Random Forest model to fit the dataset.

Josh Hawley - Josh was in charge of implementing the XG Boost model and conducting hyperparameter tuning.

Kushal Ram Tayi - Kushal's tasks included building the Logistic model, calculating the Variance Inflation Factor (VIF), and using penalized LASSO regression to address multicollinearity.

### 15. Conclusion

To conclude for this modeling notebook we find the following:

There is a significant class imbalance in the dataset, with a small number of defaults compared to non-defaults. Addressing this imbalance is important as it can affect model training and evaluation strategies. So, we used imblearn library and under sampling technique to balance the target variable.

We used one hot encoder to encode the object variables into dummies so that the modeling can also be done on the factor variables.

We first started with logistic regression model and got around 0.67 kaggle score and understood that it has multicollinearity by examing the VIF scores and used penalized regression model LASSO to fit the data and the kaggle score improved to 0.69.

We further examined the data modeling with random forest and did hyper parameter tuning and got kaggle around 0.66 indicating to need further models.

So, we used XG Boost model with hyper parameter tuning and secured the 0.71 kaggle score and then we examined the light Gradient boosting which is fast and accurate with kaggle score of 0.723.

To conclude, we can say that Light Gradient Boost is the good model fit for the given cleaned dataset with Kaggle score of 0.72387.