# CAREER CAPSTONE - SPRING 2024

## Assigment - Modelling for Home Loan Credit Risk

## Table of Contents

<h2>
<ol>
1. <a href="#1.Introduction">Introduction</a>
    <ul>   
        <li><a href="#i.-Business-Problem">Business Problem</a></li>
        <li><a href="#ii.-Analytics-Approach">Analytics Approach</a></li>
        <li><a href="#iii.-Scope-of-the-Project">Scope</a></li>
        <li><a href="#iv.-Purpose-of-the-Notebook">Purpose of the Notebook</a></li>
    </ul>
    2. <a href="#2.Data-Exploration">Data-Exploration</a>
    <ul>   
        <li><a href="#2.i-Importing-Libraries">Importing Libraries</a></li>
        <li><a href="#2.ii-Loading-and-Reviewing-Datasets">Loading and Reviewing Datasets</a></li>
    </ul>   
3. <a href="#3.Data-Cleaning-&-Feature-Engineering">Data Cleaning & Feature Engineering</a>
    <ul>   
        <li><a href="#3.i-Train-Set">Train Set</a></li>
        <li><a href="#3.ii-Test-Set">Train Set</a></li>
    </ul>  
4. <a href="#4.Modelling">Modelling</a>
    <ul>
        <li><a href="#4.iii-Light-Gradient-Boost-Model">Light Gradient Boost Model</a></li>
    </ul>
5. <a href="#5.Results">Results</a>
    <br>



## 1.Introduction

Home Credit is an international consumer finance provider which operates in 9 countries. It provides point of sales loans, cash loans and revolving loans to underserved borrowers. The term undeserved borrower here refers to those who earn regular income from their job or businesses, but have little or no credit history and find it difficult to get credits from other traditional lending organizations. They believe that the credit history should not be a barrier for a borrower to fulfill their dreams. The goal of the project is to leverage the accumulated borrower behavioral data to develop predictive models. These models will enable Home Credit to efficiently analyze the risk associated with a given client and estimate the safe credit amount to be lent, even for underserved borrowers with little or no credit history. Ultimately, the aim is to provide financial assistance to customers and fulfill their dreams while ensuring responsible lending practices.


### i. Business Problem

Home Credit is facing a substantial barrier in providing loans to individuals with insufficient or non-existent credit histories. This challenge hampers the company's mission of broadening financial inclusion and exposes this demographic to exploitation by untrustworthy lenders.

### ii. Analytics Approach

This is a predictive analytics project, and the analytics approach for solving the problem will involve a supervised learning methodology. A classification algorithm will be utilized to predict whether a loan applicant is likely to repay or not. We will make use of a variety of alternative data (including telco and transactional information) to create the model. The target variable for the supervised classification model will be binary, indicating whether the applicant is classified as capable of repayment or not.¶

### iii. Scope of the Project

The project entails developing a predictive analytics model through supervised learning and a classification algorithm for loan repayment prediction. Utilizing diverse alternative data, including telco and transactional information, the scope encompasses model development, validation, and integration. Excluded from the scope are detailed loan decline analyses and specific loan term considerations. Potential future additions involve refining the model with extra data sources or incorporating real-time updates.

### iv. Purpose of the Notebook

In this notebook, we delve into the realm of credit default risk at Home Credit, a company dedicated to extending loans to individuals with scant or nonexistent credit histories. This demographic often falls prey to predatory lenders, underscoring the importance of Home Credit's mission to provide a secure and positive borrowing experience.To tackle this challenge, Home Credit harnesses alternative data sources like telecom and transactional records to gauge their clients' repayment capabilities. Our analysis kicks off with data cleansing, where we tidy and preprocess the data for modeling. We then proceed to construct predictive models using this refined dataset to evaluate credit default risk. Our focus lies on the target variable, employing various machine learning techniques and gauging performance metrics such as accuracy and ROC values. Our aim is to pinpoint the most effective model based on Kaggle scores. We initiate with Logistic Regression, and we delve into Random Forest, XGBoost, and Light XGBoost to gauge their predictive prowess in assessing credit default risk.

## 2.Data Exploration

### 2.i Importing Libraries

In [None]:
pip install prettytable



In [None]:
pip install xgboost



In [None]:
pip install lightgbm



In [None]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style='darkgrid',palette="colorblind")
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

from datetime import datetime

from scipy.stats import uniform

import matplotlib.pyplot as plt
import seaborn as sns
from prettytable import PrettyTable

import os
import pickle
import gc
import warnings
warnings.filterwarnings('ignore')
from datetime import datetime

from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import roc_auc_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import roc_curve
from sklearn.metrics import confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.calibration import CalibratedClassifierCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, mean_squared_error, confusion_matrix
import xgboost as xgb
from xgboost import XGBClassifier
from xgboost import XGBRegressor
from sklearn.metrics import accuracy_score, mean_squared_error, confusion_matrix
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score, f1_score, precision_score, recall_score

from lightgbm import LGBMClassifier
from lightgbm import LGBMRegressor
from sklearn.model_selection import cross_validate


### 2.ii Loading and Reviewing Datasets

#### Data Source : https://www.kaggle.com/competitions/home-credit-default-risk/data

### i. Importing Data

In [None]:
# 1. Loading Train and Test Datasets
train_dataset=pd.read_csv("application_train.csv")
test_dataset=pd.read_csv("application_test.csv")


# 2. Loading Supporting Datasets
#previous_app=pd.read_csv("previous_application.csv")
#sample_submission=pd.read_csv("sample_submission.csv")
#installments_payments=pd.read_csv("installments_payments.csv")
#credit_card_balance=pd.read_csv("credit_card_balance.csv")
#bureau_balance=pd.read_csv("bureau_balance.csv")
#bureau=pd.read_csv("bureau.csv")
#POS_CASH_balance=pd.read_csv("POS_CASH_balance.csv")

### ii. Preview of Training Data set

In [None]:
# head provides the first 5 rows of the dataset.
train_dataset.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# tail provides the last 5 rows of the dataset.
train_dataset.tail()


Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
307506,456251,0,Cash loans,M,N,N,0,157500.0,254700.0,27558.0,...,0,0,0,0,,,,,,
307507,456252,0,Cash loans,F,N,Y,0,72000.0,269550.0,12001.5,...,0,0,0,0,,,,,,
307508,456253,0,Cash loans,F,N,Y,0,153000.0,677664.0,29979.0,...,0,0,0,0,1.0,0.0,0.0,1.0,0.0,1.0
307509,456254,1,Cash loans,F,N,Y,0,171000.0,370107.0,20205.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
307510,456255,0,Cash loans,F,N,N,0,157500.0,675000.0,49117.5,...,0,0,0,0,0.0,0.0,0.0,2.0,0.0,1.0


Based on the output of the head() and tail() functions applied to the dataset, we observe that it comprises 122 columns
and 307511 rows. Among the crucial columns are "SK_ID_CURR," representing the loan application ID, and "TARGET,"
serving as the target variable denoting default status.

In [None]:
train_dataset.describe()

Unnamed: 0,SK_ID_CURR,TARGET,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
count,307511.0,307511.0,307511.0,307511.0,307511.0,307499.0,307233.0,307511.0,307511.0,307511.0,...,307511.0,307511.0,307511.0,307511.0,265992.0,265992.0,265992.0,265992.0,265992.0,265992.0
mean,278180.518577,0.080729,0.417052,168797.9,599026.0,27108.573909,538396.2,0.020868,-16036.995067,63815.045904,...,0.00813,0.000595,0.000507,0.000335,0.006402,0.007,0.034362,0.267395,0.265474,1.899974
std,102790.175348,0.272419,0.722121,237123.1,402490.8,14493.737315,369446.5,0.013831,4363.988632,141275.766519,...,0.089798,0.024387,0.022518,0.018299,0.083849,0.110757,0.204685,0.916002,0.794056,1.869295
min,100002.0,0.0,0.0,25650.0,45000.0,1615.5,40500.0,0.00029,-25229.0,-17912.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,189145.5,0.0,0.0,112500.0,270000.0,16524.0,238500.0,0.010006,-19682.0,-2760.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,278202.0,0.0,0.0,147150.0,513531.0,24903.0,450000.0,0.01885,-15750.0,-1213.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,367142.5,0.0,1.0,202500.0,808650.0,34596.0,679500.0,0.028663,-12413.0,-289.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0
max,456255.0,1.0,19.0,117000000.0,4050000.0,258025.5,4050000.0,0.072508,-7489.0,365243.0,...,1.0,1.0,1.0,1.0,4.0,9.0,8.0,27.0,261.0,25.0


In [None]:
train_dataset.shape

(307511, 122)

In [None]:
# Number of each type of column
train_dataset.dtypes.value_counts()

float64    65
int64      41
object     16
Name: count, dtype: int64

In [None]:
# Count of Missing Values
train_dataset.isna().sum().sort_values(ascending=False)

COMMONAREA_MEDI             214865
COMMONAREA_AVG              214865
COMMONAREA_MODE             214865
NONLIVINGAPARTMENTS_MODE    213514
NONLIVINGAPARTMENTS_AVG     213514
                             ...  
NAME_HOUSING_TYPE                0
NAME_FAMILY_STATUS               0
NAME_EDUCATION_TYPE              0
NAME_INCOME_TYPE                 0
SK_ID_CURR                       0
Length: 122, dtype: int64

In [None]:
# Count of Duplpicate values in Training Dataset
train_dataset.duplicated().sum()

0

In [None]:
# Number of unique classes in each object column
train_dataset.select_dtypes('object').apply(pd.Series.nunique, axis = 0)

NAME_CONTRACT_TYPE             2
CODE_GENDER                    3
FLAG_OWN_CAR                   2
FLAG_OWN_REALTY                2
NAME_TYPE_SUITE                7
NAME_INCOME_TYPE               8
NAME_EDUCATION_TYPE            5
NAME_FAMILY_STATUS             6
NAME_HOUSING_TYPE              6
OCCUPATION_TYPE               18
WEEKDAY_APPR_PROCESS_START     7
ORGANIZATION_TYPE             58
FONDKAPREMONT_MODE             4
HOUSETYPE_MODE                 3
WALLSMATERIAL_MODE             7
EMERGENCYSTATE_MODE            2
dtype: int64

### iii. Preview of Test Dataset

In [None]:
test_dataset.head()

Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100001,Cash loans,F,N,Y,0,135000.0,568800.0,20560.5,450000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
1,100005,Cash loans,M,N,Y,0,99000.0,222768.0,17370.0,180000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
2,100013,Cash loans,M,Y,Y,0,202500.0,663264.0,69777.0,630000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,1.0,4.0
3,100028,Cash loans,F,N,Y,2,315000.0,1575000.0,49018.5,1575000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
4,100038,Cash loans,M,Y,N,1,180000.0,625500.0,32067.0,625500.0,...,0,0,0,0,,,,,,


In [None]:
test_dataset.describe()

Unnamed: 0,SK_ID_CURR,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
count,48744.0,48744.0,48744.0,48744.0,48720.0,48744.0,48744.0,48744.0,48744.0,48744.0,...,48744.0,48744.0,48744.0,48744.0,42695.0,42695.0,42695.0,42695.0,42695.0,42695.0
mean,277796.67635,0.397054,178431.8,516740.4,29426.240209,462618.8,0.021226,-16068.084605,67485.366322,-4967.652716,...,0.001559,0.0,0.0,0.0,0.002108,0.001803,0.002787,0.009299,0.546902,1.983769
std,103169.547296,0.709047,101522.6,365397.0,16016.368315,336710.2,0.014428,4325.900393,144348.507136,3552.612035,...,0.039456,0.0,0.0,0.0,0.046373,0.046132,0.054037,0.110924,0.693305,1.838873
min,100001.0,0.0,26941.5,45000.0,2295.0,45000.0,0.000253,-25195.0,-17463.0,-23722.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,188557.75,0.0,112500.0,260640.0,17973.0,225000.0,0.010006,-19637.0,-2910.0,-7459.25,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,277549.0,0.0,157500.0,450000.0,26199.0,396000.0,0.01885,-15785.0,-1293.0,-4490.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
75%,367555.5,1.0,225000.0,675000.0,37390.5,630000.0,0.028663,-12496.0,-296.0,-1901.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,3.0
max,456250.0,20.0,4410000.0,2245500.0,180576.0,2245500.0,0.072508,-7338.0,365243.0,0.0,...,1.0,0.0,0.0,0.0,2.0,2.0,2.0,6.0,7.0,17.0


In [None]:
test_dataset.shape

(48744, 121)

In [None]:
test_dataset.dtypes.value_counts()

float64    65
int64      40
object     16
Name: count, dtype: int64

In [None]:
test_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48744 entries, 0 to 48743
Columns: 121 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(40), object(16)
memory usage: 45.0+ MB


In [None]:
test_dataset.duplicated().sum()

0

In [None]:
test_dataset.select_dtypes('object').apply(pd.Series.nunique, axis = 0)

NAME_CONTRACT_TYPE             2
CODE_GENDER                    2
FLAG_OWN_CAR                   2
FLAG_OWN_REALTY                2
NAME_TYPE_SUITE                7
NAME_INCOME_TYPE               7
NAME_EDUCATION_TYPE            5
NAME_FAMILY_STATUS             5
NAME_HOUSING_TYPE              6
OCCUPATION_TYPE               18
WEEKDAY_APPR_PROCESS_START     7
ORGANIZATION_TYPE             58
FONDKAPREMONT_MODE             4
HOUSETYPE_MODE                 3
WALLSMATERIAL_MODE             7
EMERGENCYSTATE_MODE            2
dtype: int64

In [None]:
test_dataset.isna().sum().sort_values(ascending=False)

COMMONAREA_AVG              33495
COMMONAREA_MODE             33495
COMMONAREA_MEDI             33495
NONLIVINGAPARTMENTS_AVG     33347
NONLIVINGAPARTMENTS_MODE    33347
                            ...  
NAME_HOUSING_TYPE               0
NAME_FAMILY_STATUS              0
NAME_EDUCATION_TYPE             0
NAME_INCOME_TYPE                0
SK_ID_CURR                      0
Length: 121, dtype: int64

### Obervations :

The dataset consists of two main tables: the train set and the test set, both organized by SK_ID_CURR. The train set comprises 122 columns and 307,511 rows, with 16 categorical and 106 numeric columns. On the other hand, the test set contains 121 columns and 48,744 rows, featuring 16 categorical and 105 numeric columns. These tables form the foundation of the dataset, providing a comprehensive view of the data structure and variables involved.

## 3.Data Cleaning & Feature Engineering

### 3.i Train Set

In [None]:
train_dataset.isna().sum().sort_values(ascending=False)

COMMONAREA_MEDI             214865
COMMONAREA_AVG              214865
COMMONAREA_MODE             214865
NONLIVINGAPARTMENTS_MODE    213514
NONLIVINGAPARTMENTS_AVG     213514
                             ...  
NAME_HOUSING_TYPE                0
NAME_FAMILY_STATUS               0
NAME_EDUCATION_TYPE              0
NAME_INCOME_TYPE                 0
SK_ID_CURR                       0
Length: 122, dtype: int64

In [None]:
# Number of unique classes in each object column
train_dataset.select_dtypes('object').apply(pd.Series.nunique, axis = 0)

NAME_CONTRACT_TYPE             2
CODE_GENDER                    3
FLAG_OWN_CAR                   2
FLAG_OWN_REALTY                2
NAME_TYPE_SUITE                7
NAME_INCOME_TYPE               8
NAME_EDUCATION_TYPE            5
NAME_FAMILY_STATUS             6
NAME_HOUSING_TYPE              6
OCCUPATION_TYPE               18
WEEKDAY_APPR_PROCESS_START     7
ORGANIZATION_TYPE             58
FONDKAPREMONT_MODE             4
HOUSETYPE_MODE                 3
WALLSMATERIAL_MODE             7
EMERGENCYSTATE_MODE            2
dtype: int64

In [None]:
# Handling missing values
object_columns = train_dataset.select_dtypes(include='object').columns
nan_object_columns = train_dataset[object_columns].isnull().sum()
nan_object_columns = nan_object_columns[nan_object_columns > 0].index

# Replace missing values with the mode for each of these columns
for column in nan_object_columns:
    mode_value = train_dataset[column].mode()[0]  # Get the first mode value
    train_dataset[column] = train_dataset[column].fillna(mode_value)


In [None]:
numeric_columns = train_dataset.select_dtypes(include=['int64', 'float64']).columns
nan_numeric_columns = train_dataset[numeric_columns].isnull().sum()
nan_numeric_columns = nan_numeric_columns[nan_numeric_columns > 0].index

# Replace missing values with the mean for each of these numeric columns
for column in nan_numeric_columns:
    mean_value = train_dataset[column].mean()  # Calculate the mean, ignoring NaN values
    train_dataset[column] = train_dataset[column].fillna(mean_value)


In [None]:
# Feature Engineering

In [None]:
flag_documents = [
    "FLAG_DOCUMENT_2", "FLAG_DOCUMENT_4", "FLAG_DOCUMENT_7",
    "FLAG_DOCUMENT_10", "FLAG_DOCUMENT_12", "FLAG_DOCUMENT_17"
]

# Loop to set the value of 1 to 0 for these train set columns
for col in flag_documents:
    train_dataset[col] = train_dataset[col].apply(lambda x: 0 if x == 1 else x)

In [None]:
# Calculate the mode of the NAME_FAMILY_STATUS column for both datasets
mode_train = train_dataset['NAME_FAMILY_STATUS'].mode()[0]


# Replace 'Unknown' values with the mode in the training data
train_dataset.loc[train_dataset['NAME_FAMILY_STATUS'] == 'Unknown', 'NAME_FAMILY_STATUS'] = mode_train

In [None]:
# Convert negative days to positive for age
train_dataset['DAYS_BIRTH'] = np.abs(train_dataset['DAYS_BIRTH'])
# Convert negative days to positive for registration period
train_dataset['DAYS_REGISTRATION'] = np.abs(train_dataset['DAYS_REGISTRATION'])
 # Convert negative days to positive for days since ID publication
train_dataset['DAYS_ID_PUBLISH'] = np.abs(train_dataset['DAYS_ID_PUBLISH'])
# Convert negative days to positive for days employed
train_dataset["DAYS_EMPLOYED"] = np.abs(train_dataset["DAYS_EMPLOYED"])
# Convert negative days to positive for days since last phone change
train_dataset['DAYS_LAST_PHONE_CHANGE'] = np.abs(train_dataset['DAYS_LAST_PHONE_CHANGE'])
# Apply conditions and set values for the new column
train_dataset.loc[(train_dataset["FLAG_OWN_CAR"] == "N") & (train_dataset["OWN_CAR_AGE"] == -1), "NEW_CERTAINLY_HAVE_CAR"] = "N"
train_dataset.loc[(train_dataset["FLAG_OWN_CAR"] == "N") & (train_dataset["OWN_CAR_AGE"] != -1), "NEW_CERTAINLY_HAVE_CAR"] = "M"
train_dataset.loc[(train_dataset["FLAG_OWN_CAR"] == "Y") & (train_dataset["OWN_CAR_AGE"] != -1), "NEW_CERTAINLY_HAVE_CAR"] = "Y"
train_dataset.loc[(train_dataset["FLAG_OWN_CAR"] == "Y") & (train_dataset["OWN_CAR_AGE"] == -1), "NEW_CERTAINLY_HAVE_CAR"] = "M"

# Find the mode of the column excluding 'M'
mode_value = train_dataset[train_dataset["NEW_CERTAINLY_HAVE_CAR"] != "M"]["NEW_CERTAINLY_HAVE_CAR"].mode()[0]

# Replace 'M' with the mode value
train_dataset.loc[train_dataset["NEW_CERTAINLY_HAVE_CAR"] == "M", "NEW_CERTAINLY_HAVE_CAR"] = mode_value

In [None]:
train_dataset["NEW_HOUSING_TYPE_RATING"] = train_dataset['NAME_HOUSING_TYPE'].map({
    'House / apartment': 1, 'Co-op apartment': 2, 'Municipal apartment': 3,
    'Rented apartment': 4, 'Office apartment': 6, 'With parents': 5
})

# Sum of ratings indicating a better region or neighborhood
train_dataset["NEW_REGION_RATING_TOTAL"] = train_dataset["REGION_RATING_CLIENT"] + train_dataset["REGION_RATING_CLIENT_W_CITY"]

# Total rating combining housing and regional scores
train_dataset["NEW_REGION_HOUSING_TOTAL_RATING"] = train_dataset["NEW_HOUSING_TYPE_RATING"] + train_dataset["NEW_REGION_RATING_TOTAL"]

# Set credit score based on ownership and total regional/housing score
train_dataset.loc[(train_dataset["FLAG_OWN_REALTY"]== "Y") & (train_dataset["NEW_REGION_HOUSING_TOTAL_RATING"] == 3),"NEW_REALTY_SCORE"] = 100
train_dataset.loc[(train_dataset["FLAG_OWN_REALTY"]== "Y") & (train_dataset["NEW_REGION_HOUSING_TOTAL_RATING"] == 5),"NEW_REALTY_SCORE"] = 90
train_dataset.loc[(train_dataset["FLAG_OWN_REALTY"]== "Y") & (train_dataset["NEW_REGION_HOUSING_TOTAL_RATING"] == 6),"NEW_REALTY_SCORE"] = 85
train_dataset.loc[(train_dataset["FLAG_OWN_REALTY"]== "Y") & (train_dataset["NEW_REGION_HOUSING_TOTAL_RATING"] == 7),"NEW_REALTY_SCORE"] = 80
train_dataset.loc[(train_dataset["FLAG_OWN_REALTY"]== "Y") & (train_dataset["NEW_REGION_HOUSING_TOTAL_RATING"] == 8),"NEW_REALTY_SCORE"] = 75
train_dataset.loc[(train_dataset["FLAG_OWN_REALTY"]== "Y") & (train_dataset["NEW_REGION_HOUSING_TOTAL_RATING"] == 9),"NEW_REALTY_SCORE"] = 70
train_dataset.loc[(train_dataset["FLAG_OWN_REALTY"]== "Y") & (train_dataset["NEW_REGION_HOUSING_TOTAL_RATING"] == 10),"NEW_REALTY_SCORE"] = 65
train_dataset.loc[(train_dataset["FLAG_OWN_REALTY"]== "Y") & (train_dataset["NEW_REGION_HOUSING_TOTAL_RATING"] == 11),"NEW_REALTY_SCORE"] = 60
train_dataset.loc[(train_dataset["FLAG_OWN_REALTY"]== "Y") & (train_dataset["NEW_REGION_HOUSING_TOTAL_RATING"] == 12),"NEW_REALTY_SCORE"] = 55
train_dataset.loc[(train_dataset["FLAG_OWN_REALTY"]== "N") & (train_dataset["NEW_REGION_HOUSING_TOTAL_RATING"] == 3),"NEW_REALTY_SCORE"] = 50
train_dataset.loc[(train_dataset["FLAG_OWN_REALTY"]== "N") & (train_dataset["NEW_REGION_HOUSING_TOTAL_RATING"] == 4),"NEW_REALTY_SCORE"] = 45
train_dataset.loc[(train_dataset["FLAG_OWN_REALTY"]== "N") & (train_dataset["NEW_REGION_HOUSING_TOTAL_RATING"] == 5),"NEW_REALTY_SCORE"] = 40
train_dataset.loc[(train_dataset["FLAG_OWN_REALTY"]== "N") & (train_dataset["NEW_REGION_HOUSING_TOTAL_RATING"] == 6),"NEW_REALTY_SCORE"] = 35
train_dataset.loc[(train_dataset["FLAG_OWN_REALTY"]== "N") & (train_dataset["NEW_REGION_HOUSING_TOTAL_RATING"] == 7),"NEW_REALTY_SCORE"] = 30
train_dataset.loc[(train_dataset["FLAG_OWN_REALTY"]== "N") & (train_dataset["NEW_REGION_HOUSING_TOTAL_RATING"] == 8),"NEW_REALTY_SCORE"] = 25
train_dataset.loc[(train_dataset["FLAG_OWN_REALTY"]== "N") & (train_dataset["NEW_REGION_HOUSING_TOTAL_RATING"] == 9),"NEW_REALTY_SCORE"] = 20
train_dataset.loc[(train_dataset["FLAG_OWN_REALTY"]== "N") & (train_dataset["NEW_REGION_HOUSING_TOTAL_RATING"] == 10),"NEW_REALTY_SCORE"] = 15
train_dataset.loc[(train_dataset["FLAG_OWN_REALTY"]== "N") & (train_dataset["NEW_REGION_HOUSING_TOTAL_RATING"] == 11),"NEW_REALTY_SCORE"] = 10
train_dataset.loc[(train_dataset["FLAG_OWN_REALTY"]== "N") & (train_dataset["NEW_REGION_HOUSING_TOTAL_RATING"] == 12),"NEW_REALTY_SCORE"] = 5

In [None]:
# Consider whether to compare AMT_ANNUITY annually or compare it with the customer's income
train_dataset['NEW_PERCENT_CREDIT_GOODPRICE'] = (train_dataset['AMT_CREDIT'] / train_dataset['AMT_GOODS_PRICE']) * 100

# Classify the percentage into risk levels
train_dataset.loc[(train_dataset['NEW_PERCENT_CREDIT_GOODPRICE'] <= 25), 'NEW_PERCENT_CREDIT_LEVEL'] = 'Very_Safe'
train_dataset.loc[(train_dataset['NEW_PERCENT_CREDIT_GOODPRICE'] > 25) & (train_dataset['NEW_PERCENT_CREDIT_GOODPRICE'] <= 50), 'NEW_PERCENT_CREDIT_LEVEL'] = 'Safe'
train_dataset.loc[(train_dataset['NEW_PERCENT_CREDIT_GOODPRICE'] > 50) & (train_dataset['NEW_PERCENT_CREDIT_GOODPRICE'] <= 75), 'NEW_PERCENT_CREDIT_LEVEL'] = 'Moderate_Risk'
train_dataset.loc[(train_dataset['NEW_PERCENT_CREDIT_GOODPRICE'] > 75) & (train_dataset['NEW_PERCENT_CREDIT_GOODPRICE'] <= 100), 'NEW_PERCENT_CREDIT_LEVEL'] = 'Risky'
train_dataset.loc[(train_dataset['NEW_PERCENT_CREDIT_GOODPRICE'] > 100), 'NEW_PERCENT_CREDIT_LEVEL'] = 'Very_Risky'

In [None]:
# Assessing the ability to repay the credit based on debt-income ratio
train_dataset.loc[(train_dataset["AMT_ANNUITY"] / train_dataset["AMT_INCOME_TOTAL"]) <= 0.4, "MAY_PAY_STATUS"] = 1
train_dataset.loc[(train_dataset["AMT_ANNUITY"] / train_dataset["AMT_INCOME_TOTAL"]) > 0.4, "MAY_PAY_STATUS"] = 0

# Additional financial indicators
train_dataset['DAYS_EMPLOYED_PERC'] = train_dataset['DAYS_EMPLOYED'] / train_dataset['DAYS_BIRTH']
train_dataset['INCOME_PER_PERSON'] = train_dataset['AMT_INCOME_TOTAL'] / train_dataset['CNT_FAM_MEMBERS']
train_dataset['PAYMENT_RATE'] = train_dataset['AMT_ANNUITY'] / train_dataset['AMT_CREDIT']
train_dataset['CREDIT_YEAR'] = train_dataset['AMT_CREDIT'] / train_dataset['AMT_ANNUITY']

# Convert DAYS_BIRTH to age in years and categorize into age groups
train_dataset["NEW_DAYS_YEAR"] = (train_dataset["DAYS_BIRTH"] / 365).round().astype(int)
labels = ['Young', 'Young Adult', 'Middle-aged', 'Elderly', 'Very Elderly']
train_dataset['NEW_AGE_GROUP'] = pd.qcut(train_dataset['NEW_DAYS_YEAR'], q=5, labels=labels)

# Calculate a weighted average of external sources
train_dataset['NEW_EXT_SOURCE'] = train_dataset['EXT_SOURCE_2']*0.4 + train_dataset['EXT_SOURCE_3']*0.33 + train_dataset['EXT_SOURCE_1']*0.27

# Adjusting DAYS_EMPLOYED to represent months and setting employment duration risk levels
train_dataset['DAYS_EMPLOYED'] = train_dataset["DAYS_EMPLOYED"] / 30
train_dataset.loc[train_dataset["DAYS_EMPLOYED"] > 12, "NEW_DAYS_EMPLOYED_LEVEL"] = "Very Safe"
train_dataset.loc[(train_dataset["DAYS_EMPLOYED"] > 6) & (train_dataset["DAYS_EMPLOYED"] <= 12), "NEW_DAYS_EMPLOYED_LEVEL"] = "Safe"
train_dataset.loc[(train_dataset["DAYS_EMPLOYED"] >= 0) & (train_dataset["DAYS_EMPLOYED"] <= 6), "NEW_DAYS_EMPLOYED_LEVEL"] = "Risky"

# Setting credit request risk levels based on the number of inquiries per year
train_dataset.loc[train_dataset["AMT_REQ_CREDIT_BUREAU_YEAR"] <= 5, 'NEW_REQ_YEAR'] = "Very Safe"
train_dataset.loc[(train_dataset["AMT_REQ_CREDIT_BUREAU_YEAR"] > 5) & (train_dataset["AMT_REQ_CREDIT_BUREAU_YEAR"] < 10), 'NEW_REQ_YEAR'] = "Safe"
train_dataset.loc[train_dataset["AMT_REQ_CREDIT_BUREAU_YEAR"] >= 10, 'NEW_REQ_YEAR'] = "Risky"

In [None]:
from sklearn.preprocessing import LabelEncoder

def label_encoder(dataframe, binary_col):
    labelencoder = LabelEncoder()
    dataframe[binary_col] = labelencoder.fit_transform(dataframe[binary_col])
    return dataframe

# Identifying binary columns
binary_cols = [col for col in train_dataset.columns if train_dataset[col].dtypes == "O" and len(train_dataset[col].unique()) == 2]
# Removing the target column from the list of binary columns
binary_cols = [col for col in binary_cols if col != "TARGET"]

for col in binary_cols:
    label_encoder(train_dataset, col)

In [None]:
# ONE-HOT ENCODING

# Selecting only the categorical columns excluding 'SK_ID_CURR' and 'TARGET'
categorical_columns = train_dataset.select_dtypes(include=['object', 'category']).columns.tolist()
categorical_columns = [col for col in categorical_columns if col not in ['SK_ID_CURR', 'TARGET']]

# Applying one-hot encoding to these categorical columns
df_encoded = pd.get_dummies(train_dataset, columns=categorical_columns, drop_first=True)


In [None]:
# Selecting columns with object and category data types
categorical_columns = train_dataset.select_dtypes(include=['object', 'category']).columns.tolist()
# Printing the list of categorical columns
print(categorical_columns)
# Dropping the categorical columns from the dataset
train_dataset = train_dataset.drop(columns=categorical_columns)

['CODE_GENDER', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE', 'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'NEW_CERTAINLY_HAVE_CAR', 'NEW_PERCENT_CREDIT_LEVEL', 'NEW_AGE_GROUP', 'NEW_DAYS_EMPLOYED_LEVEL', 'NEW_REQ_YEAR']


In [None]:
drop_col_list = ["CODE_GENDER","DAYS_BIRTH","FLAG_OWN_CAR","REGION_RATING_CLIENT","REGION_RATING_CLIENT_W_CITY","EXT_SOURCE_1","EXT_SOURCE_2",
                 "EXT_SOURCE_3","APARTMENTS_AVG","BASEMENTAREA_AVG","YEARS_BEGINEXPLUATATION_AVG","YEARS_BUILD_AVG",
                 "COMMONAREA_AVG","ELEVATORS_AVG","ENTRANCES_AVG","FLOORSMAX_AVG","FLOORSMIN_AVG","LANDAREA_AVG",
                 "LIVINGAPARTMENTS_AVG","LIVINGAREA_AVG","NONLIVINGAPARTMENTS_AVG","NONLIVINGAREA_AVG","APARTMENTS_MODE",
                 "BASEMENTAREA_MODE","YEARS_BEGINEXPLUATATION_MODE","YEARS_BUILD_MODE","COMMONAREA_MODE","ELEVATORS_MODE",
                 "ENTRANCES_MODE","FLOORSMAX_MODE","FLOORSMIN_MODE","LANDAREA_MODE","LIVINGAPARTMENTS_MODE","LIVINGAREA_MODE",
                 "NONLIVINGAPARTMENTS_MODE","NONLIVINGAREA_MODE","APARTMENTS_MEDI","BASEMENTAREA_MEDI","YEARS_BEGINEXPLUATATION_MEDI",
                 "YEARS_BUILD_MEDI","COMMONAREA_MEDI","ELEVATORS_MEDI","ENTRANCES_MEDI","FLOORSMAX_MEDI","FLOORSMIN_MEDI",
                 "LANDAREA_MEDI","LIVINGAPARTMENTS_MEDI","LIVINGAREA_MEDI","NONLIVINGAPARTMENTS_MEDI","NONLIVINGAREA_MEDI",
                 "OBS_30_CNT_SOCIAL_CIRCLE","DEF_30_CNT_SOCIAL_CIRCLE","OBS_60_CNT_SOCIAL_CIRCLE","DEF_60_CNT_SOCIAL_CIRCLE",
                 "FLAG_DOCUMENT_2","FLAG_DOCUMENT_4","FLAG_DOCUMENT_5","FLAG_DOCUMENT_6","FLAG_DOCUMENT_7","FLAG_DOCUMENT_8",
                 "FLAG_DOCUMENT_9","FLAG_DOCUMENT_10","FLAG_DOCUMENT_11","FLAG_DOCUMENT_12","FLAG_DOCUMENT_13","FLAG_DOCUMENT_14",
                 "FLAG_DOCUMENT_15","FLAG_DOCUMENT_16","FLAG_DOCUMENT_17","FLAG_DOCUMENT_18","FLAG_DOCUMENT_19","FLAG_DOCUMENT_20",
                 "FLAG_DOCUMENT_21","AMT_REQ_CREDIT_BUREAU_HOUR","AMT_REQ_CREDIT_BUREAU_DAY","AMT_REQ_CREDIT_BUREAU_WEEK",
                 "AMT_REQ_CREDIT_BUREAU_MON","AMT_REQ_CREDIT_BUREAU_QRT","NEW_OCCUPATION_RATING"]


# Remove the specified columns
for col in drop_col_list:
  if col in train_dataset.columns.tolist():
    train_dataset.drop(columns=col, inplace=True)

In [None]:
train_dataset.shape

(307511, 45)

After executing the code to drop specified columns from the train_dataset, the resulting output (307511, 45) indicates that the dataset now comprises 307,511 rows and 45 columns. Initially, the dataset contained 122 columns. However, a list of 77 columns was provided for removal, including features such as CODE_GENDER, DAYS_BIRTH, EXT_SOURCE_1, and various document flags. Following the removal process, the dataset underwent a significant reduction in the number of columns, resulting in 45 remaining. Despite this reduction, the number of rows in the dataset remained consistent at 307,511, ensuring that no data was lost during the column removal operation.

In [None]:
train_dataset['NEW_REALTY_SCORE']=train_dataset['NEW_REALTY_SCORE'].fillna(train_dataset['NEW_REALTY_SCORE'].median())

In [None]:
features = train_dataset.columns.tolist()

### 3.ii Test Set

In [None]:
def clean_feature(dataset):
  object_columns = dataset.select_dtypes(include='object').columns
  nan_object_columns = dataset[object_columns].isnull().sum()
  nan_object_columns = nan_object_columns[nan_object_columns > 0].index

  # Replace missing values with the mode for each of these columns
  for column in nan_object_columns:
      mode_value = dataset[column].mode()[0]  # Get the first mode value
      dataset[column] = dataset[column].fillna(mode_value)


  numeric_columns = dataset.select_dtypes(include=['int64', 'float64']).columns
  nan_numeric_columns = dataset[numeric_columns].isnull().sum()
  nan_numeric_columns = nan_numeric_columns[nan_numeric_columns > 0].index

  # Replace missing values with the mean for each of these numeric columns
  for column in nan_numeric_columns:
      mean_value = dataset[column].mean()  # Calculate the mean, ignoring NaN values
      dataset[column] = dataset[column].fillna(mean_value)


  flag_documents = [
      "FLAG_DOCUMENT_2", "FLAG_DOCUMENT_4", "FLAG_DOCUMENT_7",
      "FLAG_DOCUMENT_10", "FLAG_DOCUMENT_12", "FLAG_DOCUMENT_17"
  ]

  # Loop to set the value of 1 to 0 for these columns in both training and testing data
  for col in flag_documents:
      dataset[col] = dataset[col].apply(lambda x: 0 if x == 1 else x)

  # Calculate the mode of the NAME_FAMILY_STATUS column for both datasets
  mode_train = dataset['NAME_FAMILY_STATUS'].mode()[0]


  # Replace 'Unknown' values with the mode in the training data
  dataset.loc[dataset['NAME_FAMILY_STATUS'] == 'Unknown', 'NAME_FAMILY_STATUS'] = mode_train

  dataset['DAYS_BIRTH'] = np.abs(dataset['DAYS_BIRTH'])
  dataset['DAYS_REGISTRATION'] = np.abs(dataset['DAYS_REGISTRATION'])
  dataset['DAYS_ID_PUBLISH'] = np.abs(dataset['DAYS_ID_PUBLISH'])
  dataset["DAYS_EMPLOYED"] = np.abs(dataset["DAYS_EMPLOYED"])
  dataset['DAYS_LAST_PHONE_CHANGE'] = np.abs(dataset['DAYS_LAST_PHONE_CHANGE'])

  # Apply conditions and set values for the new column in the training dataset
  dataset.loc[(dataset["FLAG_OWN_CAR"] == "N") & (dataset["OWN_CAR_AGE"] == -1), "NEW_CERTAINLY_HAVE_CAR"] = "N"
  dataset.loc[(dataset["FLAG_OWN_CAR"] == "N") & (dataset["OWN_CAR_AGE"] != -1), "NEW_CERTAINLY_HAVE_CAR"] = "M"
  dataset.loc[(dataset["FLAG_OWN_CAR"] == "Y") & (dataset["OWN_CAR_AGE"] != -1), "NEW_CERTAINLY_HAVE_CAR"] = "Y"
  dataset.loc[(dataset["FLAG_OWN_CAR"] == "Y") & (dataset["OWN_CAR_AGE"] == -1), "NEW_CERTAINLY_HAVE_CAR"] = "M"

  # Find the mode of the column excluding 'M'
  mode_value = dataset[dataset["NEW_CERTAINLY_HAVE_CAR"] != "M"]["NEW_CERTAINLY_HAVE_CAR"].mode()[0]

  # Replace 'M' with the mode value in the training dataset
  dataset.loc[dataset["NEW_CERTAINLY_HAVE_CAR"] == "M", "NEW_CERTAINLY_HAVE_CAR"] = mode_value

  dataset["NEW_HOUSING_TYPE_RATING"] = dataset['NAME_HOUSING_TYPE'].map({
      'House / apartment': 1, 'Co-op apartment': 2, 'Municipal apartment': 3,
      'Rented apartment': 4, 'Office apartment': 6, 'With parents': 5
  })

  # Sum of ratings indicating a better region or neighborhood
  dataset["NEW_REGION_RATING_TOTAL"] = dataset["REGION_RATING_CLIENT"] + dataset["REGION_RATING_CLIENT_W_CITY"]

  # Total rating combining housing and regional scores
  dataset["NEW_REGION_HOUSING_TOTAL_RATING"] = dataset["NEW_HOUSING_TYPE_RATING"] + dataset["NEW_REGION_RATING_TOTAL"]

  # Set credit score based on ownership and total regional/housing score
  dataset.loc[(dataset["FLAG_OWN_REALTY"]== "Y") & (dataset["NEW_REGION_HOUSING_TOTAL_RATING"] == 3),"NEW_REALTY_SCORE"] = 100
  dataset.loc[(dataset["FLAG_OWN_REALTY"]== "Y") & (dataset["NEW_REGION_HOUSING_TOTAL_RATING"] == 5),"NEW_REALTY_SCORE"] = 90
  dataset.loc[(dataset["FLAG_OWN_REALTY"]== "Y") & (dataset["NEW_REGION_HOUSING_TOTAL_RATING"] == 6),"NEW_REALTY_SCORE"] = 85
  dataset.loc[(dataset["FLAG_OWN_REALTY"]== "Y") & (dataset["NEW_REGION_HOUSING_TOTAL_RATING"] == 7),"NEW_REALTY_SCORE"] = 80
  dataset.loc[(dataset["FLAG_OWN_REALTY"]== "Y") & (dataset["NEW_REGION_HOUSING_TOTAL_RATING"] == 8),"NEW_REALTY_SCORE"] = 75
  dataset.loc[(dataset["FLAG_OWN_REALTY"]== "Y") & (dataset["NEW_REGION_HOUSING_TOTAL_RATING"] == 9),"NEW_REALTY_SCORE"] = 70
  dataset.loc[(dataset["FLAG_OWN_REALTY"]== "Y") & (dataset["NEW_REGION_HOUSING_TOTAL_RATING"] == 10),"NEW_REALTY_SCORE"] = 65
  dataset.loc[(dataset["FLAG_OWN_REALTY"]== "Y") & (dataset["NEW_REGION_HOUSING_TOTAL_RATING"] == 11),"NEW_REALTY_SCORE"] = 60
  dataset.loc[(dataset["FLAG_OWN_REALTY"]== "Y") & (dataset["NEW_REGION_HOUSING_TOTAL_RATING"] == 12),"NEW_REALTY_SCORE"] = 55
  dataset.loc[(dataset["FLAG_OWN_REALTY"]== "N") & (dataset["NEW_REGION_HOUSING_TOTAL_RATING"] == 3),"NEW_REALTY_SCORE"] = 50
  dataset.loc[(dataset["FLAG_OWN_REALTY"]== "N") & (dataset["NEW_REGION_HOUSING_TOTAL_RATING"] == 4),"NEW_REALTY_SCORE"] = 45
  dataset.loc[(dataset["FLAG_OWN_REALTY"]== "N") & (dataset["NEW_REGION_HOUSING_TOTAL_RATING"] == 5),"NEW_REALTY_SCORE"] = 40
  dataset.loc[(dataset["FLAG_OWN_REALTY"]== "N") & (dataset["NEW_REGION_HOUSING_TOTAL_RATING"] == 6),"NEW_REALTY_SCORE"] = 35
  dataset.loc[(dataset["FLAG_OWN_REALTY"]== "N") & (dataset["NEW_REGION_HOUSING_TOTAL_RATING"] == 7),"NEW_REALTY_SCORE"] = 30
  dataset.loc[(dataset["FLAG_OWN_REALTY"]== "N") & (dataset["NEW_REGION_HOUSING_TOTAL_RATING"] == 8),"NEW_REALTY_SCORE"] = 25
  dataset.loc[(dataset["FLAG_OWN_REALTY"]== "N") & (dataset["NEW_REGION_HOUSING_TOTAL_RATING"] == 9),"NEW_REALTY_SCORE"] = 20
  dataset.loc[(dataset["FLAG_OWN_REALTY"]== "N") & (dataset["NEW_REGION_HOUSING_TOTAL_RATING"] == 10),"NEW_REALTY_SCORE"] = 15
  dataset.loc[(dataset["FLAG_OWN_REALTY"]== "N") & (dataset["NEW_REGION_HOUSING_TOTAL_RATING"] == 11),"NEW_REALTY_SCORE"] = 10
  dataset.loc[(dataset["FLAG_OWN_REALTY"]== "N") & (dataset["NEW_REGION_HOUSING_TOTAL_RATING"] == 12),"NEW_REALTY_SCORE"] = 5

  # Consider whether to compare AMT_ANNUITY annually or compare it with the customer's income
  dataset['NEW_PERCENT_CREDIT_GOODPRICE'] = (dataset['AMT_CREDIT'] / dataset['AMT_GOODS_PRICE']) * 100

  # Classify the percentage into risk levels
  dataset.loc[(dataset['NEW_PERCENT_CREDIT_GOODPRICE'] <= 25), 'NEW_PERCENT_CREDIT_LEVEL'] = 'Very_Safe'
  dataset.loc[(dataset['NEW_PERCENT_CREDIT_GOODPRICE'] > 25) & (dataset['NEW_PERCENT_CREDIT_GOODPRICE'] <= 50), 'NEW_PERCENT_CREDIT_LEVEL'] = 'Safe'
  dataset.loc[(dataset['NEW_PERCENT_CREDIT_GOODPRICE'] > 50) & (dataset['NEW_PERCENT_CREDIT_GOODPRICE'] <= 75), 'NEW_PERCENT_CREDIT_LEVEL'] = 'Moderate_Risk'
  dataset.loc[(dataset['NEW_PERCENT_CREDIT_GOODPRICE'] > 75) & (dataset['NEW_PERCENT_CREDIT_GOODPRICE'] <= 100), 'NEW_PERCENT_CREDIT_LEVEL'] = 'Risky'
  dataset.loc[(dataset['NEW_PERCENT_CREDIT_GOODPRICE'] > 100), 'NEW_PERCENT_CREDIT_LEVEL'] = 'Very_Risky'

  # Assessing the ability to repay the credit based on debt-income ratio
  dataset.loc[(dataset["AMT_ANNUITY"] / dataset["AMT_INCOME_TOTAL"]) <= 0.4, "MAY_PAY_STATUS"] = 1
  dataset.loc[(dataset["AMT_ANNUITY"] / dataset["AMT_INCOME_TOTAL"]) > 0.4, "MAY_PAY_STATUS"] = 0

  # Additional financial indicators
  dataset['DAYS_EMPLOYED_PERC'] = dataset['DAYS_EMPLOYED'] / dataset['DAYS_BIRTH']
  dataset['INCOME_PER_PERSON'] = dataset['AMT_INCOME_TOTAL'] / dataset['CNT_FAM_MEMBERS']
  dataset['PAYMENT_RATE'] = dataset['AMT_ANNUITY'] / dataset['AMT_CREDIT']
  dataset['CREDIT_YEAR'] = dataset['AMT_CREDIT'] / dataset['AMT_ANNUITY']

  # Convert DAYS_BIRTH to age in years and categorize into age groups
  dataset["NEW_DAYS_YEAR"] = (dataset["DAYS_BIRTH"] / 365).round().astype(int)
  labels = ['Young', 'Young Adult', 'Middle-aged', 'Elderly', 'Very Elderly']
  dataset['NEW_AGE_GROUP'] = pd.qcut(dataset['NEW_DAYS_YEAR'], q=5, labels=labels)

  # Calculate a weighted average of external sources
  dataset['NEW_EXT_SOURCE'] = dataset['EXT_SOURCE_2']*0.4 + dataset['EXT_SOURCE_3']*0.33 + dataset['EXT_SOURCE_1']*0.27

  # Adjusting DAYS_EMPLOYED to represent months and setting employment duration risk levels
  dataset['DAYS_EMPLOYED'] = dataset["DAYS_EMPLOYED"] / 30
  dataset.loc[dataset["DAYS_EMPLOYED"] > 12, "NEW_DAYS_EMPLOYED_LEVEL"] = "Very Safe"
  dataset.loc[(dataset["DAYS_EMPLOYED"] > 6) & (dataset["DAYS_EMPLOYED"] <= 12), "NEW_DAYS_EMPLOYED_LEVEL"] = "Safe"
  dataset.loc[(dataset["DAYS_EMPLOYED"] >= 0) & (dataset["DAYS_EMPLOYED"] <= 6), "NEW_DAYS_EMPLOYED_LEVEL"] = "Risky"

  # Setting credit request risk levels based on the number of inquiries per year
  dataset.loc[dataset["AMT_REQ_CREDIT_BUREAU_YEAR"] <= 5, 'NEW_REQ_YEAR'] = "Very Safe"
  dataset.loc[(dataset["AMT_REQ_CREDIT_BUREAU_YEAR"] > 5) & (dataset["AMT_REQ_CREDIT_BUREAU_YEAR"] < 10), 'NEW_REQ_YEAR'] = "Safe"
  dataset.loc[dataset["AMT_REQ_CREDIT_BUREAU_YEAR"] >= 10, 'NEW_REQ_YEAR'] = "Risky"

  from sklearn.preprocessing import LabelEncoder

  def label_encoder(dataframe, binary_col):
      labelencoder = LabelEncoder()
      dataframe[binary_col] = labelencoder.fit_transform(dataframe[binary_col])
      return dataframe

  binary_cols = [col for col in dataset.columns if dataset[col].dtypes == "O" and len(dataset[col].unique()) == 2]

  binary_cols = [col for col in binary_cols if col != "TARGET"]

  for col in binary_cols:
      label_encoder(dataset, col)

  # ONE-HOT ENCODING

  # Selecting only the categorical columns excluding 'SK_ID_CURR' and 'TARGET'
  categorical_columns = dataset.select_dtypes(include=['object', 'category']).columns.tolist()
  categorical_columns = [col for col in categorical_columns if col not in ['SK_ID_CURR', 'TARGET']]

  # Applying one-hot encoding to these categorical columns
  df_encoded = pd.get_dummies(dataset, columns=categorical_columns, drop_first=True)

  categorical_columns = dataset.select_dtypes(include=['object', 'category']).columns.tolist()
  print(categorical_columns)
  dataset = dataset.drop(columns=categorical_columns)


  drop_col_list = ["CODE_GENDER","DAYS_BIRTH","FLAG_OWN_CAR","REGION_RATING_CLIENT","REGION_RATING_CLIENT_W_CITY","EXT_SOURCE_1","EXT_SOURCE_2",
                  "EXT_SOURCE_3","APARTMENTS_AVG","BASEMENTAREA_AVG","YEARS_BEGINEXPLUATATION_AVG","YEARS_BUILD_AVG",
                  "COMMONAREA_AVG","ELEVATORS_AVG","ENTRANCES_AVG","FLOORSMAX_AVG","FLOORSMIN_AVG","LANDAREA_AVG",
                  "LIVINGAPARTMENTS_AVG","LIVINGAREA_AVG","NONLIVINGAPARTMENTS_AVG","NONLIVINGAREA_AVG","APARTMENTS_MODE",
                  "BASEMENTAREA_MODE","YEARS_BEGINEXPLUATATION_MODE","YEARS_BUILD_MODE","COMMONAREA_MODE","ELEVATORS_MODE",
                  "ENTRANCES_MODE","FLOORSMAX_MODE","FLOORSMIN_MODE","LANDAREA_MODE","LIVINGAPARTMENTS_MODE","LIVINGAREA_MODE",
                  "NONLIVINGAPARTMENTS_MODE","NONLIVINGAREA_MODE","APARTMENTS_MEDI","BASEMENTAREA_MEDI","YEARS_BEGINEXPLUATATION_MEDI",
                  "YEARS_BUILD_MEDI","COMMONAREA_MEDI","ELEVATORS_MEDI","ENTRANCES_MEDI","FLOORSMAX_MEDI","FLOORSMIN_MEDI",
                  "LANDAREA_MEDI","LIVINGAPARTMENTS_MEDI","LIVINGAREA_MEDI","NONLIVINGAPARTMENTS_MEDI","NONLIVINGAREA_MEDI",
                  "OBS_30_CNT_SOCIAL_CIRCLE","DEF_30_CNT_SOCIAL_CIRCLE","OBS_60_CNT_SOCIAL_CIRCLE","DEF_60_CNT_SOCIAL_CIRCLE",
                  "FLAG_DOCUMENT_2","FLAG_DOCUMENT_4","FLAG_DOCUMENT_5","FLAG_DOCUMENT_6","FLAG_DOCUMENT_7","FLAG_DOCUMENT_8",
                  "FLAG_DOCUMENT_9","FLAG_DOCUMENT_10","FLAG_DOCUMENT_11","FLAG_DOCUMENT_12","FLAG_DOCUMENT_13","FLAG_DOCUMENT_14",
                  "FLAG_DOCUMENT_15","FLAG_DOCUMENT_16","FLAG_DOCUMENT_17","FLAG_DOCUMENT_18","FLAG_DOCUMENT_19","FLAG_DOCUMENT_20",
                  "FLAG_DOCUMENT_21","AMT_REQ_CREDIT_BUREAU_HOUR","AMT_REQ_CREDIT_BUREAU_DAY","AMT_REQ_CREDIT_BUREAU_WEEK",
                  "AMT_REQ_CREDIT_BUREAU_MON","AMT_REQ_CREDIT_BUREAU_QRT","NEW_OCCUPATION_RATING"]


  # Remove the specified columns
  for col in drop_col_list:
    if col in dataset.columns.tolist():
      dataset.drop(columns=col, inplace=True)

  dataset['NEW_REALTY_SCORE']=dataset['NEW_REALTY_SCORE'].fillna(dataset['NEW_REALTY_SCORE'].median())
  return dataset

In [None]:
test_dataset=clean_feature(test_dataset)

['NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE', 'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'NEW_CERTAINLY_HAVE_CAR', 'NEW_PERCENT_CREDIT_LEVEL', 'NEW_AGE_GROUP', 'NEW_DAYS_EMPLOYED_LEVEL', 'NEW_REQ_YEAR']


In [None]:
test_dataset.shape

(48744, 44)

Initially, the test set had a shape of (48744, 121), indicating that it comprised 48,744 rows and 121 columns. After executing the cleaning and feature engineering functions, the test set underwent transformations that resulted in a reduction in both the number of rows and columns. The observed shape of (48744, 44) after preprocessing suggests that certain rows and columns were removed or modified during this process.

In [None]:
target = train_dataset.TARGET.value_counts()
target

TARGET
0    282686
1     24825
Name: count, dtype: int64

In [None]:
# Calculating the percentage of target class.
percenttarget=(target.values/len(train_dataset)*100)
percenttarget

array([91.92711805,  8.07288195])

The dataset exhibits a class distribution where the no-default class (0) constitutes approximately 91.927%, while the default class (1) accounts for about 8.072%. In this scenario, a majority class classifier, which always predicts the no-default class, would achieve an accuracy of 91.927% specifically for predicting instances of the no-default class.

## 4.Modelling

In [None]:
# Splitting df_train into training and validation sets
X = train_dataset.drop(columns=['TARGET'])
y = train_dataset['TARGET']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)


## 4.iii Light Gradient Boost Model

In [None]:
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
from sklearn.metrics import roc_auc_score, f1_score, precision_score, recall_score


# Splitting df_train into training and validation sets
X =  train_dataset.drop(columns=['TARGET'])
y = train_dataset['TARGET']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Create the LightGBM data containers
train_data = lgb.Dataset(X_train, label=y_train)
val_data = lgb.Dataset(X_val, label=y_val)
params = {
    'objective': 'binary',
    'metric': 'binary_error',
    'num_leaves': 31,
    'learning_rate': 0.01,
    'feature_fraction': 0.9,
    'min_data_in_leaf': 20,   # higher values can prevent overfitting
    'lambda_l1': 0.5,         # L1 regularization
    'bagging_fraction': 0.8,  # enables bagging (subsampling)
    'bagging_freq': 5         # perform bagging every 5 iterations
}

# Train the model with early stopping
num_round = 1000
bst = lgb.train(params, train_data, num_boost_round=num_round, valid_sets=[val_data])

# Evaluate the model on the validation set
y_val_pred = bst.predict(X_val, num_iteration=bst.best_iteration)


[LightGBM] [Info] Number of positive: 19876, number of negative: 226132
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.152792 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 4313
[LightGBM] [Info] Number of data points in the train set: 246008, number of used features: 43
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.080794 -> initscore=-2.431606
[LightGBM] [Info] Start training from score -2.431606


In [None]:
y_val_pred_binary = [1 if pred > 0.17 else 0 for pred in y_val_pred]  # Adjusting threshold as needed
accuracy_lgbm1 = accuracy_score(y_val, y_val_pred_binary)
#accuracy_lgbm2 = accuracy_score(y_val, y_val_pred)

print("Validation Accuracy:", accuracy_lgbm1)
#print("Validation Accuracy2:", accuracy_lgbm2)

# Calculate additional metrics
roc_auc = roc_auc_score(y_val, y_val_pred)
precision = precision_score(y_val, y_val_pred_binary)
recall = recall_score(y_val, y_val_pred_binary)
f1 = f1_score(y_val, y_val_pred_binary)

print(f"ROC-AUC: {roc_auc}, Precision: {precision}, Recall: {recall}, F1-Score: {f1}")


Validation Accuracy: 0.8638765588670472
ROC-AUC: 0.7580289672915319, Precision: 0.25384726017546383, Recall: 0.35663770458678523, F1-Score: 0.2965888086035961


The LGBoost model has validation accuracy of 86.39%, it demonstrates consistency in classifying instances correctly. Moreover, boasting a ROC-AUC score of 75.80%, the model showcases robust discriminatory power in distinguishing between positive and negative classes. Precision, measured at approximately 25.38%, indicates the model's adeptness at avoiding false positive predictions, while its recall of around 35.66% reflects its ability to capture true positive instances effectively. The F1-score, a harmonic mean of precision and recall, stands at approximately 29.66%, underscoring the model's balanced performance across precision and recall metrics.

In [None]:
test_pred = bst.predict(test_dataset, num_iteration=bst.best_iteration)
submission = pd.DataFrame({
    'SK_ID_CURR': test_dataset['SK_ID_CURR'],  # Extract 'SK_ID_CURR' from df_test
    'TARGET': test_pred  # Predicted probabilities or labels
})
submission.to_csv('submission1.csv',index=False)

The kaggle score is 0.75121 which is high when compared to all the models used in this notebook.





## 5.Results




Here, we are tablulating the results of accuracy, ROC, Kaggle score for different model used in this notebook modeling.


In [None]:
result ={
    'Model Name': ['Logistic Regression', 'RandomForest', 'XGBoost', 'Light Gradient Boost'],
    'Accuracy': [0.91, 0.91 , 0.92, 0.86],
    'AUC Score': [0.59, 0.71, 0.50, 0.75 ],
    'Kaggle Score': [0.53499, 0.54010 , 0.74150, 0.75121]
}

result = pd.DataFrame(result)
result

Based on the results table, it is evident that Light Gradient Boost achieves a notable Kaggle score of 0.75121 and demonstrates a commendable AUC score of 0.75.