 # Decision Trees

<font color = "brown"><font size = 5> Context </font>

<font size = 3 color = "white"> This dataset represents customer credit information used to predict whether a borrower will default on a loan. It contains demographic information, financial history, employment details, credit account behavior, and the final repayment outcome. The goal is to build a classification model that can differentiate between customers who are likely to repay their loan on time and those who are at risk of defaulting.

The dataset includes both numerical and categorical attributes (encoded as symbolic codes like A11, A12, etc.). The target variable is Default_On_Payment, where 1 indicates a loan default and 0 means successful repayment.

This dataset is commonly used for credit risk modeling, enabling financial institutions to understand borrower behavior and make informed lending decisions. In this project, it will be used to demonstrate model training, evaluation, and the effect of overfitting, especially in Decision Tree–based models. </font>

<font size =5 color = "brown"> Data Dictionary </font>

| Column Name                           | Meaning / Description                            | Data Type   | Example Values                    |
| ------------------------------------- | ------------------------------------------------ | ----------- | --------------------------------- |
| **Customer_ID**                       | Unique customer identifier                       | Integer     | `100001`, `100245`                |
| **Status_Checking_Acc**               | Status of checking (current) account             | Categorical | `A11`, `A12`, `A13`, `A14`        |
| **Duration_in_Months**                | Loan duration in months                          | Numeric     | `6`, `24`, `48`                   |
| **Credit_History**                    | Customer credit repayment history                | Categorical | `A30`, `A31`, `A32`, `A33`, `A34` |
| **Purposre_Credit_Taken** *(Purpose)* | Purpose of loan                                  | Categorical | `A40`, `A41`, `A42`, `A43`, `A46` |
| **Credit_Amount**                     | Loan amount requested/approved                   | Numeric     | `1169`, `5000`, `7882`            |
| **Savings_Acc**                       | Savings account/bond holdings                    | Categorical | `A61`, `A62`, `A63`, `A64`, `A65` |
| **Years_At_Present_Employment**       | Employment tenure at current job                 | Categorical | `A71`, `A72`, `A73`, `A74`, `A75` |
| **Inst_Rt_Income**                    | Installment rate as % of disposable income       | Numeric     | `1`, `2`, `3`, `4`                |
| **Marital_Status_Gender**             | Marital status & gender combined                 | Categorical | `A91`, `A92`, `A93`, `A94`        |
| **Guarantors**                        | Guarantor / co-applicant                         | Categorical | `A101`, `A102`, `A103`            |
| **Residence_Duration**                | Years living at current residence                | Numeric     | `1`, `2`, `3`, `4`                |
| **Most_Valuable_Available_Asset**     | Major assets held                                | Categorical | `A121`, `A122`, `A123`, `A124`    |
| **Age**                               | Customer age in years                            | Numeric     | `22`, `35`, `49`, `67`            |
| **Other_Inst_Plans**                  | Other installment plans                          | Categorical | `A141`, `A142`, `A143`            |
| **Housing**                           | Housing status                                   | Categorical | `A151`, `A152`, `A153`            |
| **Num_CC**                            | # of existing credit accounts                    | Numeric     | `1`, `2`, `3`                     |
| **Job**                               | Customer job skill level                         | Categorical | `A171`, `A172`, `A173`, `A174`    |
| **Dependents**                        | Number of financial dependents                   | Numeric     | `1`, `2`                          |
| **Telephone**                         | Telephone availability                           | Categorical | `A191`, `A192`                    |
| **Foreign_Worker**                    | Whether a foreign worker                         | Categorical | `A201`, `A202`                    |
| **Default_On_Payment**                | **Target variable**: 1 = default, 0 = no default | Binary      | `0`, `1`                          |

----------------------------------

## Importing the libraries

In [63]:
# Import necessary libraries.
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import (classification_report,confusion_matrix,
                            accuracy_score, roc_auc_score, roc_curve,
                            recall_score, precision_score, f1_score)


# To enable plotting graphs in Jupyter notebook
%matplotlib inline
pd.set_option('display.float_format', lambda x: '%.3f' % x)# to display numbers in digits

## Reading the dataset

In [64]:
data = pd.read_csv("default_on_payment.csv")
df = data.copy()

## Checking the shape of the dataset

In [65]:
df.shape

(5000, 22)

We have 5000 observations with 22 features.

## Checking the column names

In [66]:
df.columns

Index(['Customer_ID', 'Status_Checking_Acc', 'Duration_in_Months',
       'Credit_History', 'Purposre_Credit_Taken', 'Credit_Amount',
       'Savings_Acc', 'Years_At_Present_Employment', 'Inst_Rt_Income',
       'Marital_Status_Gender', 'Other_Debtors_Guarantors',
       'Current_Address_Yrs', 'Property', 'Age', 'Other_Inst_Plans ',
       'Housing', 'Num_CC', 'Job', 'Dependents', 'Telephone', 'Foreign_Worker',
       'Default_On_Payment'],
      dtype='object')

## Dropping the customer ID

In [67]:
df.drop("Customer_ID", axis=1, inplace=True)

In [68]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 21 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   Status_Checking_Acc          5000 non-null   object
 1   Duration_in_Months           5000 non-null   int64 
 2   Credit_History               5000 non-null   object
 3   Purposre_Credit_Taken        5000 non-null   object
 4   Credit_Amount                5000 non-null   int64 
 5   Savings_Acc                  5000 non-null   object
 6   Years_At_Present_Employment  5000 non-null   object
 7   Inst_Rt_Income               5000 non-null   int64 
 8   Marital_Status_Gender        5000 non-null   object
 9   Other_Debtors_Guarantors     5000 non-null   object
 10  Current_Address_Yrs          5000 non-null   int64 
 11  Property                     5000 non-null   object
 12  Age                          5000 non-null   int64 
 13  Other_Inst_Plans             5000

We do not have any missing values. Also, we have 8 numeric and 13 object columns.

<font size = 5 color = "brown"> Model Building </font>

Separating the X and Y variables and Using Dummies to convert the categorical columns to Numerical type

In [69]:
X = df.drop('Default_On_Payment', axis=1)
y = df['Default_On_Payment']

object_cols = df.select_dtypes(include=['object']).columns

X = pd.get_dummies(data = X, columns = object_cols, drop_first = True, dtype = int)


X.head()

Unnamed: 0,Duration_in_Months,Credit_Amount,Inst_Rt_Income,Current_Address_Yrs,Age,Num_CC,Dependents,Status_Checking_Acc_A12,Status_Checking_Acc_A13,Status_Checking_Acc_A14,...,Property_A124,Other_Inst_Plans _A142,Other_Inst_Plans _A143,Housing_A152,Housing_A153,Job_A172,Job_A173,Job_A174,Telephone_A192,Foreign_Worker_A202
0,6,1169,4,4,67,2,1,0,0,0,...,0,0,1,1,0,0,1,0,1,0
1,48,5951,2,2,22,1,1,1,0,0,...,0,0,1,1,0,0,1,0,0,0
2,12,2096,2,3,49,1,2,0,0,1,...,0,0,1,1,0,1,0,0,0,0
3,42,7882,2,4,45,1,2,0,0,0,...,0,0,1,0,1,0,1,0,0,0
4,24,4870,3,4,53,2,2,0,0,0,...,1,0,1,0,1,0,1,0,0,0


Splitting the data in train and test set

In [70]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Base Decision Tree Model

Building a base decision tree model.

In [71]:
base_dtree = DecisionTreeClassifier(random_state=0)
base_dtree.fit(X_train, y_train)

0,1,2
,criterion,'gini'
,splitter,'best'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,
,random_state,0
,max_leaf_nodes,
,min_impurity_decrease,0.0


Predicting on the training dataset and checking the confusion matrix.

In [72]:
pred_train = base_dtree.predict(X_train)

confusion_matrix(y_train, pred_train)

array([[1791,  231],
       [ 605,  873]])

Predicting on the test dataset and checking the confusion matrix.

In [73]:
pred_test = base_dtree.predict(X_test)
confusion_matrix(y_test, pred_test)

array([[639, 211],
       [375, 275]])

F1 score of the train set.

In [74]:
f1_score(y_train, pred_train)

0.6762199845081333

F1 score of the test set.

In [75]:
f1_score(y_test, pred_test)

0.4841549295774648

So, we can see here that a base decision tree is working as a fully grown tree. This is why it is also called as a greedy algorithm. This is why we see that the result is overfitting here. We will see in the next code how to handle this overfitting.