# Decision Tree Classifier

Will use the previous loan_data dataset here as well and to make decision trees and visualize, analyse the data

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [2]:
df = pd.read_csv("loan_data.csv")

# cleanup and category conversion

df.person_gender =[1 if value == "male" else 0 for value in df.person_gender]
# ordinal categories
category_mapper = {'Doctorate': 4, 'Master': 3, 'Bachelor': 2, 'Associate': 1, 'High School': 0}
df['person_education'] = df['person_education'].map(category_mapper)

# convert binary categories
df.previous_loan_defaults_on_file =[1 if value == "Yes" else 0 for value in df.previous_loan_defaults_on_file]

# One-hot encode nominal variables
from sklearn.preprocessing import OneHotEncoder

variables = ['person_home_ownership', 'loan_intent']

# use encoder
encoder = OneHotEncoder(sparse_output=False).set_output(transform="pandas")
one_hot_encoded = encoder.fit_transform(df[variables]).astype(int)
df = pd.concat([df,one_hot_encoded],axis=1).drop(columns=variables)

df.head()

Unnamed: 0,person_age,person_gender,person_education,person_income,person_emp_exp,loan_amnt,loan_int_rate,loan_percent_income,cb_person_cred_hist_length,credit_score,...,person_home_ownership_MORTGAGE,person_home_ownership_OTHER,person_home_ownership_OWN,person_home_ownership_RENT,loan_intent_DEBTCONSOLIDATION,loan_intent_EDUCATION,loan_intent_HOMEIMPROVEMENT,loan_intent_MEDICAL,loan_intent_PERSONAL,loan_intent_VENTURE
0,22.0,0,3,71948.0,0,35000.0,16.02,0.49,3.0,561,...,0,0,0,1,0,0,0,0,1,0
1,21.0,0,0,12282.0,0,1000.0,11.14,0.08,2.0,504,...,0,0,1,0,0,1,0,0,0,0
2,25.0,0,0,12438.0,3,5500.0,12.87,0.44,3.0,635,...,1,0,0,0,0,0,0,1,0,0
3,23.0,0,2,79753.0,0,35000.0,15.23,0.44,2.0,675,...,0,0,0,1,0,0,0,1,0,0
4,24.0,1,3,66135.0,1,35000.0,14.27,0.53,4.0,586,...,0,0,0,1,0,0,0,1,0,0


# Train a Decision Tree

In [3]:
# Prepare the model
y = df["loan_status"] # our target variable
X = df.drop(["loan_status"], axis=1) # our predictors

In [4]:
from sklearn.tree import DecisionTreeClassifier
# Fit the decision tree classifier with default hyper-parameters
clf = DecisionTreeClassifier()
clf.fit(X, y)

# Visualizing decision tree

In [5]:
# was having some issues in my mac with dot in graphviz to setting up the dot here
# https://stackoverflow.com/questions/35064304/runtimeerror-make-sure-the-graphviz-executables-are-on-your-systems-path-aft
import os
os.environ["PATH"] += os.pathsep + "/opt/homebrew/bin"

In [6]:
from sklearn.tree import export_graphviz
import subprocess
from sklearn import tree

# Export the decision tree to DOT format
export_graphviz(clf, 
                   feature_names=X.columns,  
                   class_names=['Not Approved', 'Approved'],
                   filled=True, rounded=True, node_ids=True, out_file='tree.dot')



# Convert DOT to SVG
subprocess.call(['dot', '-Tsvg', 'tree.dot', '-o', 'tree.svg'])

0

# Analyse on Decision Tree Nodes

### Node #0 (root node)

Feature: previous_loan_defaults_on_file <= 0.5  
gini = 0.346  
samples = 45,000  
value = [35,000 Not Approved, 10,000 Approved]  
class = Not Approved

The most important decision-making factor is whether someone has a previous loan default on file. If not (≤ 0.5 means "No"), the tree continues further. This makes a sense that past defaults are a important factor here.

### Node #1 (left of node #0)

Feature: loan_percent_income <= 0.245  
gini = 0.495  
samples = 22,142  
value = [12,142 Not Approved, 10,000 Approved]  
class = Not Approved

After no previous defaults, the next important factor is how much of their income the loan would take up.

If the loan is less than 24.5% of their income, there's a better chance of approval. The gini is close to 0.5 here so it's a very confusing point to make a decision here.

### Node #5450 (right of node 0)

gini = 0.0  
samples = 22,858  
value = [22,858 Not Approved, 0 Approved]  
class = Not Approved

This group if has previous defaults, and every single one was not approved. The model is extremely confident here as gini = 0 here.

## Interesting data logic/behavior:

- Tree shows that previous loan defaults are the strongest predictor of loan rejection here.
- Applicants with no defaults are always rejected, no matter what their income or credit score is.
- For those with no defaults, the loan amount relative to income is the next key factor here to consider.

## New insights about dataset with decision tree

It’s surprising that the model completely rejects all applicants with past defaults. There are no exceptions as gini shows 0. Also in Node #1, the tree can’t clearly separate good vs. bad borrowers just based on loan_percent_income. The class distribution is still close (approx.12,000 vs 10,000).

# Advance Tasks related to Decision Trees

## Capabilities and Benefits of Decision Trees

1. Decision Trees makes the data visulaiztion easy to understand and mirros the way humanly alike on how a decision was made. We can follow the nodes from root to end to understand and visulaize how a decision was made.
2. It can handle both type of numerical or categorical data effectively.
3. It can provide insights into feature importance by highlighting which variables contribute most to the decision-making process.

It can be useful in various real world applications.

1. Business Decision-Making
2. Healthcare Diagnostics
3. Financial Risk Assessment
4. Marketing Strategies etc.

## Trying Decision Trees for another importand variable

We'll try here with loan_percent_income and see how it much important this variable actually is and how it shows in Decision Trees.

In [7]:
X = df.drop(columns=['loan_percent_income'])
y = df['loan_percent_income']

In [8]:
# y is a conitunues variable so need to use regressor here
from sklearn.tree import DecisionTreeRegressor
# Fit the decision tree classifier with default hyper-parameters
rgr = DecisionTreeRegressor()
rgr.fit(X, y)

In [9]:
from sklearn.tree import export_graphviz
import subprocess
from sklearn import tree

# Export the decision tree to DOT format
export_graphviz(rgr, 
                   feature_names=X.columns,  
                   filled=True, rounded=True, node_ids=True, out_file='tree.dot')



# Convert DOT to SVG
subprocess.call(['dot', '-Tsvg', 'tree.dot', '-o', 'tree-alternate-variable.svg'])

0

## Decision Tree Analysis

Node #0 (root Node)
Condition: previous_loan_defaults_on_file <= 0.5  
squared_error = 0.173  
samples = 45,000  
value = 0.222

This is the first and most important split. It checks whether the applicant has no previous loan defaults. The average loan_percent_income across the whole dataset is approx. (22.2% of income).

Node #1 (left)

Condition: loan_percent_income <= 0.245  
squared_error = 0.248  
samples = 22,142  
value = 0.452

Among applicants with no defaults, the next most important variable here is how much of their income the loan takes. For these people, the average loan_percent_income is higher about around 45.2%. This tells us that people without previous defaults may take on relatively larger loans.

Node #5442 (right)

squared_error = 0.0  
samples = 22,858  
value = 0.0

If a person has defaulted before, their loan_percent_income is always predicted as 0.


It's surprising that the tree can so confidently say any applicant with a loan default ends up with 0% loan-to-income ratio. Also, 
in Node #1, the squared error is still 0.248, meaning there's significant data spreading in loan_percent_income for people with no defaults. This suggests other variables like credit_score, education, or income may interact here in other complex ways and more.