# 6. Decision Trees and Ensemble Learning


This week, we'll talk about decision trees and tree-based ensemble algorithms

## 6.1 Credit risk scoring project

* Dataset: https://github.com/gastonstat/CreditScoring

In [2]:
import pandas as pd
import numpy as np

import plotly.express as px
import plotly.graph_objects as go

In [3]:
import plotly.io as pio

# Create a custom theme and set it as default
pio.templates["custom"] = pio.templates["plotly_white"]
pio.templates["custom"].layout.margin = {'b': 25, 'l': 25, 'r': 25, 't': 50}
pio.templates["custom"].layout.width = 450
pio.templates["custom"].layout.height = 300
pio.templates["custom"].layout.autosize = False
pio.templates["custom"].layout.font.family="Arial"
pio.templates["custom"].layout.title.update({"x":0.5, "xref":"paper", "font_family":"Arial Black"})
pio.templates["custom"].layout.xaxis.update({"showline":True, "linecolor":"darkgray"})
pio.templates["custom"].layout.yaxis.update({"showline":True, "linecolor":"darkgray"})
pio.templates["custom"].layout.colorway = ['#1F77B4', '#FF7F0E', '#2CA02C', '#D62728', '#9467BD',
                                           '#8C564B', '#E377C2', '#7F7F7F', '#BCBD22', '#17BECF']
pio.templates.default = "custom"

## 6.2 Data cleaning and preparation

* Downloading the dataset
* Re-encoding the categorical variables
* Doing the train/validation/test split

In [5]:
data = 'https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-06-trees/CreditScoring.csv'

In [6]:
!curl -o CreditScoring.csv $data

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  178k  100  178k    0     0   527k      0 --:--:-- --:--:-- --:--:--  525k


In [7]:
!head CreditScoring.csv

"Status","Seniority","Home","Time","Age","Marital","Records","Job","Expenses","Income","Assets","Debt","Amount","Price"
1,9,1,60,30,2,1,3,73,129,0,0,800,846
1,17,1,60,58,3,1,1,48,131,0,0,1000,1658
2,10,2,36,46,2,2,3,90,200,3000,0,2000,2985
1,0,1,60,24,1,1,1,63,182,2500,0,900,1325
1,0,1,36,26,1,1,1,46,107,0,0,310,910
1,1,2,60,36,2,1,1,75,214,3500,0,650,1645
1,29,2,60,44,2,1,1,75,125,10000,0,1600,1800
1,9,5,12,27,1,1,1,35,80,0,0,200,1093
1,0,2,60,32,2,1,3,90,107,15000,0,1200,1957


In [8]:
df = pd.read_csv(data)

In [9]:
df.columns = df.columns.str.lower()

In [10]:
df.status.value_counts()

status
1    3200
2    1254
0       1
Name: count, dtype: int64

In [11]:
status_values = {
    1: 'ok',
    2: 'default',
    0: 'unk'
}

df.status = df.status.map(status_values)

In [12]:
home_values = {
    1: 'rent',
    2: 'owner',
    3: 'private',
    4: 'ignore',
    5: 'parents',
    6: 'other',
    0: 'unk'
}

df.home = df.home.map(home_values)

marital_values = {
    1: 'single',
    2: 'married',
    3: 'widow',
    4: 'separated',
    5: 'divorced',
    0: 'unk'
}

df.marital = df.marital.map(marital_values)

records_values = {
    1: 'no',
    2: 'yes',
    0: 'unk'
}

df.records = df.records.map(records_values)

job_values = {
    1: 'fixed',
    2: 'partime',
    3: 'freelance',
    4: 'others',
    0: 'unk'
}

df.job = df.job.map(job_values)

In [13]:
df.head()

Unnamed: 0,status,seniority,home,time,age,marital,records,job,expenses,income,assets,debt,amount,price
0,ok,9,rent,60,30,married,no,freelance,73,129,0,0,800,846
1,ok,17,rent,60,58,widow,no,fixed,48,131,0,0,1000,1658
2,default,10,owner,36,46,married,yes,freelance,90,200,3000,0,2000,2985
3,ok,0,rent,60,24,single,no,fixed,63,182,2500,0,900,1325
4,ok,0,rent,36,26,single,no,fixed,46,107,0,0,310,910


In [14]:
df.describe().round()

Unnamed: 0,seniority,time,age,expenses,income,assets,debt,amount,price
count,4455.0,4455.0,4455.0,4455.0,4455.0,4455.0,4455.0,4455.0,4455.0
mean,8.0,46.0,37.0,56.0,763317.0,1060341.0,404382.0,1039.0,1463.0
std,8.0,15.0,11.0,20.0,8703625.0,10217569.0,6344253.0,475.0,628.0
min,0.0,6.0,18.0,35.0,0.0,0.0,0.0,100.0,105.0
25%,2.0,36.0,28.0,35.0,80.0,0.0,0.0,700.0,1118.0
50%,5.0,48.0,36.0,51.0,120.0,3500.0,0.0,1000.0,1400.0
75%,12.0,60.0,45.0,72.0,166.0,6000.0,0.0,1300.0,1692.0
max,48.0,72.0,68.0,180.0,99999999.0,99999999.0,99999999.0,5000.0,11140.0


In [15]:
for c in ['income', 'assets', 'debt']:
    df[c] = df[c].replace(to_replace=99999999, value=np.nan)

In [16]:
df.describe().round()

Unnamed: 0,seniority,time,age,expenses,income,assets,debt,amount,price
count,4455.0,4455.0,4455.0,4455.0,4421.0,4408.0,4437.0,4455.0,4455.0
mean,8.0,46.0,37.0,56.0,131.0,5403.0,343.0,1039.0,1463.0
std,8.0,15.0,11.0,20.0,86.0,11573.0,1246.0,475.0,628.0
min,0.0,6.0,18.0,35.0,0.0,0.0,0.0,100.0,105.0
25%,2.0,36.0,28.0,35.0,80.0,0.0,0.0,700.0,1118.0
50%,5.0,48.0,36.0,51.0,120.0,3000.0,0.0,1000.0,1400.0
75%,12.0,60.0,45.0,72.0,165.0,6000.0,0.0,1300.0,1692.0
max,48.0,72.0,68.0,180.0,959.0,300000.0,30000.0,5000.0,11140.0


In [17]:
df = df[df.status != 'unk'].reset_index(drop=True)

In [18]:
from sklearn.model_selection import train_test_split

df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=11)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=11)

In [19]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [20]:
y_train = (df_train.status == 'default').astype('int').values
y_val = (df_val.status == 'default').astype('int').values
y_test = (df_test.status == 'default').astype('int').values

In [21]:
del df_train['status']
del df_val['status']
del df_test['status']

In [22]:
df_train

Unnamed: 0,seniority,home,time,age,marital,records,job,expenses,income,assets,debt,amount,price
0,10,owner,36,36,married,no,freelance,75,0.0,10000.0,0.0,1000,1400
1,6,parents,48,32,single,yes,fixed,35,85.0,0.0,0.0,1100,1330
2,1,parents,48,40,married,no,fixed,75,121.0,0.0,0.0,1320,1600
3,1,parents,48,23,single,no,partime,35,72.0,0.0,0.0,1078,1079
4,5,owner,36,46,married,no,freelance,60,100.0,4000.0,0.0,1100,1897
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2667,18,private,36,45,married,no,fixed,45,220.0,20000.0,0.0,800,1600
2668,7,private,60,29,married,no,fixed,60,51.0,3500.0,500.0,1000,1290
2669,1,parents,24,19,single,no,fixed,35,28.0,0.0,0.0,400,600
2670,15,owner,48,43,married,no,freelance,60,100.0,18000.0,0.0,2500,2976


## 6.3 Decision trees

* How a decision tree looks like
* Training a decision tree 
* Overfitting
* Controlling the size of a tree

In [23]:
def assess_risk(client):
    if client['records'] == 'yes':
        if client['job'] == 'parttime':
            return 'default'
        else:
            return 'ok'
    else:
        if client['assets'] > 6000:
            return 'ok'
        else:
            return 'default'

In [24]:
xi = df_train.iloc[0].to_dict()

In [25]:
assess_risk(xi)

'ok'

In [35]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import roc_auc_score
from sklearn.tree import export_text

In [36]:
numerical = ['seniority', 'time', 'age', 'expenses', 'income', 'assets', 'debt', 'amount', 'price']
categorical = ['home', 'marital', 'records', 'job']

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(sparse_output=False), categorical),
        ('num', 'passthrough', numerical)
    ]
)

X_train = preprocessor.fit_transform(df_train[categorical + numerical])

In [37]:
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

In [38]:
X_val = preprocessor.transform(df_val[categorical + numerical])

In [39]:
y_pred = dt.predict_proba(X_val)[:, 1]
roc_auc_score(y_val, y_pred)

np.float64(0.6670430602310431)

In [40]:
y_pred = dt.predict_proba(X_train)[:, 1]
roc_auc_score(y_train, y_pred)

np.float64(1.0)

In [41]:
dt = DecisionTreeClassifier(max_depth=2)
dt.fit(X_train, y_train)

In [42]:
y_pred = dt.predict_proba(X_train)[:, 1]
auc = roc_auc_score(y_train, y_pred)
print('train:', auc)

y_pred = dt.predict_proba(X_val)[:, 1]
auc = roc_auc_score(y_val, y_pred)
print('val:', auc)

train: 0.7054989859726213
val: 0.6685264343319367


In [129]:
print(export_text(dt, feature_names=list(preprocessor.get_feature_names_out())))

|--- cat__records_no <= 0.50
|   |--- num__seniority <= 6.50
|   |   |--- num__amount <= 862.50
|   |   |   |--- num__price <= 925.00
|   |   |   |   |--- num__income <= 117.50
|   |   |   |   |   |--- class: 1
|   |   |   |   |--- num__income >  117.50
|   |   |   |   |   |--- class: 1
|   |   |   |--- num__price >  925.00
|   |   |   |   |--- num__price <= 1382.00
|   |   |   |   |   |--- class: 0
|   |   |   |   |--- num__price >  1382.00
|   |   |   |   |   |--- class: 0
|   |   |--- num__amount >  862.50
|   |   |   |--- num__assets <= 8250.00
|   |   |   |   |--- cat__job_fixed <= 0.50
|   |   |   |   |   |--- num__assets <= 3425.00
|   |   |   |   |   |   |--- class: 1
|   |   |   |   |   |--- num__assets >  3425.00
|   |   |   |   |   |   |--- class: 1
|   |   |   |   |--- cat__job_fixed >  0.50
|   |   |   |   |   |--- num__age <= 31.50
|   |   |   |   |   |   |--- class: 1
|   |   |   |   |   |--- num__age >  31.50
|   |   |   |   |   |   |--- class: 1
|   |   |   |--- num__a

## 6.4 Decision tree learning algorithm

* Finding the best split for one column
* Finding the best split for the entire dataset
* Stopping criteria
* Decision tree learning algorithm

In [44]:
data = [
    [8000, 'default'],
    [2000, 'default'],
    [   0, 'default'],
    [5000, 'ok'],
    [5000, 'ok'],
    [4000, 'ok'],
    [9000, 'ok'],
    [3000, 'default'],
]

df_example = pd.DataFrame(data, columns=['assets', 'status'])
df_example

Unnamed: 0,assets,status
0,8000,default
1,2000,default
2,0,default
3,5000,ok
4,5000,ok
5,4000,ok
6,9000,ok
7,3000,default


In [45]:
df_example.sort_values('assets')

Unnamed: 0,assets,status
2,0,default
1,2000,default
7,3000,default
5,4000,ok
4,5000,ok
3,5000,ok
0,8000,default
6,9000,ok


In [46]:
Ts = [0, 2000, 3000, 4000, 5000, 8000]

In [47]:
T = 4000
df_left = df_example[df_example.assets <= T]
df_right = df_example[df_example.assets > T]

display(df_left)
print(df_left.status.value_counts(normalize=True))
display(df_right)
print(df_left.status.value_counts(normalize=True))

Unnamed: 0,assets,status
1,2000,default
2,0,default
5,4000,ok
7,3000,default


status
default    0.75
ok         0.25
Name: proportion, dtype: float64


Unnamed: 0,assets,status
0,8000,default
3,5000,ok
4,5000,ok
6,9000,ok


status
default    0.75
ok         0.25
Name: proportion, dtype: float64


In [48]:
from IPython.display import display

In [49]:
for T in Ts:
    print(T)
    df_left = df_example[df_example.assets <= T]
    df_right = df_example[df_example.assets > T]
    
    display(df_left)
    print(df_left.status.value_counts(normalize=True))
    display(df_right)
    print(df_right.status.value_counts(normalize=True))

    print()

0


Unnamed: 0,assets,status
2,0,default


status
default    1.0
Name: proportion, dtype: float64


Unnamed: 0,assets,status
0,8000,default
1,2000,default
3,5000,ok
4,5000,ok
5,4000,ok
6,9000,ok
7,3000,default


status
ok         0.571429
default    0.428571
Name: proportion, dtype: float64

2000


Unnamed: 0,assets,status
1,2000,default
2,0,default


status
default    1.0
Name: proportion, dtype: float64


Unnamed: 0,assets,status
0,8000,default
3,5000,ok
4,5000,ok
5,4000,ok
6,9000,ok
7,3000,default


status
ok         0.666667
default    0.333333
Name: proportion, dtype: float64

3000


Unnamed: 0,assets,status
1,2000,default
2,0,default
7,3000,default


status
default    1.0
Name: proportion, dtype: float64


Unnamed: 0,assets,status
0,8000,default
3,5000,ok
4,5000,ok
5,4000,ok
6,9000,ok


status
ok         0.8
default    0.2
Name: proportion, dtype: float64

4000


Unnamed: 0,assets,status
1,2000,default
2,0,default
5,4000,ok
7,3000,default


status
default    0.75
ok         0.25
Name: proportion, dtype: float64


Unnamed: 0,assets,status
0,8000,default
3,5000,ok
4,5000,ok
6,9000,ok


status
ok         0.75
default    0.25
Name: proportion, dtype: float64

5000


Unnamed: 0,assets,status
1,2000,default
2,0,default
3,5000,ok
4,5000,ok
5,4000,ok
7,3000,default


status
default    0.5
ok         0.5
Name: proportion, dtype: float64


Unnamed: 0,assets,status
0,8000,default
6,9000,ok


status
default    0.5
ok         0.5
Name: proportion, dtype: float64

8000


Unnamed: 0,assets,status
0,8000,default
1,2000,default
2,0,default
3,5000,ok
4,5000,ok
5,4000,ok
7,3000,default


status
default    0.571429
ok         0.428571
Name: proportion, dtype: float64


Unnamed: 0,assets,status
6,9000,ok


status
ok    1.0
Name: proportion, dtype: float64



In [50]:
data = [
    [8000, 3000, 'default'],
    [2000, 1000, 'default'],
    [   0, 1000, 'default'],
    [5000, 1000, 'ok'],
    [5000, 1000, 'ok'],
    [4000, 1000, 'ok'],
    [9000,  500, 'ok'],
    [3000, 2000, 'default'],
]

df_example = pd.DataFrame(data, columns=['assets', 'debt', 'status'])
df_example

Unnamed: 0,assets,debt,status
0,8000,3000,default
1,2000,1000,default
2,0,1000,default
3,5000,1000,ok
4,5000,1000,ok
5,4000,1000,ok
6,9000,500,ok
7,3000,2000,default


In [51]:
df_example.sort_values('debt')

Unnamed: 0,assets,debt,status
6,9000,500,ok
1,2000,1000,default
3,5000,1000,ok
2,0,1000,default
5,4000,1000,ok
4,5000,1000,ok
7,3000,2000,default
0,8000,3000,default


In [52]:
thresholds = {
    'assets': [0, 2000, 3000, 4000, 5000, 8000],
    'debt': [500, 1000, 2000]
}

In [53]:
for feature, Ts in thresholds.items():
    print('#####################')
    print(feature)
    for T in Ts:
        print(T)
        df_left = df_example[df_example[feature] <= T]
        df_right = df_example[df_example[feature] > T]

        display(df_left)
        print(df_left.status.value_counts(normalize=True))
        display(df_right)
        print(df_right.status.value_counts(normalize=True))

        print()
    print('#####################')

#####################
assets
0


Unnamed: 0,assets,debt,status
2,0,1000,default


status
default    1.0
Name: proportion, dtype: float64


Unnamed: 0,assets,debt,status
0,8000,3000,default
1,2000,1000,default
3,5000,1000,ok
4,5000,1000,ok
5,4000,1000,ok
6,9000,500,ok
7,3000,2000,default


status
ok         0.571429
default    0.428571
Name: proportion, dtype: float64

2000


Unnamed: 0,assets,debt,status
1,2000,1000,default
2,0,1000,default


status
default    1.0
Name: proportion, dtype: float64


Unnamed: 0,assets,debt,status
0,8000,3000,default
3,5000,1000,ok
4,5000,1000,ok
5,4000,1000,ok
6,9000,500,ok
7,3000,2000,default


status
ok         0.666667
default    0.333333
Name: proportion, dtype: float64

3000


Unnamed: 0,assets,debt,status
1,2000,1000,default
2,0,1000,default
7,3000,2000,default


status
default    1.0
Name: proportion, dtype: float64


Unnamed: 0,assets,debt,status
0,8000,3000,default
3,5000,1000,ok
4,5000,1000,ok
5,4000,1000,ok
6,9000,500,ok


status
ok         0.8
default    0.2
Name: proportion, dtype: float64

4000


Unnamed: 0,assets,debt,status
1,2000,1000,default
2,0,1000,default
5,4000,1000,ok
7,3000,2000,default


status
default    0.75
ok         0.25
Name: proportion, dtype: float64


Unnamed: 0,assets,debt,status
0,8000,3000,default
3,5000,1000,ok
4,5000,1000,ok
6,9000,500,ok


status
ok         0.75
default    0.25
Name: proportion, dtype: float64

5000


Unnamed: 0,assets,debt,status
1,2000,1000,default
2,0,1000,default
3,5000,1000,ok
4,5000,1000,ok
5,4000,1000,ok
7,3000,2000,default


status
default    0.5
ok         0.5
Name: proportion, dtype: float64


Unnamed: 0,assets,debt,status
0,8000,3000,default
6,9000,500,ok


status
default    0.5
ok         0.5
Name: proportion, dtype: float64

8000


Unnamed: 0,assets,debt,status
0,8000,3000,default
1,2000,1000,default
2,0,1000,default
3,5000,1000,ok
4,5000,1000,ok
5,4000,1000,ok
7,3000,2000,default


status
default    0.571429
ok         0.428571
Name: proportion, dtype: float64


Unnamed: 0,assets,debt,status
6,9000,500,ok


status
ok    1.0
Name: proportion, dtype: float64

#####################
#####################
debt
500


Unnamed: 0,assets,debt,status
6,9000,500,ok


status
ok    1.0
Name: proportion, dtype: float64


Unnamed: 0,assets,debt,status
0,8000,3000,default
1,2000,1000,default
2,0,1000,default
3,5000,1000,ok
4,5000,1000,ok
5,4000,1000,ok
7,3000,2000,default


status
default    0.571429
ok         0.428571
Name: proportion, dtype: float64

1000


Unnamed: 0,assets,debt,status
1,2000,1000,default
2,0,1000,default
3,5000,1000,ok
4,5000,1000,ok
5,4000,1000,ok
6,9000,500,ok


status
ok         0.666667
default    0.333333
Name: proportion, dtype: float64


Unnamed: 0,assets,debt,status
0,8000,3000,default
7,3000,2000,default


status
default    1.0
Name: proportion, dtype: float64

2000


Unnamed: 0,assets,debt,status
1,2000,1000,default
2,0,1000,default
3,5000,1000,ok
4,5000,1000,ok
5,4000,1000,ok
6,9000,500,ok
7,3000,2000,default


status
ok         0.571429
default    0.428571
Name: proportion, dtype: float64


Unnamed: 0,assets,debt,status
0,8000,3000,default


status
default    1.0
Name: proportion, dtype: float64

#####################


## 6.5 Decision trees parameter tuning

* selecting `max_depth`
* selecting `min_samples_leaf`

In [54]:
depths = [1, 2, 3, 4, 5, 6, 10, 15, 20, None]

for depth in depths: 
    dt = DecisionTreeClassifier(max_depth=depth)
    dt.fit(X_train, y_train)
    
    y_pred = dt.predict_proba(X_val)[:, 1]
    auc = roc_auc_score(y_val, y_pred)
    
    print('%4s -> %.3f' % (depth, auc))

   1 -> 0.606
   2 -> 0.669
   3 -> 0.739
   4 -> 0.761
   5 -> 0.766
   6 -> 0.758
  10 -> 0.697
  15 -> 0.661
  20 -> 0.662
None -> 0.650


In [55]:
scores = []

for depth in [4, 5, 6]:
    for s in [1, 5, 10, 15, 20, 500, 100, 200]:
        dt = DecisionTreeClassifier(max_depth=depth, min_samples_leaf=s)
        dt.fit(X_train, y_train)

        y_pred = dt.predict_proba(X_val)[:, 1]
        auc = roc_auc_score(y_val, y_pred)
        
        scores.append((depth, s, auc))

In [56]:
columns = ['max_depth', 'min_samples_leaf', 'auc']
df_scores = pd.DataFrame(scores, columns=columns)

In [90]:
pivot_df_scores = df_scores.pivot_table(values='auc', 
                         index='max_depth', 
                         columns='min_samples_leaf', 
                         aggfunc='first')





In [91]:
pivot_df_scores 

min_samples_leaf,1,5,10,15,20,100,200,500
max_depth,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
4,0.760726,0.760726,0.760726,0.763223,0.760432,0.75567,0.74726,0.679842
5,0.766429,0.767664,0.761785,0.77176,0.772272,0.763026,0.759073,0.679842
6,0.758994,0.763602,0.778345,0.786113,0.773265,0.776604,0.768642,0.679842


In [102]:
# Create heatmap using px.imshow
fig = px.imshow(pivot_df_scores,
                labels=dict(x="min_samples_leaf", y="max_depth", color="AUC Score"),
                color_continuous_scale='Viridis',
                aspect='auto',  # This maintains reasonable dimensions
                title='AUC Scores')

fig.update_xaxes(type='category', tickmode='array', ticktext=pivot_df_scores.columns)
fig.update_yaxes(type='category', tickmode='array', ticktext=pivot_df_scores.index)

In [104]:
dt = DecisionTreeClassifier(max_depth=6, min_samples_leaf=15)
dt.fit(X_train, y_train)

In [105]:
print(export_text(dt, feature_names=list(dv.get_feature_names_out())))

|--- job=fixed <= 0.50
|   |--- marital=separated <= 6.50
|   |   |--- seniority <= 862.50
|   |   |   |--- time <= 925.00
|   |   |   |   |--- price <= 117.50
|   |   |   |   |   |--- class: 1
|   |   |   |   |--- price >  117.50
|   |   |   |   |   |--- class: 1
|   |   |   |--- time >  925.00
|   |   |   |   |--- time <= 1382.00
|   |   |   |   |   |--- class: 0
|   |   |   |   |--- time >  1382.00
|   |   |   |   |   |--- class: 0
|   |   |--- seniority >  862.50
|   |   |   |--- records=no <= 8250.00
|   |   |   |   |--- job=others <= 0.50
|   |   |   |   |   |--- records=no <= 3425.00
|   |   |   |   |   |   |--- class: 1
|   |   |   |   |   |--- records=no >  3425.00
|   |   |   |   |   |   |--- class: 1
|   |   |   |   |--- job=others >  0.50
|   |   |   |   |   |--- marital=unk <= 31.50
|   |   |   |   |   |   |--- class: 1
|   |   |   |   |   |--- marital=unk >  31.50
|   |   |   |   |   |   |--- class: 1
|   |   |   |--- records=no >  8250.00
|   |   |   |   |--- price <= 13

## 6.6 Ensembles and random forest

* Board of experts
* Ensembling models 
* Random forest - ensembling decision trees
* Tuning random forest

In [106]:
from sklearn.ensemble import RandomForestClassifier

In [107]:
scores = []

for n in range(10, 201, 10):
    rf = RandomForestClassifier(n_estimators=n, random_state=1)
    rf.fit(X_train, y_train)

    y_pred = rf.predict_proba(X_val)[:, 1]
    auc = roc_auc_score(y_val, y_pred)
    
    scores.append((n, auc))

In [108]:
df_scores = pd.DataFrame(scores, columns=['n_estimators', 'auc'])

In [109]:
px.line(df_scores, x='n_estimators', y='auc', title='AUC Score vs. Number of Estimators')

In [110]:
scores = []

for d in [5, 10, 15]:
    for n in range(10, 201, 10):
        rf = RandomForestClassifier(n_estimators=n,
                                    max_depth=d,
                                    random_state=1)
        rf.fit(X_train, y_train)

        y_pred = rf.predict_proba(X_val)[:, 1]
        auc = roc_auc_score(y_val, y_pred)

        scores.append((d, n, auc))

In [111]:
columns = ['max_depth', 'n_estimators', 'auc']
df_scores = pd.DataFrame(scores, columns=columns)

In [113]:
fig = go.Figure()

for d in [5, 10, 15]:
    df_subset = df_scores[df_scores.max_depth == d]
    
    fig.add_trace(
        go.Scatter(
            x=df_subset.n_estimators,
            y=df_subset.auc,
            name=f'max_depth={d}',
            mode='lines+markers'
        )
    )

# Update layout
fig.update_layout(
    title='AUC Scores by Number of Estimators',
    xaxis_title='Number of Estimators',
    yaxis_title='AUC Score',
    width=800,
    height=500,
    hovermode='x'
)

# Show plot
fig.show()

In [114]:
max_depth = 10

In [115]:
scores = []

for s in [1, 3, 5, 10, 50]:
    for n in range(10, 201, 10):
        rf = RandomForestClassifier(n_estimators=n,
                                    max_depth=max_depth,
                                    min_samples_leaf=s,
                                    random_state=1)
        rf.fit(X_train, y_train)

        y_pred = rf.predict_proba(X_val)[:, 1]
        auc = roc_auc_score(y_val, y_pred)

        scores.append((s, n, auc))

In [116]:
columns = ['min_samples_leaf', 'n_estimators', 'auc']
df_scores = pd.DataFrame(scores, columns=columns)

In [121]:
# Define colors and values
colors = ['black', 'blue', 'orange', 'red', 'grey']
values = [1, 3, 5, 10, 50]

# Create the figure
fig = go.Figure()

# Add a line for each `min_samples_leaf` value
for s, col in zip(values, colors):
    df_subset = df_scores[df_scores.min_samples_leaf == s]
    fig.add_trace(go.Scatter(
        x=df_subset.n_estimators,
        y=df_subset.auc,
        mode='lines',
        line=dict(color=col),
        name=f'min_samples_leaf={s}'
    ))

# Customize layout
fig.update_layout(
    title="AUC vs n_estimators for Different min_samples_leaf Values",
    xaxis_title="n_estimators",
    yaxis_title="AUC",
    legend_title="min_samples_leaf",
    height=600,
    width=800
)

fig.show()

In [123]:
min_samples_leaf = 3

In [124]:
rf = RandomForestClassifier(n_estimators=200,
                            max_depth=max_depth,
                            min_samples_leaf=min_samples_leaf,
                            random_state=1)
rf.fit(X_train, y_train)

Other useful parametes:

* `max_features`
* `bootstrap`

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

## 6.7 Gradient boosting and XGBoost

* Gradient boosting vs random forest
* Installing XGBoost
* Training the first model
* Performance monitoring
* Parsing xgboost's monitoring output

In [125]:
!pip install xgboost

Collecting xgboost
  Obtaining dependency information for xgboost from https://files.pythonhosted.org/packages/95/a4/16490d38b4854a1ce4995f4088bcb701b5057f711e34c95cd6e29792cdde/xgboost-2.1.2-py3-none-manylinux_2_28_x86_64.whl.metadata
  Downloading xgboost-2.1.2-py3-none-manylinux_2_28_x86_64.whl.metadata (2.1 kB)
Collecting nvidia-nccl-cu12 (from xgboost)
  Obtaining dependency information for nvidia-nccl-cu12 from https://files.pythonhosted.org/packages/ed/1f/6482380ec8dcec4894e7503490fc536d846b0d59694acad9cf99f27d0e7d/nvidia_nccl_cu12-2.23.4-py3-none-manylinux2014_x86_64.whl.metadata
  Downloading nvidia_nccl_cu12-2.23.4-py3-none-manylinux2014_x86_64.whl.metadata (1.8 kB)
Downloading xgboost-2.1.2-py3-none-manylinux_2_28_x86_64.whl (153.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m153.9/153.9 MB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hDownloading nvidia_nccl_cu12-2.23.4-py3-none-manylinux2014_x86_64.whl (199.0 MB)
[2K   [90m━━━━━

In [131]:
# import xgboost as xgb
from xgboost import XGBClassifier

In [132]:
xgboost = XGBClassifier()

xgb_params = {
    'n_estimators': 10,
    'eta': 0.3, 
    'max_depth': 6,
    'min_child_weight': 1,
    
    'objective': 'binary:logistic',
    'nthread': 8,
    
    'seed': 1,
    'verbosity': 1,
}

xgboost = XGBClassifier(**xgb_params)

xgboost.fit(X_train, y_train)

In [134]:
y_pred = xgboost.predict_proba(X_val)[:, 1]

In [135]:
roc_auc_score(y_val, y_pred)

np.float64(0.8200848853261)

In [142]:
xgb_params = {
    'n_estimators': 200,
    'eta': 0.3, 
    'max_depth': 6,
    'min_child_weight': 1,
    'eval_metric': 'auc',
    'objective': 'binary:logistic',
    'nthread': 8,
    # 'early_stopping_rounds': 10,
    'seed': 1,
    'verbosity': 1,
}

xgboost = XGBClassifier(**xgb_params)

eval_set=[(X_train, y_train), (X_val, y_val)]

xgboost.fit(X_train, y_train, eval_set=eval_set)

[0]	validation_0-auc:0.86743	validation_1-auc:0.78218
[1]	validation_0-auc:0.89168	validation_1-auc:0.78873
[2]	validation_0-auc:0.90565	validation_1-auc:0.79428
[3]	validation_0-auc:0.91788	validation_1-auc:0.80610
[4]	validation_0-auc:0.92357	validation_1-auc:0.81074
[5]	validation_0-auc:0.92944	validation_1-auc:0.81521
[6]	validation_0-auc:0.93658	validation_1-auc:0.81924
[7]	validation_0-auc:0.93977	validation_1-auc:0.82044
[8]	validation_0-auc:0.94718	validation_1-auc:0.81853
[9]	validation_0-auc:0.95001	validation_1-auc:0.82008
[10]	validation_0-auc:0.95359	validation_1-auc:0.81956
[11]	validation_0-auc:0.95632	validation_1-auc:0.82015
[12]	validation_0-auc:0.95897	validation_1-auc:0.81814
[13]	validation_0-auc:0.96139	validation_1-auc:0.81997
[14]	validation_0-auc:0.96590	validation_1-auc:0.82163
[15]	validation_0-auc:0.96742	validation_1-auc:0.82244
[16]	validation_0-auc:0.96963	validation_1-auc:0.82355
[17]	validation_0-auc:0.97140	validation_1-auc:0.82243
[18]	validation_0-au

In [143]:
evals_result = xgboost.evals_result()

In [148]:
evals_result.keys()

dict_keys(['validation_0', 'validation_1'])

In [149]:
results_df = pd.DataFrame({
    'Round': range(1, len(evals_result['validation_0']['auc']) + 1),
    'Train AUC': evals_result['validation_0']['auc'],
    'Validation AUC': evals_result['validation_1']['auc']
})

In [152]:
# Create a line chart using Plotly Express
fig = px.line(
    results_df,
    x='Round',
    y=['Train AUC', 'Validation AUC'],  # Specify the columns to plot
    title='AUC Over Rounds',
    labels={'value': 'AUC', 'Round': 'Boosting Round'}
)

# Show the figure
fig.show()

## 6.8 XGBoost parameter tuning

Tuning the following parameters:

* `eta`
* `max_depth`
* `min_child_weight`


In [176]:
xgboost = XGBClassifier()

xgb_params = {
    'n_estimators': 200,
    'eta': 0.3, 
    'max_depth': 6,
    'min_child_weight': 1,
    'eval_metric': 'auc',
    'objective': 'binary:logistic',
    'nthread': 8,
    
    'seed': 1,
    'verbosity': 1,
}

eval_set=[(X_val, y_val)]

In [177]:
data = []
try_eta = [0.01, 0.03, 0.1, 0.3, 1]

for eta in try_eta:
    xgb_params_copy = xgb_params.copy()
    xgb_params_copy['eta'] = eta
    
    # Train the model
    xgboost = XGBClassifier(**xgb_params_copy)
    xgboost.fit(X_train, y_train, eval_set=eval_set, verbose=False)
    
    # Get the evaluation results directly and store in data
    evals_result = xgboost.evals_result()
    auc_values = evals_result['validation_0']['auc']
    
    # Flatten the results into the data list
    for i, auc in enumerate(auc_values):
        data.append({'eta': eta, 'Round': i + 1, 'AUC': auc})

In [178]:
# Create a DataFrame
scores_df = pd.DataFrame(data)

fig = px.line(scores_df, x='Round', y='AUC', color='eta', 
              title='AUC Scores by Learning Rate (eta)')

# Show the plot
fig.show()

In [179]:
xgb_params['eta'] = 0.1

In [181]:
data = []
try_max_depth = [3, 5, 7, 9]

for max_depth in try_max_depth:
    xgb_params_copy = xgb_params.copy()
    xgb_params_copy['max_depth'] = max_depth
    
    # Train the model
    xgboost = XGBClassifier(**xgb_params_copy)
    xgboost.fit(X_train, y_train, eval_set=eval_set, verbose=False)
    
    # Get the evaluation results directly and store in data
    evals_result = xgboost.evals_result()
    auc_values = evals_result['validation_0']['auc']
    
    # Flatten the results into the data list
    for i, auc in enumerate(auc_values):
        data.append({'max_depth': max_depth, 'Round': i + 1, 'AUC': auc})

In [184]:
# Create a DataFrame
scores_df = pd.DataFrame(data)

fig = px.line(scores_df, x='Round', y='AUC', color='max_depth', 
              title='AUC Scores by Learning Rate (eta)')

# Show the plot
fig.show() 

In [185]:
xgb_params['max_depth'] = 3

In [186]:
data = []
try_min_child_weight = [1, 3, 10, 20]

for min_child_weight in try_min_child_weight:
    xgb_params_copy = xgb_params.copy()
    xgb_params_copy['min_child_weight'] = min_child_weight
    
    # Train the model
    xgboost = XGBClassifier(**xgb_params_copy)
    xgboost.fit(X_train, y_train, eval_set=eval_set, verbose=False)
    
    # Get the evaluation results directly and store in data
    evals_result = xgboost.evals_result()
    auc_values = evals_result['validation_0']['auc']
    
    # Flatten the results into the data list
    for i, auc in enumerate(auc_values):
        data.append({'min_child_weight': min_child_weight, 'Round': i + 1, 'AUC': auc})

In [188]:
# Create a DataFrame
scores_df = pd.DataFrame(data)

fig = px.line(scores_df, x='Round', y='AUC', color='min_child_weight', 
              title='AUC Scores by Learning Rate (eta)')

# Show the plot
fig.show() 

In [189]:
xgb_params['min_child_weight'] = 10

Other parameters: https://xgboost.readthedocs.io/en/latest/parameter.html

Useful ones:

* `subsample` and `colsample_bytree`
* `lambda` and `alpha`

## 6.9 Selecting the final model

* Choosing between xgboost, random forest and decision tree
* Training the final model
* Saving the model

In [190]:
dt = DecisionTreeClassifier(max_depth=6, min_samples_leaf=15)
dt.fit(X_train, y_train)

In [191]:
y_pred = dt.predict_proba(X_val)[:, 1]
roc_auc_score(y_val, y_pred)

np.float64(0.7856009784214477)

In [192]:
rf = RandomForestClassifier(n_estimators=200,
                            max_depth=10,
                            min_samples_leaf=3,
                            random_state=1)
rf.fit(X_train, y_train)

In [197]:
y_pred = rf.predict_proba(X_val)[:, 1]
roc_auc_score(y_val, y_pred)

np.float64(0.8265632946646969)

In [194]:
xgb_params

{'n_estimators': 200,
 'eta': 0.1,
 'max_depth': 3,
 'min_child_weight': 10,
 'eval_metric': 'auc',
 'objective': 'binary:logistic',
 'nthread': 8,
 'seed': 1,
 'verbosity': 1}

In [195]:
xgb_params = {'n_estimators': 200,
 'eta': 0.1,
 'max_depth': 3,
 'min_child_weight': 10,
 'eval_metric': 'auc',
 'objective': 'binary:logistic',
 'nthread': 8,
 'seed': 1,
 'verbosity': 1}

xgb_model = XGBClassifier(**xgb_params)
xgb_model.fit(X_train, y_train, verbose=False)

In [199]:
y_pred = xgb_model.predict_proba(X_val)[:, 1]
roc_auc_score(y_val, y_pred)

np.float64(0.8342284032840086)

In [200]:
df_full_train = df_full_train.reset_index(drop=True)

In [201]:
y_full_train = (df_full_train.status == 'default').astype(int).values

In [202]:
del df_full_train['status']

In [204]:
X_full_train = preprocessor.transform(df_full_train)
X_test = preprocessor.transform(df_test)

In [205]:
xgb_model.fit(X_full_train, y_full_train)

In [206]:
y_pred = xgb_model.predict_proba(X_test)[:, 1]

In [207]:
roc_auc_score(y_test, y_pred)

np.float64(0.8298286984995844)

## 6.10 Summary

* Decision trees learn if-then-else rules from data.
* Finding the best split: select the least impure split. This algorithm can overfit, that's why we control it by limiting the max depth and the size of the group.
* Random forest is a way of combininig multiple decision trees. It should have a diverse set of models to make good predictions.
* Gradient boosting trains model sequentially: each model tries to fix errors of the previous model. XGBoost is an implementation of gradient boosting. 

## 6.11 Explore more

* For this dataset we didn't do EDA or feature engineering. You can do it to get more insights into the problem.
* For random forest, there are more parameters that we can tune. Check `max_features` and `bootstrap`.
* There's a variation of random forest caled "extremely randomized trees", or "extra trees". Instead of selecting the best split among all possible thresholds, it selects a few thresholds randomly and picks the best one among them. Because of that extra trees never overfit. In Scikit-Learn, they are implemented in `ExtraTreesClassifier`. Try it for this project.
* XGBoost can deal with NAs - we don't have to do `fillna` for it. Check if not filling NA's help improve performance.
* Experiment with other XGBoost parameters: `subsample` and `colsample_bytree`.
* When selecting the best split, decision trees find the most useful features. This information can be used for understanding which features are more important than otheres. See example here for [random forest](https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html) (it's the same for plain decision trees) and for [xgboost](https://stackoverflow.com/questions/37627923/how-to-get-feature-importance-in-xgboost)
* Trees can also be used for solving the regression problems: check `DecisionTreeRegressor`, `RandomForestRegressor` and the `objective=reg:squarederror` parameter for XGBoost.