<div style="background-color:rgba(210, 129, 21, 0.5);">
    <h1><center>Understand the Models You Love</center></h1>
</div>

In this month's TPS, I am diving deep into some of the popular bagging and boosting models and I am starting with Random Forest. The purpose of this exercise is to become better at hyper-parameter tuning. Since there are already a lot of codes on Optuna and others being implemented with Random Forest, we will just look at the basic impact of those parameters in this notebook.

I would make changes to the important parameters and mention their impact. **Please note that these parameter observations are made independent of each other and only for the current data we have**. For speed, I am choosing a simple test split of 30% size on 10000 samples. 
Following is the link to the parameters' documentation - https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

**Feel free to run your own experiments and upvote if you find this code useful :)**

In [None]:
import random
random.seed(123)

import pandas as pd
import numpy as np
import datatable as dt
import warnings
warnings.filterwarnings("ignore")

# importing evaluation and data split packages

from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.metrics import roc_auc_score

# importing modelling packages

from sklearn.ensemble import RandomForestClassifier

In [None]:
# taking only 10000 rows as sample

train = pd.read_csv(r'../input/tabular-playground-series-oct-2021/train.csv',nrows=10000)
test = pd.read_csv(r'../input/tabular-playground-series-oct-2021/test.csv',nrows=10000)
sub = pd.read_csv(r'../input/tabular-playground-series-oct-2021/sample_submission.csv',nrows=10000)

<div style="background-color:rgba(210, 129, 21, 0.5);">
    <h1><center>Memory Reduction</center></h1>
</div>

In [None]:
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64','float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                else:
                    df[col] = df[col].astype(np.float32)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

In [None]:
train = reduce_mem_usage(train)
test  = reduce_mem_usage(test)

<div style="background-color:rgba(210, 129, 21, 0.5);">
    <h1><center>Data Splitting</center></h1>
</div>

In [None]:
X = train.drop(columns=["id", "target"]).copy()
y = train["target"].copy()
test_for_model = test.drop(columns=["id"]).copy()

# freeing up some memory

del train
del test 

# splitting into training and validation data

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=2021,stratify = y)

<div style="background-color:rgba(210, 129, 21, 0.5);">
    <h1><center>Random Forest</center></h1>
</div>

In [None]:
model = RandomForestClassifier(n_estimators=100, criterion='gini', max_depth=None,
                               min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0,
                               max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, 
                               bootstrap=True, oob_score=False, n_jobs=None,
                               random_state=None, verbose=0, warm_start=False,
                               class_weight=None, ccp_alpha=0.0, max_samples=None)

# we will focus on the key ones like n_estimators, max_depth, max_features, criterion, max samples and bootstrap
# have fixed random_state as 2021 to control bootstrapping randomness.

# **Estimators** - The number of trees in the forest

Higher the number of trees, more accuracy, but more training time too.
Also, it might lead to overfitting, so have to be careful when setting a high number. The improvement in accuracy also drops beyond a certain number.

In [None]:
estimators = [100,200,500,1000,1500]

for est in estimators:
    model = RandomForestClassifier(n_estimators=est,random_state=2021)
    model.fit(X_train,y_train)
    print('No. of trees: ',est," ",'AUC: ',roc_auc_score(y_test,model.predict_proba(X_test)[:,1]))

# **Max Depth** - The depth of the trees, until we get pure leaves.

By default, a tree is split until no further information gain is obtained from the leaves.
'None' is the ideal value, but I have used some others to compare performance.
Maybe all leaves became pure after 50 as depth, hence we see no change in 50 to 100.

In [None]:
depth = [None,10,20,30,40,50,100]

for d in depth:
    model = RandomForestClassifier(max_depth = d,random_state=2021)
    model.fit(X_train,y_train)
    print('Max Depth: ',d," ",'AUC: ',roc_auc_score(y_test,model.predict_proba(X_test)[:,1]))

# **Max Features** - Number of features to be considered when splitting.

By default, we go with the sqrt of the number of features.
I have used 'auto' (same as sqrt), log2 (of n_features), None (all features) and 100 as trial values.
100 gave the best value, so we can consider setting some random values for this parameter when tuning.

In [None]:
max_features = ['auto','log2',100, None]

for m in max_features:
    model = RandomForestClassifier(max_features = m,random_state=2021)
    model.fit(X_train,y_train)
    print('Max Features: ',m," ",'AUC: ',roc_auc_score(y_test,model.predict_proba(X_test)[:,1]))

# **Criterion** - For splitting the tree based on information gain or Gini Impurity

Quite similar results, but entropy does better.

In [None]:
criterion = ['gini','entropy']

for c in criterion:
    model = RandomForestClassifier(criterion = c,random_state=2021)
    model.fit(X_train,y_train)
    print('Criterion: ',c," ",'AUC: ',roc_auc_score(y_test,model.predict_proba(X_test)[:,1]))

# **Bootstrap** - default is true (samples taken with replacement), else the whole data would be taken for building a tree

Recommended value is true, which does better as well.

In [None]:
bootstrap = [True,False]

for b in bootstrap:
    model = RandomForestClassifier(bootstrap = b,random_state=2021)
    model.fit(X_train,y_train)
    print('Bootstrap: ',b," ",'AUC: ',roc_auc_score(y_test,model.predict_proba(X_test)[:,1]))

# **Max Samples** - Number of samples taken from data, to build a tree

If none, it takes all. Otherwise, a value between 0 to 1 is mentioned (as fraction of total)
All considered, gave us the best results. There was improvement on its increase until 1.0.

In [None]:
samples = [None,0.1,0.2,0.4,0.8,1]

for s in samples:
    model = RandomForestClassifier(max_samples = s,random_state=2021)
    model.fit(X_train,y_train)
    print('Max Samples: ',s," ",'AUC: ',roc_auc_score(y_test,model.predict_proba(X_test)[:,1]))

# **Min Samples Leaf** - The minimum number of samples required to be at a leaf node.

This may have the effect of smoothing the model, especially in regression.

In [None]:
min_samples_leaf = [0.1,0.2,0.3,0.4,0.5]

for s in min_samples_leaf:
    model = RandomForestClassifier(min_samples_leaf = s,random_state=2021)
    model.fit(X_train,y_train)
    print('Min Samples in Leaf: ',s," ",'AUC: ',roc_auc_score(y_test,model.predict_proba(X_test)[:,1]))

<div style="background-color:rgba(210, 129, 21, 0.5);">
    <h1><center>Your Turn</center></h1>
</div>

Please use this notebook to understand the parameters involved and then run your own trials, using Optuna or other optimization packages.
**I hope you found this useful, please upvote if you did :)**