# 1. Optimizations using Random Forest Classifiers 

This follow-up notebook serves to improve the baseline prediction accuracies we've first established in the Baseline using RandomForest and GridSearch techniques.

To do this, we'll be first importing the necessary statements for working with the dataset, as well as the gained insights from the Baseline. After that, we use the afforementioned classifiers, improve their parameters and evaluate the result we get.

Refer to the Baseline document (`Task02_SupervisedLearning_BaselineBarba_Guerrero_Schmidt.ipynb`) for further details on the steps performed so far.


The authors of this project are:

- [Raúl Barba Rojas](Raul.Barba@alu.uclm.es)
- [Diego Guerrero Del Pozo](Diego.Guerrero@alu.uclm.es)
- [Marvin Schmidt](Marvin.Schmidt@alu.uclm.es)

# 2. Preparations

Before we're able to work on the classifiers, the necessary libraries and data is loaded. This step follows the same steps and explanations as the Baseline document and will therefore omit commentary of the code.

## 2.1 - Library Imports

In [1]:
%matplotlib inline

from google.colab import output # only used for console print clearing (output.clear())
import timeit # runtime measurements
from datetime import datetime


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

from sklearn.metrics import f1_score

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier

from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV






## 2.2 - Dataset initialization



Data imports, just like in the baseline.
Feature selection from insights of decision tree classifier.

In [2]:
# import train_values.csv
!wget https://github.com/alan-flint/Richter-DrivenData/raw/master/input/train_values.csv
df_train_values= pd.read_csv("train_values.csv", index_col = "building_id")

# import train_labels.csv
!wget https://github.com/alan-flint/Richter-DrivenData/raw/master/input/train_labels.csv
df_train_labels = pd.read_csv("train_labels.csv", index_col = "building_id")

# join into full dataset train_full.csv
df_train_values.join(df_train_labels).to_csv("train_full.csv")

# import test_values.csv
!wget https://github.com/alan-flint/Richter-DrivenData/raw/master/input/test_values.csv
df_test_values = pd.read_csv('test_values.csv', index_col='building_id')

# import submission_format.csv
!wget https://github.com/alan-flint/Richter-DrivenData/raw/master/input/submission_format.csv
df_submission_format = pd.read_csv('submission_format.csv', index_col='building_id')

# clear console prints
output.clear()

df_train_values.head()

Unnamed: 0_level_0,geo_level_1_id,geo_level_2_id,geo_level_3_id,count_floors_pre_eq,age,area_percentage,height_percentage,land_surface_condition,foundation_type,roof_type,...,has_secondary_use_agriculture,has_secondary_use_hotel,has_secondary_use_rental,has_secondary_use_institution,has_secondary_use_school,has_secondary_use_industry,has_secondary_use_health_post,has_secondary_use_gov_office,has_secondary_use_use_police,has_secondary_use_other
building_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
802906,6,487,12198,2,30,6,5,t,r,n,...,0,0,0,0,0,0,0,0,0,0
28830,8,900,2812,2,10,8,7,o,r,n,...,0,0,0,0,0,0,0,0,0,0
94947,21,363,8973,2,10,5,5,t,r,n,...,0,0,0,0,0,0,0,0,0,0
590882,22,418,10694,2,10,6,5,t,r,n,...,0,0,0,0,0,0,0,0,0,0
201944,11,131,1488,3,30,8,9,t,r,n,...,0,0,0,0,0,0,0,0,0,0


## 2.3 - Pre-processing of dataset

OneHot encoded full dataset (full set of features)

In [3]:
# create df_train_values_complete: normalized categorical variables
df_train_values_norm = pd.get_dummies(df_train_values)

# create df_train_values_complete: normalized categorical variables
df_test_values_norm = pd.get_dummies(df_test_values)

#df_train_values_norm.head()

Selected features from DT classifier analysis

In [4]:
selected_features_classifier_dt =  ["geo_level_1_id", 
                                    "geo_level_2_id", 
                                    "geo_level_3_id",
                                    "foundation_type_r", 
                                    "area_percentage", 
                                    "age", 
                                    "height_percentage"]

OneHot encoded dataset with filtered features

In [5]:
# only keep selected features for training data
df_train_values_subset_norm = df_train_values_norm[selected_features_classifier_dt]

# only keep selected features for testing data
df_test_values_subset_norm = df_test_values_norm[selected_features_classifier_dt]

df_train_values_subset_norm.head()

Unnamed: 0_level_0,geo_level_1_id,geo_level_2_id,geo_level_3_id,foundation_type_r,area_percentage,age,height_percentage
building_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
802906,6,487,12198,1,6,30,5
28830,8,900,2812,1,8,10,7
94947,21,363,8973,1,5,10,5
590882,22,418,10694,1,6,10,5
201944,11,131,1488,1,8,30,9


# 3. First approaches to Random Forests

## 3.1 Decision Tree classifier using Bagging

Split dataset into training and testing data.

In [6]:
X_train, X_test, y_train, y_test = train_test_split(df_train_values_subset_norm, df_train_labels, random_state=1)

for df in [X_train, X_test, y_train, y_test]:
  print(df.shape)

(195450, 7)
(65151, 7)
(195450, 1)
(65151, 1)


In [None]:
def train_classify_dt_bagging():
  # parametrization of classifier
  dt_model = DecisionTreeClassifier()
  bagging_model = BaggingClassifier(dt_model, n_estimators=100, max_samples=0.8, random_state=1)

  # training with dataset
  bagging_model.fit(X_train, y_train)

  # predicting the labels of the test split
  y_pred_bagging = bagging_model.predict(X_test)

  # F1-score
  f1 = f1_score(y_test, y_pred_bagging, average='micro')
  print('F1 score: ' + '{:10.4f}'.format(f1))

# execute and time
elapsed_time = timeit.timeit(stmt=train_classify_dt_bagging, number=1)
print('Elapsed time: ' + '{:10.2f}'.format(elapsed_time) + " seconds.")

  y = column_or_1d(y, warn=True)


F1 score:     0.7091
Elapsed time:         69 seconds.


## 3.2 Classification using `RandomForestClassifier`

In [None]:
def train_classify_rf():
  # parametrization of classifier
  rf_model = RandomForestClassifier(n_estimators=100, random_state=0)

  # training with dataset
  rf_model.fit(X_train, y_train)

  # predicting the labels of the test split
  y_pred_rf = rf_model.predict(X_test)

  # F1-score
  f1 = f1_score(y_test, y_pred_rf, average='micro')
  print('F1 score: ' + '{:10.4f}'.format(f1))

# execute and time
elapsed_time = timeit.timeit(stmt=train_classify_rf, number=1)
print('Elapsed time: ' + '{:10.2f}'.format(elapsed_time) + " seconds.")

  rf_model.fit(X_train, y_train)


F1 score:     0.7020
Elapsed time:      36.12 seconds.


The results from this simple approach, considering that no hyperparameters were configured, are great in comparison to the Baseline classifiers.

| **Model used** | **Score obtained** | **Ranking** |
|:--------------:|:-------------------:|:-----------:|
|   BernoulliNB  |        0.5665       |     1561    |
|   KNN  |        0.7215       |     995    |
|   DT  |        0.6910       |     1275    |
|   DT + Bagging |        0.7091       |     not tested    |
|   Random Forests  |        0.7020       |     not tested    |

Let's try to increase the number of estimators in the classifier from `100` to `1000`. 

In [None]:
def train_classify_rf_mod1():
  # parametrization of classifier
  rf_model = RandomForestClassifier(n_estimators=1000, random_state=0)

  # training with dataset
  rf_model.fit(X_train, y_train)

  # predicting the labels of the test split
  y_pred_rf = rf_model.predict(X_test)

  # F1-score
  f1 = f1_score(y_test, y_pred_rf, average='micro')
  print('F1 score: ' + '{:10.4f}'.format(f1))

# execute and time
elapsed_time = timeit.timeit(stmt=train_classify_rf_mod1, number=1)
print('Elapsed time: ' + '{:10.2f}'.format(elapsed_time) + " seconds.")

  rf_model.fit(X_train, y_train)


F1 score:     0.7046
Elapsed time:     336.47 seconds.


The achieved score increases by around 0,37% to 0.7046. This is considerable, keeping in mind that the best possible score improvement on DrivenData is 0.7558, an improvement of 7,66%.

In further tuning steps, n should be as high as possible while balancing the necessary runtime.

# 4. Automated Hyperparameter Tuning
Instead of manually figuring out the parameters to optimize, we'll be using `RandomizedSearch`, as well as `GridSearch` to automate this process. To do this, we specify different values for each hyperparameter first. Then, a model is trained evaluated for different combinations of these hyperparameters.






## 4.1 Randomized Search

Randomized search is a technique where a range of hyperparameter values is defined, and a random sample of these values is selected and used to train and evaluate the model. This process is repeated multiple times, with different samples of hyperparameter values being selected each time. Randomized search is **less exhaustive than grid search**, but it can still be effective for finding good hyperparameter values for a model

Let's see the parameters defined by classifier we initialized with `n_estimators` only.

In [None]:
# Print out the parameters of the trained classifier
print(RandomForestClassifier(n_estimators=1000, random_state=0).__dict__)

{'base_estimator': DecisionTreeClassifier(), 'n_estimators': 1000, 'estimator_params': ('criterion', 'max_depth', 'min_samples_split', 'min_samples_leaf', 'min_weight_fraction_leaf', 'max_features', 'max_leaf_nodes', 'min_impurity_decrease', 'random_state', 'ccp_alpha'), 'bootstrap': True, 'oob_score': False, 'n_jobs': None, 'random_state': 0, 'verbose': 0, 'warm_start': False, 'class_weight': None, 'max_samples': None, 'criterion': 'gini', 'max_depth': None, 'min_samples_split': 2, 'min_samples_leaf': 1, 'min_weight_fraction_leaf': 0.0, 'max_features': 'auto', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'ccp_alpha': 0.0}


We try to define a couple of values and adjust them later based on the outcome of each try. This cycle of parameter adjustment - training - evaluation will be repeated four times.

In cycle four, the availavle parameters have been narrowed down in a way, so that all combinations could be tested exhaustively. The exhaustive search should be normally done using GridSearch, so in the future, Randomized Search should only be reserved for finding the vague parameter ranges. Also, parameters found by Randomized Search will, form here on out not be used as final values, only as preperation for GridSearch.

| **Cycle** | **#n_iter, #cv** |  **#fits** |**Best parameters** | **Best F1** | **execution time** |insights for next cycle |
|:--------------:|:-------------------:|:-----------:|:-----------:|:-----------:|:-----------:|:-----------:|
|   1  |     20, 3  |    60  |    not saved  |       0.651  |        ~15min  | - only keep `n_estimators` >= 256,<br> - include higher values for `max_depth`, <br>- only keep `bootstrap` = True |
|   2  |      50,3  |     150  | {'n_estimators': 250,<br> 'min_samples_split': 2,<br> 'min_samples_leaf': 8,<br> 'max_features': 'sqrt',<br> 'max_depth': 16,<br> 'bootstrap': True}  |       0.694  |        1h23min  |  - Include even higher values for `max_depth`<br>-`min_samples_leaf` = 16 can be removed<br>-`n_iter, cv` values are fine  |  
|   3.1  |      50,3  |     150  |    not saved<br>(slightly different<br>due to slight changes<br>at input)  |       0.707  |  1h51min  |  execute training locally (way faster) |
|   3.2<br>(ran locally)  |      50,3  |     150  |    {'n_estimators': 475,<br> 'min_samples_split': 2,<br> 'min_samples_leaf': 8,<br> 'max_features': 'auto',<br> 'max_depth': 32,<br> 'bootstrap': True}  |     0.7081    |        40min  | - fix `min_samples_leaf` to 8 <br>- try `max_depth` around the range of 30-60<br>- replace `n_estimators` = 250 with 550<br>- try cv = 2 to run higher `n_iter`  |
|   4<br>(ran locally)  |      96,2  |     192  |    {'n_estimators': 475,<br> 'min_samples_split': 4,<br> 'min_samples_leaf': 8,<br> 'max_features': 'sqrt',<br> 'max_depth': 40,<br> 'bootstrap': True} |       0.704  |        ~44min  |  - ideal `max_depth` seems to be <br>somewhere between 30-40<br> |

As we can see, the parameters from cycle 3 seem to be quite close to what the actual best values are for the problem at hand. So, using GridSearch, we'll narrow down the parameters even further.

In [None]:
# "report" utility from ML2020 example to generate report on best results
def report(results, n_top=3): 
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                  results['mean_test_score'][candidate],
                  results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")


print("INIT: " + datetime.now().strftime('%Y-%m-%d %H:%M:%S'))

# deciding parameters
chosen_n_iter = 50
chosen_cv = 3

# definition of classifier without parameters
rf_randsearch_model = RandomForestClassifier()

# definition of possible parameters to test
params_to_test = {"n_estimators": [325, 400, 475, 550],  # Number of trees in random forest
                  "max_features": ['auto', 'sqrt'],  # Number of features to consider at every split
                  "max_depth": [32, 40, 48, 56],  # Maximum number of levels in tree
                  "min_samples_split": [2, 4, 6],  # Minimum number of samples required to split a node
                  "min_samples_leaf": [8],  # Minimum number of samples required at each leaf node
                  "bootstrap": [True]  # Method of selecting samples for training each tree
                  }

# definition of RandSearch instance
randsearch_rf_model = RandomizedSearchCV(
    estimator = rf_randsearch_model, 
    param_distributions = params_to_test, 
    n_iter = chosen_n_iter,               # (!) number of parameter settings that are sampled
    cv = chosen_cv,                       # 5-fold cross validation
    random_state=0,                       # generate reproducable list of possible values
    n_jobs = -1,                          # (!) use all available processors
    scoring='f1_micro',                   # use F1 score as evaluation method
    verbose=1
    )

# fit classifier using hyperparameter combinations from RandSearch approach 
randsearch_rf_model.fit(X_train, np.array(y_train).ravel()

print("DONE: " + datetime.now().strftime('%Y-%m-%d %H:%M:%S'))

# show results
report(randsearch_rf_model.cv_results_, n_top=10)
print("Best parameters: " + str(randsearch_rf_model.best_params_))
print("Best score: " + str(randsearch_rf_model.best_score_))

## 4.2 Grid Search

Grid search is a technique where a grid of hyperparameter values is defined, and the model is trained and evaluated using **all possible combinations** of these values. This can be an exhaustive and time-consuming process, but it is effective for finding the optimal hyperparameter values for a model.

Essentially, we run the same stucture of code statements, but replace the `RandomizedSearchCV` for `GridSearchCV`. Furthermore, since this search is exhaustive, we should include a good range of parameters we find interesting to only have to run this step once. As found out above, we know which values (more or less) are close to optimal, so we'll reference them and and interpolate those in between.

In [None]:
print("INIT: " + datetime.now().strftime('%Y-%m-%d %H:%M:%S'))

# deciding parameters
chosen_cv = 2

# definition of classifier without parameters
rf_gridsearch_model = RandomForestClassifier()

# definition of possible parameters to test
params_to_test = {"n_estimators": [350, 450, 450, 550, 650], # Number of trees in random forest
                  "max_features": ['auto', 'sqrt'], # Number of features to consider at every split
                  "max_depth": [30, 33, 36, 40], # Maximum number of levels in tree
                  "min_samples_split": [2, 4, 6], #  Minimum number of samples required to split a node
                  "min_samples_leaf": [8], # Minimum number of samples required at each leaf node
                  "bootstrap": [True] # Method of selecting samples for training each tree
                  }


# definition of RandSearch instance
gridsearch_rf_model = GridSearchCV(
    estimator = rf_gridsearch_model, 
    param_grid = params_to_test, 
    cv = chosen_cv,                       # 5-fold cross validation
    n_jobs = -1,                          # (!) use all available processors
    scoring='f1_micro',                   # use F1 score as evaluation method
    verbose=10
)

# fit classifier using hyperparameter combinations from RandSearch approach 
gridsearch_rf_model.fit(X_train, np.array(y_train).ravel())

print("DONE: " + datetime.now().strftime('%Y-%m-%d %H:%M:%S'))

# show results
report(gridsearch_rf_model.cv_results_, n_top=10)
print("Best parameters: " + str(gridsearch_rf_model.best_params_))
print("Best score: " + str(gridsearch_rf_model.best_score_))


|  **#fits** | **Best parameters** | **Best F1** | **execution time** |
|:--------------:|:-------------------:|:-----------:|:-----------:|
|   240  |  {'bootstrap': True,<br> 'max_depth': 36,<br> 'max_features': 'auto',<br> 'min_samples_leaf': 8,<br> 'min_samples_split': 4,<br> 'n_estimators': 450}  |    0.704  |  52min | 

Since, technically, we already performed grip search above within the Randomized Search, we had already found a great set of parameters. Using values in between the already-found ones did not yield better results either, unfortunately.

Therefore, the best result we can find with the `RandomForestClassifier` using parameter search yields an F1 score of `0.7081` using the following hyperparameters.

- `n_estimators`: 475,
- `min_samples_split`: 2
- `min_samples_leaf`: 8
- `max_features`: auto
- `max_depth`: 32
- `bootstrap`: True

## 4.3 Preparing the submission

Just like with the baseline algorithms, we want to produce an entry file to the competition. This will provide us with a comparable result to the the baseline `DecisionTreeClassifier`, but also measure the grade of overfitting on the training data.

The steps necessary to do this are very similar to the steps explained in the baselase document and therefore not further explained.

In [None]:
# produce final RandomForestClassifier instance with the found parameters
rf_model =  RandomForestClassifier(
              n_estimators=475, 
              min_samples_split=2, 
              min_samples_leaf=8, 
              max_features='auto', 
              max_depth=32, 
              bootstrap=True)

# training with dataset
rf_model.fit(X_train, np.array(y_train).ravel())

RandomForestClassifier(max_depth=32, min_samples_leaf=8, n_estimators=475)

In [None]:
# obtain the predictions
rf_predictions = rf_model.predict(df_test_values_subset_norm)


# create the submission file
rf_submission = pd.DataFrame(data=rf_predictions,
                             columns=df_submission_format.columns, # only one column: 'damage_grade' 
                             index=df_submission_format.index)
rf_submission.to_csv('rf_submission_baseline.csv')

The obtained result returned by DrivenData exceeds the F1 score achieved in this colab and is rated at `0.7123`, ranked 1155 (more people entered the competition since last time). 

In comparison to all of our results, we can see the following progress. 

Pay attention to the "Comparison w/ best possible result" column, as it's easy to discard the improvements as minor. The best F1-Score as of 12/12/22 in the competition is rated at `0.7558`.

|  Technique used | F1-Score |  Comparison w/ best possible result |
|:--------------:|:-------------------:|:-------------------:|
|   DT without Tuning |        0.6545            |  86,59%   |  
|   DT w/ manual Tuning |        0.6910          |  91,42%   |  
|   DT + Bagging |        0.7091                 |  93,82%   |  
|   RFs without Tuning            |        0.7020          |   92,88%  |  
|   RFs w/ tuned Hyperparameters  |   0.7123     |  94,24%   |  