# 1. Introduction

In this google colab, we'll be applying our accumulated knowledge on the techniques of supervised learning algorithms. The task to be adressed for this milestone is the prediction of damage levels to buildings caused by the 2015 Gorkha earthquake in Nepal. Further information on the task is retrievable from the competition page by **drivendata.org**: "[Richter's Predictor: Modeling Earthquake Damage](https://www.drivendata.org/competitions/57/nepal-earthquake/)".

The authors of this project are:

- [Raúl Barba Rojas](Raul.Barba@alu.uclm.es)
- [Diego Guerrero Del Pozo](Diego.Guerrero@alu.uclm.es)
- [Marvin Schmidt](Marvin.Schmidt@alu.uclm.es)

# 2. Preparations

## 2.1 Importing libraries

In [None]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

## 2.2 Importing training data

All the datasets from the DrivenData competition can be accessed in this github repository.

In this section, we simply load the three different datasets as pandas dataframes, so that we can work with them to achieve the desired results.

---

There are two different csv files related to the training dataset:

1. `train_values.csv`: this file contains the values of the different features with which the training will be performed.
2. `train_labels.csv `: this file contains the values of the labels for the output feature that we are trying to predict, which is called `damage_grade`.

Thus, we first need to download the datasets from the github repository and we need to load them as dataframes:

In [None]:
!wget https://github.com/alan-flint/Richter-DrivenData/raw/master/input/train_values.csv
df_train_values= pd.read_csv("train_values.csv", index_col = "building_id")
df_train_values

--2022-12-07 12:12:54--  https://github.com/alan-flint/Richter-DrivenData/raw/master/input/train_values.csv
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/alan-flint/Richter-DrivenData/master/input/train_values.csv [following]
--2022-12-07 12:12:55--  https://raw.githubusercontent.com/alan-flint/Richter-DrivenData/master/input/train_values.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23442727 (22M) [text/plain]
Saving to: ‘train_values.csv.3’


2022-12-07 12:12:55 (146 MB/s) - ‘train_values.csv.3’ saved [23442727/23442727]



Unnamed: 0_level_0,geo_level_1_id,geo_level_2_id,geo_level_3_id,count_floors_pre_eq,age,area_percentage,height_percentage,land_surface_condition,foundation_type,roof_type,...,has_secondary_use_agriculture,has_secondary_use_hotel,has_secondary_use_rental,has_secondary_use_institution,has_secondary_use_school,has_secondary_use_industry,has_secondary_use_health_post,has_secondary_use_gov_office,has_secondary_use_use_police,has_secondary_use_other
building_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
802906,6,487,12198,2,30,6,5,t,r,n,...,0,0,0,0,0,0,0,0,0,0
28830,8,900,2812,2,10,8,7,o,r,n,...,0,0,0,0,0,0,0,0,0,0
94947,21,363,8973,2,10,5,5,t,r,n,...,0,0,0,0,0,0,0,0,0,0
590882,22,418,10694,2,10,6,5,t,r,n,...,0,0,0,0,0,0,0,0,0,0
201944,11,131,1488,3,30,8,9,t,r,n,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
688636,25,1335,1621,1,55,6,3,n,r,n,...,0,0,0,0,0,0,0,0,0,0
669485,17,715,2060,2,0,6,5,t,r,n,...,0,0,0,0,0,0,0,0,0,0
602512,17,51,8163,3,55,6,7,t,r,q,...,0,0,0,0,0,0,0,0,0,0
151409,26,39,1851,2,10,14,6,t,r,x,...,0,0,0,0,0,0,0,0,0,0


In [None]:
!wget https://github.com/alan-flint/Richter-DrivenData/raw/master/input/train_labels.csv
df_train_labels = pd.read_csv("train_labels.csv", index_col = "building_id")
df_train_labels

--2022-12-07 12:12:56--  https://github.com/alan-flint/Richter-DrivenData/raw/master/input/train_labels.csv
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/alan-flint/Richter-DrivenData/master/input/train_labels.csv [following]
--2022-12-07 12:12:57--  https://raw.githubusercontent.com/alan-flint/Richter-DrivenData/master/input/train_labels.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2330792 (2.2M) [text/plain]
Saving to: ‘train_labels.csv.3’


2022-12-07 12:12:57 (53.1 MB/s) - ‘train_labels.csv.3’ saved [2330792/2330792]



Unnamed: 0_level_0,damage_grade
building_id,Unnamed: 1_level_1
802906,3
28830,2
94947,3
590882,2
201944,3
...,...
688636,2
669485,3
602512,3
151409,2


Once we have loaded both datasets we need to join them, obtaining the complete training dataset:

In [None]:
df_train_values.join(df_train_labels).to_csv("train_full.csv")

## 2.3 Importing testing data

In order to be able to evaluate our findings, we'll also need the testing data, as well as the template for the submission file. These datasets can also be accessed from this github repository.

1. `test_values.csv`: this file contains the values of the different features with which the testing will be performed.
2. `submission_format.csv`: this file contains "empty" labels for all the buildings we're trying to predict the damage grade for. It's a template file to be modified later, in which every label for ``damage_grade`` is ``1``.

In [None]:
from sklearn.preprocessing import StandardScaler

!wget https://github.com/alan-flint/Richter-DrivenData/raw/master/input/test_values.csv
test_values = pd.read_csv('test_values.csv', index_col='building_id')
test_values = pd.get_dummies(test_values)

!wget https://github.com/alan-flint/Richter-DrivenData/raw/master/input/submission_format.csv
submission_format = pd.read_csv('submission_format.csv', index_col='building_id')

--2022-12-07 12:13:02--  https://github.com/alan-flint/Richter-DrivenData/raw/master/input/test_values.csv
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/alan-flint/Richter-DrivenData/master/input/test_values.csv [following]
--2022-12-07 12:13:02--  https://raw.githubusercontent.com/alan-flint/Richter-DrivenData/master/input/test_values.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7815385 (7.5M) [text/plain]
Saving to: ‘test_values.csv.3’


2022-12-07 12:13:02 (67.9 MB/s) - ‘test_values.csv.3’ saved [7815385/7815385]

--2022-12-07 12:13:04--  https://github.com/alan-flint/Richter-DrivenD

# 3. Model implementation

## 3.1. LightGBM

The first step would be to choose the features, and we are going to choose those ones obtained from the decision trees, which have proven to be the best for now.

In [None]:
df_train_values_subset = pd.get_dummies(df_train_values)

selected_features = ['age',
                         'area_percentage',
                         'height_percentage',
                         'geo_level_1_id',
                         'geo_level_2_id',
                         'geo_level_3_id',
                         'has_superstructure_adobe_mud',
                         'has_superstructure_mud_mortar_stone',
                         'has_superstructure_stone_flag',
                         'has_superstructure_cement_mortar_stone',
                         'has_superstructure_mud_mortar_brick',
                         'has_superstructure_cement_mortar_brick',
                         'has_superstructure_timber',
                         'has_superstructure_bamboo',
                         'has_superstructure_rc_non_engineered',
                         'has_superstructure_rc_engineered',
                         'has_superstructure_other',
                         'foundation_type_r',
                         'ground_floor_type_v',
                         'other_floor_type_q']

df_train_values_subset = df_train_values_subset[selected_features]
df_train_values_subset

Unnamed: 0_level_0,age,area_percentage,height_percentage,geo_level_1_id,geo_level_2_id,geo_level_3_id,has_superstructure_adobe_mud,has_superstructure_mud_mortar_stone,has_superstructure_stone_flag,has_superstructure_cement_mortar_stone,has_superstructure_mud_mortar_brick,has_superstructure_cement_mortar_brick,has_superstructure_timber,has_superstructure_bamboo,has_superstructure_rc_non_engineered,has_superstructure_rc_engineered,has_superstructure_other,foundation_type_r,ground_floor_type_v,other_floor_type_q
building_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
802906,30,6,5,6,487,12198,1,1,0,0,0,0,0,0,0,0,0,1,0,1
28830,10,8,7,8,900,2812,0,1,0,0,0,0,0,0,0,0,0,1,0,1
94947,10,5,5,21,363,8973,0,1,0,0,0,0,0,0,0,0,0,1,0,0
590882,10,6,5,22,418,10694,0,1,0,0,0,0,1,1,0,0,0,1,0,0
201944,30,8,9,11,131,1488,1,0,0,0,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
688636,55,6,3,25,1335,1621,0,1,0,0,0,0,0,0,0,0,0,1,0,0
669485,0,6,5,17,715,2060,0,1,0,0,0,0,0,0,0,0,0,1,0,1
602512,55,6,7,17,51,8163,0,1,0,0,0,0,0,0,0,0,0,1,0,1
151409,10,14,6,26,39,1851,0,0,0,0,0,1,0,0,0,0,0,1,1,0


We need to normalize the non-binary features:

In [None]:
df_train_values_subset['geo_level_1_id'] = (df_train_values_subset['geo_level_1_id']-df_train_values_subset['geo_level_1_id'].min())/(df_train_values_subset['geo_level_1_id'].max()-df_train_values_subset['geo_level_1_id'].min())
df_train_values_subset['geo_level_2_id'] = (df_train_values_subset['geo_level_2_id']-df_train_values_subset['geo_level_2_id'].min())/(df_train_values_subset['geo_level_2_id'].max()-df_train_values_subset['geo_level_2_id'].min())
df_train_values_subset['geo_level_3_id'] = (df_train_values_subset['geo_level_3_id']-df_train_values_subset['geo_level_3_id'].min())/(df_train_values_subset['geo_level_3_id'].max()-df_train_values_subset['geo_level_3_id'].min())
df_train_values_subset['age'] = (df_train_values_subset['age']-df_train_values_subset['age'].min())/(df_train_values_subset['age'].max()-df_train_values_subset['age'].min())
df_train_values_subset['area_percentage'] = (df_train_values_subset['area_percentage']-df_train_values_subset['area_percentage'].min())/(df_train_values_subset['area_percentage'].max()-df_train_values_subset['area_percentage'].min())
df_train_values_subset['height_percentage'] = (df_train_values_subset['height_percentage']-df_train_values_subset['height_percentage'].min())/(df_train_values_subset['height_percentage'].max()-df_train_values_subset['height_percentage'].min())

test_values['geo_level_1_id'] = (test_values['geo_level_1_id']-test_values['geo_level_1_id'].min())/(test_values['geo_level_1_id'].max()-test_values['geo_level_1_id'].min())
test_values['geo_level_2_id'] = (test_values['geo_level_2_id']-test_values['geo_level_2_id'].min())/(test_values['geo_level_2_id'].max()-test_values['geo_level_2_id'].min())
test_values['geo_level_3_id'] = (test_values['geo_level_3_id']-test_values['geo_level_3_id'].min())/(test_values['geo_level_3_id'].max()-test_values['geo_level_3_id'].min())
test_values['age'] = (test_values['age']-test_values['age'].min())/(test_values['age'].max()-test_values['age'].min())
test_values['area_percentage'] = (test_values['area_percentage']-test_values['area_percentage'].min())/(test_values['area_percentage'].max()-test_values['area_percentage'].min())
test_values['height_percentage'] = (test_values['height_percentage']-test_values['height_percentage'].min())/(test_values['height_percentage'].max()-test_values['height_percentage'].min())

And then, split the dataset between train and test.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(df_train_values_subset, df_train_labels.damage_grade, random_state=1)

In our case, it is also necessary to install the `lightgbm` library.

In [None]:
!pip install lightgbm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


We need to implement a custom metric as well, which will be the function below.

In [None]:
from sklearn.metrics import f1_score

def evaluate_microF1_lgb(truth, predictions):  
    # this follows the discussion in https://github.com/Microsoft/LightGBM/issues/1483
    pred_labels = predictions.reshape(len(np.unique(truth)),-1).argmax(axis=0)
    f1 = f1_score(truth, pred_labels, average='micro')
    return ('f1', f1, True)

Now that we have everything ready, we can implement the model to do some pre-evaluation:

In [None]:
import lightgbm as lgb

model = lgb.LGBMClassifier(random_state = 0, n_jobs = -1, objective='multiclass')

model.fit(X_train,Y_train,eval_set=[(X_test,Y_test),(X_train,Y_train)], verbose=20, eval_metric = evaluate_microF1_lgb)

print('Testing accuracy {:.4f}'.format(model.score(X_test,Y_test)))

[20]	training's multi_logloss: 0.711064	training's f1: 0.681448	valid_0's multi_logloss: 0.70727	valid_0's f1: 0.684917
[40]	training's multi_logloss: 0.680316	training's f1: 0.692029	valid_0's multi_logloss: 0.677525	valid_0's f1: 0.693128
[60]	training's multi_logloss: 0.664404	training's f1: 0.699289	valid_0's multi_logloss: 0.663286	valid_0's f1: 0.699636
[80]	training's multi_logloss: 0.653268	training's f1: 0.704533	valid_0's multi_logloss: 0.654048	valid_0's f1: 0.704333
[100]	training's multi_logloss: 0.644691	training's f1: 0.708749	valid_0's multi_logloss: 0.647433	valid_0's f1: 0.706113
Testing accuracy 0.7061


For the hyperparametrization part, we'll set different ranges and list of values for each parameter to try as much different combinations as possible.

In [None]:
from scipy.stats import randint as sp_randint
from scipy.stats import uniform as sp_uniform
param_test ={'num_leaves': sp_randint(12, 20), 
             'n_estimators' : sp_randint(64, 4096),
             'min_child_samples': sp_randint(40, 100), 
             'min_child_weight': [1e-5, 1e-3, 1e-2, 1e-1, 1, 1e1, 1e2, 1e3, 1e4],
             'subsample': sp_uniform(loc=0.75, scale=0.25), 
             'colsample_bytree': sp_uniform(loc=0.8, scale=0.15),
             'reg_alpha': [0, 1e-3, 1e-1, 1, 10, 50, 100],
             'reg_lambda': [0, 1e-3, 1e-1, 1, 10, 50, 100],
             'learning_rate' : sp_uniform(0.1, 0.9)
            }

In [None]:
def report(results, n_top=3): 
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                  results['mean_test_score'][candidate],
                  results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")

In [None]:
from sklearn.model_selection import RandomizedSearchCV

model = lgb.LGBMClassifier(random_state = 0, n_jobs = -1)

rs = RandomizedSearchCV(
    estimator = model, 
    param_distributions=param_test, 
    n_iter= 100,
    scoring='f1_micro',
    cv=5,
    random_state = 0,
    error_score = 'raise',
    verbose=True)

random_search_models = rs.fit(X_train,Y_train)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


## 3.2. Pre-evaluation

Finally, it is safe to output the obtained results:

In [None]:
report(random_search_models.cv_results_, n_top = 5)

Model with rank: 1
Mean validation score: 0.736 (std: 0.002)
Parameters: {'colsample_bytree': 0.93555796096189, 'learning_rate': 0.1750801918978167, 'min_child_samples': 70, 'min_child_weight': 1e-05, 'n_estimators': 1754, 'num_leaves': 16, 'reg_alpha': 1, 'reg_lambda': 0.1, 'subsample': 0.8217287608251511}

Model with rank: 2
Mean validation score: 0.736 (std: 0.002)
Parameters: {'colsample_bytree': 0.8324825531636558, 'learning_rate': 0.22169635606490684, 'min_child_samples': 76, 'min_child_weight': 10.0, 'n_estimators': 1989, 'num_leaves': 18, 'reg_alpha': 0, 'reg_lambda': 10, 'subsample': 0.8466222452814655}

Model with rank: 3
Mean validation score: 0.736 (std: 0.001)
Parameters: {'colsample_bytree': 0.8689783825634012, 'learning_rate': 0.14015107112870268, 'min_child_samples': 50, 'min_child_weight': 0.1, 'n_estimators': 2850, 'num_leaves': 15, 'reg_alpha': 0, 'reg_lambda': 1, 'subsample': 0.8721025149850975}

Model with rank: 4
Mean validation score: 0.735 (std: 0.002)
Parameter

Since the values obtained for the f1-score are not good enough, we decided to try different alternatives that may lead to better results, such as stacked models.