# 1. Introduction

In this google colab, we'll be applying our accumulated knowledge on the techniques of supervised learning algorithms. The task to be adressed for this milestone is the prediction of damage levels to buildings caused by the 2015 Gorkha earthquake in Nepal. Further information on the task is retrievable from the competition page by **drivendata.org**: "[Richter's Predictor: Modeling Earthquake Damage](https://www.drivendata.org/competitions/57/nepal-earthquake/)".

The authors of this project are:

- [Raúl Barba Rojas](Raul.Barba@alu.uclm.es)
- [Diego Guerrero Del Pozo](Diego.Guerrero@alu.uclm.es)
- [Marvin Schmidt](Marvin.Schmidt@alu.uclm.es)

# 2. Preparations

## 2.1 Importing libraries

In [1]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

## 2.2 Importing training data

All the datasets from the DrivenData competition can be accessed in this github repository.

In this section, we simply load the three different datasets as pandas dataframes, so that we can work with them to achieve the desired results.

---

There are two different csv files related to the training dataset:

1. `train_values.csv`: this file contains the values of the different features with which the training will be performed.
2. `train_labels.csv `: this file contains the values of the labels for the output feature that we are trying to predict, which is called `damage_grade`.

Thus, we first need to download the datasets from the github repository and we need to load them as dataframes:

In [2]:
!wget https://github.com/alan-flint/Richter-DrivenData/raw/master/input/train_values.csv
df_train_values= pd.read_csv("train_values.csv", index_col = "building_id")
df_train_values

--2022-12-12 21:51:33--  https://github.com/alan-flint/Richter-DrivenData/raw/master/input/train_values.csv
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/alan-flint/Richter-DrivenData/master/input/train_values.csv [following]
--2022-12-12 21:51:33--  https://raw.githubusercontent.com/alan-flint/Richter-DrivenData/master/input/train_values.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23442727 (22M) [text/plain]
Saving to: ‘train_values.csv’


2022-12-12 21:51:34 (129 MB/s) - ‘train_values.csv’ saved [23442727/23442727]



Unnamed: 0_level_0,geo_level_1_id,geo_level_2_id,geo_level_3_id,count_floors_pre_eq,age,area_percentage,height_percentage,land_surface_condition,foundation_type,roof_type,...,has_secondary_use_agriculture,has_secondary_use_hotel,has_secondary_use_rental,has_secondary_use_institution,has_secondary_use_school,has_secondary_use_industry,has_secondary_use_health_post,has_secondary_use_gov_office,has_secondary_use_use_police,has_secondary_use_other
building_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
802906,6,487,12198,2,30,6,5,t,r,n,...,0,0,0,0,0,0,0,0,0,0
28830,8,900,2812,2,10,8,7,o,r,n,...,0,0,0,0,0,0,0,0,0,0
94947,21,363,8973,2,10,5,5,t,r,n,...,0,0,0,0,0,0,0,0,0,0
590882,22,418,10694,2,10,6,5,t,r,n,...,0,0,0,0,0,0,0,0,0,0
201944,11,131,1488,3,30,8,9,t,r,n,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
688636,25,1335,1621,1,55,6,3,n,r,n,...,0,0,0,0,0,0,0,0,0,0
669485,17,715,2060,2,0,6,5,t,r,n,...,0,0,0,0,0,0,0,0,0,0
602512,17,51,8163,3,55,6,7,t,r,q,...,0,0,0,0,0,0,0,0,0,0
151409,26,39,1851,2,10,14,6,t,r,x,...,0,0,0,0,0,0,0,0,0,0


In [3]:
!wget https://github.com/alan-flint/Richter-DrivenData/raw/master/input/train_labels.csv
df_train_labels = pd.read_csv("train_labels.csv", index_col = "building_id")
df_train_labels

--2022-12-12 21:51:36--  https://github.com/alan-flint/Richter-DrivenData/raw/master/input/train_labels.csv
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/alan-flint/Richter-DrivenData/master/input/train_labels.csv [following]
--2022-12-12 21:51:37--  https://raw.githubusercontent.com/alan-flint/Richter-DrivenData/master/input/train_labels.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2330792 (2.2M) [text/plain]
Saving to: ‘train_labels.csv’


2022-12-12 21:51:37 (34.1 MB/s) - ‘train_labels.csv’ saved [2330792/2330792]



Unnamed: 0_level_0,damage_grade
building_id,Unnamed: 1_level_1
802906,3
28830,2
94947,3
590882,2
201944,3
...,...
688636,2
669485,3
602512,3
151409,2


Once we have loaded both datasets we need to join them, obtaining the complete training dataset:

In [4]:
df_train_values.join(df_train_labels).to_csv("train_full.csv")

## 2.3 Importing testing data

In order to be able to evaluate our findings, we'll also need the testing data, as well as the template for the submission file. These datasets can also be accessed from this github repository.

1. `test_values.csv`: this file contains the values of the different features with which the testing will be performed.
2. `submission_format.csv`: this file contains "empty" labels for all the buildings we're trying to predict the damage grade for. It's a template file to be modified later, in which every label for ``damage_grade`` is ``1``.

In [5]:
from sklearn.preprocessing import StandardScaler

!wget https://github.com/alan-flint/Richter-DrivenData/raw/master/input/test_values.csv
test_values = pd.read_csv('test_values.csv', index_col='building_id')
test_values = pd.get_dummies(test_values)

!wget https://github.com/alan-flint/Richter-DrivenData/raw/master/input/submission_format.csv
submission_format = pd.read_csv('submission_format.csv', index_col='building_id')

--2022-12-12 21:51:45--  https://github.com/alan-flint/Richter-DrivenData/raw/master/input/test_values.csv
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/alan-flint/Richter-DrivenData/master/input/test_values.csv [following]
--2022-12-12 21:51:45--  https://raw.githubusercontent.com/alan-flint/Richter-DrivenData/master/input/test_values.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7815385 (7.5M) [text/plain]
Saving to: ‘test_values.csv’


2022-12-12 21:51:46 (75.8 MB/s) - ‘test_values.csv’ saved [7815385/7815385]

--2022-12-12 21:51:47--  https://github.com/alan-flint/Richter-DrivenData/

# 3. Model implementation

## 3.1. Basic Stacked Model

For implementing a basic stacked model using the models from the baseline, what we need to do first is to select a set of features, which, in our case, will be the same obtained using decision trees.

In [6]:
df_train_values_subset = pd.get_dummies(df_train_values)

selected_features = ['age',
                         'area_percentage',
                         'height_percentage',
                         'geo_level_1_id',
                         'geo_level_2_id',
                         'geo_level_3_id',
                         'has_superstructure_adobe_mud',
                         'has_superstructure_mud_mortar_stone',
                         'has_superstructure_stone_flag',
                         'has_superstructure_cement_mortar_stone',
                         'has_superstructure_mud_mortar_brick',
                         'has_superstructure_cement_mortar_brick',
                         'has_superstructure_timber',
                         'has_superstructure_bamboo',
                         'has_superstructure_rc_non_engineered',
                         'has_superstructure_rc_engineered',
                         'has_superstructure_other',
                         'foundation_type_r',
                         'ground_floor_type_v',
                         'other_floor_type_q']

df_train_values_subset = df_train_values_subset[selected_features]
df_train_values_subset

Unnamed: 0_level_0,age,area_percentage,height_percentage,geo_level_1_id,geo_level_2_id,geo_level_3_id,has_superstructure_adobe_mud,has_superstructure_mud_mortar_stone,has_superstructure_stone_flag,has_superstructure_cement_mortar_stone,has_superstructure_mud_mortar_brick,has_superstructure_cement_mortar_brick,has_superstructure_timber,has_superstructure_bamboo,has_superstructure_rc_non_engineered,has_superstructure_rc_engineered,has_superstructure_other,foundation_type_r,ground_floor_type_v,other_floor_type_q
building_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
802906,30,6,5,6,487,12198,1,1,0,0,0,0,0,0,0,0,0,1,0,1
28830,10,8,7,8,900,2812,0,1,0,0,0,0,0,0,0,0,0,1,0,1
94947,10,5,5,21,363,8973,0,1,0,0,0,0,0,0,0,0,0,1,0,0
590882,10,6,5,22,418,10694,0,1,0,0,0,0,1,1,0,0,0,1,0,0
201944,30,8,9,11,131,1488,1,0,0,0,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
688636,55,6,3,25,1335,1621,0,1,0,0,0,0,0,0,0,0,0,1,0,0
669485,0,6,5,17,715,2060,0,1,0,0,0,0,0,0,0,0,0,1,0,1
602512,55,6,7,17,51,8163,0,1,0,0,0,0,0,0,0,0,0,1,0,1
151409,10,14,6,26,39,1851,0,0,0,0,0,1,0,0,0,0,0,1,1,0


We will need to normalize those numerical features that need it, such as the age, the area percentage...

In [7]:
df_train_values_subset['age'] = (df_train_values_subset['age']-df_train_values_subset['age'].min())/(df_train_values_subset['age'].max()-df_train_values_subset['age'].min())
df_train_values_subset['area_percentage'] = (df_train_values_subset['area_percentage']-df_train_values_subset['area_percentage'].min())/(df_train_values_subset['area_percentage'].max()-df_train_values_subset['area_percentage'].min())
df_train_values_subset['height_percentage'] = (df_train_values_subset['height_percentage']-df_train_values_subset['height_percentage'].min())/(df_train_values_subset['height_percentage'].max()-df_train_values_subset['height_percentage'].min())

test_values['age'] = (test_values['age']-test_values['age'].min())/(test_values['age'].max()-test_values['age'].min())
test_values['area_percentage'] = (test_values['area_percentage']-test_values['area_percentage'].min())/(test_values['area_percentage'].max()-test_values['area_percentage'].min())
test_values['height_percentage'] = (test_values['height_percentage']-test_values['height_percentage'].min())/(test_values['height_percentage'].max()-test_values['height_percentage'].min())

And then, split the dataset between train and test.

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(df_train_values_subset, df_train_labels.damage_grade, random_state=1)

Now, let us create a stacked model. Firstly, we need to define the previous models. Let's begin with the XGBoost model:

In [9]:
from xgboost import XGBClassifier
from sklearn.metrics import f1_score

xgb_model = XGBClassifier(
    random_state = 0,
    subsample = 0.7,
    reg_lambda =  1.5,
    n_estimators = 475,
    max_depth = 9,
    learning_rate = 0.1,
    gamma = 0.1,
    colsample_bytree = 0.8
)

xgb_model.fit(X_train, Y_train)

Y_pred = xgb_model.predict(X_test)    # obtain the test predictions

# F1-score
f1 = f1_score(Y_test, Y_pred, average = 'micro')
print('F1 score: ' + '{:10.4f}'.format(f1))

F1 score:     0.7464


Following with the KNN model:

In [10]:
df_train_values_knn_after_dt_normalized = df_train_values_subset.join(df_train_labels)
train_knn_after_dt_normalized, test_knn_after_dt_normalized = train_test_split(df_train_values_knn_after_dt_normalized, test_size = 0.33)

In [11]:
from sklearn import neighbors

# Constructor
n_neighbors = 32
weights = 'distance'
knn_model_after_dt = neighbors.KNeighborsClassifier(n_neighbors = n_neighbors, weights = weights, p = 1)

# Fitting and predicting
knn_model_after_dt.fit(X = df_train_values_knn_after_dt_normalized[selected_features], y = df_train_labels)
y_pred_knn_after_dt_normalized = knn_model_after_dt.predict(X = test_knn_after_dt_normalized[selected_features])
accuracy_knn_after_dt_normalized = f1_score(test_knn_after_dt_normalized['damage_grade'], y_pred_knn_after_dt_normalized, average = 'micro')

print('Accuracy with F1 score:', accuracy_knn_after_dt_normalized)

  return self._fit(X, y)


Accuracy with F1 score: 0.9634414353655275


And finishing with the decision tree model:

In [12]:
from sklearn.tree import DecisionTreeClassifier

dt_model = DecisionTreeClassifier(criterion="entropy", 
                                  max_depth = 22, 
                                  #class_weight={1:5, 2:30, 3:18},
                                  min_samples_split = 60, 
                                  min_samples_leaf = 20)
dt_model.fit(X_train, Y_train)

# predicting the labels of the test split
y_pred_dt = dt_model.predict(X_test)

# F1-score
f1 = f1_score(Y_test, y_pred_dt, average='micro')
print('F1 score: ' + '{:10.4f}'.format(f1))

F1 score:     0.7151


Now that we have the three models ready, we can join them together in a stacked model as it follows:

In [13]:
from sklearn.ensemble import StackingClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression

kfold = StratifiedKFold(n_splits = 5)

stacked_model = StackingClassifier(
    estimators = [('xgb', xgb_model), ('knn', knn_model_after_dt), ('dt', dt_model)],
    final_estimator = LogisticRegression(),
    cv = kfold,
    n_jobs = -1
)

stacked_model.fit(X_train, Y_train)

StackingClassifier(cv=StratifiedKFold(n_splits=5,
        random_state=RandomState(MT19937) at 0x7FE818D22A40, shuffle=False),
                   estimators=[('xgb',
                                XGBClassifier(colsample_bytree=0.8, gamma=0.1,
                                              max_depth=9, n_estimators=475,
                                              objective='multi:softprob',
                                              reg_lambda=1.5, subsample=0.7)),
                               ('knn',
                                KNeighborsClassifier(n_neighbors=32, p=1,
                                                     weights='distance')),
                               ('dt',
                                DecisionTreeClassifier(criterion='entropy',
                                                       max_depth=22,
                                                       min_samples_leaf=20,
                                                       min_samples_split=60))]

## 3.2. Pre-evaluation

What is left is to evaluate the model properly to obtain a f1-score.

In [14]:
Y_pred = stacked_model.predict(X_test)    # Obtain the test predictions

# f1-score
f1 = f1_score(Y_test, Y_pred, average = 'micro')
print('F1 score: ' + '{:10.4f}'.format(f1))

F1 score:     0.7499


In our case we obtain 0.7499, which should be a very good value as long as overfitting does not occur.

## 3.3. Preparing the submission

Finally, we upload the results to the competition to check how much we have improved the model.

In [15]:
# Apply feature reduction
test_values_subset = test_values[selected_features]

# Obtain the predictions
predictions = stacked_model.predict(test_values_subset)

# Create the submission file
xgboost_submission = pd.DataFrame(data=predictions,
                             columns=submission_format.columns, # only one column: 'damage_grade' 
                             index=submission_format.index)
xgboost_submission.to_csv('xgboost_submission_grid_search.csv')

We obtained a f1-score of `0.7423`, which leads to rank `#532`. 