# 1. Introduction

In this google colab, we'll be applying our accumulated knowledge on the techniques of supervised learning algorithms. The task to be adressed for this milestone is the prediction of damage levels to buildings caused by the 2015 Gorkha earthquake in Nepal. Further information on the task is retrievable from the competition page by **drivendata.org**: "[Richter's Predictor: Modeling Earthquake Damage](https://www.drivendata.org/competitions/57/nepal-earthquake/)".

The authors of this project are:

- [Raúl Barba Rojas](Raul.Barba@alu.uclm.es)
- [Diego Guerrero Del Pozo](Diego.Guerrero@alu.uclm.es)
- [Marvin Schmidt](Marvin.Schmidt@alu.uclm.es)

# 2. Preparations

## 2.1. Installing CatBoost

In [1]:
!pip install catboost

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting catboost
  Downloading catboost-1.1.1-cp38-none-manylinux1_x86_64.whl (76.6 MB)
[K     |████████████████████████████████| 76.6 MB 19 kB/s 
Installing collected packages: catboost
Successfully installed catboost-1.1.1


## 2.2. Importing libraries

In [2]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

## 2.3 Importing training data

All the datasets from the DrivenData competition can be accessed in this github repository.

In this section, we simply load the three different datasets as pandas dataframes, so that we can work with them to achieve the desired results.

---

There are two different csv files related to the training dataset:

1. `train_values.csv`: this file contains the values of the different features with which the training will be performed.
2. `train_labels.csv `: this file contains the values of the labels for the output feature that we are trying to predict, which is called `damage_grade`.

Thus, we first need to download the datasets from the github repository and we need to load them as dataframes:

In [3]:
!wget https://github.com/alan-flint/Richter-DrivenData/raw/master/input/train_values.csv
df_train_values= pd.read_csv("train_values.csv", index_col = "building_id")
df_train_values

--2022-12-12 22:18:18--  https://github.com/alan-flint/Richter-DrivenData/raw/master/input/train_values.csv
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/alan-flint/Richter-DrivenData/master/input/train_values.csv [following]
--2022-12-12 22:18:18--  https://raw.githubusercontent.com/alan-flint/Richter-DrivenData/master/input/train_values.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23442727 (22M) [text/plain]
Saving to: ‘train_values.csv’


2022-12-12 22:18:19 (210 MB/s) - ‘train_values.csv’ saved [23442727/23442727]



Unnamed: 0_level_0,geo_level_1_id,geo_level_2_id,geo_level_3_id,count_floors_pre_eq,age,area_percentage,height_percentage,land_surface_condition,foundation_type,roof_type,...,has_secondary_use_agriculture,has_secondary_use_hotel,has_secondary_use_rental,has_secondary_use_institution,has_secondary_use_school,has_secondary_use_industry,has_secondary_use_health_post,has_secondary_use_gov_office,has_secondary_use_use_police,has_secondary_use_other
building_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
802906,6,487,12198,2,30,6,5,t,r,n,...,0,0,0,0,0,0,0,0,0,0
28830,8,900,2812,2,10,8,7,o,r,n,...,0,0,0,0,0,0,0,0,0,0
94947,21,363,8973,2,10,5,5,t,r,n,...,0,0,0,0,0,0,0,0,0,0
590882,22,418,10694,2,10,6,5,t,r,n,...,0,0,0,0,0,0,0,0,0,0
201944,11,131,1488,3,30,8,9,t,r,n,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
688636,25,1335,1621,1,55,6,3,n,r,n,...,0,0,0,0,0,0,0,0,0,0
669485,17,715,2060,2,0,6,5,t,r,n,...,0,0,0,0,0,0,0,0,0,0
602512,17,51,8163,3,55,6,7,t,r,q,...,0,0,0,0,0,0,0,0,0,0
151409,26,39,1851,2,10,14,6,t,r,x,...,0,0,0,0,0,0,0,0,0,0


In [4]:
!wget https://github.com/alan-flint/Richter-DrivenData/raw/master/input/train_labels.csv
df_train_labels = pd.read_csv("train_labels.csv", index_col = "building_id")
df_train_labels

--2022-12-12 22:18:20--  https://github.com/alan-flint/Richter-DrivenData/raw/master/input/train_labels.csv
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/alan-flint/Richter-DrivenData/master/input/train_labels.csv [following]
--2022-12-12 22:18:21--  https://raw.githubusercontent.com/alan-flint/Richter-DrivenData/master/input/train_labels.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2330792 (2.2M) [text/plain]
Saving to: ‘train_labels.csv’


2022-12-12 22:18:22 (50.1 MB/s) - ‘train_labels.csv’ saved [2330792/2330792]



Unnamed: 0_level_0,damage_grade
building_id,Unnamed: 1_level_1
802906,3
28830,2
94947,3
590882,2
201944,3
...,...
688636,2
669485,3
602512,3
151409,2


Once we have loaded both datasets we need to join them, obtaining the complete training dataset:

In [5]:
df_train_values.join(df_train_labels).to_csv("train_full.csv")

## 2.3 Importing testing data

In order to be able to evaluate our findings, we'll also need the testing data, as well as the template for the submission file. These datasets can also be accessed from this github repository.

1. `test_values.csv`: this file contains the values of the different features with which the testing will be performed.
2. `submission_format.csv`: this file contains "empty" labels for all the buildings we're trying to predict the damage grade for. It's a template file to be modified later, in which every label for ``damage_grade`` is ``1``.

In [6]:
from sklearn.preprocessing import StandardScaler

!wget https://github.com/alan-flint/Richter-DrivenData/raw/master/input/test_values.csv
test_values = pd.read_csv('test_values.csv', index_col='building_id')
test_values = pd.get_dummies(test_values)

!wget https://github.com/alan-flint/Richter-DrivenData/raw/master/input/submission_format.csv
submission_format = pd.read_csv('submission_format.csv', index_col='building_id')

--2022-12-12 22:18:28--  https://github.com/alan-flint/Richter-DrivenData/raw/master/input/test_values.csv
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/alan-flint/Richter-DrivenData/master/input/test_values.csv [following]
--2022-12-12 22:18:28--  https://raw.githubusercontent.com/alan-flint/Richter-DrivenData/master/input/test_values.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7815385 (7.5M) [text/plain]
Saving to: ‘test_values.csv’


2022-12-12 22:18:29 (114 MB/s) - ‘test_values.csv’ saved [7815385/7815385]

--2022-12-12 22:18:30--  https://github.com/alan-flint/Richter-DrivenData/r

# 3. Model implementation

## 3.1. CatBoost

First, we need to decide what features to use. In our case it will be the ones obtained from the decision trees in the baseline.

In [7]:
df_train_values_subset = pd.get_dummies(df_train_values)

selected_features = ['age',
                         'area_percentage',
                         'height_percentage',
                         'geo_level_1_id',
                         'geo_level_2_id',
                         'geo_level_3_id',
                         'has_superstructure_adobe_mud',
                         'has_superstructure_mud_mortar_stone',
                         'has_superstructure_stone_flag',
                         'has_superstructure_cement_mortar_stone',
                         'has_superstructure_mud_mortar_brick',
                         'has_superstructure_cement_mortar_brick',
                         'has_superstructure_timber',
                         'has_superstructure_bamboo',
                         'has_superstructure_rc_non_engineered',
                         'has_superstructure_rc_engineered',
                         'has_superstructure_other',
                         'foundation_type_r',
                         'ground_floor_type_v',
                         'other_floor_type_q']

df_train_values_subset = df_train_values_subset[selected_features]
df_train_values_subset

Unnamed: 0_level_0,age,area_percentage,height_percentage,geo_level_1_id,geo_level_2_id,geo_level_3_id,has_superstructure_adobe_mud,has_superstructure_mud_mortar_stone,has_superstructure_stone_flag,has_superstructure_cement_mortar_stone,has_superstructure_mud_mortar_brick,has_superstructure_cement_mortar_brick,has_superstructure_timber,has_superstructure_bamboo,has_superstructure_rc_non_engineered,has_superstructure_rc_engineered,has_superstructure_other,foundation_type_r,ground_floor_type_v,other_floor_type_q
building_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
802906,30,6,5,6,487,12198,1,1,0,0,0,0,0,0,0,0,0,1,0,1
28830,10,8,7,8,900,2812,0,1,0,0,0,0,0,0,0,0,0,1,0,1
94947,10,5,5,21,363,8973,0,1,0,0,0,0,0,0,0,0,0,1,0,0
590882,10,6,5,22,418,10694,0,1,0,0,0,0,1,1,0,0,0,1,0,0
201944,30,8,9,11,131,1488,1,0,0,0,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
688636,55,6,3,25,1335,1621,0,1,0,0,0,0,0,0,0,0,0,1,0,0
669485,0,6,5,17,715,2060,0,1,0,0,0,0,0,0,0,0,0,1,0,1
602512,55,6,7,17,51,8163,0,1,0,0,0,0,0,0,0,0,0,1,0,1
151409,10,14,6,26,39,1851,0,0,0,0,0,1,0,0,0,0,0,1,1,0


First, we need to normalize those numerical features like the age or the area percentage.

In [8]:
df_train_values_subset['age'] = (df_train_values_subset['age']-df_train_values_subset['age'].min())/(df_train_values_subset['age'].max()-df_train_values_subset['age'].min())
df_train_values_subset['area_percentage'] = (df_train_values_subset['area_percentage']-df_train_values_subset['area_percentage'].min())/(df_train_values_subset['area_percentage'].max()-df_train_values_subset['area_percentage'].min())
df_train_values_subset['height_percentage'] = (df_train_values_subset['height_percentage']-df_train_values_subset['height_percentage'].min())/(df_train_values_subset['height_percentage'].max()-df_train_values_subset['height_percentage'].min())

test_values['age'] = (test_values['age']-test_values['age'].min())/(test_values['age'].max()-test_values['age'].min())
test_values['area_percentage'] = (test_values['area_percentage']-test_values['area_percentage'].min())/(test_values['area_percentage'].max()-test_values['area_percentage'].min())
test_values['height_percentage'] = (test_values['height_percentage']-test_values['height_percentage'].min())/(test_values['height_percentage'].max()-test_values['height_percentage'].min())

As we are working with CatBoost, it is important to have categorical features, like the geo levels in this example.

In [9]:
df_train_values_subset['geo_level_1_id'] = df_train_values_subset['geo_level_1_id'].astype('category')
df_train_values_subset['geo_level_2_id'] = df_train_values_subset['geo_level_2_id'].astype('category')
df_train_values_subset['geo_level_3_id'] = df_train_values_subset['geo_level_3_id'].astype('category')
test_values['geo_level_1_id'] = test_values['geo_level_1_id'].astype('category')
test_values['geo_level_2_id'] = test_values['geo_level_2_id'].astype('category')
test_values['geo_level_3_id'] = test_values['geo_level_3_id'].astype('category')

And then, we split the dataset between train and test.

In [10]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(df_train_values_subset, df_train_labels.damage_grade, random_state=1)

## 3.2. Pre-evaluation

Finally, we are ready to implement the CatBoost model to predict the damage grade labels:

In [11]:
import time
from catboost import CatBoostClassifier
from sklearn.metrics import f1_score

model = CatBoostClassifier(random_state = 0)

%time model.fit(X_train, Y_train, cat_features = ['geo_level_1_id', 'geo_level_2_id', 'geo_level_3_id'])

Y_pred = model.predict(X_test)    # Obtain the test predictions

# f1-score
f1 = f1_score(Y_test, Y_pred, average = 'micro')
print('F1 score: ' + '{:10.4f}'.format(f1))

Learning rate set to 0.103554
0:	learn: 1.0255558	total: 1.1s	remaining: 18m 15s
1:	learn: 0.9689118	total: 2s	remaining: 16m 35s
2:	learn: 0.9230123	total: 2.49s	remaining: 13m 48s
3:	learn: 0.8842905	total: 2.88s	remaining: 11m 56s
4:	learn: 0.8526345	total: 3.32s	remaining: 11m 1s
5:	learn: 0.8255464	total: 3.7s	remaining: 10m 13s
6:	learn: 0.8027421	total: 4.07s	remaining: 9m 37s
7:	learn: 0.7830673	total: 4.47s	remaining: 9m 14s
8:	learn: 0.7663164	total: 4.91s	remaining: 9m
9:	learn: 0.7518190	total: 5.32s	remaining: 8m 46s
10:	learn: 0.7388563	total: 5.67s	remaining: 8m 29s
11:	learn: 0.7280713	total: 6.13s	remaining: 8m 24s
12:	learn: 0.7180350	total: 6.56s	remaining: 8m 18s
13:	learn: 0.7091895	total: 6.96s	remaining: 8m 10s
14:	learn: 0.6991482	total: 7.38s	remaining: 8m 4s
15:	learn: 0.6899480	total: 7.79s	remaining: 7m 58s
16:	learn: 0.6821826	total: 8.15s	remaining: 7m 51s
17:	learn: 0.6755864	total: 8.52s	remaining: 7m 44s
18:	learn: 0.6696449	total: 8.85s	remaining: 7m 3

## 3.3. Preparing the submission

Now, we can output the results to be upload to the competition.

In [12]:
# Apply feature reduction
test_values_subset = test_values[selected_features]

# Obtain the predictions
predictions = model.predict(test_values_subset)

# Create the submission file
xgboost_submission = pd.DataFrame(data=predictions,
                             columns=submission_format.columns, # Only one column: 'damage_grade' 
                             index=submission_format.index)
xgboost_submission.to_csv('catboost_submission1.csv')

This gives us a huge `0.7419` in the competition to reach rank `#557`, but let's see if it can be improved in other colabs.