I am very attached to Nepal as I did a great travel there back in the days. When I was there, hanging out in Kathmandu, I could see the damages of the 2015 earthquake and all the efforts put in to rebuild the city.

This dataset and classification problem come from DrivenData.org, a platform which hosts data science competition for the social good. Don't hesitate to check them out, they host great projects !

Link of the competition : https://www.drivendata.org/competitions/57/nepal-earthquake

# IMPORTING USEFUL LIBRARIES

In [None]:
import numpy as np
import pandas as pd

import pprint

import seaborn as sb
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
import xgboost as xgb

from sklearn.metrics import classification_report, f1_score, confusion_matrix

from sklearn import ensemble, tree, linear_model, svm, naive_bayes, neural_network, neighbors

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
from IPython.core.interactiveshell import InteractiveShell  
InteractiveShell.ast_node_interactivity = "all"
#allows to, among other functionnalities,print head and info of a df in the same cell 
from IPython.display import display_html 

# LOADING DATA

In [None]:
train = pd.read_csv('/kaggle/input/richters-predictor-modeling-earthquake-damage/train_values.csv')
target = pd.read_csv('/kaggle/input/richters-predictor-modeling-earthquake-damage/train_labels.csv')
test = pd.read_csv('/kaggle/input/richters-predictor-modeling-earthquake-damage/test_values.csv')
sub_format = pd.read_csv('/kaggle/input/richters-predictor-modeling-earthquake-damage/submission_format.csv')

In [None]:
train = pd.merge(train, target, on = 'building_id', how = 'left')
train.set_index('building_id', drop = True, inplace = True)
test.set_index('building_id', drop = True, inplace = True)

In [None]:
train.head()
train.info()

We have a total of 37 parameters, mostly numerical values, with a few of them being string values. We'll transform these strings to numerical / categorical values further.

# Problem description

We're trying to predict the ordinal variable damage_grade, which represents a level of damage to the building that was hit by the earthquake.

**There are 3 grades of the damage:**
1. represents low damage
1. represents a medium amount of damage
1. represents almost complete destruction

# Performance metric

We are predicting the level of damage from 1 to 3. The level of damage is an ordinal variable meaning that ordering is important. This can be viewed as a classification or an ordinal regression problem. (Ordinal regression is sometimes described as an problem somewhere in between classification and regression.)

To measure the performance of our algorithms, we'll use the F1 score which balances the precision and recall of a classifier. Traditionally, the F1 score is used to evaluate performance on a binary classifier, but since we have three possible labels we will use a variant called the micro averaged F1 score.

# Few vizualizations to discover the data we have to work with

In [None]:
sb.countplot(train['damage_grade'])
print(train['damage_grade'].value_counts())

We can note that damage_grade = 2 is much more representated than 1 and about twice more than 3. Let's investigate more, using the geo_level as another variable : according to the host of the competition, the 'geo level' data represents the geographic region in which building exists, from largest (level 1) to most specific sub-region (level 3). Let's see what does it mean on results :

In [None]:
plt.figure(figsize=(20,5))

plt.subplot(1,3,1)
sb.barplot(train['damage_grade'], train['geo_level_1_id'])

plt.subplot(1,3,2)
sb.barplot(train['damage_grade'], train['geo_level_2_id'])

plt.subplot(1,3,3)
sb.barplot(train['damage_grade'], train['geo_level_3_id'])

plt.show()

In [None]:
plt.figure(figsize=(20,5))

plt.subplot(1,3,1)
sb.distplot(train['age'], kde = False)

plt.subplot(1,3,2)
plt.hist(train['age'], range=(0,200))

plt.subplot(1,3,3)
sb.barplot(train['damage_grade'],train['age'])

Building are almost all above 100 years of age. For each damage grade, only a few are a thousand years old. With no suprise, newer buildings where less damaged.

Let's now look at height / area percentage and floor data :

In [None]:
plt.figure(figsize = (20,5))

plt.subplot(1,3,1)
sb.barplot(train['damage_grade'], train['height_percentage'])

plt.subplot(1,3,2)
sb.barplot(train['damage_grade'], train['area_percentage'])

plt.subplot(1,3,3)
sb.barplot(train['damage_grade'], train['count_floors_pre_eq'])

We can see a sligh correlation between the height/area data and the damage grade level. Plus, higher buildings tend to get more damaged.

What about the 'superstructure' data ?

In [None]:
superstructure_cols = [x for x in train.columns if 'super' in x]
secondary_use_cols = [x for x in train.columns if 'secondary' in x]

superstructure_corr = train[superstructure_cols+['damage_grade']].corr()
secondary_use_corr = train[secondary_use_cols+['damage_grade']].corr()

plt.figure(figsize=(30,8))

plt.subplot(1,2,1)
sb.heatmap(secondary_use_corr)

plt.subplot(1,2,2)
sb.heatmap(superstructure_corr)

Correlation between damages and secondary_use is weak but we can see that there might be something interesting with superstructures :

In [None]:
plt.figure(figsize = (20,5))

plt.subplot(1,3,1)
sb.barplot(train['damage_grade'], train['has_superstructure_adobe_mud'])

plt.subplot(1,3,2)
sb.barplot(train['damage_grade'], train['has_superstructure_mud_mortar_stone'])

plt.subplot(1,3,3)
sb.barplot(train['damage_grade'], train['has_superstructure_cement_mortar_brick'])

We can clearly see that mud structures were much more damaged than more solid ones like cement.

From now, we'll move to data preparation and a first modeling approach.

# Data preparation (split, cleaning, ...)#

Treating categorical data with pandas 'get_dummies' function.

In [None]:
text_features = []
for column in train.columns:
    if train[column].dtype == 'object':
        text_features.append(column)

for feature in text_features:
    train = train.join(pd.get_dummies(train[feature], prefix = feature))
    test = test.join(pd.get_dummies(test[feature], prefix = feature))
    
    train.drop(feature, axis = 1, inplace = True)
    test.drop(feature, axis = 1, inplace = True)


features = train.drop('damage_grade', axis = 1).columns

In [None]:
train.head()

Our train dataset is ready for train/test split :

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(train[features], train.damage_grade, random_state = 42)

We will fit the train data to common classifiers to see which one of these performs better on a first approach :

In [None]:
classifiers = [neighbors.KNeighborsClassifier(),
               tree.DecisionTreeClassifier(),
               ensemble.RandomForestClassifier(),
               ensemble.GradientBoostingClassifier(),
               xgb.XGBClassifier()]

def test_models(classifiers):
    
    for model in classifiers:
        
        model.fit(X_train, Y_train)
        Y_pred = model.predict(X_test)
        
        print(model)
        score = f1_score(Y_test, Y_pred, average='micro')
        print(score)
        print('############')
        
test_models(classifiers)

Looks like RandomForest and XGBoost performs pretty well and both give a f1 score which will get you a minima of top 20% in the competition.

But I am sure we can do better than that. Let's try to tune both of these results by performing feature engineering and parameters optimization.

# Feature engineering

Let's dive into some changes / removals we could perform on our features to get a better result.

First, let's re-train our 2 best base-models and compare their confusion matrices :

In [None]:
rf_clf = ensemble.RandomForestClassifier()
xgb_clf = xgb.XGBClassifier()

rf_clf.fit(X_train, Y_train)
y_pred_rf = rf_clf.predict(X_test)

xgb_clf.fit(X_train, Y_train)
y_pred_xgb = xgb_clf.predict(X_test)

In [None]:
df_cm_rf = pd.DataFrame(confusion_matrix(Y_test, y_pred_rf), columns=np.unique(Y_test), index = np.unique(Y_test))
df_cm_rf.index.name = 'Real'
df_cm_rf.columns.name = 'Predicted'

df_cm_xgb = pd.DataFrame(confusion_matrix(Y_test, y_pred_xgb), columns=np.unique(Y_test), index = np.unique(Y_test))
df_cm_xgb.index.name = 'Real'
df_cm_xgb.columns.name = 'Predicted'

plt.figure(figsize=(20,8))

plt.subplot(1,2,1)
sb.heatmap(df_cm_rf, annot=True, fmt='d', annot_kws={"size": 24})

plt.subplot(1,2,2)
sb.heatmap(df_cm_xgb, annot=True, fmt='d', annot_kws={"size": 24})

Both of our models over-estimate the damage_grade level 2 (42300 / 43813 predicted for 36994 real), which is almost + 15%.

On the other hand, they tend to under-estimate both damage_grade levels 1 and 3.

Let's look at the classification reports for both of our models :

In [None]:
print("Random Forest")
print(classification_report(Y_test, y_pred_rf))
print('############################################################')
print("XG Boost")
print(classification_report(Y_test, y_pred_xgb))

These classification reports show that the prediction for damage_grade = 1 isn't really good, which can appear as a predictle result, looking at the damage_grade repartition in the training set seen before.

In [None]:
importance_rf = pd.DataFrame({"Features":features, "Importance_RF":rf_clf.feature_importances_}).sort_values(by='Importance_RF', ascending = False).head(15)
importance_xgb = pd.DataFrame({"Features":features, "Importance_XGB":xgb_clf.feature_importances_}).sort_values(by='Importance_XGB', ascending = False).head(15)

RF_styler = importance_rf.style.set_table_attributes("style='display:inline'").set_caption('Top 15 Random Forest importance')
XGB_styler = importance_xgb.style.set_table_attributes("style='display:inline'").set_caption('Top 15 XGBoost importance')

display_html(RF_styler._repr_html_()+XGB_styler._repr_html_(), raw=True)

We see here that, for the Random Forest, the geographical data is unmissable for our model. On the other hand, for XGBoost, categorical data look like to have more influence.

Let's just check that 'foundation_type_r' feature, which has a high importance for XGB :

In [None]:
train['foundation_type_r'].value_counts()
sb.barplot(train['damage_grade'], train['foundation_type_r'])

This feature seems to be important to predict an output of damage_grade=3.

Let's now see if we find outliers in other numerical values :

In [None]:
num_features = ['geo_level_1_id','geo_level_2_id','geo_level_3_id','age','area_percentage','height_percentage']
i = 1

plt.figure(figsize=(20,10))

for col in num_features:
    plt.subplot(3,3,i)
    ax=sb.boxplot(train[col].dropna())
    plt.xlabel(col)
    i+=1
plt.show()

# TUNING XGBOOST CLF

Let's take back our base-xgboost classifier perform a few tuning manipulations on its parameters :

In [None]:
print('Baseline f1 score :')
print(f1_score(Y_test, y_pred_xgb, average='micro'))
print('Parameters associated :')
xgb_clf.get_params

In [None]:
param_1 = {'max_depth' : [10, 20, 40, 60, 80]}

xgb_gs = GridSearchCV(xgb_clf, param_1, n_jobs=4,verbose=5, scoring='f1_micro', cv=3)

xgb_gs.fit(X_train, Y_train)

In [None]:
def make_submission(test_data, classifier):
    
    classifier.fit(X_train, Y_train)
    
    test_data['damage_grade'] = classifier.predict(test_data[features])

    test_data['damage_grade'].to_csv('submission.csv', index = True)

In [None]:
make_submission(test, xgb_gs)

Thanks for reading. Do not hesitate to comment if you have questions about the competition or my notebook. I will regularly update this with better classifiers in order to gain, I hope, a few ranks up. :D

Remarks about what I could do better are greatly appreciated !