# Richter's Predictor: Modeling Earthquake Damage

#### Datadriven is hosting a competition to predict the level of damage to buildings caused by the 2015 Gorkha earthquake in Nepal based on aspects of building location and construction.

#### I took part of the competion with this script and obtained a micro-averaged F1 score of 0.7289 (Position 381 out of 2980).

#### More information can be found in the following [link]
[link]: https://www.drivendata.org/competitions/57/nepal-earthquake/

In [None]:
import dask.dataframe as dd
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import copy
import math
from sklearn.model_selection import cross_val_score

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
train_X = dd.read_csv("/kaggle/input/richters-predictor-modeling-earthquake-damage/train_values.csv")
train_y = dd.read_csv("/kaggle/input/richters-predictor-modeling-earthquake-damage/train_labels.csv")

df = train_X.merge(train_y, how="inner", on = "building_id")
sample = df.sample(frac=1, random_state=12).compute()

In [None]:
object_list = list(sample.select_dtypes("object").columns)
print(object_list)

#### After analyzing the columns and their meaning, we decide to remove those related to the secondary purpose of the building because it has no impact in the damage.

In [None]:
colToDrop = []
for col in sample.columns:
    if "secondary" in col:
        colToDrop.append(col)
sample.drop(colToDrop, axis=1, inplace=True)

#### Let's see if the dataset is balanced:

In [None]:
sns.countplot(x='damage_grade', data=sample)

#### We can clearly see that the dataset is not balanced. However, after working on the prediction of the damages we can see that we get a better result with this dataset, rather than an oversampled or undersampled one, so we will keep it like that.

#### Next step will be to get the dummy variables for the categorical ones:

In [None]:
sample = pd.get_dummies(sample)

#### We are ready to prepare the dataset in order to work on the prediction:

In [None]:
X = sample.loc[:,sample.columns != "damage_grade"].values
y = sample.loc[:,sample.columns == "damage_grade"].values

#### We split the dataset to train the model with some of the data and to test it with some different data:

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0)

#### The best result for this case is obtained with the Gradientboost:

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
boost = GradientBoostingClassifier(learning_rate=0.15,max_depth=5, min_samples_split=1200, n_estimators= 300, verbose = 1)
boost.fit(X_train, y_train.flatten())
y_pred = boost.predict(X_test)
from sklearn.metrics import f1_score
f1_score = f1_score(y_test, y_pred, average='micro')
print(f1_score)

#### Let's take a look at the confusion matrix:

In [None]:
from sklearn.metrics import confusion_matrix
conf_matriz = confusion_matrix(y_test, y_pred)
print(conf_matriz)

#### Last step is to save the results correctly to upload the file to the Drivendata website to get the results of our prediction:

In [None]:
test_x = pd.read_csv("/kaggle/input/richters-predictor-modeling-earthquake-damage/test_values.csv")
test_data = pd.get_dummies(test_x)

In [None]:
for elem in test_data.columns:
  if elem not in sample.columns:
      test_data.drop(elem, axis=1, inplace=True)   

In [None]:
y_pred_test = boost.predict(test_data)
y_pred_test = pd.DataFrame(y_pred_test, columns = ["damage_grade"])
building = test_x.loc[:,"building_id"]
building = pd.DataFrame(building, columns = ["building_id"])
solucion = pd.concat([building, y_pred_test], axis = 1)

#solucion.to_csv("/kaggle/output/solution.csv", index = False)