# Richter's Predictor: Modeling Earthquake Damage

Source: https://www.drivendata.org/competitions/57/nepal-earthquake/

Based on aspects of building location and construction, your goal is to predict the level of damage to buildings caused by the 2015 Gorkha earthquake in Nepal.

EVALUATION METRIC
Fmicro=2⋅Pmicro⋅Rmicro/Pmicro+Rmicro
The metric used for this competition is the micro-averaged F1 score.

# Get data from website

In [6]:
!wget https://s3.amazonaws.com/drivendata/data/57/public/train_values.csv -nc -P ./nepal
!wget https://s3.amazonaws.com/drivendata/data/57/public/train_labels.csv -nc -P ./nepal
!wget https://s3.amazonaws.com/drivendata/data/57/public/test_values.csv  -nc -P ./nepal

--2020-02-13 16:19:07--  https://s3.amazonaws.com/drivendata/data/57/public/train_values.csv
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.179.157
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.179.157|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23442727 (22M) [text/csv]
Saving to: ‘./nepal/train_values.csv’


2020-02-13 16:19:08 (89.1 MB/s) - ‘./nepal/train_values.csv’ saved [23442727/23442727]

File ‘./nepal/train_labels.csv’ already there; not retrieving.

--2020-02-13 16:19:08--  https://s3.amazonaws.com/drivendata/data/57/public/test_values.csv
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.179.157
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.179.157|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7815385 (7.5M) [text/csv]
Saving to: ‘./nepal/test_values.csv’


2020-02-13 16:19:08 (48.3 MB/s) - ‘./nepal/test_values.csv’ saved [7815385/7815385]



# Import Data

In [10]:
import pandas as pd

In [21]:
X = pd.read_csv('./nepal/train_values.csv', index_col= 'building_id' )

In [22]:
X.head()

Unnamed: 0_level_0,geo_level_1_id,geo_level_2_id,geo_level_3_id,count_floors_pre_eq,age,area_percentage,height_percentage,land_surface_condition,foundation_type,roof_type,...,has_secondary_use_agriculture,has_secondary_use_hotel,has_secondary_use_rental,has_secondary_use_institution,has_secondary_use_school,has_secondary_use_industry,has_secondary_use_health_post,has_secondary_use_gov_office,has_secondary_use_use_police,has_secondary_use_other
building_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
802906,6,487,12198,2,30,6,5,t,r,n,...,0,0,0,0,0,0,0,0,0,0
28830,8,900,2812,2,10,8,7,o,r,n,...,0,0,0,0,0,0,0,0,0,0
94947,21,363,8973,2,10,5,5,t,r,n,...,0,0,0,0,0,0,0,0,0,0
590882,22,418,10694,2,10,6,5,t,r,n,...,0,0,0,0,0,0,0,0,0,0
201944,11,131,1488,3,30,8,9,t,r,n,...,0,0,0,0,0,0,0,0,0,0


In [23]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 260601 entries, 802906 to 747594
Data columns (total 38 columns):
geo_level_1_id                            260601 non-null int64
geo_level_2_id                            260601 non-null int64
geo_level_3_id                            260601 non-null int64
count_floors_pre_eq                       260601 non-null int64
age                                       260601 non-null int64
area_percentage                           260601 non-null int64
height_percentage                         260601 non-null int64
land_surface_condition                    260601 non-null object
foundation_type                           260601 non-null object
roof_type                                 260601 non-null object
ground_floor_type                         260601 non-null object
other_floor_type                          260601 non-null object
position                                  260601 non-null object
plan_configuration                        2606

In [20]:
y = pd.read_csv('./nepal/train_labels.csv', index_col = 'building_id')['damage_grade']

## Attempt 1: Model w/ One Feature

In [24]:
X_height = X[['height_percentage']]
X_height.head()

Unnamed: 0_level_0,height_percentage
building_id,Unnamed: 1_level_1
802906,5
28830,7
94947,5
590882,5
201944,9


In [36]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

In [30]:
X_train, X_test, y_train, y_test = train_test_split(X_height,
                                                    y,
                                                    test_size=0.2,
                                                    random_state =42)

In [33]:
one_feat_model = LogisticRegression(solver='lbfgs', multi_class = 'auto')
one_feat_model.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [34]:
y_train_pred = one_feat_model.predict(X_train)

In [37]:
print('In-sample f1 score:')
f1_score(y_train, y_train_pred, average ='micro')

In-sample f1 score:


0.5699779355333845

In [38]:
y_test_pred = one_feat_model.predict(X_test)

In [39]:
print('Out-sample f1 score:')
f1_score(y_test, y_test_pred, average ='micro')

Out-sample f1 score:


0.5660290477926364

# Create Submission file

In [40]:
X_comp_test = pd.read_csv('./nepal/test_values.csv', index_col = 'building_id')

In [42]:
y_comp_pred = one_feat_model.predict(X_comp_test[['height_percentage']])

In [48]:
y_submission = pd.DataFrame(y_comp_pred, index= X_comp_test.index, columns = ['damage_grade'])

In [49]:
y_submission.to_csv('nepal/1st_sub.csv')

Score:0.56