
<h1><center>WELCOME TO THE PROJECT:</center></h1>

<h2><center>Richter's Predictor: Modeling Earthquake Damage. </center></h2>

##### Contributor: Stephen Vu (Vu Kim Thanh)

### Starting Date: Sep 5th, 2020. 
<font color='red'>Stephen</font> is writing this from **S506 of QUT lab room**

#### The dataset is extracted from Data Driven Contest. More info of the competition: https://www.drivendata.org/competitions/57/nepal-earthquake/

## Problem Formulation
Based on aspects of building location and construction, your goal is to predict the level of damage to buildings caused by the 2015 Gorkha earthquake in Nepal.

The data was collected through surveys by Kathmandu Living Labs and the Central Bureau of Statistics, which works under the National Planning Commission Secretariat of Nepal. This survey is one of the largest post-disaster datasets ever collected, containing valuable information on earthquake impacts, household conditions, and socio-economic-demographic statistics.

## Problem description
We're trying to predict the ordinal variable damage_grade, which represents a level of damage to the building that was hit by the earthquake. There are 3 grades of the damage:

1. represents low damage

2. represents a medium amount of damage

3. represents almost complete destruction



## Performance metric
We are predicting the level of damage from 1 to 3. The level of damage is an ordinal variable meaning that ordering is important. This can be viewed as a classification or an ordinal regression problem. (Ordinal regression is sometimes described as an problem somewhere in between classification and regression.)

To measure the performance of our algorithms, we'll use the F1 score which balances the precision and recall of a classifier. Traditionally, the F1 score is used to evaluate performance on a binary classifier, but since we have three possible labels we will use a variant called the micro averaged F1 score.

Fmicro=2⋅Pmicro⋅RmicroPmicro+Rmicro
where

Pmicro=∑3k=1TPk∑3k=1(TPk+FPk),  Rmicro=∑3k=1TPk∑3k=1(TPk+FNk)
and TP is True Positive, FP is False Positive, FN is False Negative, and k represents each class in 1,2,3.

In Python, you can easily calculate this loss using sklearn.metrics.f1_score with the keyword argument average='micro'. 

## <font color='blue'>Cleaning and EDA </font>

In [1]:
#Import basic packages for exploratory
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt 
from scipy import stats
import seaborn as sns 
from pandas.api.types import is_numeric_dtype

In [8]:
#Import data training test: 
df_X= pd.read_csv('C:/Users/n10648771/OneDrive - Queensland University of Technology/EarthQuake_DrivenData/dataset/train_values.csv')
df_y = pd.read_csv('C:/Users/n10648771/OneDrive - Queensland University of Technology/EarthQuake_DrivenData/dataset/train_labels.csv')
df = df_X.merge(df_y, on='building_id')
df_test=pd.read_csv('C:/Users/n10648771/OneDrive - Queensland University of Technology/EarthQuake_DrivenData/dataset/test_values.csv')

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 260601 entries, 0 to 260600
Data columns (total 40 columns):
building_id                               260601 non-null int64
geo_level_1_id                            260601 non-null int64
geo_level_2_id                            260601 non-null int64
geo_level_3_id                            260601 non-null int64
count_floors_pre_eq                       260601 non-null int64
age                                       260601 non-null int64
area_percentage                           260601 non-null int64
height_percentage                         260601 non-null int64
land_surface_condition                    260601 non-null object
foundation_type                           260601 non-null object
roof_type                                 260601 non-null object
ground_floor_type                         260601 non-null object
other_floor_type                          260601 non-null object
position                                  260601 non

As we can see, all data have been **well-documented with no data errors (nulls, wrong datatypes)**

In [10]:
display(df.head(10).T)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
building_id,802906,28830,94947,590882,201944,333020,728451,475515,441126,989500
geo_level_1_id,6,8,21,22,11,8,9,20,0,26
geo_level_2_id,487,900,363,418,131,558,475,323,757,886
geo_level_3_id,12198,2812,8973,10694,1488,6089,12066,12236,7219,994
count_floors_pre_eq,2,2,2,2,3,2,2,2,2,1
age,30,10,10,10,30,10,25,0,15,0
area_percentage,6,8,5,6,8,9,3,8,8,13
height_percentage,5,7,5,5,9,5,4,6,6,4
land_surface_condition,t,o,t,t,t,t,n,t,t,t
foundation_type,r,r,r,r,r,r,r,w,r,i


In [11]:
#Identifying outliers: 
outliers=[]
def detect_outlier(data_1):
    
    threshold=3
    mean_1 = np.mean(data_1)
    std_1 =np.std(data_1)
    
    
    for y in data_1:
        z_score= (y - mean_1)/std_1 
        if np.abs(z_score) > threshold:
            outliers.append(y)
    return len(outliers)

In [12]:
for col in df.columns:
    if is_numeric_dtype(df[col]):
        a=detect_outlier(df[col])
        display((df[col].name+ " : " +str(a) + ' outliers'))
        outliers.clear()

'building_id : 0 outliers'

'geo_level_1_id : 0 outliers'

'geo_level_2_id : 0 outliers'

'geo_level_3_id : 0 outliers'

'count_floors_pre_eq : 2496 outliers'

'age : 1390 outliers'

'area_percentage : 3845 outliers'

'height_percentage : 2434 outliers'

'has_superstructure_adobe_mud : 23101 outliers'

'has_superstructure_mud_mortar_stone : 0 outliers'

'has_superstructure_stone_flag : 8947 outliers'

'has_superstructure_cement_mortar_stone : 4752 outliers'

'has_superstructure_mud_mortar_brick : 17761 outliers'

'has_superstructure_cement_mortar_brick : 19615 outliers'

'has_superstructure_timber : 0 outliers'

'has_superstructure_bamboo : 22154 outliers'

'has_superstructure_rc_non_engineered : 11099 outliers'

'has_superstructure_rc_engineered : 4133 outliers'

'has_superstructure_other : 3905 outliers'

'count_families : 2330 outliers'

'has_secondary_use : 0 outliers'

'has_secondary_use_agriculture : 16777 outliers'

'has_secondary_use_hotel : 8763 outliers'

'has_secondary_use_rental : 2111 outliers'

'has_secondary_use_institution : 245 outliers'

'has_secondary_use_school : 94 outliers'

'has_secondary_use_industry : 279 outliers'

'has_secondary_use_health_post : 49 outliers'

'has_secondary_use_gov_office : 38 outliers'

'has_secondary_use_use_police : 23 outliers'

'has_secondary_use_other : 1334 outliers'

'damage_grade : 0 outliers'

In [13]:
df.count_floors_pre_eq.max()

9