# Predicting Building Damage from Earthquakes

![](https://www.travelinggeologist.com/wp-content/uploads/2015/05/Fig.-8.-Patan-Durbar-Square.jpg)

## Introduction

This notebook has been created to help familiarise myself with using Python as I have traditionally been an R user. The task is a multi-class supervised learning problem to predict the damage classification of buildings as a result of the Gorkha earthquake in April 2015. The earthquake in question was 7.8 in magnitude and occured near the Gorkha district of Gandaki Pradesh, Nepal. Almost 9,000 lives were lost, millions of people were instantly made homeless, and $10 billion in damages about half of Nepal's nominal GDP was incurred.

This analysis is applied to one of the largest post-disaster datasets ever collected, containing valuable information on earthquake impacts, household conditions, and socio-economic-demographic statistics. The target variable has 5 classes, labelled 'Grade 1':'Grade 5' which each represent the different scale of damage sustained to the building.

This notebook will explore the data and continue on to developing a predictive model, the general strucutre of work is as follows:

* [Prepare Environment](#section-one)
* [Data Reference](#section-two)
* [Exploratory Analysis](#section-three)
* [Feature Engineering](#section-four)
* [Model Building](#section-five)
    - [Split Data](#section-five-subsection-one)
    - [Data Processing](#section-five-subsection-two)        
    - [Cross Validation](#section-five-subsection-four)
    - [Model Building & Comparison](#section-five-subsection-five)        
* [Model Evaluation](#section-six)
* [Conclusion](#section-seven)  

<a id="section-one"></a>
## Prepare Environment

In this section we prepare the environment for analysis which includes loading libraries for functionality and data to work with. 

The following packages are used in this analysis:

1. Numpy: Numerical computing
1. Pandas: Dataframes
1. Matplotlib: General visualisations
1. Seaborn: Statistical visualisations
1. Sklearn: Modelling framework

In [None]:
import numpy as np # linear algebra
import pandas as pd # dataframes
import matplotlib.pyplot as plt # General visualisations
import matplotlib.ticker as mtick # Axis visuals
import seaborn as sns # Statistical visualisations
from math import pi # Radar chart support
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler, FunctionTransformer, LabelEncoder, RobustScaler
from sklearn.feature_selection import VarianceThreshold
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import roc_auc_score, f1_score, confusion_matrix 

roc_auc_score
# Load data into pandas
df_stru = pd.read_csv("../input/earthquake-magnitude-damage-and-impact/csv_building_structure.csv",
                     index_col = 'building_id')

# Plot styling
plt.style.use('ggplot')

<a id="section-two"></a>
## Data Reference

There is only one dataframe in this dataset which as can be seen from below has 762k rows and 30 columns of information, one of the fields is our target variable which is called 'damage_grade'.

In [None]:
print("Structure data has {} rows and {} columns".format(*df_stru.shape))

The two print outs below show that the number of variables by type and the type of the variables in the dataset. There are 20 integer variables and 10 categorical. However, from the second print out we ca see that the first three variables are actually references and need to be treated as categorical.

In [None]:
display(df_stru.dtypes.value_counts())
df_stru.dtypes

In [None]:
# Convert data types to categorical
df_stru = df_stru.astype({'district_id': 'object', 'vdcmun_id': 'object', 'ward_id': 'object'})

The output below shows that most of the data is fully intact, there are a handful of columns whose fields aren't fully populated. As there are only a small number of missing entries, these will be dropped from the dataframe, only 12 observations are dropped in total.

In [None]:
# View missingness
df_temp = df_stru.isnull().sum().reset_index(name='count')
display(df_temp[df_temp['count'] > 0])

# Drop Rows with missing data
df_stru.dropna(inplace = True)

<a id="section-three"></a>
## Exploratory Analysis

To start we look at the distribution of the target variable, we can see the occurence of each grade increases with the classification. Such that Grade 5 occurs the most frequently in the dataset while Grade 1 appears the least frequently. As this is a classifcation problem, we can see that it is unbalanced, with different grades accounting for very different proprotions of observations.

In [None]:
plt.figure(figsize=(12,5))
ax = sns.countplot(x='damage_grade', data=df_stru, order = ['Grade 1', 'Grade 2', 'Grade 3', 'Grade 4', 'Grade 5'])
ax.yaxis.set_major_formatter(mtick.StrMethodFormatter('{x:,.0f}'))
plt.title("Distribution of Damage Grade")
plt.xlabel("Damage Grading")
plt.show()

<a id="section-one"></a>
### Categorical Variables

The output below shows the non-numeric fields in our datatype which python has recorded as type object. These variables will be explored in turn to understand how they relate to the damage grade classification.

In [None]:
# view Names
df_stru.select_dtypes(include=object).dtypes # Data types

The first variable to analyse is the district, we can see from the plot below that most districts are spread relatively evenly across all damage grades. There are a few examples of districts which are mostly associated with grade 5.

In [None]:
# Calculate counts
df_temp = df_stru.groupby(['district_id','damage_grade']).size().reset_index(name='count')

# Set Index
df_temp = df_temp.set_index(['district_id', 'damage_grade'])

# Calculate Proportion of grade
df_temp = df_temp.groupby(level=0).apply(lambda x: 100 * x / float(x.sum())).reset_index()

# Pivot table
df_temp = pd.pivot_table(df_temp, values='count', index=['district_id'], columns='damage_grade')

# Plot char
plt.figure(figsize=(12,5))
ax = sns.heatmap(data = df_temp, annot = True )
plt.xticks(rotation = 50)
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5), ncol=1)
plt.title("Distribution of Damage Grade by District")
plt.xlabel("Damage Grade")
plt.ylabel("District")
plt.show()

# Clean up
del df_temp, ax

While the plot is very busy, there are certain clusters of 'vdcmun_id' which are associated with different grades. One example is the ids associated with grade 5.

The output below shows that there are 110 vdcmun and 945 unique ward_id's in the dataset. It might be prudent to do some dimensionality reduction when it comes to model building.

In [None]:
print("vdcmun Wards: {}".format(len(df_stru.vdcmun_id.unique())))
print("Unique Wards: {}".format(len(df_stru.ward_id.unique())))

Starting with the land surface condition, the plot below indicates that the vast majority (80% +) of buildings have a flat surface. Roughly 15% of buildings were on land with a moderate slope and the smallest percentage (less than 5%) had a severe slope.

In [None]:
# Create a plot
plt.figure(figsize=(12,5))
ax = df_stru.land_surface_condition.value_counts(normalize = True).plot(kind = "bar")
ax.yaxis.set_major_formatter(mtick.PercentFormatter())
plt.title("Land Surface Condition")
plt.ylabel("Count")
plt.xlabel("Land Condition")
plt.show()

# clean script
del ax

The plot below combines the land condition with the damage grade for each building. The plot is converted to show how each grade is distributed amongst the three land surface conditions. The proportion of grades 1 and 2 is much closer for buildings on flat and steep sloped group, but much more evenly spaced out for building on a moderate slope. In all cases the frequency of grade is correlated with the label of the grade.

In [None]:
# Calculate counts
df_temp = df_stru.groupby(['land_surface_condition','damage_grade']).size().reset_index(name='count')

# Set Index
df_temp = df_temp.set_index(['land_surface_condition', 'damage_grade'])

# Calculate Proportion of grade
df_temp = df_temp.groupby(level=0).apply(lambda x: 100 * x / float(x.sum())).reset_index()

# Plot chart
plt.figure(figsize=(12,5))
ax = sns.barplot(data = df_temp, 
                 x = 'land_surface_condition', 
                 y = 'count', 
                 hue = 'damage_grade')
ax.yaxis.set_major_formatter(mtick.PercentFormatter())
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5), ncol=1)
plt.title("Distribution of Damage Grade by Land Surface Type")
plt.xlabel("Land Surface Condition")
plt.ylabel("Percentage of Buildings")
plt.show()

# Clean up
del df_temp, ax

Moving on to the next categorical variable, the output below shows that there are 5 entries in the foundation type, with the 5th option being a grouping of less popular options. The most common foundation is mortar, it is then a big jump down to bamboo as the next most popular option. The chart shows quite a change in the distribution of foundation types vs damage grade. Bambo and cement foundations tend to see an inverse relationship with grade, grade 1 is the most popular and this decreases as the grade classification increments. Mud mortar foundations have a very different relationship with damage grade in which it follows the proportions of the overal grade variable. Other and RC show a less consistent pattern with damage grade, which could be driven by the smaller sample size (especially for other). Visually this variable appears to have some degree of predictive power.

In [None]:
df_stru.foundation_type.value_counts()

In [None]:
# Calculate counts
df_temp = df_stru.groupby(['foundation_type','damage_grade']).size().reset_index(name='count')

# Set Index
df_temp = df_temp.set_index(['foundation_type', 'damage_grade'])

# Calculate Proportion of grade
df_temp = df_temp.groupby(level=0).apply(lambda x: 100 * x / float(x.sum())).reset_index()

# Plot chart
plt.figure(figsize=(12,5))
ax = sns.barplot(data = df_temp, 
                 x = 'foundation_type', 
                 y = 'count', 
                 hue = 'damage_grade')
ax.yaxis.set_major_formatter(mtick.PercentFormatter())
plt.xticks(rotation = 50)
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5), ncol=1)
plt.title("Distribution of Damage Grade by Foundation Type")
plt.xlabel("Foundation Type")
plt.ylabel("Percentage of Buildings")
plt.show()

# Clean up
del df_temp, ax


According to the roof type variable there are three different types of roof. The vast majority of buildings have the Bamboo-Timber/ light roof, but there are still over 200k buildings with timber heavy and just over 5k with RCC.

In [None]:
plt.figure(figsize = (12,6))
ax = df_stru.roof_type.value_counts().plot(kind = "bar")
plt.title("Number of Buildings by Roof Type")
plt.show()

The spread of building between roof type and grade is quite similar for the types with Bamboo and timber. Roof type RCC actually is most commonly associated with grade 1 and then 2, it is minimally associated with grades 4 and 5. Given the similarities between the two bamboo/ timber roof types, they could potentially be grouped together.

In [None]:
# Calculate counts
df_temp = df_stru.groupby(['roof_type','damage_grade']).size().reset_index(name='count')

# Set Index
df_temp = df_temp.set_index(['roof_type', 'damage_grade'])

# Calculate Proportion of grade
df_temp = df_temp.groupby(level=0).apply(lambda x: 100 * x / float(x.sum())).reset_index()

# Pivot table
df_temp = pd.pivot_table(df_temp, values='count', index=['roof_type'], columns='damage_grade')

# Plot char
plt.figure(figsize=(12,5))
ax = sns.heatmap(data = df_temp, annot = True )
plt.xticks(rotation = 50)
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5), ncol=1)
plt.title("Distribution of Damage Grade by Roof Type")
plt.xlabel("Damage Grade")
plt.ylabel("Roof Type")
plt.show()

# Clean up
del df_temp, ax

There are multiple types of ground floor material used across the buildings, one the most popular is Mud which can be seen below. RC and Brick are distant second/ third place but occur quite frequently. Interestingly the association between ground type and damage grade differs quite significantly dependent on materials used for the ground. Materials Other and Timber are spread quite evenly across the grades, while brick/ stone and mud are more often associated with the higher grades 4 and 5, RC is most commonly associated with the lower grades. These patterns indicate that this variable might have some predictive power.

In [None]:
df_stru.ground_floor_type.value_counts()

In [None]:
# Calculate counts
df_temp = df_stru.groupby(['ground_floor_type','damage_grade']).size().reset_index(name='count')

# Set Index
df_temp = df_temp.set_index(['ground_floor_type', 'damage_grade'])

# Calculate Proportion of grade
df_temp = df_temp.groupby(level=0).apply(lambda x: 100 * x / float(x.sum())).reset_index()

# Pivot table
df_temp = pd.pivot_table(df_temp, values='count', index=['ground_floor_type'], columns='damage_grade')

# Plot char
plt.figure(figsize=(12,5))
ax = sns.heatmap(data = df_temp, annot = True )
plt.xticks(rotation = 50)
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5), ncol=1)
plt.title("Distribution of Damage Grade by Ground Floor Type")
plt.xlabel("Damage Grade")
plt.ylabel("Ground Floor Type")
plt.show()

# Clean up
del df_temp, ax

The two outputs below are focussed on the 'other_floor_type' variable, we can see that timber/bamboo-mud is very common flooring material, existing in over 63% of homes. There is also an option called not applicable, indicating that some buildings only have a ground floor, which might be worth creating a feature to extract. Beyond this the chart below shows that the timber based materials share a similar relationship with damage grade, it might be useful to group these up. RCC has a bigger association with the lower grades and N/A provides a slight mix with grade 5 and 1 appearing the most often. 

In [None]:
df_stru.other_floor_type.value_counts(normalize = True)

In [None]:
# Calculate counts
df_temp = df_stru.groupby(['other_floor_type','damage_grade']).size().reset_index(name='count')

# Set Index
df_temp = df_temp.set_index(['other_floor_type', 'damage_grade'])

# Calculate Proportion of grade
df_temp = df_temp.groupby(level=0).apply(lambda x: 100 * x / float(x.sum())).reset_index()

# Plot char
plt.figure(figsize=(12,5))
ax = sns.barplot(data = df_temp, 
                 x = 'other_floor_type', 
                 y = 'count', 
                 hue = 'damage_grade')
ax.yaxis.set_major_formatter(mtick.PercentFormatter())
plt.xticks(rotation = 50)
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5), ncol=1)
plt.title("Distribution of Damage Grade by Other Floor Type")
plt.xlabel("Other Floor Type")
plt.ylabel("Percentage of Buildings")
plt.show()

# Clean up
del df_temp, ax


With regards to the position of the building, there can be an attachment between 1 and 3 sides or no attachment at all. As with other features which have been explored, there is a dominate attribute amongst buildings or lack of in this case. 'Not attached' position accounts for almost 80% of buildings in the dataset. The association between position and damage grade of buildings differ less when compared with No attachment and buildings with an attachment on one-side. Buildings with 2 and 3 side attachments appear to have different patterns to the rest.

In [None]:
df_stru.position.value_counts(normalize = True)

In [None]:
# Calculate counts
df_temp = df_stru.groupby(['position','damage_grade']).size().reset_index(name='count')

# Set Index
df_temp = df_temp.set_index(['position', 'damage_grade'])

# Calculate Proportion of grade
df_temp = df_temp.groupby(level=0).apply(lambda x: 100 * x / float(x.sum())).reset_index()

# Plot char
plt.figure(figsize=(12,5))
ax = sns.barplot(data = df_temp, 
                 x = 'position', 
                 y = 'count', 
                 hue = 'damage_grade')
ax.yaxis.set_major_formatter(mtick.PercentFormatter())
plt.xticks(rotation = 50)
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5), ncol=1)
plt.title("Distribution of Damage Grade by Position")
plt.xlabel("Position")
plt.ylabel("Percentage of Buildings")
plt.show()

# Clean up
del df_temp, ax

The plan configuration represents the general shape of the building and we can see from the output below that there are 10 different types of plot shape. The most popular type of plot shape is rectangular, followed by square and then L-shape. Generally speaking most shapes are evenly spread across the damage grades, however there are a few which index slightly higher on one grade or another. Examples being 'building with central courtyard' and 'others' having a higher association with lower grades. 'Rectangular' and 'square' index higher on grade 5.

In [None]:
df_stru.plan_configuration.value_counts()

In [None]:
# Calculate counts
df_temp = df_stru.groupby(['plan_configuration','damage_grade']).size().reset_index(name='count')

# Set Index
df_temp = df_temp.set_index(['plan_configuration', 'damage_grade'])

# Calculate Proportion of grade
df_temp = df_temp.groupby(level=0).apply(lambda x: 100 * x / float(x.sum())).reset_index()

# Pivot table
df_temp = pd.pivot_table(df_temp, values='count', index=['plan_configuration'], columns='damage_grade')

# Plot char
plt.figure(figsize=(12,5))
ax = sns.heatmap(data = df_temp, annot = True )
plt.xticks(rotation = 50)
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5), ncol=1)
plt.title("Distribution of Damage Grade by Plan Configuration")
plt.xlabel("Damage Grade")
plt.ylabel("Plan Configuration")
plt.show()

# Clean up
del df_temp, ax

The condition field appears to evaluate the standing of the building after the earthquake. There are several common recordings, unfortunately all pointing towards some form of damage (only 8% were not damaged). The chart allows us to interpret the grades with a little more clarity, we can see that grade 5 appears to represent the buildings which were damaged the most, we see some conditions align solely to particular grades.

In [None]:
df_stru.condition_post_eq.value_counts()

In [None]:
# Calculate counts
df_temp = df_stru.groupby(['condition_post_eq','damage_grade']).size().reset_index(name='count')

# Set Index
df_temp = df_temp.set_index(['condition_post_eq', 'damage_grade'])

# Calculate Proportion of grade
df_temp = df_temp.groupby(level=0).apply(lambda x: 100 * x / float(x.sum())).reset_index()

# Pivot table
df_temp = pd.pivot_table(df_temp, values='count', index=['condition_post_eq'], columns='damage_grade')

# Plot char
plt.figure(figsize=(12,5))
ax = sns.heatmap(data = df_temp, annot = True )
plt.xticks(rotation = 50)
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5), ncol=1)
plt.title("Distribution of Damage Grade by Condition")
plt.xlabel("Damage Grade")
plt.ylabel("Condition")
plt.show()

# Clean up
del df_temp, ax

The final categorical variable identifies the scale of damage caused to the building into 4 groups going from no need to intervention to complete reconstruction. The majority of buildings required reconstruction or some degree of repair. The plot shows a nice pattern, ultimately Grade 1 buildings not requiring any intervention, grades 2 and 3 requiring minor and major repair respectively and then grades 4 and 5 requiring reconstruction.

In [None]:
df_stru.technical_solution_proposed.value_counts()

In [None]:
# Calculate counts
df_temp = df_stru.groupby(['technical_solution_proposed','damage_grade']).size().reset_index(name='count')

# Set Index
df_temp = df_temp.set_index(['technical_solution_proposed', 'damage_grade'])

# Calculate Proportion of grade
df_temp = df_temp.groupby(level=0).apply(lambda x: 100 * x / float(x.sum())).reset_index()

# Pivot table
df_temp = pd.pivot_table(df_temp, values='count', index=['technical_solution_proposed'], columns='damage_grade')

# Plot char
plt.figure(figsize=(12,7))
ax = sns.heatmap(data = df_temp, annot = True )
plt.xticks(rotation = 50)
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5), ncol=1)
plt.title("Distribution of Damage Grade by Proposed Solution")
plt.xlabel("Damage Grade")
plt.ylabel("Proposed Solution")
plt.show()

# Clean up
del df_temp, ax

### Numerical Variables

This section is focussed on exploring the relationship between damage grade and the numerical variables. As with the categorical subsection, the aim is to identify variables which may have predictive power. We can see from the print out below that there are 17 variables with type numerical.

In [None]:
# view Names
df_stru.select_dtypes(include=[float, int]).dtypes # Data types

The output below shows the descriptive statistics about the numerical variables in our dataset. First thing to notice is that all the variables which have the 'has' prefix are binary variables. For the variables which aren't binary we can see that the mean and median values of the columns tend to be in a reasonable range indicating an absense of extreme values. Building age has a 8 year gap between the mean and median and we can see from the maximum value that the age is recorded as 999 years old. This is possible but potentially an outlier which might need to be addressed in later sections.

In [None]:
# Describe the 
df_stru.describe().transpose().drop('count', axis=1)

The plot below looks at the number of rooms a building had before and after the earthquake. 

Focusing on the left plot (before earthquake) it shows the distribution of rooms that building had. Across all grades the vast majority of buildings had between 1 and 4 rooms, there are however a number of outliers per grade in which there were as many as 9 rooms across all grades. We can see a bit of a pattern emerging, moving from grades 1 - 5, the core distribution of rooms appears to grow. Essentially building with more rooms pre earthquake tend to have a worse outcome assessment. This might be based on area, in proximity to the earthquakes epicentre or another reason entirely. 

Focussing on the right plot(after earthquake) we can see a pattern that higher damage grades equate to lower distribution of rooms than before. Lower damage grades tend to have a very similar distribution. In the following sections, a new feature could be interesting in which we extract the difference between these two variables.

In [None]:
fig, ax = plt.subplots(1,2,figsize = (16,5), sharey='row')
fig.suptitle("Comparison of Rooms per Building Pre & Post Earthquake by Damage Grade")
ax[0].set_title("Before Earthquake")
ax[1].set_title("After Earthquake")
sns.boxplot(data = df_stru, 
            x = "damage_grade", 
            y = "count_floors_pre_eq", 
            order = ['Grade 1', 'Grade 2', 'Grade 3', 'Grade 4', 'Grade 5'],
            ax = ax[0])
sns.boxplot(data = df_stru, 
            x = "damage_grade", 
            y = "count_floors_post_eq", 
            order = ['Grade 1', 'Grade 2', 'Grade 3', 'Grade 4', 'Grade 5'],
            ax = ax[1])
plt.setp(ax[:], xlabel='Damage Grade')
plt.setp(ax[0], ylabel='Number of Rooms')
plt.setp(ax[1], ylabel=None)
plt.show()

The plot below shows the empirical cumulative distribution for the building's age split by damage grade. The descriptive stats indicated that there might be outliers, so the plot has been duplicated for the complete and a filtered number of observations. The ecdf highlights a relationship between building age and damage grade, as a buildings age increases it becomes more associated with higher damage grades.

In [None]:
def ecdf(series):
    ''' This function calculates the ECDF for a series of real numbers'''
    
    # Number of data points
    n = len(series)
    
    # Sort output
    x = np.sort(series)
    
    # Sequence proportion
    y = np.arange(1, n+ 1) / n * 100
    
    # Return dataframe
    return pd.DataFrame({'x':x, 'y':y})

In [None]:
# List each grade
v_grades = ['Grade ' + str(x) for x in range(1,6)]

# Create a plot template
fig, ax = plt.subplots(1,2, figsize = (16,5), sharey='row')
fig.suptitle("ECDF for Building Age")
ax[0].set_title("Complete Dataset")
ax[1].set_title("Filtered Dataset")

# Create a ecdf for each grade
for grade in v_grades:
    df_temp = df_stru[df_stru.damage_grade == grade] # filter for grade only 
    df_temp = ecdf(df_temp.age_building) # Compute ecdf for grade
    sns.lineplot(data = df_temp, x = 'x', y = 'y', label = grade, ax = ax[0]) # Generate line plot
    del df_temp

# Create a ecdf for each grade - Remove outliers
for grade in v_grades:
    df_temp = df_stru[(df_stru.damage_grade == grade) & (df_stru.age_building <= 100)] # filter
    df_temp = ecdf(df_temp.age_building) # Compute ecdf for grade
    sns.lineplot(data = df_temp, x = 'x', y = 'y', label = grade, ax = ax[1]) # Generate line plot
    del df_temp

# show plot
plt.setp(ax[:], xlabel='Building Age (Years)')
plt.setp(ax[0], ylabel='Percentage of Buildings')
plt.setp(ax[1], ylabel=None)
plt.show()

# Clear objects
del grade, v_grades, fig, ax

The plot below shows the distribution of the buildings plinth area by grade. Ignoring the height of the bars between grades, we can see that the building area generally has the same distribution with a peak of circa 250 sq ft. I would have hoped to have seen a different distribution within the grades, but as these aren't really present, it indicates this variable will have limited predictive capacity.

In [None]:
# List of grades
v_grades = ['Grade ' + str(x) for x in range(1,6)]

# Plot structure
fig, ax = plt.subplots(1, 5, figsize=(18,5), sharey="row", sharex="row")
fig.suptitle("Distribution of Plinth Area by Grade")

# Build plot
count = 0
for grade in v_grades:
    df_temp = df_stru[(df_stru.damage_grade == grade) & (df_stru.plinth_area_sq_ft <= 2000)] # filter for grade
    ax[count].set_title(grade)
    sns.distplot(a = df_temp['plinth_area_sq_ft'], label = grade, kde=False, ax=ax[count])
    del df_temp
    count += 1
    
# Plot Aesthetics
plt.setp(ax[:], xlabel='Plinth Area (Square ft)')
plt.setp(ax[0], ylabel='Frequency')
plt.show()

# Clear objects
del grade, v_grades, fig, ax, count

The plot below shows the distribution of building heights before and after the earthquate by damage grade. For damage grades 1-3, the general distribution building heights are unchanged between before and after the quake. Damage grade 4 displays a slight reduction in height between the two time periods, while damage grade 5 drops to zero for the bulk of the distribution indicating these building complete fell down.

In [None]:
fig, ax = plt.subplots(1,2,figsize = (16,5), sharey='row')
fig.suptitle("Comparison of Building Height Pre & Post Earthquake by Damage Grade")
ax[0].set_title("Before Earthquake")
ax[1].set_title("After Earthquake")
sns.boxplot(data = df_stru, 
            x = "damage_grade", 
            y = "height_ft_pre_eq", 
            order = ['Grade 1', 'Grade 2', 'Grade 3', 'Grade 4', 'Grade 5'],
            ax = ax[0])
sns.boxplot(data = df_stru, 
            x = "damage_grade", 
            y = "height_ft_post_eq", 
            order = ['Grade 1', 'Grade 2', 'Grade 3', 'Grade 4', 'Grade 5'],
            ax = ax[1])
plt.setp(ax[:], xlabel='Damage Grade')
plt.setp(ax[0], ylabel='Building Height (feet)')
plt.setp(ax[1], ylabel=None)
plt.show()

# clear objects
del fig, ax

The plot below summarises the mean occurence of each of the superstructure variables. There is a radar plot for each grade with the height of the bar indicating which attributes were present the most frequently across all buildings. Buildings 4 & 5 (those which sustained the most damage), are almost all associated with the super structure mud, mortar and store. This super structure is present in grades 1 - 3 but to a much lower concentration. Grade 1, is the most spread with regards to super strucutres, aligning with timber and brick in fair amounts.

In [None]:
# list of binary columns
v_cols = ['has_superstructure_adobe_mud', 'has_superstructure_mud_mortar_stone',
          'has_superstructure_stone_flag', 'has_superstructure_cement_mortar_stone',
          'has_superstructure_mud_mortar_brick', 'has_superstructure_cement_mortar_brick', 
          'has_superstructure_timber', 'has_superstructure_bamboo', 
          'has_superstructure_rc_non_engineered', 'has_superstructure_rc_engineered', 
          'has_superstructure_other']

# New Names of Binary Cols
v_names = {'has_superstructure_adobe_mud':'adobe_mud', 
           'has_superstructure_mud_mortar_stone':'mud_mortar_stone',
          'has_superstructure_stone_flag':'stone_flag', 
           'has_superstructure_cement_mortar_stone':'cement_mortar_stone',
          'has_superstructure_mud_mortar_brick':'mud_mortar_brick', 
           'has_superstructure_cement_mortar_brick':'cement_mortar_brick', 
          'has_superstructure_timber':'timber', 
           'has_superstructure_bamboo':'bamboo', 
          'has_superstructure_rc_non_engineered':'rc_non_engineered', 
           'has_superstructure_rc_engineered':'rc_engineered', 
          'has_superstructure_other':'other'}

# Summarise and rename columns
df_temp = df_stru.groupby('damage_grade')[v_cols].agg('mean').reset_index()
df_temp.rename(columns=v_names, inplace = True)

# Lists to use
v_grades = ['Grade ' + str(x) for x in range(1,6)]
v_colour = ['b', 'g', 'r', 'c', 'm']

# number of variables
v_categories = list(df_temp)[1:]
v_N = len(v_categories)

# Angles
v_angles = [n / float(v_N) * 2 * pi for n in range(v_N)]
v_angles += v_angles[:1]

# Initialise the plot
fig, ax = plt.subplots(2,3, figsize = (16,12), subplot_kw=dict(polar=True))

# Format axis
plt.setp(ax, # X
         xticks = v_angles[:-1], 
         xticklabels = v_categories,
         yticks = [0.10,0.25,0.50,0.75,1.0],
         yticklabels = ["10%","25%","50%","75%","100%"],
         ylim = (0,1))

# Populate plot in a loop
count, row, col = 0, 0, 0
for grade in v_grades:   
    values = df_temp.loc[count].drop('damage_grade').values.flatten().tolist()
    values += values[:1]
    ax[row, col].plot(v_angles, values, linewidth=1, linestyle='solid', label=grade)
    ax[row, col].fill(v_angles, values, v_colour[count], alpha=0.1)
    ax[row, col].set_title(grade)

    # Increment counters
    if count >= 2: # Ensure reference correct row
        row = 1
    
    if col < 2: # Ensure reference correct col
        col += 1
    else:
        col = 0
    
    count += 1 # increase count var

# Drop 6th subplot
fig.delaxes(ax[1,2])

# Clear objects
del v_cols, v_names, df_temp, v_grades, v_colour,\
    v_categories, v_N, v_angles, fig, ax, count, row, col, grade

<a id="section-four"></a>
## Feature Engineering

This is usually the opportunity to identify the really important predictive features to help the models succeed. This section will be kept light and will only focus on the following two features:

1. Net Rooms (Number of rooms after the quake - Number of rooms before the quake)
1. Net Height (Building height after the quake - Building height before the quake)

In [None]:
# New fields to add
df_stru['net_rooms'] = df_stru.count_floors_post_eq - df_stru.count_floors_pre_eq
df_stru['net_height'] = df_stru.height_ft_post_eq - df_stru.height_ft_pre_eq

<a id="section-five"></a>
## Model Building

In this section the following predictive models will be built:

1. KNN
1. Random Forest
1. Gradient Boosted Machines

The aim is to identify the model which best predicts the damage grade on new data given the input features. Within this section, a pre-processing pipeline will be set up to prepare the data for training. A test dataset will be extracted from the main data to give us the opportunity to assess how the models perform on completely new data. The models will be evaluated on the test data using the evaulation metric F1.

<a id="section-four-subsection-one"></a>
### Creating a Test Dataset

A test set is created to help evaluate the performance of predictive models which will be developed in this section. It provides the opportunity to assess how the model performs on unseen data. The test set will account for 20% of the observations, choosen at random but stratified around the target variable to ensure that the proportions are the same in the training and testing data. 

The table below shows that this has been achieved and we cann see that the training and test dataframes have very similar proportions of observations for each damage grade.

In [None]:
# Create training and testing data
x_train, x_test, y_train, y_test = train_test_split(df_stru.drop('damage_grade', axis = 1), 
                                                    df_stru['damage_grade'],
                                                    test_size = 0.2, 
                                                    random_state = 1989, 
                                                    stratify = df_stru['damage_grade'],
                                                    shuffle=True)


# Visualise proportions on train and test
pd.merge(y_train.reset_index(name="damage_grade").damage_grade.value_counts(normalize = True).reset_index(name="train"),
         y_test.reset_index(name="damage_grade").damage_grade.value_counts(normalize = True).reset_index(name="test"),
         on = "index",
         how = "left")

<a id="section-five-subsection-two"></a>
### Data Processing Pipelines

This subsection is foccused on preparing the data to give the models the best opportunity of identifying the signal in the data. The following steps will be applied:

1. Label Encode Target Variable
1. Ensure variables have the correct data types
1. Covert nomial variables to numeric with onehot encoding
1. Restrain outliers
1. Centre and scale all numerical variables
1. Remove variables with either no or minimal variance

Firstly we encode the objects y_train and y_test to be value labels ranging from 0 to 4, corresponding to Grades 1 - 5.

In [None]:
# ------------ Target Processing ------------

# Target Variable Transformer
preprocessor_tar = LabelEncoder()
y_train = preprocessor_tar.fit_transform(y_train)
y_test = preprocessor_tar.fit_transform(y_test)

# Visualise Counts
pd.DataFrame(y_train).value_counts(normalize = True)

# Clean objects
del preprocessor_tar

The second step is to create a pipeline of processing steps which can be applied to the predictor features. These are all combined into a single pipe which easily allows for the same processing steps to be applied to different dataframes.

The print out below shows that with have significantly increased the dimensionality of our dataset as a result of one hot encoding the categorical variables. I did not venture into it this time, but it would have been sensible to apply some dimensionality reduction to the columns created by one hot encoding.

In [None]:
# Print x train shape before processing
print("Before preprocessing there were {} rows and {} columns".format(*x_train.shape))

# ------------ Predictor Processing ------------
# Identify columns
fts_cvt_obj = ['district_id', 'vdcmun_id', 'ward_id']
fts_outlier = ['age_building']
fts_cat = df_stru.drop(fts_cvt_obj, axis = 1).select_dtypes(include=['object']).drop('damage_grade', axis = 1).columns
fts_num = df_stru.select_dtypes(np.number).columns

# Convert to object Transformer
def covert_to_object(x):
    '''Converts a column to object'''
    return pd.DataFrame(x).astype(object)
trans_to_object = Pipeline(steps = [('convert_to_object', FunctionTransformer(covert_to_object))])

# Outlier Restriction
trans_outlier = Pipeline(steps = [('Outlier_scaler', RobustScaler(quantile_range = (0,0.9)))])

# Categorical Transformer
trans_cat = Pipeline(steps = [('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Numerical Transformer
trans_num = Pipeline(steps = [('scaler', StandardScaler()),
                              ('MinMax', MinMaxScaler())])

# Zero or Near Zero variance
trans_nzv = Pipeline(steps = [('nzv', VarianceThreshold(threshold = 0.01))])

# Create a single Preprocessing step for predictors
preprocessor_preds = ColumnTransformer(
    transformers=[
        ('convert_to_object', trans_to_object, fts_cvt_obj), # Convert data types
        ('Outlier', trans_outlier, fts_outlier), # Outlier treatment 
        ('num', trans_num, fts_num), # Centre and scale
        ('cat', trans_cat, fts_cat), # One Hot encode
        ('nzv', trans_nzv,[]) # One Hot encode
    ])
      
# Apply the transformations to both train and test
x_train = preprocessor_preds.fit_transform(x_train)
x_test = preprocessor_preds.fit_transform(x_test)   

# Print x train shape before processing
print("After preprocessing there are {} rows and {} columns".format(*x_train.shape))

# Clean objects
del fts_cvt_obj, fts_outlier, fts_cat, fts_num, covert_to_object, trans_to_object, trans_outlier, trans_cat, trans_num, trans_nzv

<a id="section-five-subsection-three"></a>
### Cross Validation Strategy

5 fold cross validation will be used as the strategy, which will provide confidence that we are not overfitting and allow us to make fair comparisons between competing models. Ultimately this approach splits the data into 5 random partitions and builds 5 models, for each model it uses 4 partitions of data to train the models and the fifth for testing.

In [None]:
# Store the Kfold object
kfold = KFold(n_splits=5, random_state=1989, shuffle = True)

<a id="section-five-subsection-four"></a>
### Model Building and Comparison

The three models are built across the same 5 folds of data and the output below shows how each model performs using the F1 score evaluation metric. We can see that the models are blah blah blah. The stats present a summarised evaluation, indicating how the model performed generally across all of the folds.

* KNN model: It performed the worst achieving a F1 score of 79.2%, the low standard deviation score indiates that similar perfomance was achieve across the 5 folds
* Random Forest model: This model achieved a F1 score of 90%, just over a 10% point increase relative to the simple KNN model. The standard deviation score for this model also indcates similar performance across the folds.
* Gradient Boosted Machine model: This model achieve a F1 score of 88.3%, which under performs against the random forest but performs strongly against the KNN model

In [None]:
# List of classification models
classifiers = [('KNN', KNeighborsClassifier(3)),
               ('RF', RandomForestClassifier()),
               ('GBM', GradientBoostingClassifier())]

# Evaluate each model
results = []
names = []
for name, model in classifiers:
    cv_results = cross_val_score(model, x_train, y_train, cv=kfold, scoring='f1_micro')
    results.append(cv_results)
    names.append(name)
    print("%s: %f (%f)" % (name, cv_results.mean(), cv_results.std()))

In [None]:
# Summarise scores
pd.DataFrame(np.transpose(results), columns = names).reset_index()

<a id="section-six"></a>
## Model Evaluation on Test

The Random forest model will be the only model progressed, it will be rebuilt on the full dataset and then used to predict the damage grade of the test data. The f1 score will again be used to evaluate how well the model performs in predicting the grades. 

The f1 score in the output below shows that the model performs equally as well on the test data as it did on the training data. The confusion matrix shows that the model has the lowest accuracy predicting grade 3 and this would likely be a part of my focus in a different scenario to help increase the scores.

In [None]:
# Build the random forest on the full training data
rf = RandomForestClassifier() # instance
rf.fit(x_train, y_train) # fit model
y_pred = rf.predict(x_test) # predict on test

# Calculate confusion matrix
con_mat = confusion_matrix(y_test, y_pred)
con_mat = con_mat / con_mat.astype(np.float).sum(axis=1)

# Evaluate model
print("F1 Score: %f " % f1_score(y_test, y_pred,average='micro'))

# Plot Model
plt.figure(figsize = (12,6))
ax = sns.heatmap(con_mat, annot = True)
plt.title("Confusion Matrix")
plt.show()

# Clear objects
del rf, y_pred, con_mat, ax

<a id="section-seven"></a>
## Conclusion

This task was used as an opportunity to explore python to build a predictive supervised learning model. There were many aspects of the analysis which could have been extended such as a more extensive feature engineering, applying dimensionality reduction and hyperparameter turning. That said, without this an f1 score of 89.4% isn't that bad. If you have made it this far, I hope you enjoyed reading it, I certainly have enjoyed learning.