## $\text{Table of Contents}$

* [1. Introduction](#1)
* [2. Missing Values And Preprocessing](#2)
* [3. Exploratory Data Analysis (EDA)](#3)
* [4. Clustering](#4)


<a id="1"></a>
# $\text{Introduction}$

![](https://cdn.britannica.com/34/127134-050-49EC55CD/Building-foundation-earthquake-Japan-Kobe-January-1995.jpg)

**In the following kernel, we will be exploring earthquake records.
We will try to visualize the data in a way which will grant us some insight as to what affects earthquakes, is there any connection between periods and earthquakes? , is there a connection between The magnitude of the earthquake and the depth, this is some of the key points we will try to visualize along with the mean magnitudes and depth across the years and the standard deviation.
Why the standard deviation you may ask, well the answer maybe not as intuitive. 
We can think of a year being 'special' or worth looking into if that year had high-value earthquake magnitudes and low standard deviation which basically means throughout the year there was a certain amount of earthquakes and all those earthquakes had high magnitudes. 
Also, the opposite is interesting as well, meaning we have on average low magnitudes but with high standard deviation.**

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import plotly.express as ex
import plotly.graph_objs as go
import plotly.figure_factory as ff
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')


In [None]:
e_data = pd.read_csv('/kaggle/input/earthquake-database/database.csv')
e_data.head()

### A quick look at what our data looks like and the features we will work on

<a id="2"></a>
# $\text{Missing Values And Preprocessing}$

In [None]:
missing = e_data.isna().sum()
missing = missing[missing>0]
missing = missing.reset_index()
tr = go.Bar(x=missing['index'],y=missing[0],name='Missing')
tr2 = go.Bar(x=missing['index'],y=[e_data.shape[0]]*len(missing['index']),name='Total')

data = [tr2,tr]
fig = go.Figure(data=data,layout={'title':'Proportion Of Missing Values In Our Dataset','barmode':'overlay'})
fig.show()

### As can be seen in the bar chart above half of our features have missing values where all those features except the magnitude type are missing more than 70% of the data.Another important point is that the data we are missing explains different behaviors in other features. for example the 'Depth Error' features explain to us the degree of error in the depth measurement which is being estimated.We shell use the mode of the magnitude type feature in order to replace the small number of missing values.As for the other features which can some of them can be imputed using regression and nearest neighbor approach we will leave them because it may be mathematically correct due to high correlation but remember! correlation does not intend caucasian and when looking at a natural disaster like an earthquake I am interested in an element that can highlight caucasian rather than correlation.



In [None]:
#Tackle Missing Values
e_data['Magnitude Type'] = e_data['Magnitude Type'].fillna(e_data['Magnitude Type'].mode()[0])

missing = e_data.isna().sum()
missing = missing[missing>0]
missing = missing.reset_index()
not_missing = [col for col in e_data.columns if col not in missing['index'].values]

In [None]:
def get_day_of_week(sir):
    return sir.weekday()
def get_month(sir):
    return sir.month
def get_year(sir):
    return sir.year


e_data =e_data[not_missing]
e_data.Date = pd.to_datetime(e_data.Date)

e_data['Day_of_Week'] = e_data.Date.apply(get_day_of_week)
e_data['Month'] = e_data.Date.apply(get_month)
e_data['Year'] = e_data.Date.apply(get_year)


<a id="3"></a>
# $\text{Exploratory Data Analysis (EDA)}$

In [None]:
Info = e_data.describe()
Info.loc['kurt'] = e_data.kurt()
Info.loc['skew'] = e_data.skew()
Info

### As we can see via the skewness value even before plotting the distributions of our features, the magnitude, and depth features are positively skewed and a fair amount at that.it can be already assumed that the cause of the skewness in the data is the phenomena of 'black swans', we do not know when those swans or outliers will occur as with any natural disaster, but there are some earthquakes with significantly higher magnitudes and with higher depths than the average earthquakes which take on fairly low values of the same features.

In [None]:
tmp = e_data.groupby(by='Year').count()
tmp = tmp.reset_index()[['Year','Date']]
tmp
fig = ex.line(tmp,x='Year',y='Date')
fig.update_layout(
    title= 'Number Of Earthquakes Over The Years 1965-1966',
    xaxis = dict(
        tickmode = 'linear',
        tick0 = 0.0,
        dtick = 1
    )
)
fig.show()

### From the 60 till the last few years there is an average climbing trend in the total amount of earthquake events per year. We see that in the last 3 years in our dataset there is a sudden drop in our trend 

In [None]:
tmp = e_data.groupby(by='Year').mean()
tmp = tmp.reset_index()[['Year','Magnitude']]
tmp
fig = ex.line(tmp,x='Year',y='Magnitude')
fig.update_layout(
    title= 'Mean Earthquakes Magnitude Over The Years 1965-1966',
    xaxis = dict(
        tickmode = 'linear',
        tick0 = 0.0,
        dtick = 1
    )
)
fig.show()

### We see that in the 60's we had a very high mean magnitude of earthquakes in compression to later years where the mean magnitude stays around the same value for all years.

In [None]:
tmp = e_data.groupby(by='Year').std()
tmp = tmp.reset_index()[['Year','Magnitude']]
tmp
fig = ex.line(tmp,x='Year',y='Magnitude')
fig.update_layout(
    title= 'Earthquake Standard Deviation From The Mean Over The Years 1965-1966, Mean SD Shown With Red Line',
    xaxis = dict(
        tickmode = 'linear',
        tick0 = 0.0,
        dtick = 1
    )
)
fig.add_shape(
        # Line Horizontal
            type="line",
            x0=1965,
            y0=tmp['Magnitude'].mean(),
            x1=2016,
            y1=tmp['Magnitude'].mean(),
            line=dict(
                color="red",
                width=2.5,
                dash="dashdot",
            ),
    )
fig.show()

### The standard deviation of earthquake magnitude over the years gives us the following insight. As explained in the beginning observing a year where the variance of earthquake magnitudes is large can imply that something was wrong at that year and it's worth investigating, in other words, why did the magnitude vary so much during that year in comparison to an average year where there is a certain amount of small earthquake with little to nonpeculiar behavior.

In [None]:
tmp = e_data[['Year','Day_of_Week']]
tmp=tmp.groupby(by='Year').agg(lambda x:x.value_counts().index[0])
tmp = tmp.reset_index()
days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday',
            'Sunday']
days = {k:days[k] for k in range(0,7)}
tmp['Day_of_Week'] = tmp['Day_of_Week'].replace(days)
fig = ex.pie(tmp,names='Day_of_Week',title='Propotion Of Earthquakes On A Certian Day Of Week Over The Years ')
fig.show()

### A higher percentage of earthquakes were recorded on Saturdays and Mondays 

In [None]:
ex.pie(e_data,names='Status',title='Proportion Of Different Earthquake Statuses',hole=.3)

In [None]:
ex.pie(e_data,names='Source',title='Proportion Of Different Earthquake Sources',hole=.3)

### 88 percent of our sources are from the US

In [None]:
pivot_table = e_data.pivot_table(index='Year',columns='Month',values='Magnitude')
sns.clustermap(pivot_table,annot=True,cmap='coolwarm',col_cluster=False,figsize=(20,13))

### The intuition behind the heatmap above was to find out are there any years which stand out in the average magnitude value in a certain month. in other words, i wanted to find out are stronger earthquakes more common in February of each fifth year or some pattern which may uncover the cycle behavior of earthquakes. I found out that all the years have a similar behavior except 1966,1971 and 1968  which had significantly higher average earthquake magnitude.

In [None]:
plt.figure(figsize=(20,11))
ax = sns.distplot(e_data['Latitude'],label='Latitude')
ax.set_title('Distribution Of Earthquake Latitudes',fontsize=19)
ax.set_ylabel('Density',fontsize=16)
ax.set_xlabel('Latitude',fontsize=16)

plt.show()

### We see that the latitudes in our dataset follow a multimodal distribution of a trimodal distribution to be precise, I assume we can use clustering to cluster the 3 different groups of latitudes and try and understand why those clusters have similar latitudes and what unites those clusters.

In [None]:
plt.figure(figsize=(20,11))
ax = sns.distplot(e_data['Longitude'],label='Longitude',color='teal')
ax.set_title('Distribution Of Earthquake Longitudes',fontsize=19)
ax.set_ylabel('Density',fontsize=16)
ax.set_xlabel('Longitude',fontsize=16)

plt.show()

### The altitude follows a trimodal distribution similar to the latitude meaning we will perform clustering on those features in a later stage of this kernel hopefully giving some insight on different groups of earthquakes sites.

In [None]:
ex.pie(e_data,names='Type',title='Proportion Of Different Eqrthquake Types In Our Dataset')

### We see that there are other measurements source in our data like exposition, etc but they are less than 1 percent of our data meaning we will treat them as extreme outliers or 'black swans' as we already referred to extreme unpredictable outliers.

In [None]:
plt.figure(figsize=(20,11))
ax = sns.distplot(e_data['Depth'],label='Depth',color='red')
ax.set_title('Distribution Of Earthquake Depths',fontsize=19)
ax.set_ylabel('Density',fontsize=16)
ax.set_xlabel('Depth',fontsize=16)

plt.show()

### Here when we take a look at the distribution of earthquake depths we see that most earthquakes follow a bimodal distribution around depth 60, but we have some records of earthquakes occurring at depth 600-700 which are rare and defined as black swans we will ignore those values in the next steps inorder to see the true distribution without an extremely long tail

In [None]:
plt.figure(figsize=(20,11))
ax = sns.distplot(e_data['Magnitude'],label='Magnitude',color='teal')
ax.set_title('Distribution Of Earthquake Magnitudes',fontsize=19)
ax.set_ylabel('Density',fontsize=16)
ax.set_xlabel('Magnitude',fontsize=16)

plt.show()

In [None]:
#Outlier Removal
e_data = e_data[e_data['Depth'] <300]

In [None]:
tmp = e_data.copy()
tmp = tmp[tmp['Magnitude']<=6.1]
tmp = tmp[tmp['Depth']<60]

sns.jointplot(data=tmp,x='Depth',y='Magnitude',kind='kde',cmap='coolwarm',height=12,levels=30)

### So we can confirm that an earthquake is most likely to be at depth 10 or 30 with a magnitude of 5.5. looking at the distributions of each of the parameters we can see that as concluded earlier the depth follows a bimodal distribution but our magnitude feature follows a multimodal distribution of higher-order which suggests that our magnitude scale is rounded and is not continuous rather more of a 'discrete' feature which possible should be interested as an ordinal scale rather than a value of the measurement.

<a id="4"></a>
# $\text{Clustering}$

In [None]:
#Clustring
from sklearn.cluster import KMeans,DBSCAN
KMeans = KMeans(n_clusters=3)

c_data = e_data.copy()
KMeans.fit(e_data[['Magnitude','Depth']])
c_data['Cluster'] = KMeans.labels_

fig = ex.scatter_3d(c_data,x='Longitude',y='Depth',z='Magnitude',color='Cluster',height=900)
fig.show()

In [None]:
name = 'eye = (x:2, y:2, z:0.1)'
camera = dict(
    eye=dict(x=2, y=2, z=0.4)
)

fig.update_layout(scene_camera=camera)

fig.show()