<h1 style="background-color:gray;font-family:newtimeroman;font-size:250%;color:whitesmoke;text-align:center;border-radius: 15px 50px;">Table of Contents</h1>


* [1. Introduction](#1)
* [2. Data Preprocessing](#2)
* [3. Exploratory Data Analysis (EDA)](#3)
* [4. Clustering](#4)


<a id="1"></a>
<h1 style="background-color:gray;font-family:newtimeroman;font-size:250%;color:whitesmoke;text-align:center;border-radius: 15px 50px;">Introduction</h1>


![](https://cdn.britannica.com/34/127134-050-49EC55CD/Building-foundation-earthquake-Japan-Kobe-January-1995.jpg)

**In the following kernel, we will be exploring earthquake records. We will try to visualize the data in a way that will grant us some insight as to what affects earthquakes. Is there any connection between periods and earthquakes? , is there a connection between The magnitude of the earthquake and the depth? We will try to visualize these key points, along with the mean magnitudes and depth across the years and the standard deviation. Why the standard deviation, you may ask? Well, the answer maybe not as intuitive. We can think of a year being 'special' or worth looking into if that year had high-value earthquake magnitudes and low standard deviation, which basically means that there was a certain amount of earthquakes and all those earthquakes throughout the year had high magnitudes. The opposite is also interesting, meaning we have low magnitudes but with high standard deviation..**

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from pandas.plotting import autocorrelation_plot
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf
from statsmodels.tsa.seasonal import seasonal_decompose
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as ex
import plotly.graph_objs as go
import plotly.offline as pyo
from plotly.subplots import make_subplots
pyo.init_notebook_mode()
sns.set_style('darkgrid')


In [None]:
e_data = pd.read_csv('/kaggle/input/earthquake-database/database.csv')
e_data.head()

### A quick look at what our data looks like and the features we will work on

<a id="2"></a>
<h1 style="background-color:gray;font-family:newtimeroman;font-size:250%;color:whitesmoke;text-align:center;border-radius: 15px 50px;">Data Preprocessing</h1>


In [None]:
missing = e_data.isna().sum()
missing = missing[missing>0]
missing = missing.reset_index()
tr = go.Bar(x=missing['index'],y=missing[0],name='Missing')
tr2 = go.Bar(x=missing['index'],y=[e_data.shape[0]]*len(missing['index']),name='Total')

data = [tr2,tr]
fig = go.Figure(data=data,layout={'title':'Proportion Of Missing Values In Our Dataset','barmode':'overlay'})
fig.show()

### As shown in the bar chart above, half of our features have missing values where all those features except the magnitude type are missing more than 70% of the data. Another important point is that the data we are missing explains different behaviors in other features. for example, the 'Depth Error' features explain the degree of error in the depth measurement that is being estimated. We shell use the mode of the magnitude type feature to replace the small number of missing values, as for the other features which can some of them can be imputed using regression and nearest neighbor approach. We will leave them because it may be mathematically correct due to high correlation but remember! Correlation does not intend caucasian, and when looking at a natural disaster like an earthquake, I am interested in an element that can highlight caucasian rather than correlation.



In [None]:
#Tackle Missing Values
e_data['Magnitude Type'] = e_data['Magnitude Type'].fillna(e_data['Magnitude Type'].mode()[0])

missing = e_data.isna().sum()
missing = missing[missing>0]
missing = missing.reset_index()
not_missing = [col for col in e_data.columns if col not in missing['index'].values]

In [None]:
def get_day_of_week(sir):
    return sir.weekday()
def get_month(sir):
    return sir.month
def get_year(sir):
    return sir.year


e_data =e_data[not_missing]
e_data.Date = pd.to_datetime(e_data.Date)

e_data['Day_of_Week'] = e_data.Date.apply(get_day_of_week)
e_data['Month'] = e_data.Date.apply(get_month)
e_data['Year'] = e_data.Date.apply(get_year)


<a id="3"></a>
<h1 style="background-color:gray;font-family:newtimeroman;font-size:250%;color:whitesmoke;text-align:center;border-radius: 15px 50px;">Exploratory Data Analysis (EDA)</h1>


In [None]:
Info = e_data.describe()
Info.loc['kurt'] = e_data.kurt()
Info.loc['skew'] = e_data.skew()
Info

### As we can see via the skewness value, even before plotting our features' distributions, the magnitude and depth features are positively skewed and a fair amount at that. It can be already assumed that the cause of the skewness in the data is the phenomena of 'black swans,' we do not know when those swans or outliers will occur as with any natural disaster. Still, there are some earthquakes with significantly higher magnitudes and with higher depths than the average earthquakes, which take on fairly low values of the same features.

In [None]:
f_data = e_data.copy().rename(columns={'Date':'date'})
partitions = []
partitions.append(f_data.loc[44:int(len(f_data)/3)-1,:])
partitions.append(f_data.loc[int(len(f_data)/3):2*int(len(f_data)/3)-1,:])
partitions.append(f_data.loc[2*int(len(f_data)/3):3*int(len(f_data)/3),:])


neg_part_means =[]
neg_part_std   =[]
pos_part_means =[]
pos_part_std   =[]
for part in partitions:
    neg_part_means.append(part['Magnitude'].mean())
    neg_part_std.append(part['Magnitude'].std())
    pos_part_means.append(part['Depth'].mean())
    pos_part_std.append(part['Depth'].std())
    
res_df = pd.DataFrame({'Depth Mean':pos_part_means,'Magnitude Mean':neg_part_means,'Depth SD':pos_part_std,'Magnitude SD':neg_part_std},
                     index = [f'Partition_{i}' for i in range(1,4)])



res_df

In [None]:
fig = make_subplots(rows=3, cols=2)

for idx,prt in enumerate(partitions):
    fig.add_trace(
    go.Scatter(x=prt['date'], y=prt['Depth'],name=f'Depth Part {idx+1}'),
    row=idx+1, col=1)
    fig.add_trace(
    go.Scatter(x=prt['date'], y=prt['Magnitude'],name=f'Magnitude Part {idx+1}'),
    row=idx+1, col=2)

fig.update_layout(height=600, width=900, title_text="Distibution Of Yearly Magnitude/Deapth Over Our Time Line For Each Partition")
fig.show()

In [None]:
fig = make_subplots(rows=4, cols=2, subplot_titles=('Observed Depth', 'Observed Magnitude', 'Trend Depth','Trend Magnitude','Seasonal Depth','Seasonal Magnitude','Residual Depth','Residual Magnitude'))

lbl = ['Depth','Magnitude']

for idx,column in enumerate(['Depth','Magnitude']):
    res = seasonal_decompose(f_data[column], period=100, model='additive', extrapolate_trend='freq')
    
    fig.add_trace(
    go.Scatter(x=np.arange(0,len(res.observed)), y=res.observed,name='{} Observed'.format(lbl[idx])),
    row=1, col=idx+1)
    
    fig.add_trace(
    go.Scatter(x=np.arange(0,len(res.trend)), y=res.trend,name='{} Trend'.format(lbl[idx])),
    row=2, col=idx+1)
    
    fig.add_trace(
    go.Scatter(x=np.arange(0,len(res.seasonal)), y=res.seasonal,name='{} Seasonal'.format(lbl[idx])),
    row=3, col=idx+1)
    
    fig.add_trace(
    go.Scatter(x=np.arange(0,len(res.resid)), y=res.resid,name='{} Residual'.format(lbl[idx])),
    row=4, col=idx+1)
            
fig.update_layout(height=600, width=900, title_text="Decomposition Of Our Magnitude/Depth into Trend,Level,Seasonality and Residuals")
fig.show()

In [None]:
f, ax = plt.subplots(nrows=2, ncols=1, figsize=(16, 10))

ax[0].set_title('Depth Autocorrelation Analysis ',fontsize=18,fontweight='bold')
autocorrelation_plot(e_data['Depth'],ax=ax[0])
ax[1].set_title('Magnitude Autocorrelation Analysis ',fontsize=18,fontweight='bold')
autocorrelation_plot(e_data['Magnitude'],ax=ax[1],color='tab:red')
plt.show()

In [None]:
f, ax = plt.subplots(nrows=2, ncols=2, figsize=(16, 10))
ax[0,0].set_ylim(-0.1,0.1)
ax[1,0].set_ylim(-0.1,0.1)
ax[0,1].set_ylim(-0.1,0.1)
ax[1,1].set_ylim(-0.1,0.1)
plot_acf(e_data['Magnitude'],lags=50, ax=ax[0,0],title='Autocorrelation Magnitude')
plot_pacf(e_data['Magnitude'],lags=50, ax=ax[1,0],title='Partial Autocorrelation Magnitude')
plot_acf(e_data['Depth'],lags=50, ax=ax[0,1],color='tab:red',title='Autocorrelation Depth')
plot_pacf(e_data['Depth'],lags=50, ax=ax[1,1],color='tab:red',title='Partial Autocorrelation Depth')
plt.show()

In [None]:
tmp = e_data.groupby(by='Year').count()
tmp = tmp.reset_index()[['Year','Date']]
tmp
fig = ex.line(tmp,x='Year',y='Date')
fig.update_layout(
    title= 'Number Of Earthquakes Over The Years 1965-1966',
    xaxis = dict(
        tickmode = 'linear',
        tick0 = 0.0,
        dtick = 1
    )
)
fig.add_shape(type="line",
    x0=tmp['Year'].values[0], y0=tmp['Date'].mean(), x1=tmp['Year'].values[-1], y1=tmp['Date'].mean(),
    line=dict(
        color="Red",
        width=2,
        dash="dashdot",
    ),
        name='Mean'
)
fig.show()

### From the 60 till the last few years, there is an average climbing trend in the total earthquake events per year. We see that in the last 3 years in our dataset, there is a sudden drop in our trend 

In [None]:
tmp = e_data.groupby(by='Year').mean()
tmp = tmp.reset_index()[['Year','Magnitude']]
tmp
fig = ex.line(tmp,x='Year',y='Magnitude')
fig.update_layout(
    title= 'Mean Earthquakes Magnitude Over The Years 1965-1966',
    xaxis = dict(
        tickmode = 'linear',
        tick0 = 0.0,
        dtick = 1
    )
)
fig.show()

### We see that in the 60's we had a very high mean magnitude of earthquakes in compression to later years where the mean magnitude stays around the same value for all years.

In [None]:
tmp = e_data.groupby(by='Year').std()
tmp = tmp.reset_index()[['Year','Magnitude']]
tmp
fig = ex.line(tmp,x='Year',y='Magnitude')
fig.update_layout(
    title= 'Earthquake Standard Deviation From The Mean Over The Years 1965-1966, Mean SD Shown With Red Line',
    xaxis = dict(
        tickmode = 'linear',
        tick0 = 0.0,
        dtick = 1
    )
)
fig.add_shape(
        # Line Horizontal
            type="line",
            x0=1965,
            y0=tmp['Magnitude'].mean(),
            x1=2016,
            y1=tmp['Magnitude'].mean(),
            line=dict(
                color="red",
                width=2.5,
                dash="dashdot",
            ),
    )
fig.show()

### The standard deviation of earthquake magnitude over the years gives us the following insight. As explained in the beginning observing a year where the variance of earthquake magnitudes is large can imply that something was wrong. It's worth investigating, in other words, why did the magnitude vary so much during that year in comparison to an average year where there is a certain amount of small earthquake with little to non peculiar behavior.

In [None]:
fig = make_subplots(
    rows=2, cols=2,
    column_widths=[0.6, 0.4],subplot_titles=('Location Of Recorded Earthquakes','Distriubtion Of Magnitudes',  'Distriubtion Of Depths'),
    row_heights=[0.4, 0.6],
    specs=[[{"type": "scattergeo", "rowspan": 2}, {"type": "histogram"}],
           [            None                    , {"type": "bar"}]])

fig.add_trace(
    go.Scattergeo(lat=e_data["Latitude"],
                  lon=e_data["Longitude"],
                  mode="markers",
                  hoverinfo="text",
                  text=e_data.Magnitude,
                  showlegend=False,
                  marker=dict(color="crimson", size=4, opacity=0.8)),
    row=1, col=1
)

fig.add_trace(
    go.Histogram(x=e_data.Magnitude,name='Magnitude'),
    row=1, col=2
)
fig.add_trace(
    go.Histogram(x=e_data.Depth,name='Depth'),
    row=2, col=2
)


fig.update_geos(
    projection_type="orthographic",
    landcolor="white",
    oceancolor="MidnightBlue",
    showocean=True,
    lakecolor="LightBlue"
)

fig.update_xaxes(tickangle=45)

fig.update_layout(
    template="plotly_dark",
    margin=dict(r=10, t=25, b=40, l=60),
    
)

fig.show()

In [None]:
tmp = e_data[['Year','Day_of_Week']]
tmp=tmp.groupby(by='Year').agg(lambda x:x.value_counts().index[0])
tmp = tmp.reset_index()
days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday',
            'Sunday']
days = {k:days[k] for k in range(0,7)}
tmp['Day_of_Week'] = tmp['Day_of_Week'].replace(days)
fig = ex.pie(tmp,names='Day_of_Week',title='Propotion Of Earthquakes On A Certian Day Of Week Over The Years ')
fig.show()

### A higher percentage of earthquakes were recorded on Saturdays and Mondays 

In [None]:
ex.pie(e_data,names='Status',title='Proportion Of Different Earthquake Statuses',hole=.3)

In [None]:
ex.pie(e_data,names='Source',title='Proportion Of Different Earthquake Sources',hole=.3)

### 88 percent of our sources are from the US

In [None]:
pivot_table = e_data.pivot_table(index='Year',columns='Month',values='Magnitude')
sns.clustermap(pivot_table,annot=True,cmap='coolwarm',col_cluster=False,figsize=(20,13))

### The intuition behind the heatmap above was to find out there are any years that stand out in the average magnitude value in a certain month. In other words, I wanted to find out are stronger earthquakes more common in February of each fifth year or some pattern that may uncover the cycle behavior of earthquakes. I found out that all the years have a similar behavior except 1966,1971, and 1968  which had significantly higher average earthquake magnitude.

In [None]:
plt.figure(figsize=(20,11))
ax = sns.distplot(e_data['Latitude'],label='Latitude')
ax.set_title('Distribution Of Earthquake Latitudes',fontsize=19)
ax.set_ylabel('Density',fontsize=16)
ax.set_xlabel('Latitude',fontsize=16)

plt.show()

### We see that our dataset's latitudes follow a multimodal distribution of a trimodal distribution to be precise; I assume we can use clustering to cluster the 3 different groups of latitudes and try and understand why those clusters have similar latitudes and what unites those clusters.

In [None]:
plt.figure(figsize=(20,11))
ax = sns.distplot(e_data['Longitude'],label='Longitude',color='teal')
ax.set_title('Distribution Of Earthquake Longitudes',fontsize=19)
ax.set_ylabel('Density',fontsize=16)
ax.set_xlabel('Longitude',fontsize=16)

plt.show()

### The altitude follows a trimodal distribution similar to the latitude. We will perform clustering on those features in a later stage of this kernel, hopefully giving some insight into different groups of earthquake sites.

In [None]:
ex.pie(e_data,names='Type',title='Proportion Of Different Eqrthquake Types In Our Dataset')

### We see that there are other measurements source in our data like exposition, etc... They are still less than 1 percent of our data, meaning we will treat them as extreme outliers or 'black swans' as we already referred to extreme, unpredictable outliers.

In [None]:
plt.figure(figsize=(20,11))
ax = sns.distplot(e_data['Depth'],label='Depth',color='red')
ax.set_title('Distribution Of Earthquake Depths',fontsize=19)
ax.set_ylabel('Density',fontsize=16)
ax.set_xlabel('Depth',fontsize=16)

plt.show()

### When we look at the distribution of earthquake depths, we see that most earthquakes follow a bimodal distribution around depth 60. Still, we have some records of earthquakes occurring at depth 600-700, rare and defined as black swans. We will ignore those values in the next steps to see the true distribution without an extremely long tail.

In [None]:
plt.figure(figsize=(20,11))
ax = sns.distplot(e_data['Magnitude'],label='Magnitude',color='teal')
ax.set_title('Distribution Of Earthquake Magnitudes',fontsize=19)
ax.set_ylabel('Density',fontsize=16)
ax.set_xlabel('Magnitude',fontsize=16)

plt.show()

In [None]:
#Outlier Removal
e_data = e_data[e_data['Depth'] <300]

In [None]:
tmp = e_data.copy()
tmp = tmp[tmp['Magnitude']<=6.1]
tmp = tmp[tmp['Depth']<60]

sns.jointplot(data=tmp,x='Depth',y='Magnitude',kind='kde',cmap='coolwarm',height=12,levels=30)

### So we can confirm that an earthquake is most likely to be at depth 10 or 30 with a magnitude of 5.5. looking at the distributions of each of the parameters, we can see that, as concluded earlier, the depth follows a bimodal distribution. Still, our magnitude feature follows a multimodal distribution of higher-order, which suggests that our magnitude scale is rounded and is not continuous rather than a 'discrete' feature, which possibly should be interested as an ordinal scale than a value of the measurement.

<a id="4"></a>
<h1 style="background-color:gray;font-family:newtimeroman;font-size:250%;color:whitesmoke;text-align:center;border-radius: 15px 50px;">Clustering</h1>


In [None]:
#Clustring
from sklearn.cluster import KMeans,DBSCAN
DB = DBSCAN(eps=0.5,algorithm='ball_tree',min_samples=15)
c_data = e_data.copy()
DB.fit(e_data[['Magnitude','Depth']])
c_data['Cluster'] = DB.labels_

fig = ex.scatter_3d(c_data,x='Longitude',y='Depth',z='Magnitude',color='Cluster',height=900)
fig.show()