# Time series analysis of earthquake data provided by [Itokiana RAFIDINARIVO](https://www.kaggle.com/itokianarafidinarivo)


## Disclaimer
This is the first of two (or three) notebooks on EQ dataset intended for a time series analysis.

## A note for reader
This is my second attempt at a time series analysis. The first one was performed on [Surface Solar Radiation Dataset](https://www.kaggle.com/saurabhshahane/surface-solar-radiation-dataset) by [Saurabh Shahane](https://www.kaggle.com/saurabhshahane) and can be found [here](https://www.kaggle.com/syedalimohsinbukhari/surface-solar-radiation-time-series-analysis). So, if there was something you wanted to see but did not find, kindly let me know. It will be of great help to me. Suggestions, corrections, and constructive feedbacks are all welcomed.

## Plan for future notebook(s)
This notebook only covers the year 1990 as an exploratory dataset. Future notebook(s) with the complete data set will include 1990 year as well.

## Caveat in the code
This code is using `ctime`, which can make the dataset shift depending upon what time zone you're in. For my local PC, the averages were different than what are shown in **DAILY/MONTHLY MEANS OF THE MAG/SIG** here on kaggle. I tried several times, but I could not get my local system and kaggle data to get synchronized. If you know how to resolve this issue, kindly let me know.

### LIBRARY IMPORTS

In [None]:
%matplotlib inline
import os
import time
import codecs
import numpy as np
import pandas as pd
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
from scipy.stats import norm, skewnorm
import numpy.polynomial.polynomial as poly

### CUSTOM FUNCTIONS

In [None]:
'''
Time when the event occurred. Times are reported in milliseconds since the epoch (1970-01-01T00:00:00.000Z),
and do not include leap seconds. In certain output formats, the date is formatted for readability; Long Integer

updated: Time when the event was most recently updated. Times are reported in milliseconds since the epoch.
In certain output formats, the date is formatted for readability; Long Integer
'''


def time_converter(epoch):
    """
    This function coverts a given millisecond time to GMT+0.0 time
    Since I'm in GMT+5.0 h zone, I am applying the required correction as well
    """
    # taken from https://stackoverflow.com/a/12400556/3212945
    # 18000000 is correction for GMT+5
    _tt = time.ctime((epoch-18000000)/1000)
    day, month, mmonth = _tt.split()[0], _tt.split()[1], _tt.split()[2]

    return day, month, mmonth


def get_average_by(df, group_by, get_average_of, re_index):
    """
    Gives the mean of parameter speciifed in a new dataset
    
    PARAMETERS:
    -----------
                df: dataset to be used
          group_by: column name in the dataset by which grouping is required
    get_average_of: column name in the dataset for which mean calculation is required
          re_index: same as group_by
    """
    # reindexing idea taken from https://stackoverflow.com/a/30010004/3212945
    _temp = df.groupby([group_by]).agg(Mean=(get_average_of, 'mean')).reindex(re_index).reset_index()
    return _temp


def plot_hist_normdist(x, nbins, plot_hatch, plot_label):
    """
    Plots histogram with normal and skewnormal distriubtions overplotted on the data
    
    PARAMETERS:
    -----------
             x: Data for which histogram is required
         nbins: Number of bins for histogram
    plot_hatch: Hatch style for matplotlib histogram
    plot_label: Label to display in plot legend
     plot_list: A string list containing the word 
    """
    h3, l3 = [], []

    x_new = np.arange(x.min(), x.max(), 0.01)
    plt.hist(x, bins=nbins, histtype='step', hatch=plot_hatch, label=plot_label)
    h1, l1 = plt.gca().get_legend_handles_labels()
    plt.xlabel(plot_label)
    plt.ylabel('Counts')
    plt.twinx()
    plt.plot(x_new, norm.pdf(x_new, *norm.fit(x)), label='Normal distribution')
    plt.plot(x_new, skewnorm.pdf(x_new, *skewnorm.fit(x)), label='Skewnormal distribution')
    plt.ylabel('PDF for various normal distributions')
    plt.gca().set_ylim(bottom=0)
    h2, l2 = plt.gca().get_legend_handles_labels()
    for i, j in zip(h1, l1):
        h3.append(i)
        l3.append(j)
    for i, j in zip(h2, l2):
        h3.append(i)
        l3.append(j)
    plt.legend(handles=h3, labels=l3, loc='best')
    plt.xlabel(plot_label)
    

def axis_cheating(plt, x):
    """
    This function changes labels of plots forcibly
    
    PARAMETERS:
    -----------
        plt: matplotlib.pypot handle
          x: list for ticks
    """
    try:
        x = list(x)
        _t = [i for i in x]
        _t.insert(0, '-1')
        _t.insert(len(_t), '0')
        plt.gca().set_xticklabels(_t)
    except ValueError:
        x = list(x)
        _t = [i for i in x]
        plt.gca().set_xticklabels(_t)


def plot_means(x, y, xlab, xticks):
    """
    This function plots the mean values for given datasets
    
    PARAMETERS:
    -----------
         x: first dataset/dataseries
         y: second dataset/dataseries
      xlab: xlabel for the plot
    xticks: this parameter is used to correct the xlabels using axis_cheating function
    """

    if xlab == 'Days':
        lab = 'Daily'
    elif xlab == 'Months':
        lab = 'Monthly'

    plt.plot(x['Mean'], 'r-o', label='{} mean of EQ magnitude for 1990'.format(lab))
    plt.xlabel(xlab)
    plt.ylabel('Mean magnitude of EQ')

    if xlab=='Months':
        plt.gca().set_xticks(np.arange(0, 12, 1))
    
    axis_cheating(plt, xticks)
    
    h1, l1 = plt.gca().get_legend_handles_labels()

    plt.twinx()
    
    plt.plot(y['Mean'], 'g-o', label='{} mean of EQ significance for 1990'.format(lab))
    plt.ylabel('Mean significance of EQ')
    
    h2, l2 = plt.gca().get_legend_handles_labels()
    h1, h2, l1, l2 = h1[0], h2[0], l1[0], l2[0]
    
    plt.legend([h1, h2], [l1, l2])
    plt.grid('on')
    plt.tight_layout()
    

def plot_boxenplots(df, to_x, to_hue, mag_sig, xlab):
    """
    This function plots sns.boxenplots for given dataset and hues
    
    PARAMETERS:
    -----------
         df: dataframe
       to_x: categorical data for x-axis
     to_hue: cateogrical data for dividing the data further into packets
    mag_sig: list of string(s) containing column names from df
    """

    for i in mag_sig:
        plt.figure(figsize=(40,7), dpi=300)
        f = sns.boxenplot(data=df[[i, to_x, to_hue]], x=to_x, y=i, hue=to_hue, orient='v')
        if to_x == 'day':
            f.invert_xaxis()
        plt.xlabel(xlab)
        if i == 'mag':
            plt.ylabel('Magnitude of EQ')
            f.legend_.set_title(to_hue)
        else:
            plt.ylabel('Significance of EQ')
            f.legend_.set_title(to_hue)
        plt.tight_layout()

### READ THE FILES FOR YEAR 1990

I am adding `month_names`, `month_numbers` to the dataset. These two will be used to categorize the dataset.

In [None]:
#file_path = os.getcwd() + '/'
#file_names = [f for f in os.listdir(os.curdir) if f.startswith('1990') and f.endswith('.csv')]

file_path = '../input/earthquakes-monthly-usgs-updated-monthly/'

day_names = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']

month_names = ['jan', 'feb', 'mar', 'apr', 'may', 'jun',
               'jul', 'aug', 'sep', 'oct', 'nov', 'dec']

df_temp = pd.DataFrame()

for i, v in enumerate(range(1, 13, 1)):
    print(file_path + '1990_{}.csv'.format(i+1))
    temp = pd.read_csv(file_path + '1990_{}.csv'.format(i+1))
    df_temp = df_temp.append(temp)

# this will be used in a few plots, trust me
mag_sig = ['mag', 'sig']

### CHECKING NAs

In [None]:
df_temp.isna().sum()

We'll not take the rows with any 0/NA value.

Also, there are two major cateogrical columns
* magType: magnitude type of earthquake event
*    type: type of event from which the reading was recorded

As the dataset already contain two similar-to-one-another columns (`mag`, `sig`) I will only keep `type` column for my use, and also to reduce redundancy.

In [None]:
df_temp2 = df_temp[['mag', 'time', 'updated', 'sig', 'type', 'longitude', 'latitude']]

Let's add days to the dataset as well using `time_converter` function.

In [None]:
day, month_na, month_nu = [], [], []

for i in df_temp2['time']:
    _d, _m, __m = time_converter(i)
    day.append(_d)
    month_na.append(_m)
    month_nu.append(__m)
    
day_df = pd.DataFrame(day)
day_df.columns = ['day']

month_names_df = pd.DataFrame(month_na)
month_names_df.columns = ['month_names']

month_numbers_df = pd.DataFrame(month_nu)
month_numbers_df.columns = ['month_numbers']

df_temp2 = pd.concat([df_temp2.reset_index().drop('index', axis=1), day_df, month_names_df, month_numbers_df], axis=1, sort=False)

In [None]:
df_temp2.head()

In [None]:
# just to check the time-day correspondance is correct or not
# it should be a wednesday, according to the dataframe
time_converter(633829215390)
# the result can also be checked from https://www.epochconverter.com/

## Visualisation, FINALLY !!

### Let's first see what events have caused how much detections per day and per month for the year 1990.

In [None]:
plt.figure(figsize=(15,6))
plt.subplot(211)
f = sns.histplot(data=df_temp2, x='day', hue='type', multiple='dodge')
f.set_yscale('log')
f.set_xlabel('Days')
plt.tight_layout()

plt.subplot(212)
f = sns.histplot(data=df_temp2, x='month_names', hue='type', multiple='dodge')
f.set_xticklabels([i.capitalize() for i in month_names])
f.set_xlabel('Months')
f.set_yscale('log')

* The number of **EQ, quarry blast** events are quite even across all days/months except **Sunday** for QB.
* The number of events for **Explosion** are also even except for **Sunday** and the month of **February, March**, and **December**.
* Rock burst have minimal contribution in the weekly data and are mostly seen in the month of **Decmber** only.
*  **Sunday/Monday** and **January/February/March** have been quite friendly in the world. No `nuclear explosions` :D

### Now let's focus a bit on the magnitude of EQs and their significance

In [None]:
# me being lazy
x, y = df_temp2['mag'], df_temp2['sig']

plt.figure(figsize=(16, 8))

plt.subplot(221)
plot_hist_normdist(x=x, nbins=64, plot_hatch='/', plot_label='Magnitude of EQ')

plt.subplot(223)
pd.plotting.boxplot(x, vert=False, figsize=(12, 4))
_ = plt.xlabel('Magnitude of EQ')

plt.subplot(222)
plot_hist_normdist(x=y, nbins=64, plot_hatch='\\', plot_label='Significance of EQ')

plt.subplot(224)
pd.plotting.boxplot(y, vert=False, figsize=(12, 4))
_ = plt.xlabel('Significance of EQ')

plt.tight_layout()

* Both the **magnitude** and **significance** data is highly skewed and contains a LOT of outliers
* Both dataset show bimodal shape (they have to peaks) and do not fit either normal/skewnormal curves

### Let's check the mean/max/min distance for both these datasets.

In [None]:
for i in ['mag', 'sig']:
    x, y =  round(df_temp2[i].mean() - df_temp2[i].min(), 4), round(df_temp2[i].mean() - df_temp2[i].max(), 4)
    
    print('The difference between mean and min value of {} parameter is {}'.format(i, x))
    print('The difference between mean and min value of {} parameter is {}\n'.format(i, y))

THAT'S A LOT OF SKEWNESS.

### DAILY/MONTHLY MEANS OF THE MAG/SIG

Let's get some averages

In [None]:
df_temp2[['mag', 'day']].groupby('day').mean().reindex(day_names).T

In [None]:
df_temp2[['sig', 'day']].groupby('day').mean().reindex(day_names).T

In [None]:
_mn = [i.capitalize() for i in month_names]
df_temp2[['mag', 'month_names']].groupby('month_names').mean().reindex(_mn).T

In [None]:
df_temp2[['sig', 'month_names']].groupby('month_names').mean().reindex(_mn).T

Now, let's plot these averages.

In [None]:
x1 = get_average_by(df=df_temp2, group_by='day', get_average_of='mag', re_index=day_names)
y1 = get_average_by(df=df_temp2, group_by='day', get_average_of='sig', re_index=day_names)
x2 = get_average_by(df=df_temp2, group_by='month_names', get_average_of='mag', re_index=_mn)
y2 = get_average_by(df=df_temp2, group_by='month_names', get_average_of='sig', re_index=_mn)

In [None]:
plt.figure(figsize=(6, 8))
plt.subplot(211)
plot_means(x1, y1, 'Days', day_names)
plt.subplot(212)
plot_means(x2, y2, 'Months', _mn)

### BOXEN PLOTS, A LOT OF THEM

Now let's see the distribution of `mag` and `sig` values on `daily` and `montly` basis, divided by `type` of the event.

In [None]:
plot_boxenplots(df_temp2, 'day', 'type', mag_sig, 'Days')

In [None]:
plot_boxenplots(df_temp2, 'month_names', 'type', mag_sig, 'Months')

In [None]:
plot_boxenplots(df_temp2, 'type', 'day', mag_sig, 'EQ types')

In [None]:
plot_boxenplots(df_temp2, 'type', 'month_names', mag_sig, 'Months')

* The EQ/quarry blast/explosion, although numerous in numbers having large outliers have lower means compared to the sparsely occuring nuclear explosions, even lower than rock burst events.
* The monthly/daily average of EQ/QB/E events is almost same (also seen in the distribution plot earlier), however the mean values for NE is not evenly distributed.

### GEO PLOT,

In [None]:
px.scatter_geo(df_temp2, 'latitude', 'longitude',
               hover_data=['latitude','longitude', 'time', 'mag', 'sig'], color='type')

This is all for now :)