# QuaSaR: Identifying EEW Rings - MMI Classifier

In the efforts to understand the GeoNet datasets for havesting data that can be used in trialing the picking agorithms, we begin with the [GeoNet Strong Motion Database](https://www.geonet.org.nz/data/supplementary/nzsmdb), [rupture  model data](), and [processed recordings]() that are readily available<sup>[1](#myftnote1)</sup>. The idea is to classify the historic data by the various measures made available throught the datasets. Some of the measures include the moment magnitude, hycenter location, measuring station locations, tectonic type, rupture length, total duration, and so on.

<a name="ftnote1">[1]</a>: [All GeoNet data and images](https://github.com/GeoNet/data), with updates on Github, are made available free of charge through the GeoNet project to facilitate research into hazards and assessment of risk. GeoNet is sponsored by the New Zealand Government through its agencies: Earthquake Commission (EQC), GNS Science and Land Information New Zealand (LINZ), the National Emergency Management Agency (NEMA) and the Ministry of Business, Innovation and Employment (MBIE).

In [None]:
'''
    WARNING CONTROL to display or ignore all warnings
'''
import warnings; warnings.simplefilter('default')     #switch betweeb 'default' and 'ignore'

import logging
from functools import lru_cache

## Hueristically mining the strong motion database

[Flatfile specifications](https://static.geonet.org.nz/info/resources/applications_data/earthquake/strong_motion/Flatfiles_ColumnExplanation.pdf). They contain the horizontal and vertical acceleration response spectra, and horizontal and vertical Fourier amplitude spectra of acceleration.

We begin with mining the data with attention 
1. _Mw_: Moment Magnitude
1. _Orign_time_: Earthquake originating time
1. _TectClass_: Tectonic Class of crustal, slab, or interface
1. _Mech_: Focal mechanism whether it is a slip, strike, etc
1. _Length_km_: rupture length in Kilometers
1. _Width_km_: rupture width in Kilometers
1. _TotalDuration_: Total duration of the earthquake

### Class to load and clean data

In [None]:
#import torch
import pandas as pd
import datetime as dt
#from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

'''
    CSV LOAD into DataFrame and remove unnecessary columns
'''
strong_motion_df = pd.read_csv('../data/flatfiles/NZdatabase_flatfile_Significant_Duration_horizontal.csv', encoding = "UTF-8")

strong_motion_df = strong_motion_df.drop(columns=['CuspID','References','Location','Record'], axis=1)
strong_motion_df = strong_motion_df.replace('>2',float(2.1))
strong_motion_df = strong_motion_df.replace('<0.1',float(0.1))
''' Convert datetime to a defined as YYYYMMDD time rounded to 0 or 1 day '''
for t_idx, t_val in enumerate(strong_motion_df['Origin_time']):
    t_dt = dt.datetime.fromisoformat(t_val[0:19])
    t_float = float(t_dt.year*100000+t_dt.month*1000+t_dt.day*10+(round(t_dt.hour/24)))
    strong_motion_df['Origin_time'][t_idx]=t_float

_lst_cate_data_cols = [
                        'Origin_time',     # Origin time of earthquake in UTC
                        'TectClass',       # Tectonic classification, either ‘crustal’, ‘interface’, or ‘slab’
                        'Mech',            #
                        'HWFW',            #
                        'SiteCode',        #
                        'SiteClass1170',   #
                        'Vs30Uncert',      #
                        'TsiteUncert',     #
                        'Z1Uncertainty'    #
                      ]
cat_strong_motion_df = strong_motion_df[_lst_cate_data_cols]
num_strong_motion_df = strong_motion_df.drop(_lst_cate_data_cols, axis=1)
print(f"Shape of the Categorical DataFrame: {cat_strong_motion_df.shape}")
print(f"Shape of the Numerical DataFrame: {num_strong_motion_df.shape}")

''' LabelEncoder to convert the categorical data to numerical float64 '''
le = LabelEncoder()
for __cat_col_name in _lst_cate_data_cols:
    strong_motion_df[__cat_col_name] = le.fit_transform(strong_motion_df[__cat_col_name]).astype(float) 
print(f"Shape of the full DataFrame: {strong_motion_df.shape}")
print(f'\nPost label encoding of categorical data \n{strong_motion_df.head(3)}')

### Method to pariwise plot measures

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
'''
    Pairwise density plots for all the column variables
    attribute specs: https://static.geonet.org.nz/info/resources/applications_data/earthquake/strong_motion/Flatfiles_ColumnExplanation.pdf
'''
plot_df = strong_motion_df[[
                            'Mw',          # Moment Magnitude
                            'MwUncert',    # Mw uncertainty class 
                            'Origin_time', # Origin time of earthquake in UTC
                            'TectClass',   # Tectonic classification, either ‘crustal’, ‘interface’, or ‘slab’
                            'Mech',        # Focal mechanism: S→strike-slip, N→normal,R→reverse,U→unknown
                            'PreferredFaultPlane', # 1→one fault plane orientation is preferred, 1→Unknown
                            'Strike',      # Strike angle (degrees)
                            'Dip',         # Dip angle (degrees)
                            'Rake',        # Rake angle (degrees)
                            'HypLat',      # Hypercenter Latitude
                            'HypLon',      # Hypercernter Longitude
                            'StationLat',  # Recording Station Latitude
                            'StationLon',  # Recording Station Longitude
                            'HypN',        # Northing of Hyppercenter
                            'HypE',        # Easting of Hypercernter
                            'StationN',    # Northing of Station
                            'StationE',    # Easting of Station
                            'LENGTH_km',   # Infered rupture Length in Kilometers
                            'WIDTH_km',    # Infered down-dip rupture Width in Kilometers
                            'TotalDuration'# Total Duration of the earthquake
                           ]]
lst_plot_cols = ['Origin_time','Mw','TectClass','Mech','LENGTH_km','TotalDuration']
print(f"Description of each measure: \n{pd.DataFrame(plot_df[lst_plot_cols].describe(include='all')).T}")
g = sns.pairplot(plot_df[lst_plot_cols], 
             hue='Mw', corner=True,hue_order=None,
             kind='scatter', diag_kind='auto', height=7,markers='d')
g.fig.suptitle("Pair Plots for relevant measures") # y= some height>1


## Applying an NN Classifier
The intent is to use data available from GeoNet strong motion and felt databases to encapsulate a a Modified Mercalli Intensity (MMI). We use an Artifical Nueral Network (NN) to classify the data from the several flat files. The reason to use an NN is because of the large volume of covariates in the dataset. Therefore, the output should deliver an MMI for a new scenario.

In this notebook we are investigating the use of pytorch and tensor products and their capabilties to build a model for evaluating Objective II.B. For such we need to achieve the following steps
1. Encode or vectorize the data; especially with transforming categorical labels to numerical data
1. [Split the data](https://palikar.github.io/posts/pytorch_datasplit/) to generate training, test, and validation datasets
1. Transform them into tensors as inputs for the model

### Method to split dataset into train, validation, & test
We borrow from [Train-Validation-Test split in PyTorch](https://palikar.github.io/posts/pytorch_datasplit/#the-datasplit-class)

In [None]:
''' METHOD DATALOADER - build the training dataloader object'''
# Load necessary Pytorch packages
import torch
from torch.utils.data import DataLoader, TensorDataset
from torch import Tensor
import numpy as np

shuffle = True
test_train_split = 0.8
val_train_split = 0.2

dataset = strong_motion_df.drop(['Origin_time','SiteCode'], axis=1).astype(float)

dataset_size = len(dataset)
indices = list(range(dataset_size))
test_split = int(np.floor(test_train_split * dataset_size))

if shuffle:
    np.random.shuffle(indices)

train_indices, test_indices = indices[:test_split], indices[test_split:]
train_size = len(train_indices)
validation_split = int(np.floor((1 - val_train_split) * train_size))
train_indices, val_indices = train_indices[ : validation_split], train_indices[validation_split:]

#_targets = dataset[['TectClass','Mech']]
#_inputs = dataset.drop(['TectClass','Mech'], axis=1)
_targets = dataset['Mech']
_inputs = dataset.drop(['Mech'], axis=1)
print('Classes',set(_targets))
train_inputs = np.array(_inputs.iloc[train_indices], dtype=np.float32)
train_targets= np.array(_targets.iloc[train_indices], dtype=np.float32)

train_data = []
for i in range(len(train_inputs)):
    train_data.append([train_inputs[i], train_targets[i]])
#train_data_ts = torch.tensor([train_data], dtype=torch.float32)
train_loader = torch.utils.data.DataLoader(train_data, batch_size=10, num_workers=0, shuffle=False)
print("Train data loader length", str(len(train_loader)))

'''
for data in train_data:
    print(data)
    break
for i, (data,labels) in enumerate(train_data, 1):
    print(data,type(data))
    print(labels,type(labels))
'''

## The NN Model

#### Input layer
Comprises both numerical and categorical data. We have generated embeddings for all the categorical data.
* ```n_features: int=46``` (of 55)

#### Hidden Layer
The design of the hidden layer makes use of
1. [Linear transformation](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html#torch.nn.Linear) takes ```n_features: int=46``` (of 55) and ```n_out_feature: int=2``` (of 5).

#### Output Layer
The output classifiers are ```output={'Mw','TectClass','Mech','Length_km','Width_km', 'TotalDuration'}```. Therefore, we have ```n_out_features: int=6```.

#### Weights
The weights are randomly generated for the ```2401 x 49``` tensors. 

In [None]:
import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):
    
    def __init__(self, n_in_features: int, n_out_features: int):
        super().__init__()
        self.fc1 = nn.Linear(n_in_features, n_out_features)
        
    def forward(self,x):
        # Pass data through conv1
#        x = self.conv1(x)
        # Use the rectified-linear activation function over x
#        x = F.relu(x)

#        x = self.conv2(x)
#        x = F.relu(x)

        # Run max pooling over x
#        x = F.max_pool2d(x, 2)
        # Pass data through dropout1
#        x = self.dropout1(x)
        # Flatten x with start_dim=1
#        x = torch.flatten(x, 1)
        # Pass data through fc1
        x = self.fc1(x)
#        print(x)
#        x = F.relu(x)
#        x = self.dropout2(x)
#        x = self.fc2(x)

        # Apply softmax to x
        output = F.log_softmax(x, dim=1)
        return output


### Method to initialize the NN

In [None]:
import torch.optim as optim

train_loader_input, train_loader_target = next(iter(train_loader))
print(f"Train loader shape {train_loader_input.shape} and target loader shape {train_loader_target.shape}")
n_in_features = train_loader_input.shape[1]     # = 51
n_out_features = len(set(_targets))
#n_out_features = train_loader_target.shape[0]   # = 2
print(f'Building model with {n_in_features} in features and {n_out_features} out features')
model = Net(n_in_features,n_out_features) # On CPU
print(f"\n{model}\nWeights: {model.fc1.weight.shape}\nBias: {model.fc1.bias.shape}")

loss_function = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

### Method to train the model

In [None]:
epochs = 300
aggregated_losses = []

#categorical_train_data = train_df[_lst_cate_data_cols]
#numerical_train_data = 

print('Input train tensor data type:',train_loader_input.dtype,  'and', train_loader_input.size())
print('Target train tensor data type:',train_loader_target.dtype,  'and', train_loader_target.size())

print(f"Begin training for {epochs} epochs ...\n")
for i in range(epochs):
    i += 1
    for k, (data, labels) in enumerate(train_loader):
        y_pred = model.forward(data)
#        print(f'{k} Output shape {y_pred.shape} and tensor looks like\n{y_pred[0:2]}')
        labels = labels.type(torch.LongTensor)
#        print(f'{k} Label shape {labels.shape} and tensor looks like\n{labels}')
        single_loss = loss_function(y_pred, labels)
        aggregated_losses.append(single_loss)

    if i%25 == 1:
        print(f'epoch: {i:3} loss: {single_loss.item():10.8f}')

    optimizer.zero_grad()
    single_loss.backward()
    optimizer.step()

print(f'epoch: {i:3} loss: {single_loss.item():10.10f}')


In [None]:
_lst_outputs = ['Origin_time','Mw','TectClass','Mech','LENGTH_km','TotalDuration']
outputs = torch.tensor(tmp_sm_df[_lst_outputs].values)
print(f"{outputs.shape} \n{outputs}")

### OneHotEncoding [DEPRECATE?]

In [None]:
'''
    DEPRECATE if the LabelEncoding is sufficient; else,
    TODO: OneHotEncode before building the the tensors
'''

import logging
from functools import lru_cache

logging.debug('Preprocessing data with OneHotEncoder')

'''
    OneHotEncoder to create the arrays for training, validation, and testing
'''
train_1hotenc = np.empty_like(train_ts)
for t_indx, t in enumerate(train_ts):
    print(t_indx,t.numpy())
    train_1hotenc[t_indx] = OneHotEncoder(categories='auto', drop=None, sparse=True, dtype='float64', 
                                  handle_unknown='error').fit_transform(t.numpy())
    print(train_1hotenc[t_indx])
train_1hotenc_ts = train_1hotenc
print(f"Shape of the OneHotencoded array: {train_1hotenc.shape}")
print(f"Datatype of the OneHotencoded array: {train_1hotenc.dtype}")