<h1 style="text-align:center;">G2Net: End-to-End Pipeline</h1>
<img src='https://i.natgeofe.com/n/8ea26109-20c7-4cac-b099-86a1298957ee/colliding-black-holes.jpg'>

# Introduction
Gravitational Waves (GW) is an astronomical event that takes place due to the collision among Black Holes or merging of Neutron stars. These signals are unimaginably tiny ripples in the fabric of space-time, and when captured by the detectors, signals gets buried in detector noise.

## Objective:
In this Kaggle Challenge, we will be exploring how these mixed signal should be analysed and from there how GW can be detected from the signals. We will approach this problem as __binary classification__.

## Data:
Here, the training set of time series data containing simulated gravitational wave measurements from a network of 3 gravitational wave interferometers (__LIGO Hanford__, __LIGO Livingston__, and __Virgo__). Each time series contains either detector noise or detector noise plus a simulated gravitational wave signal. The task is to identify when a signal is present in the data (`target=1`). The simulated part is the gravitational wave signal, while the detector noise is real.

__Files__
* `train/` - the training set files, one npy file per observation; labels are provided in a files shown below
* `test/` - the test set files; you must predict the probability that the observation contains a gravitational wave
* `training_labels.csv` - target values of whether the associated signal contains a gravitational wave
* `sample_submission.csv` - a sample submission file in the correct format

In [None]:
!pip install -q dtaidistance astropy

In [None]:
import os
import json
import random
import collections
from pprint import pprint

import numpy as np
import pandas as pd
from scipy import spatial
from scipy import signal

from IPython.display import display, HTML, Markdown, clear_output
from tqdm.notebook import tqdm

import plotly.express as px
from plotly.offline import init_notebook_mode, iplot
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import plotly.graph_objects as go
from dtaidistance import dtw
from dtaidistance import dtw_visualisation as dtwvis

from astropy.timeseries import LombScargle
import librosa

from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import RobustScaler, MinMaxScaler
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import label_binarize
from sklearn.svm import SVC
from sklearn.ensemble import AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import precision_recall_fscore_support, confusion_matrix, accuracy_score, classification_report, roc_auc_score, roc_curve, auc, plot_roc_curve
from sklearn.linear_model import LogisticRegression
import xgboost as xgb

init_notebook_mode(connected=True)

random.seed(369)

# Data Exploration

## Data Loading and Distribution
First we load the `training_labels.csv` to see the distribution of the signals, check for any missing values and the distribution of the target variable.

In [None]:
train_df = pd.read_csv("../input/g2net-gravitational-wave-detection/training_labels.csv")
display(train_df.head(10))

In [None]:
# finding the number of missing values across the columns
display(Markdown('Display the missing values:'))
display(train_df.isnull().sum())

In [None]:
# finding the frequency of the target variables
freq_df = train_df.groupby('target').count().reset_index()
freq_df['target'] = freq_df['target'].astype(str)
freq_fig = px.bar(freq_df, x='target', y='id', 
                  color='target', title='Target Class Distribution')
freq_fig.update_layout(height=400, template='plotly_white')
freq_fig.add_layout_image(
    dict(
        source='https://img.icons8.com/color/452/black-hole.png',
        x=0.7, y=0.9, sizex=0.25, sizey=0.25,
    )
)
iplot(freq_fig)

Although the target distribution is textbook, we need to ensure we are not going to overfit the model when training.

As mentioned in the data description, we have 3 detector sites and each of them have signal values of duration 2 seconds sampled at 2048 Hz frequency. Now, we have to explore this data in deoth.

## Histogram

We can start off by filtering the data based on the target varaible and taking a sample of data points from each sites to calculate sample histogram distribution of the data. Then we will superimpose the result of data with signal and with no signal of the correspondig sites and see how does the data differ.

In [None]:
def hist_plot(data, group_labels, det_colors, title):
    hist_fig = ff.create_distplot(data, group_labels, show_hist=False, colors=det_colors)
    hist_fig.update_layout(height=400, title_text=title, template="plotly_white")
    iplot(hist_fig)

In [None]:
sig_det_df = train_df[train_df["target"] == 1]
det_df = train_df[train_df["target"] == 0]

det1_sd, det2_sd, det3_sd = [], [], []
display(Markdown("For Target 1: Signal + Detector"))
for each_id in tqdm(random.sample(sig_det_df.index.tolist(), 100)):
    img_id = train_df.loc[each_id, "id"]
    target_var = train_df.loc[each_id, "target"]
    file_path = "../input/g2net-gravitational-wave-detection/train/{}/{}/{}/{}.npy".format(
        img_id[0], img_id[1], img_id[2], img_id
    )
    train_arr = np.load(file_path)
    train_arr_df = pd.DataFrame.from_records(train_arr)
    det1_sd.append(train_arr_df.iloc[0, :])
    det2_sd.append(train_arr_df.iloc[1, :])
    det3_sd.append(train_arr_df.iloc[2, :])


det1_d, det2_d, det3_d = [], [], []
display(Markdown("For Target 0: Detector"))
for each_id in tqdm(random.sample(det_df.index.tolist(), 100)):
    img_id = train_df.loc[each_id, "id"]
    target_var = train_df.loc[each_id, "target"]
    file_path = "../input/g2net-gravitational-wave-detection/train/{}/{}/{}/{}.npy".format(
        img_id[0], img_id[1], img_id[2], img_id
    )
    train_arr = np.load(file_path)
    train_arr_df = pd.DataFrame.from_records(train_arr)
    det1_d.append(train_arr_df.iloc[0, :])
    det2_d.append(train_arr_df.iloc[1, :])
    det3_d.append(train_arr_df.iloc[2, :])


det1_sd_df, det2_sd_df, det3_sd_df = pd.DataFrame(det1_sd), pd.DataFrame(det2_sd), pd.DataFrame(det3_sd)
det1_d_df, det2_d_df, det3_d_df = pd.DataFrame(det1_d), pd.DataFrame(det2_d), pd.DataFrame(det3_d)

det1_sd_df_mean, det2_sd_df_mean, det3_sd_df_mean = det1_sd_df.mean(axis=0), det2_sd_df.mean(axis=0), det3_sd_df.mean(axis=0)
det1_d_df_mean, det2_d_df_mean, det3_d_df_mean = det1_d_df.mean(axis=0), det2_d_df.mean(axis=0), det3_d_df.mean(axis=0)

all_det_hist_data = [det1_sd_df_mean, det1_d_df_mean, det2_sd_df_mean, det2_d_df_mean, det3_sd_df_mean, det3_d_df_mean]
all_det_colors = ['red', 'orange', 'purple', 'blue', 'darkgreen', 'lightgreen']

group_labels = ['Site #1 Target 1', 'Site #1 Target 0', 
                'Site #2 Target 1', 'Site #2 Target 0', 
                'Site #3 Target 1', 'Site #3 Target 0']
    
hist_plot(all_det_hist_data, group_labels, all_det_colors, 'Sample Average Histogram of All Site Detection')

## Data Visualization: Time-domain

We first split up the data against the 2 classes and observe how does the data on individual level looks like.

In [None]:
# let us randomly select any 2 training files with target 1 and 0 to see how does the data looks like.

def visualize_series(each_id):
    display(Markdown(f"Selected ID: {each_id}"))
    img_id = train_df.loc[each_id, "id"]
    target_var = train_df.loc[each_id, "target"]
    file_path = "../input/g2net-gravitational-wave-detection/train/{}/{}/{}/{}.npy".format(
        img_id[0], img_id[1], img_id[2], img_id
    )
    train_arr = np.load(file_path)
    train_arr_df = pd.DataFrame.from_records(train_arr)
    display(train_arr_df)
    
    if target_var == 1:
        color = ["red", "purple", "darkgreen"]
    else:
        color = ["orange", "blue", "lightgreen"]

    train_fig = go.Figure()

    for idx in list(range(train_arr_df.shape[0])):
        train_fig.add_trace(
            go.Scatter(
                x=list(range(train_arr_df.shape[1])),
                y=train_arr_df.iloc[idx, :],
                mode="lines",
                name=f"Detector_{idx+1}",
                marker_color=color[idx],
                yaxis=["y", "y2", "y3"][idx]
            )
        )

    for idx in list(range(train_arr_df.shape[0])):
        train_fig.add_trace(
            go.Scatter(
                x=list(range(train_arr_df.shape[1])),
                y=train_arr_df.iloc[idx, :],
                mode="lines",
                name=f"Detector_{idx+1}",
                showlegend=False,
                marker_color=color[idx],
                yaxis="y4"
            )
        )

    train_fig.update_layout(
        hovermode="x",
        xaxis=dict(
            autorange=True,
            range=[0, train_arr_df.shape[1]],
            rangeslider=dict(autorange=True, range=[0, train_arr_df.shape[1]]),
        ),
        yaxis=dict(
            anchor="x",
            autorange=True,
            domain=[0, 0.2],
            linecolor="red",
            side="left",
            type="linear",
            zeroline=False,
        ),
        yaxis2=dict(
            anchor="x",
            autorange=True,
            domain=[0.25, 0.45],
            linecolor="blue",
            side="left",
            type="linear",
            zeroline=False,
        ),
        yaxis3=dict(
            anchor="x",
            autorange=True,
            domain=[0.5, 0.7],
            linecolor="green",
            side="left",
            type="linear",
            zeroline=False,
        ),
        yaxis4=dict(
            anchor="x",
            autorange=True,
            domain=[0.75, 0.95],
            side="left",
            type="linear",
            zeroline=False,
        ),
        title_text=f"GW Observation ID: {img_id} and target: {target_var} across 3 centers",
        template="plotly_white"
    )
    iplot(train_fig)


sig_det_df = train_df[train_df["target"] == 1]
det_df = train_df[train_df["target"] == 0]

display(Markdown("For Target 1: Signal + Detector"))
for each_id in random.sample(sig_det_df.index.tolist(), 2):
    visualize_series(each_id)

display(Markdown("For Target 0: Detector"))
for each_id in random.sample(det_df.index.tolist(), 2):
    visualize_series(each_id)

We don't see much difference! This was expected, isn't it? These waves are so faint that it is difficult to observe using these time series plots. Let's try to understand it from another perspective.

We are randomly selecting `N` series from each observation center that __has the signal__. For simplicity, let us assume that `S1` is the data obtained from center 1. Now we select `M` random data that has __no signal__. We are going to compute the _time-domain similarity_ score of `S1` against all the `M` datapoints that we selected using __cross-correlation__ and try to see that which has the closest similarity. In that way, we can get a rough estimation of how visually should we see the data with signal and without signal.

Since it is randomly selected, we can expect some error in the estimation because the approach in which the data has been generated was not mentioned - hence difficult to validate. Also, there is no naming convention to properly compare.

In [None]:
sig_det_df = train_df[train_df["target"] == 1]
det_df = train_df[train_df["target"] == 0]


det1_rec_sd, det1_rec_d_tmp, det1_rec_d = [], [], []
det2_rec_sd, det2_rec_d_tmp, det2_rec_d = [], [], []
det3_rec_sd, det3_rec_d_tmp, det3_rec_d = [], [], []

for sd_id in tqdm(random.sample(sig_det_df.index.tolist(), 10)):
    sd_img_id = sig_det_df.loc[sd_id, 'id']
    file_path = "../input/g2net-gravitational-wave-detection/train/{}/{}/{}/{}.npy".format(sd_img_id[0], sd_img_id[1], sd_img_id[2], sd_img_id)
    train_arr = np.load(file_path)
    train_arr_df = pd.DataFrame.from_records(train_arr)
    
    det1_rec_sd.append(train_arr_df.iloc[0, :])
    det2_rec_sd.append(train_arr_df.iloc[1, :])
    det3_rec_sd.append(train_arr_df.iloc[2, :])

for d_id in tqdm(random.sample(det_df.index.tolist(), 100)):
    d_img_id = det_df.loc[d_id, 'id']
    file_path = "../input/g2net-gravitational-wave-detection/train/{}/{}/{}/{}.npy".format(d_img_id[0], d_img_id[1], d_img_id[2], d_img_id)
    train_arr = np.load(file_path)
    train_arr_df = pd.DataFrame.from_records(train_arr)
    
    det1_rec_d_tmp.append(train_arr_df.iloc[0, :])
    det2_rec_d_tmp.append(train_arr_df.iloc[1, :])
    det3_rec_d_tmp.append(train_arr_df.iloc[2, :])
    
# performing the time-domain, frequency-domain and power similarity

def compute_similarity(ref_rec,input_rec):
    ## Time domain similarity
    ref_time = np.correlate(ref_rec,ref_rec)
    inp_time = np.correlate(ref_rec,input_rec)
    diff_time = abs(ref_time-inp_time)

    ## Freq domain similarity
    ref_freq = np.correlate(np.fft.fft(ref_rec),np.fft.fft(ref_rec)) 
    inp_freq = np.correlate(np.fft.fft(ref_rec),np.fft.fft(input_rec))
    diff_freq = abs(ref_freq-inp_freq)

    ## Power similarity
    ref_power = np.sum(ref_rec**2)
    inp_power = np.sum(input_rec**2)
    diff_power = abs(ref_power-inp_power)

    return (diff_time+diff_freq+diff_power)/3


for rec_sd1, rec_sd2, rec_sd3 in zip(tqdm(det1_rec_sd), det2_rec_sd, det3_rec_sd):
    diff_rec1, diff_rec2, diff_rec3 = [], [], []
    for rec_d1 in det1_rec_d_tmp:
        sim = compute_similarity(rec_d1, rec_sd1)
        diff_rec1.append(sim)
    
    for rec_d2 in det2_rec_d_tmp:
        sim = compute_similarity(rec_d2, rec_sd2)
        diff_rec2.append(sim)
        
    for rec_d3 in det3_rec_d_tmp:
        sim = compute_similarity(rec_d3, rec_sd3)
        diff_rec3.append(sim)
    
    det1_rec_d.append(det1_rec_d_tmp[diff_rec1.index(min(diff_rec1))])
    det2_rec_d.append(det2_rec_d_tmp[diff_rec2.index(min(diff_rec2))])
    det3_rec_d.append(det3_rec_d_tmp[diff_rec3.index(min(diff_rec3))])

Now that we calculated the similar patterns of `signal+detector` and `detector` only across all the 3 sites, let's visualize it in using __heatmap__.

In [None]:
def show_heatmap(title, sd_rec, d_rec):
    hm_det_fig = make_subplots(rows=1, cols=2, subplot_titles=("Signal+Detector", "Detector"))
    hm_det_fig.add_trace(
        go.Heatmap(z=sd_rec, legendgroup='group1'), 
        row=1, col=1)
    hm_det_fig.add_trace(
        go.Heatmap(z=d_rec, legendgroup='group1', showscale=False), 
        row=1, col=2)

    hm_det_fig.update_layout(height=300, showlegend=False, title_text=title)
    iplot(hm_det_fig)
    
show_heatmap("Detector 1 Record Heatmap", det1_rec_sd, det1_rec_d)
show_heatmap("Detector 2 Record Heatmap", det2_rec_sd, det2_rec_d)
show_heatmap("Detector 3 Record Heatmap", det3_rec_sd, det3_rec_d)

Of course, we would expect different result everytime we run but we can still estimate how different would the observation look like if we do encounter a signal against a one that doesn't have.

Now that we have overall comparison of the signals across sites, we will select the signals that are deemed simialr in a site and check how similar are they using __Dynamic Time Warping (DTW)__. 

In [None]:
# visualise similar signal from site 1
d, paths = dtw.warping_paths(det1_rec_sd[0], det1_rec_d[0], window=128)
best_path = dtw.best_path(paths)
display(Markdown('Site #1'))
display(dtwvis.plot_warpingpaths(det1_rec_sd[0], det1_rec_d[0], paths, best_path))

__UPDATE__: From these diagrams, we are witnessing a better understanding of how much the `signal+detector` data and `detector` data are similar. _I had to deleted some of these diagrams to avoid slowing down the notebook_.

## Data Visualization: Frequency-domain

So far we have covered time-domain visualization. Now let us inspect frequency-domain based analysis.

__Periodograms__

_Power Spectral Density (PSD)_ is the measure of signal's power content versus frequency. __Lomb-Scargle periodogram__ is a commonly used statistical tool designed to detect and test the significance of weak periodic signals with uneven temporal sampling, and it is pretty commonly used in the field of astronomy.

In [None]:
# performing the Lomb-Scargle PSD
def ls_psd(y):
    freq, pwr = LombScargle(list(range(4096)), 
                            y).autopower(normalization='psd', 
                                         minimum_frequency=0.98, maximum_frequency=1.02)
    return freq, pwr

records = [det1_rec_sd, det1_rec_d, det2_rec_sd, det2_rec_d, det3_rec_sd, det3_rec_d]

ls_psd_res = []
for rec in records:
    frequency, power = ls_psd(rec[1])
    ls_psd_res.append([frequency, power])

psd_fig = make_subplots(rows=3, cols=2, 
                       subplot_titles=['Site 1: Signal+Detector', 'Site 1: Detector',
                                      'Site 2: Signal+Detector', 'Site 2: Detector',
                                      'Site 3: Signal+Detector', 'Site 3: Detector'])

psd_fig.add_trace(go.Scatter(x=ls_psd_res[0][0], y=ls_psd_res[0][1], mode='lines'), row=1, col=1)
psd_fig.add_trace(go.Scatter(x=ls_psd_res[1][0], y=ls_psd_res[1][1], mode='lines'), row=1, col=2)
psd_fig.add_trace(go.Scatter(x=ls_psd_res[2][0], y=ls_psd_res[2][1], mode='lines'), row=2, col=1)
psd_fig.add_trace(go.Scatter(x=ls_psd_res[3][0], y=ls_psd_res[3][1], mode='lines'), row=2, col=2)
psd_fig.add_trace(go.Scatter(x=ls_psd_res[4][0], y=ls_psd_res[4][1], mode='lines'), row=3, col=1)
psd_fig.add_trace(go.Scatter(x=ls_psd_res[5][0], y=ls_psd_res[5][1], mode='lines'), row=3, col=2)
psd_fig.update_layout(showlegend=False, title_text='Lomb-Scargle PSD')
iplot(psd_fig)

As we can see, in all the sites having the signal, the signal power is higher in the records having `signal+detector` compared to the `detector` by a magnitude of 2-3.

## CQT

The constant-Q transform, simply known as CQT transforms a data series to the frequency domain. It is related to the Fourier transform.

In [None]:
def compute_cqt(data):
    return np.abs(librosa.cqt(np.array(data)/np.max(np.array(data)), sr=2048, n_bins=50, pad_mode='median'))

show_heatmap("Detector 1 CQT Heatmap", compute_cqt(det1_rec_sd[1]), compute_cqt(det1_rec_d[1]))
show_heatmap("Detector 2 CQT Heatmap", compute_cqt(det2_rec_sd[1]), compute_cqt(det2_rec_d[1]))
show_heatmap("Detector 3 CQT Heatmap", compute_cqt(det3_rec_sd[1]), compute_cqt(det3_rec_d[1]))

# Data Preparation and Processing

First, we are taking a sample of the original dataset and perform the data preparation to perform the classifier model.

Here, we have 3 detector sites and each of them have records of signal getting detected (target = 1) and with no signal (target = 0). So instead of the splitting the consolidated record of `training_labels.csv` into __train__ and __validation__ data points, we will split it up based on detector site.

We will be performing the __Mel-frequency cepstral coefficients (MFCC)__ to obtain the necessary frequency components that will be beneficial for this entire analysis.

In [None]:
sig_det_df = train_df[train_df["target"] == 1]
det_df = train_df[train_df["target"] == 0]

det1_total_sd, det2_total_sd, det3_total_sd = {}, {}, {}
det1_total_d, det2_total_d, det3_total_d = {}, {}, {}

for sd_id in tqdm(random.sample(sig_det_df.index.tolist(), 2000)):
    sd_img_id = sig_det_df.loc[sd_id, 'id']
    sd_file_path = "../input/g2net-gravitational-wave-detection/train/{}/{}/{}/{}.npy".format(sd_img_id[0], sd_img_id[1], sd_img_id[2], sd_img_id)
    sd_arr = np.load(sd_file_path, mmap_mode='r')
    sd_arr_df = pd.DataFrame.from_records(sd_arr)
    
    sd1, sd2, sd3 = np.array(sd_arr_df.iloc[0, :]), np.array(sd_arr_df.iloc[1, :]), np.array(sd_arr_df.iloc[2, :])
    
    det1_total_sd[sd_img_id] = librosa.feature.mfcc(sd1/max(sd1), sr=2048).flatten()
    det2_total_sd[sd_img_id] = librosa.feature.mfcc(sd2/max(sd2), sr=2048).flatten()
    det3_total_sd[sd_img_id] = librosa.feature.mfcc(sd3/max(sd3), sr=2048).flatten()

for d_id in tqdm(random.sample(det_df.index.tolist(), 2000)):
    d_img_id = det_df.loc[d_id, 'id']
    d_file_path = "../input/g2net-gravitational-wave-detection/train/{}/{}/{}/{}.npy".format(d_img_id[0], d_img_id[1], d_img_id[2], d_img_id)
    d_arr = np.load(d_file_path, mmap_mode='r')
    d_arr_df = pd.DataFrame.from_records(d_arr)
    
    d1, d2, d3 = np.array(d_arr_df.iloc[0, :]), np.array(d_arr_df.iloc[1, :]), np.array(d_arr_df.iloc[2, :])
    
    det1_total_d[d_img_id] = librosa.feature.mfcc(d1/max(d1), sr=2048).flatten()
    det2_total_d[d_img_id] = librosa.feature.mfcc(d2/max(d2), sr=2048).flatten()
    det3_total_d[d_img_id] = librosa.feature.mfcc(d3/max(d3), sr=2048).flatten()

So now we obtained records of site 1,2,3 having the signals and no signals separately - basically `det1_total_sd` contains data of site 1 with `s`ignal+`d`etector and `det1_total_d` contains data of site 1 with only `d`etector. Likewise for site 2 and site 3.

Next, we will concat the records of site 1 in a single data point - basically , `det1_total_sd + det1_total_d`. Likewise for site 2 and site 3.

In [None]:
# site 1
det1_total_sd = pd.DataFrame.from_dict(det1_total_sd, orient='index')
det1_total_sd['target'] = [1]*det1_total_sd.shape[0]

det1_total_d = pd.DataFrame.from_dict(det1_total_d, orient='index')
det1_total_d['target'] = [0]*det1_total_d.shape[0]

det1_total = pd.concat([det1_total_sd, det1_total_d])

# site 2
det2_total_sd = pd.DataFrame.from_dict(det2_total_sd, orient='index')
det2_total_sd['target'] = [1]*det2_total_sd.shape[0]

det2_total_d = pd.DataFrame.from_dict(det2_total_d, orient='index')
det2_total_d['target'] = [0]*det2_total_d.shape[0]

det2_total = pd.concat([det2_total_sd, det2_total_d])

# site 3
det3_total_sd = pd.DataFrame.from_dict(det3_total_sd, orient='index')
det3_total_sd['target'] = [1]*det3_total_sd.shape[0]

det3_total_d = pd.DataFrame.from_dict(det3_total_d, orient='index')
det3_total_d['target'] = [0]*det3_total_d.shape[0]

det3_total = pd.concat([det3_total_sd, det3_total_d])

Now that we got the records of the respective sites, we will perform data split as per the convention - we get train and validation data points of all the sites individually. So we will concat them and get the final train and validation dataset.

In [None]:
df_train_1, df_valid_1, y_train_1, y_valid_1 = train_test_split(
    det1_total.iloc[:, :-1], 
    det1_total.iloc[:, -1],
    test_size=0.2, 
    random_state=36, 
    stratify=det1_total.iloc[:, -1]
)

df_train_2, df_valid_2, y_train_2, y_valid_2 = train_test_split(
    det2_total.iloc[:, :-1], 
    det2_total.iloc[:, -1],
    test_size=0.2, 
    random_state=36, 
    stratify=det2_total.iloc[:, -1]
)

df_train_3, df_valid_3, y_train_3, y_valid_3 = train_test_split(
    det3_total.iloc[:, :-1], 
    det3_total.iloc[:, -1],
    test_size=0.2, 
    random_state=36, 
    stratify=det3_total.iloc[:, -1]
)

Decided to take mean of the columns of data points for each site and obtain the distribution. It was observed that the data distribution is different for site 3. So I performed scaling and saw that the data across all sites are relatively aligned with each other.

In [None]:
# perform scaling

sclr = MinMaxScaler()
x_train_1, x_train_2, x_train_3 = sclr.fit_transform(np.array(df_train_1)), \
sclr.fit_transform(np.array(df_train_2)), sclr.fit_transform(np.array(df_train_3))

x_valid_1, x_valid_2, x_valid_3 = sclr.fit_transform(np.array(df_valid_1)), \
sclr.fit_transform(np.array(df_valid_2)), sclr.fit_transform(np.array(df_valid_3))

# displaying the data distribution before and after scaling
hist_data = [df_train_1.mean(axis=0), df_train_2.mean(axis=0), df_train_3.mean(axis=0),
             df_valid_1.mean(axis=0), df_valid_2.mean(axis=0), df_valid_3.mean(axis=0)]

scaled_hist_data = [x_train_1.mean(axis=0), x_train_2.mean(axis=0), x_train_3.mean(axis=0),
                    x_valid_1.mean(axis=0), x_valid_2.mean(axis=0), x_valid_3.mean(axis=0)]
det_colors = ['red', 'purple', 'darkgreen', 'orange', 'blue', 'lightgreen']
group_labels = ['Train 1', 'Train 2', 'Train 3', 'Valid 1', 'Valid 2', 'Valid 3']

hist_plot(hist_data, group_labels, det_colors, 'Sample Average Histogram of All Site Detection (Original Data)')
hist_plot(scaled_hist_data, group_labels, det_colors, 'Sample Average Histogram of All Site Detection (Scaled Data)')

So we can observe that scaled data is relatively well distributed compared to the original one. Now we will concat the datapoints from all these sites.

In [None]:
df_train_final = pd.concat([pd.DataFrame.from_records(x_train_1), 
                            pd.DataFrame.from_records(x_train_2), 
                            pd.DataFrame.from_records(x_train_3)])
y_train_final = pd.concat([y_train_1, y_train_2, y_train_3])
df_valid_final = pd.concat([pd.DataFrame.from_records(x_valid_1), 
                            pd.DataFrame.from_records(x_valid_2), 
                            pd.DataFrame.from_records(x_valid_3)])
y_valid_final = pd.concat([y_valid_1, y_valid_2, y_valid_3])

display(Markdown(f'__Train Dimension:__ {df_train_final.shape}'))
display(Markdown(f'__Valid Dimension:__ {df_valid_final.shape}'))

# Feature Importance and Selection

Feature importance refers to a class of techniques for assigning scores to input features to a predictive model that indicates the relative importance of each feature when making a prediction.

The scores are useful and can be used in a range of situations in a predictive modeling problem, such as:
* Better understanding the data.
* Better understanding a model.
* Reducing the number of input features.

There are various ways by which this operation can be performed:
* Coefficients as Feature Importance
* Decision Tree Feature Importance
* Permutation Feature Importance

We will proceed with __Decision Tree Feature Importance__.

In [None]:
def calculate_feature_importance(model, model_name, x_train, y_train):
    model_fe = model.fit(x_train, y_train)

    fe = model_fe.feature_importances_
    fe_sorted = [fe[i] for i in fe.argsort()]
    feature_names = [f'feature_{i}' for i in fe.argsort()]
    
    mean_fe_score = np.mean(fe)
    selected_fe = [i for i, scr in enumerate(fe) if scr>mean_fe_score]

    fe_fig = go.Figure()
    fe_fig.add_trace(go.Bar(x=feature_names, y=fe_sorted))
    fe_fig.update_layout(height=400, title_text=f'{model_name} Feature Importance')
    iplot(fe_fig)
    
    return selected_fe

In [None]:
rf_selected_features = calculate_feature_importance(RandomForestClassifier(random_state=369, n_jobs=-1, class_weight='balanced'),
                                                    'Random Forest', df_train_final, y_train_final)

In [None]:
xgb_selected_features = calculate_feature_importance(xgb.XGBClassifier(objective="binary:logistic", booster='gbtree', eval_metric='error', 
                                                                      random_state=369, use_label_encoder=False, n_jobs=-1),
                                                    'XGBoost', df_train_final, y_train_final)

In [None]:
dt_selected_features = calculate_feature_importance(DecisionTreeClassifier(random_state=369, class_weight='balanced', max_features='auto'),
                                                    'Decision Tree', df_train_final, y_train_final)

Now, usually we would be considering the common features obtained from these methods. So we take intersection of the features and obtain the final features.

In [None]:
display(Markdown('__Number of relevant features__'))
display(Markdown(f'Random Forest: {len(rf_selected_features)} || XGBoost {len(xgb_selected_features)} || Decision Tree {len(dt_selected_features)}'))

# instersection
final_selected_features = list(set(rf_selected_features) & set(xgb_selected_features) & set(dt_selected_features))
display(Markdown('__Number of common relevant features__'))
display(len(final_selected_features))

# Modeling
We will be performing a range of binary classifier operation using Grid-Search Cross Validation technique by passing a set of parameters corresponding to the classifier we are passing, then perform ensembling technique and assess how much good result are we obtaining.

In [None]:
# defining the utils

#function for cross validation
def cross_valid_result(classifier, params, x_train, y_train):
    skf = StratifiedKFold(shuffle=True, random_state=369)
    grid_search = GridSearchCV(classifier, param_grid=params,
                              cv=skf.split(x_train, y_train),
                              scoring='accuracy', n_jobs=-1, verbose=2)
    model_opt = grid_search.fit(x_train, y_train)

    cv_res_df = pd.DataFrame(data=model_opt.cv_results_)
    cv_res_df.drop(['mean_fit_time', 'std_fit_time', 'mean_score_time', 
                    'std_score_time', 'std_test_score', 'params'], axis=1, inplace=True)
    cv_res_df_cols = list(cv_res_df.columns)
    cv_res_df_cols.insert(0, cv_res_df_cols.pop())
    cv_res_df = cv_res_df.reindex(columns=cv_res_df_cols).sort_values(by='rank_test_score')
    
    opt_params = model_opt.best_params_
    classifier = classifier.set_params(**opt_params)
    
    final_model = classifier.fit(x_train, y_train)
    y_train_pred = final_model.predict(x_train)
    clear_output()
    
    display(Markdown('__Cross Validation Result__'))
    display(cv_res_df.head().round(4))
    
    cv_scores = cross_val_score(classifier, x_train, y_train, scoring='accuracy')
    mean_cv_score = np.mean(cv_scores)
    display(Markdown('__Mean Cross Validation Score__'))
    display(round(mean_cv_score, 4))
    
    display(Markdown('__Best Parameters__'))
    for k,v in opt_params.items():
        display(Markdown(f'__{k}__: {v}'))
    
    return final_model, y_train_pred


# function for roc plot
def roc_plot(x_train, y_train, x_valid, y_valid, model, model_name):
    y_train_pred_probs = model.predict_proba(x_train)[:, 1]
    y_train_auc = roc_auc_score(y_train, y_train_pred_probs)
    yt_fpr, yt_tpr, _ = roc_curve(y_train, y_train_pred_probs, pos_label = 1)

    y_valid_pred_probs = model.predict_proba(x_valid)[:, 1]
    y_valid_auc = roc_auc_score(y_valid, y_valid_pred_probs)
    yv_fpr, yv_tpr, _ = roc_curve(y_valid, y_valid_pred_probs, pos_label = 1)

    roc_fig = make_subplots(rows=1, cols=2, subplot_titles=['Train Data', 'Validation Data'])

    roc_fig.add_trace(go.Scatter(x=yt_fpr, y=yt_tpr, mode='lines', 
                                 name=f'Train AUC = {round(y_train_auc, 4)}', line_shape='linear'), row=1, col=1)
    roc_fig.add_trace(go.Scatter(x=[0,1], y=[0,1], showlegend=False,
                                 line=dict(dash='dash'), line_shape='linear'), row=1, col=1)

    roc_fig.add_trace(go.Scatter(x=yv_fpr, y=yv_tpr, mode='lines', 
                                 name=f'Validation AUC = {round(y_valid_auc, 4)}', line_shape='linear'), row=1, col=2)
    roc_fig.add_trace(go.Scatter(x=[0,1], y=[0,1], showlegend=False,
                                 line=dict(dash='dash'), line_shape='linear'), row=1, col=2)

    roc_fig.update_layout(title=f'ROC of {model_name}', title_x=0.5)

    roc_fig.update_xaxes(title_text="False Positive Rate", row=1, col=1)
    roc_fig.update_xaxes(title_text="False Positive Rate", row=1, col=2)

    roc_fig.update_yaxes(title_text="True Positive Rate", row=1, col=1)
    roc_fig.update_yaxes(title_text="True Positive Rate", row=1, col=2)
    iplot(roc_fig)
    

# function to calculate model metrics
def model_metric(model, y_train, y_pred_train, x_valid, y_valid, model_name):
    classes = [0,1]
    acc_train = accuracy_score(y_train, y_pred_train)*100
    class_report_train = classification_report(y_train, y_pred_train, 
                                               target_names=classes, output_dict=True)
    y_pred_valid = model.predict(x_valid)
    acc_valid = accuracy_score(y_valid, y_pred_valid)*100 
    class_report_valid = classification_report(y_valid, y_pred_valid, 
                                               target_names=classes, output_dict = True)
    
    display(Markdown('__Overall Statistics__'))
    display(Markdown('Accuracy on TRAIN DATA: __{}%__'. format (round(acc_train,4)))) 
    display(Markdown('Accuracy on VALIDATION DATA: __{}%__'.format(round(acc_valid,4))))
    
    cr_df_train = pd.DataFrame(class_report_train).T 
    cr_df_valid = pd.DataFrame(class_report_valid).T
    
    cm_train = confusion_matrix(y_train,y_pred_train)
    cm_valid = confusion_matrix(y_valid,y_pred_valid)
    
    train_avg_metric = cr_df_train[len(classes)+1:].T 
    valid_avg_metric = cr_df_valid[len(classes)+1:].T
    train_avg_metric['micro avg'] = precision_recall_fscore_support(y_train, y_pred_train, average='micro')
    valid_avg_metric['micro avg'] = precision_recall_fscore_support(y_valid, y_pred_valid, average='micro')
    train_avg_metric = train_avg_metric.iloc[:-1,:]
    valid_avg_metric = valid_avg_metric.iloc[:-1,:]
    
    display(Markdown(f'_Overall Average Metrics of **{model_name}**: **TRAIN DATA**_')) 
    display(train_avg_metric)
    display(Markdown(f'_Overall Average Metrics of **{model_name}**: **VALIDATION DATA**_')) 
    display(valid_avg_metric)
    
    cm_fig_t = ff.create_annotated_heatmap(cm_train, x=classes, y=classes, colorscale='darkmint', showscale=True)
    cm_fig_t.update_layout(title='Confusion Matrix TRAIN DATA', xaxis_title="Predicted", yaxis_title="Reference",
                          height=400, width=400, title_x=0.5)
    iplot(cm_fig_t)
    
    cm_fig_v = ff.create_annotated_heatmap(cm_valid, x=classes, y=classes, colorscale='darkmint', showscale=True)
    cm_fig_v.update_layout(title='Confusion Matrix VALIDATION DATA', xaxis_title="Predicted", yaxis_title="Reference",
                          height=400, width=400, title_x=0.5)
    iplot(cm_fig_v)
    
    return acc_train, acc_valid


models_acc_train, models_acc_valid = [], []

## RandomForest

In [None]:
rf_params = {'n_estimators': [100, 150], 
            'max_depth': [10, 20],
            'max_leaf_nodes': [20, 40]}

rf_classifier = RandomForestClassifier(random_state=369, n_jobs=-1, class_weight='balanced')
rf_model, rf_y_train_pred = cross_valid_result(rf_classifier, rf_params, 
                                               df_train_final.iloc[:, final_selected_features], 
                                               y_train_final)

In [None]:
rf_t_acc, rf_v_acc = model_metric(rf_model, y_train_final, rf_y_train_pred, 
                                  df_valid_final.iloc[:, final_selected_features], y_valid_final, 'Random Forest')

models_acc_train.append(rf_t_acc)
models_acc_valid.append(rf_v_acc)

roc_plot(df_train_final.iloc[:, final_selected_features], y_train_final, 
         df_valid_final.iloc[:, final_selected_features], y_valid_final, 
         rf_model, 'Random Forest')

## Logistic Regression

In [None]:
lr_params = {'max_iter': [100, 200], 
            'tol': [1e-3, 1e-4],
            'C': [1, 2]}

lr_classifier = LogisticRegression(solver='liblinear', random_state=369)
lr_model, lr_y_train_pred = cross_valid_result(lr_classifier, lr_params, 
                                               df_train_final.iloc[:, final_selected_features], 
                                               y_train_final)

In [None]:
lr_t_acc, lr_v_acc = model_metric(lr_model, y_train_final, lr_y_train_pred, 
                                  df_valid_final.iloc[:, final_selected_features], y_valid_final, 'Logistic Regression')

models_acc_train.append(lr_t_acc)
models_acc_valid.append(lr_v_acc)

roc_plot(df_train_final.iloc[:, final_selected_features], y_train_final, 
         df_valid_final.iloc[:, final_selected_features], y_valid_final, 
         lr_model, 'Logistic Regression')

## Decision Tree

In [None]:
dt_params = {'max_depth': [50, 100],
            'min_samples_leaf': [1, 2]}
dt_classifier = DecisionTreeClassifier(random_state=369, class_weight='balanced', max_features='auto')

dt_model, dt_y_train_pred = cross_valid_result(dt_classifier, dt_params, 
                                               df_train_final.iloc[:, final_selected_features], 
                                               y_train_final)

In [None]:
dt_t_acc, dt_v_acc = model_metric(dt_model, y_train_final, dt_y_train_pred, 
                                  df_valid_final.iloc[:, final_selected_features], y_valid_final, 'Decision Tree')

models_acc_train.append(dt_t_acc)
models_acc_valid.append(dt_v_acc)

roc_plot(df_train_final.iloc[:, final_selected_features], y_train_final, 
         df_valid_final.iloc[:, final_selected_features], y_valid_final, 
         dt_model, 'Decision Tree')

## AdaBoost

In [None]:
ab_params = {'n_estimators': [100, 150]}
ab_classifier = AdaBoostClassifier(random_state=369)

ab_model, ab_y_train_pred = cross_valid_result(ab_classifier, ab_params, 
                                               df_train_final.iloc[:, final_selected_features], 
                                               y_train_final)

In [None]:
ab_t_acc, ab_v_acc = model_metric(ab_model, y_train_final, ab_y_train_pred, 
                                  df_valid_final.iloc[:, final_selected_features], y_valid_final, 'AdaBoost')

models_acc_train.append(ab_t_acc)
models_acc_valid.append(ab_v_acc)

roc_plot(df_train_final.iloc[:, final_selected_features], y_train_final, 
         df_valid_final.iloc[:, final_selected_features], y_valid_final, 
         ab_model, 'AdaBoost')

## Gaussian Naive Bayes

In [None]:
nb_classifier = GaussianNB()
nb_model = nb_classifier.fit(df_train_final.iloc[:, final_selected_features], y_train_final)
nb_y_train_pred = nb_model.predict(df_train_final.iloc[:, final_selected_features])

In [None]:
nb_t_acc, nb_v_acc = model_metric(nb_model, y_train_final, nb_y_train_pred, 
                                  df_valid_final.iloc[:, final_selected_features], y_valid_final, 'Naive Bayes')

models_acc_train.append(nb_t_acc)
models_acc_valid.append(nb_v_acc)

roc_plot(df_train_final.iloc[:, final_selected_features], y_train_final, 
         df_valid_final.iloc[:, final_selected_features], y_valid_final, 
         nb_model, 'Naive Bayes')

## XGBoost

In [None]:
xgb_classifier = xgb.XGBClassifier(objective="binary:logistic", booster='gbtree',
                                   random_state=369, use_label_encoder=False, n_jobs=-1)

xgb_model = xgb_classifier.fit(df_train_final.iloc[:, final_selected_features], y_train_final)
xgb_y_train_pred = xgb_model.predict(df_train_final.iloc[:, final_selected_features])

In [None]:
xgb_t_acc, xgb_v_acc = model_metric(xgb_model, y_train_final, xgb_y_train_pred, 
                                    df_valid_final.iloc[:, final_selected_features], y_valid_final, 'XG Boost')

models_acc_train.append(xgb_t_acc)
models_acc_valid.append(xgb_v_acc)

roc_plot(df_train_final.iloc[:, final_selected_features], y_train_final, 
         df_valid_final.iloc[:, final_selected_features], y_valid_final, 
         xgb_model, 'XG Boost')

## Ensemble Method

In [None]:
from sklearn.ensemble import VotingClassifier

eclf = VotingClassifier(estimators=[('rf', rf_model), 
                                    ('lr', lr_model), 
                                    ('nb', nb_model),
                                    ('dt', dt_model),
                                    ('ab', ab_model),
                                    ('xgb', xgb_model)
                                   ],
                        voting='soft', n_jobs=-1)

ens_model = eclf.fit(df_train_final.iloc[:, final_selected_features], y_train_final)
ens_y_train_pred = ens_model.predict(df_train_final.iloc[:, final_selected_features])

In [None]:
ens_t_acc, ens_v_acc = model_metric(ens_model, y_train_final, ens_y_train_pred, 
                                    df_valid_final.iloc[:, final_selected_features], y_valid_final, 'Ensembled')

models_acc_train.append(ens_t_acc)
models_acc_valid.append(ens_v_acc)

roc_plot(df_train_final.iloc[:, final_selected_features], y_train_final, 
         df_valid_final.iloc[:, final_selected_features], y_valid_final, 
         ens_model, 'Ensembled')

# Model Comparison

In [None]:
model_name_list = ['RF', 'LR', 'DT', 'AB', 'NB', 'XGB', 'Ensemble']

comp_fig = go.Figure(data=[
    go.Bar(name='Train', x=model_name_list, y=models_acc_train),
    go.Bar(name='Validation', x=model_name_list, y=models_acc_valid)
])

# Change the bar mode
comp_fig.update_layout(barmode='group')
comp_fig.show()

Here, we can expect the best model to vary.

However, this is not the best version. I will continue to explore other methods and get better result - both in terms of dataset creation and model developement.

# To Be Continued...