# G2Net
Solution for the https://www.kaggle.com/c/g2net-gravitational-wave-detection challenge.

### Inspiration
There are a number of other interesting solutions for this task:
- Exploratory Analysis: https://www.kaggle.com/ihelon/g2net-eda-and-modeling
- Baseline: https://www.kaggle.com/yasufuminakama/g2net-efficientnet-b7-baseline-inference
- Training: https://www.kaggle.com/yasufuminakama/g2net-efficientnet-b7-baseline-training

### Library Imports


In [None]:
# Python Library Imports
%matplotlib inline
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random
from tqdm import tnrange, tqdm_notebook

# might also need:
import torch
import json
import collections
import cv2

# Notes on Kaggle Environment:
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# input folders
path_input = '/kaggle/input/g2net-gravitational-wave-detection/'
folder_train = 'train'
folder_test = 'test'

# reading submission and training label files. 
path_output = '/kaggle/working'
submission_df = pd.read_csv(path_input + 'sample_submission.csv')
train_df = pd.read_csv(path_input + 'training_labels.csv')
train_df

## Data Exploration 
Here, we will explore the form and shape of the training data provided, with a focus on how we can best train a model on it. 

In the G2Net Competition, we are provided with a time series data set, containing simulated measurements of gravitational waves. These measurements are simulated at three independent graviational wave interferometers: LIGO Hanford, LIGO Livingston, and Virgo. 

Each time series contains either detector noise, or detector noise with a simulated gravitational wave signal applied on top. Our model must identify when a signal is present in the data (target = 1). 

In [None]:
# visualisation code referenced from: 
# https://www.kaggle.com/ihelon/g2net-eda-and-modeling

def convert_id_to_path(_id):
    return f"{path_input}{folder_train}/{_id[0]}/{_id[1]}/{_id[2]}/{_id}.npy"

def visualize_sample( _id, target, 
    colors=("black", "red", "green"), 
    signal_names=("LIGO Hanford", "LIGO Livingston", "Virgo")):
    path = convert_id_to_path(_id)
    x = np.load(path)
    print(x.shape)
    plt.figure(figsize=(16, 7))
    for i in range(3):
        plt.subplot(4, 1, i + 1)
        plt.plot(x[i], color=colors[i])
        plt.legend([signal_names[i]], fontsize=12, loc="lower right")
        
        plt.subplot(4, 1, 4)
        plt.plot(x[i], color=colors[i])
    
    plt.subplot(4, 1, 4)
    plt.legend(signal_names, fontsize=12, loc="lower right")

    plt.suptitle(f"id: {_id} target: {target}", fontsize=16)
    plt.show()

In [None]:
for i in random.sample(train_df.index.tolist(), 2):
    _id = train_df.iloc[i]["id"]
    target = train_df.iloc[i]["target"]

    visualize_sample(_id, target)

### Correlation Between features
Visualise the correlation between the three inputs:

In [None]:
# correlation

for i in random.sample(train_df.index.tolist(), 2):
    _id = train_df.iloc[i]["id"]
    target = train_df.iloc[i]["target"]
    print(f"target = {target} !!")
    path = convert_id_to_path(_id)
    x = np.load(path)
    xt = x.transpose()
    df = pd.DataFrame(data=xt[:100, :], columns=["LIGO Hanford", "LIGO Livingston", "Virgo"])
    #df = pd.DataFrame(data=numpy_data, index=["row1", "row2"], columns=["column1", "column2"])
    #df = pd.DataFrame(data=x[:, :5], index=["LIGO Hanford", "LIGO Livingston", "Virgo"])
    
    sns.pairplot(df)
    plt.show()


In [None]:
from time import sleep
from tqdm.notebook import tqdm

for i in tqdm(random.sample(train_df.index.tolist(), 2)):
    _id = train_df.iloc[i]["id"]
    target = train_df.iloc[i]["target"]
    print(f"target = {target} !!")
    path = convert_id_to_path(_id)
    x = np.load(path)
    xt = x.transpose()
    df = pd.DataFrame(data=xt[:, :], columns=["LIGO Hanford", "LIGO Livingston", "Virgo"])
    #df = pd.DataFrame(data=numpy_data, index=["row1", "row2"], columns=["column1", "column2"])
    #df = pd.DataFrame(data=x[:, :5], index=["LIGO Hanford", "LIGO Livingston", "Virgo"])

    sns.pairplot(df)
    plt.show()

**Observations:**
- The LIGO Detectors have a higher correlation with eachother than they do with the Virgo Detector. This could be related to their geographical proximity. Since the LIGO detectors are closer, and the surface of the Earth is curved, then the LIGO detectors will be measuring the same polarity of spacetime. This means correlations MIGHT be less likely to be noise and more likely to be related to gravitational waves.
    - Perhaps the model might increase (or decrease) the LIGO weightings if they are similar?
- can consider using evaluations from physics/geography to quantify the data from another perspective. This can help with evaluating the raw data. 

### Data Feature Extraction?
Could human-select and extract features from the data, like the "frequency", "frequency deviation", "amplitude", "amplitude deviation".

## Baseline: Logistic Regression 
Here, we will create a baseline linear regression model on a randomly selected portion of the training set. 

### Extracting Training Sets
Here, we begin by extracting 50 randomly selected samples to a 3D array of shape (50, 3, 4096). 

In [None]:
# Returns randomly selected data samples from Training set. 
# 
# 
def sample_data(training_size=50):
    y_data = np.full(training_size, -1)
    X_data = np.zeros((training_size, 3, 4096))
    
    count = 0 # increments. 
    for i in random.sample(train_df.index.tolist(), training_size):
        img_id = train_df.iloc[i]["id"]
        y_data[count] = train_df.iloc[i]["target"]
        path = convert_id_to_path(img_id)
        X_data[count] = np.load(path)
        
        count += 1
        # stack X. 
    return (y_data, X_data)
        

y_train, X_train = sample_data(100)

print(X_train.shape)
print(y_train.shape)
print(X_train)

### Normalising Data
We must normalise the data so it is in the range (0..1). 

In [None]:
# Normalise the Data in X_data (100, 3, 4096). 
from sklearn import preprocessing

# one line scale: https://stackoverflow.com/questions/50125844/how-to-standard-scale-a-3d-matrix
scaler = preprocessing.StandardScaler()
X_scaled = scaler.fit_transform(X_train.reshape(-1, X_train.shape[-1])).reshape(X_train.shape)
X_scaled
# this scaling preserves the differences between channels in volumes.
# these differences may be useful?

# but we might want to individually scale channels. an alternate scaling 
# is being developed below: TODO

scalers = {}
X_scaled_2 = np.zeros_like(X_train)
for i in range(X_train.shape[1]):
    scalers[i] = preprocessing.StandardScaler()
    X_scaled_2[:, i, :] = scalers[i].fit_transform(X_train[:, i, :]) 

print(X_scaled_2)

### Visualising Scaled Data
Work out which scaling is better: unscaled, scaled_across, scaled_individually.


In [None]:
def visualize_scaled_sample( x, target, colors=("black", "red", "green"), signal_names=("LIGO Hanford", "LIGO Livingston", "Virgo")):    
    plt.figure(figsize=(16, 7))
    for i in range(3):
        plt.subplot(4, 1, i + 1)
        plt.plot(x[i], color=colors[i])
        plt.legend([signal_names[i]], fontsize=12, loc="lower right")
        
        plt.subplot(4, 1, 4)
        plt.plot(x[i], color=colors[i])
    
    plt.subplot(4, 1, 4)
    plt.legend(signal_names, fontsize=12, loc="lower right")

    plt.suptitle(f"id: {_id} target: {target}", fontsize=16)
    plt.show()

In [None]:
visualize_scaled_sample(X_train[3], y_train[3])
visualize_scaled_sample(X_scaled[3], y_train[3])
visualize_scaled_sample(X_scaled_2[3], y_train[3])

visualize_scaled_sample(X_train[12], y_train[12])
visualize_scaled_sample(X_scaled[12], y_train[12])
visualize_scaled_sample(X_scaled_2[12], y_train[12])