<a href="https://colab.research.google.com/github/s-feinstein/G2Net/blob/dev/G2Net.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [G2Net](https://www.kaggle.com/competitions/g2net-detecting-continuous-gravitational-waves/) – Detecting Continuous Gravitational Waves

**Objective:**
Help us detect long-lasting gravitational-wave signals!

The goal of this competition is to find continuous gravitational-wave signals. You will develop a model sensitive enough to detect weak yet long-lasting signals emitted by rapidly-spinning neutron stars within noisy data.

**Secret Objective!!!** Receive excellent marks from this final project in [3253 - Machine Learning at U. Toronto](https://learn.utoronto.ca/programs-courses/courses/3253-machine-learning)

## Authenticate with Secrets

In [4]:
!wget -q -N "https://raw.githubusercontent.com/s-feinstein/G2Net/dev/setup-colab.py"
%run setup-colab.py

## Import the dataset

In [3]:
# !pip install kaggle

!kaggle competitions download g2net-detecting-continuous-gravitational-waves

Traceback (most recent call last):
  File "/opt/conda/bin/kaggle", line 5, in <module>
    from kaggle.cli import main
  File "/opt/conda/lib/python3.7/site-packages/kaggle/__init__.py", line 23, in <module>
    api.authenticate()
  File "/opt/conda/lib/python3.7/site-packages/kaggle/api/kaggle_api_extended.py", line 166, in authenticate
    self.config_file, self.config_dir))
OSError: Could not find kaggle.json. Make sure it's located in /root/.kaggle. Or use the environment method.


## Load the labels

Load the labels and split it into train and test

In [1]:
import pandas as pd
labels = pd.read_csv('../input/g2net-detecting-continuous-gravitational-waves/train_labels.csv')

# Removing the negative labels
labels = labels[labels.target>=0]
# train_labels.target.value_counts()
# train_labels.info()

# Split Data
from sklearn.model_selection import train_test_split
#train test split
train_labels, test_labels = train_test_split(labels, test_size=0.3, random_state=42)
train_labels.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 420 entries, 108 to 102
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      420 non-null    object
 1   target  420 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 9.8+ KB


## Generate list of the training file paths

In [5]:
import os

train_files = []
train_path = "/kaggle/input/g2net-detecting-continuous-gravitational-waves/train"
for id in train_labels.loc[:,"id"]:
    filename = id + ".hdf5"
    path = os.path.join(train_path, filename)
    train_files.append(path)
    
train_files[0]

'/kaggle/input/g2net-detecting-continuous-gravitational-waves/train/2688e48bd.hdf5'

## Explore the data

In [6]:
import h5py
file = h5py.File(train_files[0])

print(list(file.keys())[0])
print(file['2688e48bd'].keys())
print(file['2688e48bd']['H1'].keys())
print(file['2688e48bd']['L1'].keys())
print(file['2688e48bd']['frequency_Hz'])
print(file['2688e48bd']['frequency_Hz'][0:4])
print(file['2688e48bd']['H1']['timestamps_GPS'])
print(file['2688e48bd']['H1']['timestamps_GPS'][0], " - ", file['2688e48bd']['H1']['timestamps_GPS'][-1])
print(file['2688e48bd']['H1']['SFTs'])
print(file['2688e48bd']['H1']['SFTs'][0:2])
print(file['2688e48bd']['L1']['timestamps_GPS'])
print(file['2688e48bd']['L1']['timestamps_GPS'][0], " - ", file['2688e48bd']['L1']['timestamps_GPS'][-1])
print(file['2688e48bd']['L1']['SFTs'])
print(file['2688e48bd']['L1']['SFTs'][0:2])

2688e48bd
<KeysViewHDF5 ['H1', 'L1', 'frequency_Hz']>
<KeysViewHDF5 ['SFTs', 'timestamps_GPS']>
<KeysViewHDF5 ['SFTs', 'timestamps_GPS']>
<HDF5 dataset "frequency_Hz": shape (360,), type "<f8">
[306.92055556 306.92111111 306.92166667 306.92222222]
<HDF5 dataset "timestamps_GPS": shape (4564,), type "<i8">
1238170479  -  1248546822
<HDF5 dataset "SFTs": shape (360, 4564), type "<c8">
[[-4.5731778e-23+1.18721092e-22j  2.8206372e-23-1.11619598e-23j
   1.1124856e-22-3.01264718e-23j ... -9.1336872e-23-1.15372359e-22j
   4.9273540e-23+8.50121319e-23j -2.2122217e-22+7.59737612e-23j]
 [-4.5931660e-23-1.34403698e-22j  2.4227980e-23+1.09901059e-22j
   1.7774413e-22-1.22264384e-23j ... -9.4565130e-23+2.65873464e-22j
  -4.0620313e-23-4.09678061e-23j -7.8723440e-23-4.63299556e-23j]]
<HDF5 dataset "timestamps_GPS": shape (4646,), type "<i8">
1238167882  -  1248558232
<HDF5 dataset "SFTs": shape (360, 4646), type "<c8">
[[ 2.2248422e-22+5.95424828e-23j -1.1790475e-22-6.16426640e-23j
   6.1226429e-23+

## Load the data into a dataframe

In [9]:
train_data_array = []
for path in train_files:
    file = h5py.File(path)
    filename = list(file.keys())[0]
    frequency_Hz = file[filename]['frequency_Hz']
    H1_SFTs = file[filename]['H1']['SFTs']
    H1_timestamps_GPS = file[filename]['H1']['timestamps_GPS']
    L1_SFTs = file[filename]['L1']['SFTs']
    L1_timestamps_GPS = file[filename]['L1']['timestamps_GPS']
    train_data_array.append([filename, frequency_Hz, H1_SFTs, H1_timestamps_GPS, L1_SFTs, L1_timestamps_GPS])
    
print(len(train_data_array))
train_data = pd.DataFrame(train_data_array, columns=['filename', 'frequency_Hz', 'H1_SFTs', 'H1_timestamps_GPS','L1_SFTs', 'L1_timestamps_GPS'])
print(train_labels.head())
train_data.head()


420
            id  target
108  2688e48bd       1
274  77b9c1867       0
602  ffa1d19c7       1
482  cbff2fdcd       0
439  b832e8026       0


Unnamed: 0,filename,frequency_Hz,H1_SFTs,H1_timestamps_GPS,L1_SFTs,L1_timestamps_GPS
0,2688e48bd,"[306.92055555555555, 306.92111111111114, 306.9...","[[(-4.5731778e-23+1.1872109e-22j), (2.8206372e...","[1238170479, 1238172279, 1238174079, 123817587...","[[(2.2248422e-22+5.954248e-23j), (-1.1790475e-...","[1238167882, 1238169682, 1238171482, 123817328..."
1,77b9c1867,"[223.00333333333336, 223.0038888888889, 223.00...","[[(6.121172e-23-6.4529106e-23j), (-2.160171e-2...","[1238168355, 1238170155, 1238171955, 123817375...","[[(-1.3897427e-22+9.1452695e-23j), (1.375147e-...","[1238166700, 1238168500, 1238170300, 123817714..."
2,ffa1d19c7,"[234.37611111111113, 234.3766666666667, 234.37...","[[(-7.681505e-23+9.315671e-23j), (3.471281e-23...","[1238176250, 1238178050, 1238179850, 123818165...","[[(1.2530556e-22-5.2474743e-23j), (5.3851836e-...","[1238180750, 1238182550, 1238184350, 123818615..."
3,cbff2fdcd,"[174.19166666666666, 174.19222222222223, 174.1...","[[(4.654238e-23-1.4819893e-22j), (7.379933e-23...","[1238172169, 1238173969, 1238175769, 123817756...","[[(-3.2067044e-23+2.6889726e-22j), (6.3008163e...","[1238186471, 1238188271, 1238190071, 123819187..."
4,b832e8026,"[403.0227777777778, 403.0233333333333, 403.023...","[[(-1.7797844e-22+7.5566185e-23j), (4.1235438e...","[1238170165, 1238171306, 1238173106, 123817490...","[[(3.8408568e-23+3.277592e-23j), (-4.227776e-2...","[1238169343, 1238171143, 1238172943, 123817474..."


## Model Evaluation

Simple model evaluation for consistent testing and comparison as we try different approaches

In [None]:
class Performance():
    def __init__(self, modelOutput, expectedOutput):
        self.truePos = 0
        self.trueNeg = 0
        self.falsePos = 0
        self.falseNeg = 0
        self.precision = 0
        self.recall = 0
        self.harmonicMean = 0
        self.calculate(modelOutput, expectedOutput)
        
    def compare_scores(modelOutput, expectedOutput):
        # Fill this in later, compare each score to tally TP/FP/TN/FN

    # Precision: TP / (TP + FP)        
    def calculate_precision():
        self.precision = self.truePos / ( self.truePos + self.falsePos)
        return self.precision

    # Recall: TP / (TP + FN)
    def calculate_recall():
        self.recall = self.truePos / ( self.truePos + self.falseNeg)
        return self.recall
    
    # Harmonic Mean: F1 = TP / (TP + ((FN + FP)/ 2) ) = 2 * ( (Precision * Recall) / (Precision + Recall) )
    def calculate_harmonic_mean():
        self.harmonicMean = 2 * ( (self.precision * self.recall) / (self.precision + self.recall) )
        return self.harmonicMean
    
    # Print the results
    def get_results():
        print("True Positives: ", self.truePos)
        print("True Negatives: ", self.trueNeg)
        print("False Positives: ", self.falsePos)
        print("False Negatives: ", self.falseNeg)
        print("Precision: ", self.precision)
        print("Recall: ", self.recall)
        print("Harmonic Mean: ", self.harmonicMean)
        
    # Calculate pipeline
    def calculate(modelOutput, expectedOutput):
        compare_scores(modelOutput, expectedOutput)
        calculate_precision()
        calculate_recall()
        calculate_harmonic_mean()
        get_results()   

## Train a basic model for baseline testing
There's a lot of data engineering we can try with this in terms of data transformations, data alignment and normalization, selectively excluding data, etc.
But first, let's take a naive approach and see how a few basic models fare.
That way we can see if future optimizations work and how well.

In [16]:
from sklearn.linear_model import SGDClassifier
import numpy as np

#sgd stands for stochastic gradient descent (read more about GSD https://medium.com/@lachlanmiller_52885/machine-learning-week-1-cost-function-gradient-descent-and-univariate-linear-regression-8f5fe69815fd)
#clf stands for classifier
sgd_clf = SGDClassifier(max_iter=5, tol=-np.infty, random_state=42)
# sgd_clf.fit(np.array(train_data.loc[:,"H1_SFTs"]), np.array(train_labels.loc[:,"target"]))
np.array(train_data.loc[:,"H1_SFTs"][0])

array([[-4.5731778e-23+1.18721092e-22j,  2.8206372e-23-1.11619598e-23j,
         1.1124856e-22-3.01264718e-23j, ...,
        -9.1336872e-23-1.15372359e-22j,  4.9273540e-23+8.50121319e-23j,
        -2.2122217e-22+7.59737612e-23j],
       [-4.5931660e-23-1.34403698e-22j,  2.4227980e-23+1.09901059e-22j,
         1.7774413e-22-1.22264384e-23j, ...,
        -9.4565130e-23+2.65873464e-22j, -4.0620313e-23-4.09678061e-23j,
        -7.8723440e-23-4.63299556e-23j],
       [ 4.0764129e-23+9.43734433e-23j,  1.3412449e-26-3.09341290e-22j,
        -1.3496165e-23+1.70917974e-22j, ...,
         3.5848973e-23-9.14058874e-23j, -7.9956947e-23+1.27143110e-23j,
        -4.7779929e-23-2.79704707e-23j],
       ...,
       [ 3.2981290e-23+9.77753649e-24j,  3.3722447e-23+1.68749433e-23j,
        -1.9531401e-23-4.58672192e-23j, ...,
         6.6101363e-23-6.45430152e-23j, -1.9212385e-23+4.93538614e-23j,
        -1.2720560e-22+5.94094177e-23j],
       [ 1.3689462e-22-1.29854444e-22j,  6.3243717e-23+1.78008790e-2