<a href="https://colab.research.google.com/github/s-feinstein/G2Net/blob/dev/G2Net.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [G2Net](https://www.kaggle.com/competitions/g2net-detecting-continuous-gravitational-waves/) – Detecting Continuous Gravitational Waves

**Objective:**
Help us detect long-lasting gravitational-wave signals!

The goal of this competition is to find continuous gravitational-wave signals. You will develop a model sensitive enough to detect weak yet long-lasting signals emitted by rapidly-spinning neutron stars within noisy data.

**Secret Objective!!!** Receive excellent marks from this final project in [3253 - Machine Learning at U. Toronto](https://learn.utoronto.ca/programs-courses/courses/3253-machine-learning)

## Authenticate with Secrets

In [10]:
!wget -q -N "https://raw.githubusercontent.com/s-feinstein/G2Net/dev/setup-colab.py"
%run setup-colab.py

## Import the dataset

In [11]:
# !pip install kaggle

!kaggle competitions download g2net-detecting-continuous-gravitational-waves

Traceback (most recent call last):
  File "/opt/conda/bin/kaggle", line 5, in <module>
    from kaggle.cli import main
  File "/opt/conda/lib/python3.7/site-packages/kaggle/__init__.py", line 23, in <module>
    api.authenticate()
  File "/opt/conda/lib/python3.7/site-packages/kaggle/api/kaggle_api_extended.py", line 166, in authenticate
    self.config_file, self.config_dir))
OSError: Could not find kaggle.json. Make sure it's located in /root/.kaggle. Or use the environment method.


## Load the labels

Load the labels and split it into train and test

In [12]:
import pandas as pd
labels = pd.read_csv('../input/g2net-detecting-continuous-gravitational-waves/train_labels.csv')

# Removing the negative labels
labels = labels[labels.target>=0]
# train_labels.target.value_counts()
# train_labels.info()

# Split Data
from sklearn.model_selection import train_test_split
#train test split
train_labels, test_labels = train_test_split(labels, test_size=0.3, random_state=42)
train_labels.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 420 entries, 108 to 102
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      420 non-null    object
 1   target  420 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 9.8+ KB


## Generate list of the training file paths

In [13]:
import os

train_files = []
train_path = "/kaggle/input/g2net-detecting-continuous-gravitational-waves/train"
for id in train_labels.loc[:,"id"]:
    filename = id + ".hdf5"
    path = os.path.join(train_path, filename)
    train_files.append(path)
    
train_files[0]

'/kaggle/input/g2net-detecting-continuous-gravitational-waves/train/2688e48bd.hdf5'

## Explore the data

In [14]:
import h5py
file = h5py.File(train_files[0])

print(list(file.keys())[0])
print(file['2688e48bd'].keys())
print(file['2688e48bd']['H1'].keys())
print(file['2688e48bd']['L1'].keys())
print(file['2688e48bd']['frequency_Hz'])
print(file['2688e48bd']['frequency_Hz'][0:4])
print(file['2688e48bd']['H1']['timestamps_GPS'])
print(file['2688e48bd']['H1']['timestamps_GPS'][0], " - ", file['2688e48bd']['H1']['timestamps_GPS'][-1])
print(file['2688e48bd']['H1']['SFTs'])
print(file['2688e48bd']['H1']['SFTs'][0:2])
print(file['2688e48bd']['L1']['timestamps_GPS'])
print(file['2688e48bd']['L1']['timestamps_GPS'][0], " - ", file['2688e48bd']['L1']['timestamps_GPS'][-1])
print(file['2688e48bd']['L1']['SFTs'])
print(file['2688e48bd']['L1']['SFTs'][0:2])

2688e48bd
<KeysViewHDF5 ['H1', 'L1', 'frequency_Hz']>
<KeysViewHDF5 ['SFTs', 'timestamps_GPS']>
<KeysViewHDF5 ['SFTs', 'timestamps_GPS']>
<HDF5 dataset "frequency_Hz": shape (360,), type "<f8">
[306.92055556 306.92111111 306.92166667 306.92222222]
<HDF5 dataset "timestamps_GPS": shape (4564,), type "<i8">
1238170479  -  1248546822
<HDF5 dataset "SFTs": shape (360, 4564), type "<c8">
[[-4.5731778e-23+1.18721092e-22j  2.8206372e-23-1.11619598e-23j
   1.1124856e-22-3.01264718e-23j ... -9.1336872e-23-1.15372359e-22j
   4.9273540e-23+8.50121319e-23j -2.2122217e-22+7.59737612e-23j]
 [-4.5931660e-23-1.34403698e-22j  2.4227980e-23+1.09901059e-22j
   1.7774413e-22-1.22264384e-23j ... -9.4565130e-23+2.65873464e-22j
  -4.0620313e-23-4.09678061e-23j -7.8723440e-23-4.63299556e-23j]]
<HDF5 dataset "timestamps_GPS": shape (4646,), type "<i8">
1238167882  -  1248558232
<HDF5 dataset "SFTs": shape (360, 4646), type "<c8">
[[ 2.2248422e-22+5.95424828e-23j -1.1790475e-22-6.16426640e-23j
   6.1226429e-23+

## Load the data into a dataframe

In [116]:
# https://www.kaggle.com/code/maharshipandya/g2net-data-and-augmentation

import numpy as np
BASE_DIR = "/kaggle/input/g2net-detecting-continuous-gravitational-waves/train/"

class SFT2Img:
    def __init__(self, labels):
        # labels the dataframe of train labels
        self.labels = labels
            
    
    def __getitem__(self, index):
        # get the file id from dataframe
        lab = self.labels.iloc[index]
        file_id = lab["id"]
        
        # this is our label
        y = np.float32(lab["target"])
        
        # SFT tensor for H1 and L1 observatory (128 columns)
        img = np.empty((2, 360, 128), dtype=np.float32)
        
        filename = f"{file_id}.hdf5"
        with h5py.File(BASE_DIR + filename, 'r') as f:
            group = f[file_id]
            
            for i, obs in enumerate(['H1', 'L1']):
                # scaling the fourier transforms (complex64 in nature)
                sft = group[obs]['SFTs'][:, :4096] * 1e22
                
                # magnitude squared
                mag = sft.real ** 2 + sft.imag ** 2
                
                # normalize and reduce 4096 to 128
                mag /= np.mean(mag)
                mag = np.mean(mag.reshape(360, 128, 32), axis=2)
                
                # 0 for H1 and 1 for L1
                img[i] = mag
        
        return img, y

# Added a minor improvement to flatten images to instance variables
# for easier model training
class Dataset:
    def __init__(self, labels):
        self.sft2img = SFT2Img(labels)
        # labels the dataframe of train labels
        self.labels = sft2img.labels
        self.H1 = np.ndarray(shape=(0, 360*128))
        self.L1 = np.ndarray(shape=(0, 360*128))
        self.build_data(sft2img)
        self.H1L1 = np.vstack((self.H1, self.L1))
        self.labelsH1L1 = np.vstack((self.labels, self.labels))
        
    def build_data(self):
        for i in range(self.sft2img.labels.target.shape[0]):
            img, label = self.sft2img[i]
            H1img = np.array(img[0]).flatten()
            L1img = np.array(img[1]).flatten()
            self.H1 = np.vstack((self.H1, H1img))
            self.L1 = np.vstack((self.L1, L1img))
        

In [117]:
train_dataset = Dataset(train_labels)
test_dataset = Dataset(test_labels)
            
print("Shape of the labels: ", train_dataset.labels.target.shape)
print("Shape of H1: ", train_dataset.H1.shape)
print("Shape of L1: ", train_dataset.L1.shape)

TypeError: 'Dataset' object is not subscriptable

## Model Evaluation

Simple model evaluation for consistent testing and comparison as we try different approaches

In [None]:
class Performance():
    def __init__(self, modelOutput, expectedOutput):
        self.truePos = 0
        self.trueNeg = 0
        self.falsePos = 0
        self.falseNeg = 0
        self.precision = 0
        self.recall = 0
        self.harmonicMean = 0
        self.calculate(modelOutput, expectedOutput)
        
    def compare_scores(modelOutput, expectedOutput):
        return
        # Fill this in later, compare each score to tally TP/FP/TN/FN

    # Precision: TP / (TP + FP)        
    def calculate_precision():
        self.precision = self.truePos / ( self.truePos + self.falsePos)
        return self.precision

    # Recall: TP / (TP + FN)
    def calculate_recall():
        self.recall = self.truePos / ( self.truePos + self.falseNeg)
        return self.recall
    
    # Harmonic Mean: F1 = TP / (TP + ((FN + FP)/ 2) ) = 2 * ( (Precision * Recall) / (Precision + Recall) )
    def calculate_harmonic_mean():
        self.harmonicMean = 2 * ( (self.precision * self.recall) / (self.precision + self.recall) )
        return self.harmonicMean
    
    # Print the results
    def get_results():
        print("True Positives: ", self.truePos)
        print("True Negatives: ", self.trueNeg)
        print("False Positives: ", self.falsePos)
        print("False Negatives: ", self.falseNeg)
        print("Precision: ", self.precision)
        print("Recall: ", self.recall)
        print("Harmonic Mean: ", self.harmonicMean)
        
    # Calculate pipeline
    def calculate(modelOutput, expectedOutput):
        compare_scores(modelOutput, expectedOutput)
        calculate_precision()
        calculate_recall()
        calculate_harmonic_mean()
        get_results()   

## Train a basic model for baseline testing
There's a lot of data engineering we can try with this in terms of data transformations, data alignment and normalization, selectively excluding data, etc.
But first, let's take a naive approach and see how a few basic models fare.
That way we can see if future optimizations work and how well.

In [112]:
from sklearn.linear_model import SGDClassifier
import numpy as np

#sgd stands for stochastic gradient descent (read more about GSD https://medium.com/@lachlanmiller_52885/machine-learning-week-1-cost-function-gradient-descent-and-univariate-linear-regression-8f5fe69815fd)
#clf stands for classifier
sgd_clf = SGDClassifier(max_iter=5, tol=-np.infty, random_state=42)
sgd_clf.fit(train_dataset.H1, train_dataset.labels.target)

pred = sgd_clf.predict([train_H1[11]])
print("Prediction: ", pred[0])
print("Actual: ", train_dataset.labels.target.iloc[11])


Prediction:  0
Actual:  0
