# Projet 8 : G2Net Gravitational Wave Detection

This work was an attempt to participate in a Kaggle competition : https://www.kaggle.com/c/g2net-gravitational-wave-detection/overview

The aim of the chosen competition is to build a classifier to detect gravitational wave in a signal. This is an oportunity to work on time series and to participate in an effort to improve gravitational wave detection using machine learning. The competition host would love if Kaggle winners were able to outdo the standard approaches.



Kaggle competition is a great way to learn state of the art techniques. I have spent a lot of time reading kernels and comments to know what works and what doesn't. I started the competition 7 days before the end so I had to go directly in the right direction.

I have come to the conclusion that :
* Almost everyone transform time series into images to work with CNN
* The optimal transformation for this project is CQT transform
* A lot of high score use transfer learning models : EfficientNetB7 seems to be the best

This notebook was inspired by useful kernels: 

https://www.kaggle.com/coldfir3/cqt-dataset-generator-rgb-jpg
    
https://www.kaggle.com/esratmaria/gravitational-wave-detection-simple-cnn-model



## Plan
* Visualisation
* Preprocessing
* Modelling
* Results

In [None]:
%%capture
!python -m pip install gwpy
!pip install astropy==4.2.1

In [None]:
import os
import shutil
from glob import glob
from tqdm.auto import tqdm
from joblib import Parallel, delayed
from pathlib import Path

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.metrics import AUC
from sklearn.model_selection import train_test_split

from gwpy.timeseries import TimeSeries
from gwpy.plot import Plot
from scipy import signal
from PIL import Image

%matplotlib inline

## Visualisation

In [None]:
train_path = glob('../input/g2net-gravitational-wave-detection/train/*/*/*/*')
test_path = glob('../input/g2net-gravitational-wave-detection/test/*/*/*/*')

In [None]:
print(f'Number of train files : {len(train_path)} which represents {len(train_path)/(len(train_path) + len(test_path))*100:.2f} % of all data')
print(f'Number of test files : {len(test_path)} which represents {len(test_path)/(len(train_path) + len(test_path))*100:.2f} % of all data')

### Loading a file

In [None]:
file_path = '../input/g2net-gravitational-wave-detection/train/0/0/0/000a5b6e5c.npy'
x_series = np.load(file_path)
x_series.shape

Each file is composed of 3 signals, one per gravitational waves detector. 

In [None]:
colors = ['red', 'green', 'blue']
signal_names = ['LIGO Hanford', 'LIGO Livingston', 'Virgo']

plt.figure(figsize=(16, 7))
for i in range(3):
    plt.subplot(3, 1, i+1)
    plt.plot(x_series[i], color=colors[i])
    plt.legend([signal_names[i]], fontsize=12, loc="lower right")


Each signal correspond to a time series of 2 seconds duration and have a sampling of 2048 Hz.

### Spectrogram

One way of working with times series is to transform the signal from the time domain to the frequency domain. The we can work on a spectrogram and apply machine learning model used for image classification.

In this competition all of the competitors used a CQT transform which seems to gave the best results. The advantage of CQT transform is that the y axis corresponding to the frequency has a log scale. This technique is used for audio signal and a log scale give a representation wich is close to the human perception.

In [None]:
def sig2rgb(fname, whiten = True, window=0.2, bandpass=True, f_range = (20,500), q_range = (16,32), q_max = 10):
    
    # Load the file 
    data = np.load(fname)
    # Split each chanel and convert to TimeSeries
    data = map(lambda x: TimeSeries(x, sample_rate=2048), data)
    # Whiten the signal and apply a tukey window
    data = map(lambda x: x.whiten(window=("tukey", window)), data)
    # (optional) bandpass filter
    if bandpass:
        data = map(lambda x: x.bandpass(*f_range), data)
    # Q-transform
    data = map(lambda x: x.q_transform(qrange=q_range, frange=f_range, logf=True, whiten=False), data)
    # Convert to RGB image
    img = np.stack(list(data), axis = -1)
    img = np.clip(img, 0, q_max)/q_max * 255
    img = img.astype(np.uint8)
    img = Image.fromarray(img).rotate(90, expand=1)
    img = img.resize((512,512), Image.ANTIALIAS)
    return img

In this notebook I used this function to generate spectrogram, inspired from this kernel : https://www.kaggle.com/coldfir3/cqt-dataset-generator-rgb-jpg

In [None]:
sig2rgb('../input/g2net-gravitational-wave-detection/train/0/0/0/000a5b6e5c.npy')

This RGB spectrogram is a superposition of the 3 signal of all 3 gravitational wave interferometers (one per color Red, Green, Blue). It is a clever idea because it condense the information so we can work on smaller images and also we can used a transfer learning approach because pre-trained networks are trained on RGB images.

## Preprocessing

The idea of this part is to generate a dataset of RGB spectrogram using the sig2rgb function. One important note :
* Applying a filter to remove certain frequency is a key element to filter out the noise
* According to a discussion in the competition the 20 to 500 Hz band is the optimal band for gravitational waves

In [None]:
def save_img(x, folder_out, **kwargs):
    fname = Path('../input/g2net-gravitational-wave-detection/' + folder_out.split('_')[0] + '/' + '/'.join([x[0], x[1], x[2], x]) + '.npy')
    file_out = folder_out + '/' + fname.with_suffix('.jpg').name
    x = sig2rgb(fname, **kwargs)
    x.save(file_out)

In [None]:
fast_sub = False # set this to False to generate the whole dataset
train = False # set this to True to generate the train set
test = False # set this to True to generate the test set

In [None]:
train_df = pd.read_csv('../input/g2net-gravitational-wave-detection/training_labels.csv')
if not os.path.isdir('train_cqt_rgb'):
    if fast_sub: train_ids = train_df['id'][:10000]
    else: train_ids = train_df['id']
    if train:
        os.makedirs('train_cqt_rgb', exist_ok = True)
        o = Parallel(n_jobs=-1)(delayed(save_img)(x, 'train_cqt_rgb') for x in tqdm(train_ids))

In [None]:
test_df = pd.read_csv('../input/g2net-gravitational-wave-detection/sample_submission.csv')
if not os.path.isdir('test_cqt_rgb'):
    if fast_sub: test_ids = test_df['id'][:10000]
    else: test_ids = test_df['id']
    if test:
        os.makedirs('test_cqt_rgb', exist_ok = True)
        o = Parallel(n_jobs=-1)(delayed(save_img)(x, 'test_cqt_rgb') for x in tqdm(test_ids))

## Modelling

In [None]:
results_dataset_1percent_path = '../input/results-second-try/'
results_dataset_100percent_path = '../input/resultsthirdtry/'
os.makedirs(results_dataset_1percent_path, exist_ok=True)
os.makedirs(results_dataset_100percent_path, exist_ok=True)

In [None]:
img_size = (512, 512)
img_shape = (512, 512, 3)
batch_size = 16
img_path = './train_cqt_rgb/'

In [None]:
df = pd.read_csv('../input/g2net-gravitational-wave-detection/training_labels.csv')
# df = df[:10000]

In [None]:
X = df['id']
y = df['target'].astype('int8').values

In [None]:
x_train, x_valid, y_train, y_valid = train_test_split(X, y, random_state = 42, stratify = y)

In [None]:
def get_filepath(id_, is_train=True):
    path = ''
    if is_train:
        return f'./train_cqt_rgb/{id_}.jpg'
    else:
        return f'./test_cqt_rgb/{id_}.jpg'

In [None]:
def imgFromPath(file_path: tf.Tensor, y=None, input_shape=img_shape):
    file = tf.io.read_file(file_path)
    x = tf.image.decode_image(file)
    x = tf.ensure_shape(x, input_shape)
    if y is None:
        return x
    else:
        return x, y

In [None]:
def makeDataset(x_train, x_valid, batch_size):

    train_dataset = tf.data.Dataset.from_tensor_slices((x_train.apply(get_filepath).values, y_train))
    # shuffle the dataset
    train_dataset = train_dataset.shuffle(len(x_train))
    train_dataset = train_dataset.map(imgFromPath)
    train_dataset = train_dataset.batch(batch_size)
    train_dataset = train_dataset.prefetch(tf.data.AUTOTUNE)

    valid_dataset = tf.data.Dataset.from_tensor_slices((x_valid.apply(get_filepath).values, y_valid))
    valid_dataset = valid_dataset.map(imgFromPath)
    valid_dataset = valid_dataset.batch(batch_size)
    valid_dataset = valid_dataset.prefetch(tf.data.AUTOTUNE)
   
    return train_dataset, valid_dataset

In [None]:
def makeModel(img_shape):
      
   model_kwargs = dict(
       include_top=False,
       weights='imagenet',
       input_tensor=None,
       input_shape=img_shape,
       pooling=None,
       classes=1000)
   
   base_model = tf.keras.applications.efficientnet.EfficientNetB7(**model_kwargs)
   base_model.trainable = False
   base_model.summary()
      
   data_augmentation = tf.keras.Sequential([
           tf.keras.layers.experimental.preprocessing.RandomFlip("horizontal")])
#            tf.keras.layers.experimental.preprocessing.RandomRotation(0.2), 
#            tf.keras.layers.experimental.preprocessing.RandomZoom(height_factor=(0.2, 0.3), width_factor=(0.2, 0.3)),
#            tf.keras.layers.experimental.preprocessing.RandomTranslation(0.3, 0.3, fill_mode='reflect', interpolation='bilinear')])

   
   global_avg_layer = tf.keras.layers.GlobalAveragePooling2D()
   prediction_layer = tf.keras.layers.Dense(1, activation='sigmoid')
   
   inputs = tf.keras.Input(img_shape)
   prepro = tf.keras.applications.efficientnet.preprocess_input(inputs)
   augmented = data_augmentation(prepro)    
   x = base_model(augmented, training=False)
   x = global_avg_layer(x)
   outputs = prediction_layer(x)
   model = tf.keras.Model(inputs, outputs)
   
   model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4),
             loss='binary_crossentropy',
             metrics=[[AUC(), 'accuracy']])
   
   return model

In [None]:
def evaluateModel(model, train_dataset, valid_dataset, results_path):
    
    os.makedirs(results_path, exist_ok=True)

    if os.path.exists(results_path + 'history.npy'):
        history_dict = np.load(results_path + 'history.npy',allow_pickle='TRUE').item()
    else:
        history = model.fit(
        train_dataset,
        epochs=3,
        validation_data=valid_dataset)

        history_dict = history.history
        np.save(results_path + 'history.npy',history_dict)
        model.save(results_path + 'model.h5')
        
    return history_dict

In [None]:
train_dataset, valid_dataset = makeDataset(x_train, x_valid, batch_size)

In [None]:
train_dataset

In [None]:
model = makeModel(img_shape)

In [None]:
history_dataset_1percent = evaluateModel(model, train_dataset, valid_dataset, results_dataset_1percent_path)

In [None]:
history_dataset_100percent = evaluateModel(model, train_dataset, valid_dataset, results_dataset_100percent_path)

## Results

In [None]:
acc = history_dataset_1percent['accuracy']
val_acc = history_dataset_1percent['val_accuracy']

loss = history_dataset_1percent['loss']
val_loss = history_dataset_1percent['val_loss']

plt.figure(figsize=(8, 8))
plt.subplot(2, 1, 1)
plt.plot(acc, label='Training Accuracy')
plt.plot(val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.ylabel('Accuracy')
plt.title('Training and Validation Accuracy')

plt.subplot(2, 1, 2)
plt.plot(loss, label='Training Loss')
plt.plot(val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.ylabel('Binary Cross Entropy')
plt.title('Training and Validation Loss')
plt.xlabel('epoch')
plt.show()

In [None]:
acc = history_dataset_100percent['accuracy']
val_acc = history_dataset_100percent['val_accuracy']

auc = history_dataset_100percent['auc_1']
val_auc = history_dataset_100percent['val_auc_1']

loss = history_dataset_100percent['loss']
val_loss = history_dataset_100percent['val_loss']

plt.figure(figsize=(10, 14))
plt.subplot(3, 1, 1)
plt.plot(acc, label='Training Accuracy')
plt.plot(val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.ylabel('Accuracy')
plt.title('Training and Validation Accuracy')

plt.subplot(3, 1, 2)
plt.plot(auc, label='Training AUC')
plt.plot(val_auc, label='Validation AUC')
plt.legend(loc='upper right')
plt.ylabel('AUC')
plt.title('Training and Validation Loss')

plt.subplot(3, 1, 3)
plt.plot(loss, label='Training Loss')
plt.plot(val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.ylabel('Binary Cross Entropy')
plt.title('Training and Validation Loss')
plt.xlabel('epoch')
plt.show()

In [None]:
acc_1p = history_dataset_1percent['accuracy']
val_acc_1p = history_dataset_1percent['val_accuracy']
acc_100p = history_dataset_100percent['accuracy']
val_acc_100p = history_dataset_100percent['val_accuracy']

loss_1p = history_dataset_1percent['loss']
val_loss_1p = history_dataset_1percent['val_loss']
loss_100p = history_dataset_100percent['loss']
val_loss_100p = history_dataset_100percent['val_loss']

plt.figure(figsize=(8, 8))
plt.subplot(2, 1, 1)
plt.plot(acc_1p, label='Training dataset 1%')
plt.plot(val_acc_1p, label='Validation dataset 1%')
plt.plot(acc_100p, label='Training dataset 100%')
plt.plot(val_acc_100p, label='Validation dataset 100%')
plt.legend(loc='lower right')
plt.ylabel('Accuracy')
plt.title('Training and Validation Accuracy')

plt.subplot(2, 1, 2)
plt.plot(loss_1p, label='Training dataset 1%')
plt.plot(val_loss_1p, label='Validation dataset 1%')
plt.plot(loss_100p, label='Training dataset 100%')
plt.plot(val_loss_100p, label='Validation dataset 100%')
plt.legend(loc='upper right')
plt.ylabel('Binary Cross Entropy')
plt.title('Training and Validation Loss')
plt.xlabel('epoch')
plt.show()

## Prediction

In [None]:
x_test = pd.read_csv('../input/g2net-gravitational-wave-detection/sample_submission.csv')

In [None]:
# test dataset
test_dataset = tf.data.Dataset.from_tensor_slices((x_test['id'].apply(get_filepath, is_train=False).values))
test_dataset = test_dataset.map(imgFromPath)
test_dataset = test_dataset.batch(batch_size)
test_dataset = test_dataset.prefetch(tf.data.AUTOTUNE)

In [None]:
predict = False

if predict:
    prediction = model.predict(test_dataset)
    submission = pd.DataFrame({'id': x_test.id, 'target': prediction.flatten()})
    submission.to_csv('submission.csv', index=False)