# Intro

Machine learning can help astronomers sort big data recorded in space exploration.

**Gravitational Wave:**

Very simply,
`A gravitational wave is like ripples in space time. It is usually caused by some of the most violent and energetic processes in the Universe.`

They are invisible but incredibly fast. 

**Why we need to detect GW?**

Detecting and analyzing the information carried by gravitational waves is allowing us to observe the Universe in a way never before possible, providing astronomers and other scientists with their first glimpses of literally un-seeable wonders.
 When a gravitational wave passes by Earth, it squeezes and stretches space. LIGO can detect this squeezing and stretching. Each LIGO observatory has two “arms” that are each more than 2 miles (4 kilometers) long.
 
##### **above info collected via some very rough googling**
 
### Goal

The GW was first detected/seen when two blackholes merged into one big black whole back in Sept, 2015.

In this competition, our goal is to detect GW signals from the mergers of binary black holes.

I am going to document this process as I start with zero idea about any of these.

The folowing two kernels have been my overall inspiration to understand this whole task to my best capability. These are really well explained and worth mentioning.

- [kernel 1](https://www.kaggle.com/pranay1990/pranay-g2net-gw)
- [kernel 2](https://github.com/SiddharthPatel45/gravitational-wave-detection/blob/main/code/gw-detection-modelling.ipynb)
- [kernel 3](https://www.kaggle.com/atamazian/nnaudio-constant-q-transform-demonstration/comments)

Thank you for sharing your work.

# Imports

In [None]:
!pip install -q nnAudio

In [None]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pylab as plt
import seaborn as sns
from glob import glob
from tqdm import tqdm
from sklearn.model_selection import train_test_split

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Conv2D
from tensorflow.keras.layers import MaxPooling2D
from tensorflow.keras.layers import GlobalAveragePooling2D
from tensorflow.keras.layers import Flatten
from tensorflow.keras.applications import EfficientNetB0
from tensorflow.keras.metrics import AUC

import librosa.display
import torch

# this is used for Contant Q Transform
from nnAudio.Spectrogram import CQT1992v2
from tensorflow.keras.applications import EfficientNetB0 as efn

## Files

```
train: contains one npy file per observation.

test: we have to predict the probability whether or not the observation contains a gravitational wave.

training_labels: If associated signal contains a GW or not.
```

The waves detected by GW detectors have noises in output signals. So researchers need to find out if the output signal is only **noise** or **signal+noise**.


We are provided with a training set of time series data containing simulated gravitational wave measurements from a network of 3 gravitational wave interferometers (LIGO Hanford, LIGO Livingston, and Virgo). 

This problem is seen as a binary classification problem, if signal is detected or not. 
 

## Data Exploration

In [None]:
train_label_dataset = pd.read_csv("../input/g2net-gravitational-wave-detection/training_labels.csv")
train_label_dataset.head()

In [None]:
train_label_dataset.shape

when `target = 1` it means that the signal (GW) is present

In [None]:
sns.countplot(data=train_label_dataset, x="target")

In [None]:
train_label_dataset['target'].value_counts()

Looking for null values:

In [None]:
train_label_dataset.isnull().sum() # no null

In [None]:
train_path = glob('../input/g2net-gravitational-wave-detection/train/*/*/*/*')

There are 560,000 **.npy** files in the `train` set

In [None]:
len(train_path)

If we want to take a took at how these data looks:

lets see how data at index 3 looks

In [None]:
explore_sample_3 = np.load(train_path[3])
explore_sample_3

We can see that there are 3 rows to the data. This represents data extracted by 3 gravitational wave interferometers (LIGO Hanford, LIGO Livingston, and Virgo) respectively.

In [None]:
explore_sample_3.shape

each index of `explore_sample_3` has **4096** columns

In [None]:
print(len(explore_sample_3[0]), len(explore_sample_3[1]), len(explore_sample_3[2]))

In [None]:
# just a tensor representation
tf.convert_to_tensor(explore_sample_3[0])

## Exploring the sample data with Librosa

Librosa is a python package for music and audio analysis, more about this awesome library can be found [here](https://librosa.org/doc/latest/index.html).

There is a very good kernel that can be found [here](https://www.kaggle.com/hinamimi/visualization-gravitational-wave-with-librosa).
It has really great demonstration of how to use Librosa.

Now first I will find the `label` (id) of `explore_sample_3` from the `training_label.csv` dataset. After that I can find whether the target is 1 or 0.

- 0 = negative sample
- 1 = possotive sample

In [None]:
train_path[3]

the value of `train_path` at index 3 looks like:'

'../input/g2net-gravitational-wave-detection/train/7/7/7/77727f6826.npy'

So, we know that the Id of `explore_sample_3` is **77727f6826**. To extract the Id the following code snippet has been written. 



In [None]:
rind = train_path[3].rindex('/') # last index where the character '/' appeared
extracted_id_for_explore_sample_3 = train_path[3][rind+1:].replace('.npy', '') # replaced .npy
extracted_id_for_explore_sample_3

We see that it is a positive sample

In [None]:
train_label_dataset[train_label_dataset['id']==extracted_id_for_explore_sample_3]['target']

In [None]:
positive_sample = explore_sample_3
# index 1 od train_path has a target of 0 so it is a negative sample.
negative_sample = np.load(train_path[1])
negative_sample

In [None]:
samples = (positive_sample, negative_sample)
targets = (1, 0)

Using librosa.display() to view raw waves: 

Kernel: https://www.kaggle.com/hinamimi/visualization-gravitational-wave-with-librosa

In [None]:
colors = ("red", "green", "blue")
signal_names = ("LIGO Hanford", "LIGO Livingston", "Virgo")

for x, i in tqdm(zip(samples, targets)):
    figure = plt.figure(figsize=(16, 7))
    figure.suptitle(f'Raw wave (target={i})', fontsize=20)
    # range is 3 because we have 3 different rows for each interferometers
    for j in range(3):
        axes = figure.add_subplot(3, 1, j+1)
        librosa.display.waveshow(x[j], sr=2048, ax=axes, color=colors[j])
        axes.set_title(signal_names[j], fontsize=12)
        axes.set_xlabel('Time[sec]')
    plt.tight_layout()
    plt.show()

In [None]:
sns.displot(positive_sample[0,:])

## Working with a cleaner datset by merging `train` and `training_labels` datasets

In [None]:
pd.set_option('display.max_colwidth',None)

In [None]:
ids = []
for files in train_path:
    ids.append(files[files.rindex('/')+1:].replace('.npy',''))
df = pd.DataFrame({"id":ids,"path":train_path})
df = pd.merge(df, train_label_dataset, on='id')

In [None]:
df.head()

In [None]:
df.shape

# Preprocessing

#### Core Idea: 
If any particular frequency is widespread in the signal or not. If true then our required GW is present.

Approach:

- convert original signal  -->  spectrogram signal
- coverting from time domain  --> frequency domain
    - done using **[Constant Q transformation](https://en.wikipedia.org/wiki/Constant-Q_transform)**
    - **[kernel](https://www.kaggle.com/atamazian/nnaudio-constant-q-transform-demonstration/comments)**


I refer to the kernel [here](https://www.kaggle.com/atamazian/nnaudio-constant-q-transform-demonstration/comments) to define my CQT.

Please have a look. 

In [None]:
# CQT
transform = CQT1992v2(sr=2048,        # sample rate
                fmin=20,        # min freq
                fmax=500,      # max freq
                hop_length=64,  # hop length
                verbose=False)

In [None]:
# the Cqt function
# preprocess function
def preprocess_function_cqt(path):
    signal = np.load(path.numpy())
    # there are 3 signal as explained before for each interferometers
    for i in range(signal.shape[0]):
        # normalize signal
        signal[i] /= np.max(signal[i])
    # horizontal stack
    signal = np.hstack(signal)
    # tensor conversion
    signal = torch.from_numpy(signal).float()
    # getting the image from CQT transform
    image = transform(signal)
    # converting to array from tensor
    image = np.array(image)
    # transpose the image to get right orientation
    image = np.transpose(image,(1,2,0))
    
    # conver the image to tf.tensor and return
    return tf.convert_to_tensor(image)

In [None]:
image = preprocess_function_cqt(tf.convert_to_tensor(df['path'][2]))
print(image.shape)
plt.imshow(image)

for a different path

In [None]:
image = preprocess_function_cqt(tf.convert_to_tensor(df['path'][5069]))
print(image.shape)
plt.imshow(image)

we can see that the image shape is **(56, 193, 1)**, so thats our shpa eof the input.

In [None]:
input_shape = (56, 193, 1)

In [None]:
def preprocess_function_parse_tf(path, y=None):
    [x] = tf.py_function(func=preprocess_function_cqt, inp=[path], Tout=[tf.float32])
    x = tf.ensure_shape(x, input_shape)
    if y is None:
        return x
    else:
        return x,y

In [None]:
# preprocess_function_parse_tf(tf.convert_to_tensor(df['path'][5069]))

### I will define the `training` and `validation` dataset from `df`

In [None]:
X = df['id']
y = df['target'].astype('int8').values

In [None]:
y

In [None]:
x_train, x_valid, y_train, y_valid = train_test_split(X, y, random_state = 42, stratify = y)

In [None]:
batch_size = 250

In [None]:
def get_npy_filepath(id_, is_train=True):
    path = ''
    if is_train:
        return f'../input/g2net-gravitational-wave-detection/train/{id_[0]}/{id_[1]}/{id_[2]}/{id_}.npy'
    else:
        return f'../input/g2net-gravitational-wave-detection/test/{id_[0]}/{id_[1]}/{id_[2]}/{id_}.npy'

In [None]:
train_dataset = tf.data.Dataset.from_tensor_slices((x_train.apply(get_npy_filepath).values, y_train))
# shuffle the dataset
train_dataset = train_dataset.shuffle(len(x_train))
train_dataset = train_dataset.map(preprocess_function_parse_tf, num_parallel_calls=tf.data.AUTOTUNE)
train_dataset = train_dataset.batch(batch_size)
train_dataset = train_dataset.prefetch(tf.data.AUTOTUNE)

In [None]:
valid_dataset = tf.data.Dataset.from_tensor_slices((x_valid.apply(get_npy_filepath).values, y_valid))
valid_dataset = valid_dataset.map(preprocess_function_parse_tf, num_parallel_calls=tf.data.AUTOTUNE)
valid_dataset = valid_dataset.batch(batch_size)
valid_dataset = valid_dataset.prefetch(tf.data.AUTOTUNE)

In [None]:
train_dataset

In [None]:
valid_dataset

# Creating the Model

In [None]:
train_dataset.take(1)

Model from [here](https://github.com/SiddharthPatel45/gravitational-wave-detection/blob/main/code/gw-detection-modelling.ipynb) ~

In [None]:
# Instantiate the Sequential model
model_cnn = Sequential(name='CNN_model')

# Add the first Convoluted2D layer w/ input_shape & MaxPooling2D layer followed by that
model_cnn.add(Conv2D(filters=16,
                     kernel_size=3,
                     input_shape=input_shape,
                     activation='relu',
                     name='Conv_01'))
model_cnn.add(MaxPooling2D(pool_size=2, name='Pool_01'))

# Second pair of Conv1D and MaxPooling1D layers
model_cnn.add(Conv2D(filters=32,
                     kernel_size=3,
                     input_shape=input_shape,
                     activation='relu',
                     name='Conv_02'))
model_cnn.add(MaxPooling2D(pool_size=2, name='Pool_02'))

# Third pair of Conv1D and MaxPooling1D layers
model_cnn.add(Conv2D(filters=64,
                     kernel_size=3,
                     input_shape=input_shape,
                     activation='relu',
                     name='Conv_03'))
model_cnn.add(MaxPooling2D(pool_size=2, name='Pool_03'))

# Add the Flatten layer
model_cnn.add(Flatten(name='Flatten'))

# Add the Dense layers
model_cnn.add(Dense(units=512,
                activation='relu',
                name='Dense_01'))
model_cnn.add(Dense(units=64,
                activation='relu',
                name='Dense_02'))

# Add the final Output layer
model_cnn.add(Dense(1, activation='sigmoid', name='Output'))

In [None]:
model_cnn.summary()

In [None]:
model_cnn.compile(optimizer=Adam(learning_rate=0.0001),
                  loss='binary_crossentropy',
                  metrics=[[AUC(), 'accuracy']])

In [None]:
# Fit the data
history_cnn = model_cnn.fit(x=train_dataset,
                            epochs=3,
                            validation_data=valid_dataset,
                            batch_size=batch_size,
                            verbose=1)

saving the model after training is complete

In [None]:
model_cnn.save('./model/cnn_model.h5')

In [None]:
ls -a ./

# Preprocessing Test 

In [None]:
ls -a ../input/g2net-gravitational-wave-detection/sample_submission.csv

assigning submission ids to the test set to make prediction on them

In [None]:
sub = pd.read_csv('../input/g2net-gravitational-wave-detection/sample_submission.csv')
x_test = sub[['id']]

In [None]:
x_test.tail()

In [None]:
# test dataset
test_dataset = tf.data.Dataset.from_tensor_slices((x_test['id'].apply(get_npy_filepath, is_train=False).values))
test_dataset = test_dataset.map(preprocess_function_parse_tf, num_parallel_calls=tf.data.AUTOTUNE)
test_dataset = test_dataset.batch(batch_size)
test_dataset = test_dataset.prefetch(tf.data.AUTOTUNE)

In [None]:
test_dataset

# Prediction

Now, we will load the cnn model that we saved after training to make prediction on `test_dataset`

In [None]:
ls -a ./model/

In [None]:
saved_cnn_model = tf.keras.models.load_model('./model/cnn_model.h5')

In [None]:
saved_cnn_model

retraining the saved model on `valid_dataset`

> previously we set x = train_dataset

In [None]:
saved_cnn_model.fit(x=valid_dataset, epochs=3, batch_size=batch_size, verbose=1)

Now saving the full model after training to make prediction on `test_dataset`

In [None]:
saved_cnn_model.save('./model/full_cnn_model.h5')

In [None]:
full_cnn_model = tf.keras.models.load_model('./model/full_cnn_model.h5')

In [None]:
prediction = full_cnn_model.predict(test_dataset)

In [None]:
prediction

In [None]:
prediction = prediction.flatten()

# Preparing to Submit

In [None]:
submission = pd.DataFrame({'id': x_test.id, 'target': prediction})

In [None]:
submission.shape

In [None]:
submission.head()

In [None]:
submission.to_csv('./submission.csv', index= False)

I started with zero idea and I ended up learning about a lot of new things. I am very much thankful to all these resources that help me increase my knowledge and give me more insight as I proceed to improve my skills on my coding journey.

I tried referencing as much as I could.