<div class="list-group" id="list-tab" role="tablist">
  <h3 class="list-group-item list-group-item-action active" data-toggle="list"  role="tab" aria-controls="home" align='center'>Table of Content</h3>

* [About the Competition](#section-one)
* [Data Provided](#section-two)
* [Import Libraries](#section-three)  
* [Load Data](#section-four)     
* [Class Distribution](#section-five)
* [Data Analysis](#section-six)     

<a id="section-one"></a>
# About the competition

`G2Net is a network of Gravitational Wave, Geophysics and Machine Learning.` 

Via an Action from COST (European Cooperation in Science and Technology), a funding agency for research and innovation networks, G2Net aims to create a broad network of scientists. From four different areas of expertise, namely GW physics, Geophysics, Computing Science and Robotics, these scientists have agreed on a common goal of tackling challenges in data analysis and noise characterization for GW detectors.

`In this competition, we aim to detect GW signals from the mergers of binary black holes.` 

Specifically, we will build a model to analyze simulated GW time-series data from a network of Earth-based detectors.

The series of images below are taken from the 2015 [paper](https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.116.061102) paper announcing the discovery of gravitational waves from a pair of merging black holes.

<img src="https://storage.googleapis.com/kaggle-media/competitions/G2Net-gravitational-waves/800px-LIGO_measurement_of_gravitational_waves.svg.png"/>

****What are we predicting?****

We need to predict the probability whether the given observation contains a gravitational wave

<a id="section-two"></a>
# Data Provided

In this competition we are provided with a training set of time series data containing simulated gravitational wave measurements from a network of 3 gravitational wave interferometers:

1. LIGO Hanford
2. LIGO Livingston
3. Virgo


Each time series contains either detector noise or detector noise plus a simulated gravitational wave signal. 

`The task is to identify when a signal is present in the data (target=1).`

Each data sample (npy file) contains 3 time series (1 for each detector) and each spans 2 sec and is sampled at 2,048 Hz.

##### Files
`train folder` - the training set files, one npy file per observation; labels are provided in a files shown below

`test folder` - the test set files; you must predict the probability that the observation contains a gravitational wave

`training_labels.csv` - target values of whether the associated signal contains a gravitational wave

`sample_submission.csv` - a sample submission file in the correct format

<a id="section-three"></a>
# Import Libraries

In [None]:
import numpy as np
import pandas as pd
import os

import matplotlib.pyplot as plt
import seaborn as sns
from glob import glob
import random
from colorama import Fore, Back, Style
# Setting plot styling.
plt.style.use('ggplot')
import warnings
warnings.filterwarnings("ignore")

<a id="section-four"></a>
# Load Data

In [None]:
labels = pd.read_csv("../input/g2net-gravitational-wave-detection/training_labels.csv")

print(Fore.BLUE + "Dataset has ",Style.RESET_ALL + "{} Observations".format(labels.shape[0]))

print(Fore.GREEN + "First 5 Observations:",Style.RESET_ALL)
display(labels.head())

In [None]:
# build a training dataframe for all the available .npy files along with their path

# path of the files
paths = glob("../input/g2net-gravitational-wave-detection/train/*/*/*/*")

# list of ids of .npy files 
ids = [path.split("/")[-1].split(".")[0] for path in paths]

# data frame containing paths and ids of .npy files 
path_df = pd.DataFrame({"path":paths,"id":ids})

# merge the dataframe built above with the dataset having target
train_df = pd.merge(left=labels,right=path_df,on="id")

# this would a comprehensive df which would include "id","target" and "path" for each of the .npy file in train folder
display(train_df.head())

# lets confirm whether the dataframe built above has expected no. of rows(560000)
print(Fore.BLUE + "No.of rows in the merged dataframe:",train_df.shape[0],Style.RESET_ALL)

In [None]:
# segregate dataframes for individual classes
target_1 = train_df[train_df.target==1]
target_0 = train_df[train_df.target==0]

<a id="section-five"></a>
# Class Distribution

****Both the labels (target=0,target=1) have equal distribution in the dataset****

In [None]:
print("Class Distribution:\n",labels.target.value_counts())

In [None]:
# visualize class distribution
sns.countplot(x="target", data=labels)
plt.title("Class Distribution")

<a id="section-six"></a>
# Data Analysis (Let's pick up a random file and analyze it)

**Note**
1. We would call the 3 different serieses as SITE-1, SITE-2 & SITE-3 in this notebook!

In [None]:
# visualize the randomly selected series
def plot_series(series,plot,target):
    if plot == "box" or plot == "kde":
        plt.figure(figsize = (20,2))    
    else:
        plt.figure(figsize = (15,12))    
    
    for idx in range(3):
        if plot == "box":
            plt.subplot(1,3,idx+1)            
            sns.boxplot(series[idx:idx+1],color = 'b')  
            
        elif plot == "kde":
            plt.subplot(1,3,idx+1)            
            sns.kdeplot(series[idx],color = 'r', shade=True,lw=2, alpha=0.5)
        else:
            plt.subplot(3,1,idx+1)            
            plt.plot(series[idx:idx+1].T,color = 'g')
            plt.title("\nSite-" + str(idx+1))        
            
    if plot == "box":    
        plt.suptitle("Box Plots(target = " + target + ")")
    elif plot == "kde":    
        plt.suptitle("Probablity Distribution Plots(target = " + target + ")")
    else:    
        plt.suptitle("Time Distribution of Signals - Spans 2 sec, Sampled at 2,048 Hz(target = " + target + ")")

        
    plt.show()

In [None]:
# pick a random series(target=1)
target_1 = target_1.sample(1).path.values[0]

pos = np.load(target_1)

print(Fore.BLUE + "Shape of the selected signal:",pos.shape,Style.RESET_ALL)
print("\n\n")

plot_series(pos,"time","1")

In [None]:
# pick a random series(target=1)
target_0 = target_0.sample(1).path.values[0]

neg = np.load(target_0)
print(Fore.BLUE + "Shape of the selected signal:",neg.shape,Style.RESET_ALL)
print("\n\n")

plot_series(neg,"plot","0")

**Points to Note:**
1. We have 560000 files, each file has dimension of 3 * 4096, this turns out to be a huge time series

2. There are some differences in the plots for series where signal is present or absent
   The series with no signals have bigger fluctuations, while the series for which signal is absent,
   have smaller more consistent ones.

In [None]:
# Probability Distribution plots for target == 1 (Signal is missing)
plot_series(pos,"box","1")

In [None]:
# Probability Distribution plots for target == 0 (Signal is missing)
plot_series(neg,"box","0")

**Points to Note:**
1. The 3 sites have fairly similar distribution for both the class types.
2. Only visible difference is with the third site, which seems to have difference in outliers than the other sites.

In [None]:
# Probability Distribution plots for target == 1 (Signal is present)
plot_series(pos,"kde","1")

In [None]:
# Probability Distribution plots for target == 0 (Signal is missing)
plot_series(neg,"kde","0")

**Note**
1. KDE plots for both the classes looks almost similar for Site-3
2. Site-2 has little bit more variation for target =0
3. Site-1 has more variation for target=1 

In [None]:
from sklearn.model_selection import train_test_split

from keras.utils import Sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, Conv1D, MaxPool1D, BatchNormalization
from keras.optimizers import RMSprop,Adam

import warnings
warnings.filterwarnings("ignore")

# Data generator
Will be used for real-time data feeding to your Keras model.

In [None]:
class DataGenerator(Sequence):
    def __init__(self, path, list_IDs, data, batch_size):
        self.path = path
        self.list_IDs = list_IDs
        self.data = data
        self.batch_size = batch_size
        self.indexes = np.arange(len(self.list_IDs))
        
    def __len__(self):
        len_ = int(len(self.list_IDs)/self.batch_size)
        if len_*self.batch_size < len(self.list_IDs):
            len_ += 1
        return len_
    
    def __getitem__(self, index):
        indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
        list_IDs_temp = [self.list_IDs[k] for k in indexes]
        X, y = self.__data_generation(list_IDs_temp)
        return X, y
    
    def __data_generation(self, list_IDs_temp):
        X = np.zeros((self.batch_size, 3, 4096))
        y = np.zeros((self.batch_size, 1))
        for i, ID in enumerate(list_IDs_temp):
            id_ = self.data.loc[ID, 'id']
            file = id_+'.npy'
            path_in = '/'.join([self.path, id_[0], id_[1], id_[2]])+'/'
            data_array = np.load(path_in+file)
            data_array = (data_array-data_array.mean())/data_array.std()
            X[i, ] = data_array
            y[i, ] = self.data.loc[ID, 'target']
        return X, y

In [None]:
sample_submission = pd.read_csv('../input/g2net-gravitational-wave-detection/sample_submission.csv')
train_idx =  labels['id'].values
y = labels['target'].values
test_idx = sample_submission['id'].values


In [None]:
train_idx, train_Valx = train_test_split(list(labels.index), test_size=0.33, random_state=2021)
test_idx = list(sample_submission.index)

In [None]:
train_generator = DataGenerator('/kaggle/input/g2net-gravitational-wave-detection/train/', train_idx, labels, 64)
val_generator = DataGenerator('/kaggle/input/g2net-gravitational-wave-detection/train/', train_Valx, labels, 64)
test_generator = DataGenerator('/kaggle/input/g2net-gravitational-wave-detection/test/', test_idx, sample_submission, 64)

In [None]:
model = Sequential()
model.add(Conv1D(64, input_shape=(3, 4096,), kernel_size=3, activation='relu'))
model.add(BatchNormalization())
model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

In [None]:
model.compile(optimizer = Adam(lr=2e-4),loss='binary_crossentropy',metrics=['acc'])

In [None]:
model.summary()


In [None]:
history = model.fit_generator(generator=train_generator, validation_data=val_generator, epochs = 1, workers=4)

As you can see, we called from model the fit_generator method instead of fit, where we just had to give our training generator as one of the arguments. Keras takes care of the rest!

In [None]:
predict = model.predict_generator(test_generator, verbose=1)

In [None]:
sample_submission['target'] = predict[:len(sample_submission)]


In [None]:
sample_submission.to_csv('submission.csv', index=False)