## Going to perform data preparation and EDA

### Importing training labels file

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
df_train_label=pd.read_csv('../input/g2net-gravitational-wave-detection/training_labels.csv')
df_train_label.head()

In [None]:
df_train_label.shape

### Importing libraries

In [None]:
import os
import matplotlib.pyplot as plt
import seaborn as sns
from glob import glob

#### Importing tha path to the files

In [None]:
# path of the files
paths_files = glob("../input/g2net-gravitational-wave-detection/train/*/*/*/*")
#paths_files

In [None]:
len(paths_files)

There are about 560000 **.npy** files in the dataframe. Now we look into a particular .npy file as shown below.

In [None]:
# Loading the first .npy data
data=np.load(paths_files[0])
data

In [None]:
data.shape

#### From the above observation we can conclude the following,
1. The sampling rate is 2048 Hz, which means that for each second 2048 samples are given. This fact is already given in the dataset description.
2. Three rows in **data** variable refer to the 3 sites mentioned in the description of data, and they are: LIGO Hanford (SITE1), LIGO Livingston (SITE2), Virgo (SITE3).
3. In the **data** variable there are $4096=2086\times 2$ columns. It refers to the total samples generated in the span of 2 seconds.


In [None]:
print(np.min(data),np.max(data))

In [None]:
# Looking is there is any missing value
df_train_label.isnull().sum()

In [None]:
df_train_label['target'].hist()

In [None]:
df_train_label['target'].value_counts()

Almost balanced data.

#### Motivated by the compact dataset representation in the kaggle notebook given [here](https://www.kaggle.com/rawaaelghali/g2net-gravitational-starter-eda) we also build similar compact dataframe as shown below.

In [None]:
ids=[]
for filext in paths_files:
    ids.append(filext[filext.rindex('/')+1:\
                              len(filext)].replace('.npy',''))
    
# data frame containing paths and ids of .npy files 
path_df = pd.DataFrame({"id":ids,"path":paths_files})
path_df.head()

In [None]:
path_df.shape

### We know do a join both of the dataframes into a resulting dataframe **df**

In [None]:
df=pd.merge(path_df,df_train_label,on='id')
del path_df, df_train_label;
df.head()

In [None]:
df.shape

In [None]:
df[df['target']==1]['path'][0]

#### Visualizing a particular .npy file where target=0 and target=1. 

In [None]:
for i in range(0,data.shape[0]):
    plt.plot(np.arange(0, data.shape[1], 1),data[i,:])
    plt.show()

#### First we look into the case when target=1

In [None]:
data1=np.load(df[df['target']==1]['path'].iloc[0])
data1

In [None]:
for i in range(0,data1.shape[0]): 
    plt.figure(figsize=(14,2))
    plt.plot(np.arange(0, data1.shape[1], 1),data1[i,:])
    # naming the x axis
    plt.xlabel('sample')
    # naming the y axis
    plt.ylabel('output')
    # naming the title
    plt.title('SITE'+str(i+1)+'(target=1)')
    plt.xlim(0,4096)
    plt.show()

We now look into target=0

In [None]:
data0=np.load(df[df['target']==0]['path'].iloc[0])
data0

In [None]:
for i in range(0,data0.shape[0]): 
    plt.figure(figsize=(14,2))
    plt.plot(np.arange(0, data0.shape[1], 1),data0[i,:])
    # naming the x axis
    plt.xlabel('sample')
    # naming the y axis
    plt.ylabel('output')
    # naming the title
    plt.title('SITE'+str(i+1)+'(target=0)')
    plt.xlim(0,4096)
    plt.show()

In [None]:
df[df['target']==0]['path'].iloc[0]

In [None]:
sns.displot(data1[0,:])

In [None]:
fig, axes = plt.subplots(3, 2, sharex=True, figsize=(14,12))
fig.suptitle('Distribution plots')
for i in range(0,data1.shape[0]):
    sns.histplot(ax=axes[i, 0], data=data1[i,:])
    axes[i,0].set_title('SITE'+str(i+1)+'(target=1)')
    sns.histplot(ax=axes[i, 1], data=data0[i,:])
    axes[i,1].set_title('SITE'+str(i+1)+'(target=0)')

#### It appears that target=1 at SITE1 has higher spread, and target=0 has higher spread at SITE3

#### Creating Training and validation set

In [None]:
from sklearn.model_selection import train_test_split
df_train, df_val= train_test_split(df, test_size=0.2, random_state=0)

In [None]:
df_train.head()

In [None]:
df_val.head()

#### Now in inorder to build custom data generator, do refer to this article [ref.(1)](https://towardsdatascience.com/keras-data-generators-and-how-to-use-them-b69129ed779c) it is extremely useful.

In [None]:
import tensorflow as tf
from tensorflow.keras.utils import Sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dense, Dropout, Flatten, Conv1D, MaxPool1D, BatchNormalization
from tensorflow.keras.optimizers import RMSprop,Adam


Directly training our model on the **.npy** files takes a lot of time because loading the data takes a lot of time than performing the ML computations. Mainly the ML computation are done on a GPU, and loading the data task is done by CPU. Former data pipelines made the GPU wait for the CPU to load the data, leading to performance issues [ref.(2)](https://cs230.stanford.edu/blog/datapipeline/). Therefore, the **tf.data** API enables you to build complex input pipelines from simple, reusable pieces [ref.(3)](https://www.tensorflow.org/guide/data). But this (**tf.data**) API lack the feature of reading **.npy** files which doesn't fit in the memory. However, I found a solution in the stackover flow, and link of the solution is given [here](https://stackoverflow.com/questions/48889482/feeding-npy-numpy-files-into-tensorflow-data-pipeline).

#### It is actually possible to read directly NPY files with TensorFlow instead of TFRecords. The key pieces are [**tf.data.FixedLengthRecordDataset**](https://www.tensorflow.org/api_docs/python/tf/data/FixedLengthRecordDataset) and [**tf.io.decode_raw**](https://www.tensorflow.org/api_docs/python/tf/io/decode_raw), along with a look at the documentation of the [**.npy**](https://numpy.org/devdocs/reference/generated/numpy.lib.format.html) format. For simplicity, let's suppose that a **float32** **.npy** file containing an array with shape (N, K) is given, and you know the number of features K beforehand, as well as the fact that it is a *float32* array. An **.npy** file is just a binary file with a small header and followed by the raw array data (object arrays are different, but we're considering numbers now). In short, you can find the size of this header with a function like this:

In [None]:
def npy_header_offset(npy_path):
    with open(str(npy_path), 'rb') as f:
        if f.read(6) != b'\x93NUMPY':
            raise ValueError('Invalid NPY file.')
        version_major, version_minor = f.read(2)
        if version_major == 1:
            header_len_size = 2
        elif version_major == 2:
            header_len_size = 4
        else:
            raise ValueError('Unknown NPY file version {}.{}.'.format(version_major, version_minor))
        header_len = sum(b << (8 * i) for i, b in enumerate(f.read(header_len_size)))
        header = f.read(header_len)
        if not header.endswith(b'\n'):
            raise ValueError('Invalid NPY file.')
        return f.tell()

In [None]:
header_size = npy_header_offset(df['path'].iloc[0])
header_size

In [None]:
file_length = os.path.getsize(df['path'].iloc[0])
file_length

In [None]:
file_length-header_size

In [None]:
3*4096*tf.float64.size

Documentation regarding tf.data.FixedLengthRecordDataset is given [here](https://www.tensorflow.org/api_docs/python/tf/data/FixedLengthRecordDataset)

In [None]:
tf_data_train=tf.data.FixedLengthRecordDataset( df_train['path'], 3*4096*tf.float64.size,\
                                         header_bytes=header_size, num_parallel_reads=4)
tf_data_val=tf.data.FixedLengthRecordDataset( df_val['path'], 3*4096*tf.float64.size,\
                                         header_bytes=header_size, num_parallel_reads=4)

In [None]:
tf_data_train

In [None]:
tf_data_train = tf_data_train.map(lambda s: tf.reshape(\
                                                       tf.io.decode_raw(s, tf.float64),\
                                                       (3,4096)))
tf_data_val = tf_data_val.map(lambda s: tf.reshape(\
                                                       tf.io.decode_raw(s, tf.float64),\
                                                       (3,4096)))
tf_data_train

In [None]:
for i in tf_data_train.take(3):
    print(i)

In [None]:
for i in tf_data_val.take(3):
    print(i)

#### I am loving it.

### Now going to zip the target column with the tensorflow dataset

In [None]:
tf_data_train= tf.data.Dataset.zip((tf_data_train,\
                             tf.data.Dataset.from_tensor_slices(df_train['target']))) 

In [None]:
i=0
for data, target in tf_data_train.take(3):
    print("tf_data_train")
    print(data.numpy(),target.numpy())
    print("df_train")
    print(np.load(df_train['path'].iloc[i]),df_train['target'].iloc[i])
    i=i+1

In [None]:
tf_data_val= tf.data.Dataset.zip((tf_data_val,\
                             tf.data.Dataset.from_tensor_slices(df_val['target']))) 
for data, target in tf_data_val.take(3):
    print(data.numpy(),target.numpy())

In [None]:
train_data = tf_data_train.batch(32).prefetch(buffer_size=64)
train_data

In [None]:
val_data = tf_data_val.batch(32).prefetch(buffer_size=64)
train_data

In [None]:
 
model = Sequential()
model.add(Conv1D(64, input_shape=(3, 4096,), kernel_size=3, activation='relu'))
model.add(BatchNormalization())
model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

In [None]:
model.compile(optimizer = Adam(lr=2e-4),loss='binary_crossentropy',metrics=['AUC'])

In [None]:
model.summary()

In [None]:
history = model.fit(train_data, validation_data=val_data, epochs = 2)

In [None]:
# path of the files
test_files = glob("../input/g2net-gravitational-wave-detection/test/*/*/*/*")
#paths_files

In [None]:
ids=[]
for filext in test_files:
    ids.append(filext[filext.rindex('/')+1:\
                              len(filext)].replace('.npy',''))
    
# data frame containing paths and ids of .npy files 
test_df = pd.DataFrame({"id":ids,"path":test_files})
test_df.head()

In [None]:
tf_data_test=tf.data.FixedLengthRecordDataset( test_df['path'], 3*4096*tf.float64.size,\
                                         header_bytes=header_size, num_parallel_reads=4)

In [None]:
tf_data_test = tf_data_test.map(lambda s: tf.reshape(\
                                                       tf.io.decode_raw(s, tf.float64),\
                                                       (3,4096)))

In [None]:
test_data = tf_data_test.batch(32).prefetch(buffer_size=64)
y_pred=model.predict(test_data)

In [None]:
y_pred.flatten()

In [None]:
output = pd.DataFrame({'Id': test_df.id, 'target': y_pred.flatten()})
output.head()

In [None]:
output.to_csv('./testing_submission.csv', index=False)
print("Your submission was successfully saved!")