# Welcome to (one more) Exploratory Data Analysis of the G2Net Dataset. 
<img src="https://i.ytimg.com/vi/TWqhUANNFXw/maxresdefault.jpg" alt="title">
The image shows a chirp pattern of gravitational waves detected by LIGO on September 14, 2015.
Credit: LIGO (http://www.ligo.org)



Undoubtedly, one of the most bright breakthroughs in science in the recent decade was the detection of gravitational waves back in 2015. These waves are tiny ripples of the space-time fabric coming from collisions of some super-heavy objects, like black holes or neutron stars, predicted by Einstein 100 years before. The waves can travel billions of light years before hitting the ultra-sensitive instruments, called interferometers. Constructing such a device is the result of immense work by many scientists, engineers, and information science experts from whole over the world. 

The first discovery of gravitational waves was widely outreached in the scientific community, as well as in general audience media. After the first event, more than 50 similar mergers or candidates have been reported. Now, when more and more data is collected, it is needed to more accurately detect GW signals, so that it can help build a more complete picture of our universe.

In the competition, you have a training set containing simulated time series data coming from three different locations (LIGO Hanford, LIGO Livingston, both in the US, and Virgo, in Italy). Each data file (in .npy format) represents either instrument noise or noise with a simulated gravitational wave signal, and is labeled 0 or 1, respectively. These labels are stored in a different .csv file. The task is to build a system capable of identification time series instances with GW signal present. So, this is a pure **binary classification problem** well known in the ML community. 

## Importing libraries

First we need to import some libraries required to load and process the data.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
pd.set_option("display.max_colwidth", None) # setting the maximum width in characters when displaying pandas column. "None" value means unlimited.

import matplotlib.pyplot as plt  # plotting
from glob import glob     # pathname management

import random    # generating (pseudo)-random numbers

import matplotlib.mlab as mlab  # some MATLAB commands
from scipy.interpolate import interp1d  # interpolating a 1-D function

# Importing data. General analysis

In [None]:
training_labels_path = '../input/g2net-gravitational-wave-detection/training_labels.csv'
training_labels = pd.read_csv(training_labels_path)

In [None]:
training_labels.head(3)

In [None]:
training_labels['target'].value_counts()

We can conclude that the training dataset is pretty much balanced.

In [None]:
training_paths = glob("../input/g2net-gravitational-wave-detection/train/*/*/*/*")
print("The total number of files in the training set:", len(training_paths))

It turned out to be useful to merge labels and file paths based on their ids.

In [None]:
ids = [path.split("/")[-1].split(".")[0] for path in training_paths]
paths_df = pd.DataFrame({"path":training_paths, "id": ids})
train_data = pd.merge(left=training_labels, right=paths_df, on="id")

In [None]:
train_data.head(3)

To load a random data sample, we can make a helper function.

In [None]:
def load_random_file(signal = None):
    """Selecting a random file from the training dataset. 
    
    Args:
        signal: bool
            optional flag defining whether to select pure detector 
            noise (False) or detector noise plus simulated signal (True).
            If skipped, the flag is chosen randomly.
    Returns:
        file_id: str
            unique id of the selected file
        target: int
            0 or 1, target value
        data: numpy.ndarray
            numpy array in the shape (3, 4096), where 3 is the number
            of detectors, 4096 is number of data points (each time series
            instance spans over 2 seconds and is sampled at 2048 Hz)
        
    """    
    if signal is None:
        signal = random.choice([True, False])
        
    filtered = train_data["target"]==signal   # filtering dataframe based on the target value
    
    index = random.choice(train_data[filtered].index)   # random index 
    
    file_id = train_data['id'].at[index]
    target = train_data['target'].at[index]
    path = train_data['path'].at[index]
    
    data = np.load(path)
    
    return file_id, target, data

# Plotting the raw data in time domain

In [None]:
file_id, target, data = load_random_file()
ylim = 1.1*np.max(data)

plt.style.use('ggplot')

fig, axs = plt.subplots(ncols=1, nrows=3, figsize=(10, 5))

for i in range(3):
    ax = axs.ravel()[i]
    ax.plot(data[i])
    ax.margins(0)
    axs[i].set_title(f"Detector {i+1}", loc='center')
    ax.set_ylabel(f"Amplitude")
    ax.set_ylim([-ylim, ylim])
    
axs[0].xaxis.set_visible(False)
axs[1].xaxis.set_visible(False)

axs[2].set_xlabel("Time stamp")
fig.suptitle(f"Raw data visualization. ID: {file_id}. Target: {target}.")
plt.show()

# Plotting the data in frequency domain

One of the ways to explore the frequency components of the data, is to plot the amplitude spectral density. To read more on this topic, please refer to the following link: https://www.gw-openscience.org/GW150914data/LOSC_Event_tutorial_GW150914.html#Whitening

In [None]:
fs = 2048      # sampling rate
NFFT = 4*fs    # the Nyquist frequency 
f_min = 20.
f_max = fs/2

In [None]:
_, target, data = load_random_file(True)

strain1, strain2, strain3 = data[0], data[1], data[2]

Pxx_1, freqs = mlab.psd(strain1, Fs = fs, NFFT = NFFT)
Pxx_2, freqs = mlab.psd(strain2, Fs = fs, NFFT = NFFT)
Pxx_3, freqs = mlab.psd(strain3, Fs = fs, NFFT = NFFT)

psd_1 = interp1d(freqs, Pxx_1)
psd_2 = interp1d(freqs, Pxx_2)
psd_3 = interp1d(freqs, Pxx_3)

fig, ax = plt.subplots(ncols=1, nrows=1, figsize=(10, 5))
ax.loglog(freqs, np.sqrt(Pxx_1),"g",label="Detector 1")
ax.loglog(freqs, np.sqrt(Pxx_2),"r",label="Detector 2")
ax.loglog(freqs, np.sqrt(Pxx_3),"b",label="Detector 3")

ax.set_xlim([f_min, f_max])
ax.set_ylabel("ASD (strain/$\sqrt{Hz}$)")
ax.set_xlabel("Frequency (Hz)")
ax.legend()

plt.show()

# Constant Q-Transform

Another very common way to visualize a GW signal is to perform a constant Q-transform (or CQT). This is a time-frequency representation widely used in processing musical data. To quickly perform Q-transform, we are going to use PyCBC library (the docs are available [here](http://pycbc.org/pycbc/latest/html/)).

In [None]:
!pip -q install pycbc
import pycbc

We can prepare some helper functions to generate and visualize Q-transforms. Some useful demos with PyCBC methods can be found [here](https://github.com/gwastro/PyCBC-Tutorials).

In [None]:
def generate_qtransform(data, fs):
    """Function for generating constant Q-transform. 
    
    Args:
        data: numpy.ndarray
            numpy array in the shape (3, 4096), where 3 is the number
            of detectors, 4096 is number of data points (each time series
            instance spans over 2 seconds and is sampled at 2048 Hz)
        fs: int
            sampling frequency
    Returns:
        times: numpy.ndarray
            array of time bins
        freqs: numpy.ndarray
            array of frequency bins
        qplanes: list
            list with 3 elements corresponding to each detector in the raw
            data file. Each element is a 2-d vector of the power in each 
            time-frequency bin
    """    
    
    qplanes = []
    for i in range(len(data)):
        
        # converting data into PyCBC Time Series format
        ts = pycbc.types.TimeSeries(data[i, :], epoch=0, delta_t=1.0/fs)   
        
        # whitening the data within some frequency range
        ts = ts.whiten(0.125, 0.125) 
        
        # calculating CQT values
        times, freqs, qplane = ts.qtransform(.002, logfsteps=100, qrange=(10, 10), frange=(20, 512))

        qplanes.append(qplane)
        
    return times, freqs, qplanes 

In [None]:
def plot_qtransform(file_id, target, data):
    """Plotting constant Q-transform data.
    
    Args:
        file_id: str
            unique id of the selected file
        target: int
            0 or 1, target value
        data: numpy.ndarray
            numpy array in the shape (3, 4096), where 3 is the number
            of detectors, 4096 is number of data points (each time series
            instance spans over 2 seconds and is sampled at 2048 Hz)
    """
    
    times, freqs, qplanes = generate_qtransform(data, fs=fs)
    
    fig, axs = plt.subplots(ncols=1, nrows=3, figsize=(12, 8))

    for i in range(3):

        axs[i].pcolormesh(times, freqs, qplanes[i], shading = 'auto')
        axs[i].set_yscale('log')
        axs[i].set_ylabel('Frequency (Hz)')
        axs[i].set_xlabel('Time (s)')
        axs[i].set_title(f"Detector {i+1}", loc='left')
        axs[i].grid(False)

    axs[0].xaxis.set_visible(False)
    axs[1].xaxis.set_visible(False)

    fig.suptitle(f"Q transform visualization. ID: {file_id}. Target: {target}.", fontsize=16)
    plt.show()

Now we can select a random data file, perform CQT and plot the results.

In [None]:
file_id, target, data = load_random_file()
plot_qtransform(file_id, target, data)

Here we have a sample with a strong GW signal, characterized by a frequency chirp on the CQT spectrogram:

In [None]:
file_id = '7945e449f3'
target = 1
data  = np.load(train_data[train_data['id']==file_id]['path'].values[0])

plot_qtransform(file_id, target, data)