In [None]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

from IPython.display import HTML

# Signal, where are you?

Hey Klagger! This is going to be a difficult competition. Why? You just can't create a spectrogram and use a machine learning model to do the classification. You need to do a good preprocessing! And this will be difficult. We are looking for the needle in the haystack as our data contains not only the signal + some noise but also a lot of signals that belong to other sources like instruments used during the experiments. Even if this data is simulated we can expect that the signal is hidden!  


In [None]:
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/FlDtXIBrAYE" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')

Have fun!

# Sources

With this notebook I try to collect information for myself to get started with this competition and I hope that you find it useful. My work is built upon the work of fantastic people and here you can find the sources:

* https://www.youtube.com/channel/UC4oFlSYpDywInX0lxpiBPwA
* https://arxiv.org/pdf/1908.11170.pdf
* https://www.gw-openscience.org/LVT151012data/LOSC_Event_tutorial_LVT151012.html

# Table of contents

To find novel approaches to extract the signal of gravitational waves we need to understand the data itself and what kind of methods are usually used. Consequently my notebook is mainly about preprocessing and data understanding. ;-) 

* [Preparation](#preparation)
* [Understanding the data](#understanding)
    * How does the data look like?
    * What kind of signals can be found?
* [What's our goal for data preprocessing?](#preprocessing_goals)
* [What kind of preprocessing methods are usually used?](#preprocessing_steps)
* [Why are we using transformations like Fourier or Q-transform?](#transformations)
    * [Waves](#waves)
    * [Inference of waves](#inference)
    * [Fourier series](#fourier_series)
* [What happens during each preprocessing step?](#whathappens)
    * [Apply a window function](#windowing)
    * [Whitening](#whitening)
    * [Bandpass filtering](#bandpass_filtering)
* [Searching for a signal](#searching_signal)
* [Understanding the sources of gravitational waves](#gw_sources)

Let's go!

# Preparation <a class="anchor" id="preparation"></a>

Ok, we need to load the packages...

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()
plt.rcParams["axes.grid"] = False

import matplotlib.mlab as mlab
from scipy import signal
from scipy.interpolate import interp1d
from scipy.signal import butter, filtfilt, iirdesign, zpk2tf, freqz

from IPython.display import HTML

... and I would like to select an example to try out the preprocessing methods that are usually used. Let's load the data together with the path of the npy files:

In [None]:
train_labels = pd.read_csv("../input/g2net-trainlabelpaths/training_labels_with_paths.csv")
train_labels.target.value_counts()

As I'm trying to find a signal, let's use a hot example:

In [None]:
example = 1

example_strain = np.load(train_labels[train_labels.target==1].iloc[example].filepath)

# Understanding the data <a class="anchor" id="understanding"></a>

## How does the data look like?

In [None]:
plt.figure(figsize=(20,5))

plt.plot(example_strain[0,:], c="firebrick", label="detector 1")
plt.plot(example_strain[1,:], c="mediumseagreen", label="detector 2")
plt.plot(example_strain[2,:], c="slateblue", label="detector 3")
plt.title("Example");
plt.legend();

### Insights

Ok, the three signals originating from different detectors all look a bit different. How was this data generated? We are given...

* detector noise from three real detectors (LIGO Hanford, LIGO Livingston, and Virgo). As far as I understand this noise is not simulated. 
* A simulated gravitational wave signal hidden in this noisy data in the case of hot targets. You can't see it with your eyes! 

## What kind of signals can be found?

We can see a lot of waves in the data above? What's their origin? The data we observe was collected using a large-scale Michelson interferometer. Watch this great video here and you see how streching of the arms of this interferometer causes the waves we found:

https://www.ligo.caltech.edu/video/IFO-response 

The interferometer is sensitive towards gravitational waves but unfortunately also for terrestral forces and displacements. This may also include vibrations of the instruments themselves etc.. This kind of forces cause streching of the interferometer arms and this leads to constructive interferance and the waves we can see above. 

This video is also great to understand how streching of the arms leads to constructive inference:

In [None]:
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/tQ_teIUb3tE" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')

You can see that the waves can cancel each other out or merge to add up their amplitudes. As there are always spacial forces acting on the arms we end up with data showing the waves above. But be careful, they do not obviously show our signal!

# What's our goal for data preprocessing? <a class="anchor" id="preprocessing_goals"></a>

We are looking for something that is hidden in noise and may be difficult to extract. Take a look and listen to well extracted (real - not simulated) signals:

In [None]:
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/QyDcTbR-kEA" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')

* The data we can see above was measured by the LIGO Hanford and LIGO Livingston detectors and both show the same event roughly starting at 0.9 s.
* You can see the signal as a bright stripe in fourier space (time-frequency domain) and you can hear it like a chirp. 
* The amplitudes of our waves do not look like our data above as it was already preprocessed. ;-)

Even though this event can be seen nicely in the spectogram we can't expect that we will always see our simulated waves as above. 

# What kind of preprocessing methods are usually used? <a class="anchor" id="preprocessing_steps"></a>

Take a look at these two notebooks:

* https://colab.research.google.com/github/losc-tutorial/quickview/blob/master/index.ipynb (with a clearly visible signal in the spectogram)
* https://www.gw-openscience.org/LVT151012data/LOSC_Event_tutorial_LVT151012.html (without a clearly visible but present signal in the spectogram!)

Basically we can see that the following preprocessing steps are applied:

* **Whitening using the Amplitude Spectral Density** to eliminate the noise that can be seen in the ASD in the range of low and high frequencies as well as close to spectral lines. I hope that we (me included) will understand this better during the following chapters. 
* Using a **bandpass filter** to supress further low frequency signals.

Before we try to understand what these steps are doing, I need to refresh my knowledge! ;-)

# Why are we using transformations like Fourier or Q-transform? <a class="anchor" id="transformations"></a>

## Waves <a class="anchor" id="waves"></a>

What is meant by low frequency noise? :-) Perhaps you didn't know so far, so let's talk a bit about oscillations! For example by looking at a sine wave:

In [None]:
x = np.arange(0,2*np.pi, step=0.1)
y = np.sin(x)

In [None]:
plt.figure(figsize=(20,5))
plt.plot(y, '-o', c="tomato")
plt.ylabel("Elongation y")
plt.xlabel("Time t")
plt.title("A sine wave");

* In the range from 0 to $2\pi$ the sinus wave show us one oscillation like a pendulum going from the middle to the left to the middle to the right and back. 
* The time that has gone during this oscillation is called the period $T$. 
* And the inverse of the period is called the frequency of the wave: $f=\frac{1}{T}$
* Further we can define the length of the wave by $\lambda = c \cdot T = \frac{c}{f}$. This is called the wavelength and c is the velocity of light.
* There is one more spacial feature about waves - the elongation y which shows us how far the pendulum (for example) is away from the middle. Furthermore we can define the amplitude $\hat{y}$ as the maximum of the elongation.

**What is meant now by a low frequency wave?**

A low frequency means a high wavelength. Consequently when we say that we want to eliminate low frequency parts of our data, this means that we want to get rid of the big oscillations with high wavelengths we can see very clearly in our data! ;-) 

# Inference of waves <a class="anchor" id="inference"></a>

We can add waves like the sine wave above to create new waves that are superpositions of the other ones. This is also called inference (so it's not the inference you do with a ML model). Let's try out a few examples!

We need a "time":

In [None]:
t = np.linspace(0, 10, 5000)

And frequencies:

In [None]:
f1 = 30
f2 = 10000

And the sine to compute the elongations y for our waves:

In [None]:
y1 = 0.1 * np.sin(2*np.pi*f1*t)
y2 = np.sin(2*np.pi*f2*t)

Don't wonder, the stuff inside the sine wave is a bit more complicated than what we used above. Instead of $y=sin(t)$ it holds $y=r \cdot sin(\omega  t)$ with $\omega=2 \pi f$ being [the angular frequency](https://en.wikipedia.org/wiki/Angular_frequency) and $r$ being the radius that we can set to adjust the amplitude $\hat{y}$ (max elongation). You can see that it's just a constant times frequency. So there is nothing new to learn. A higher frequency means more ups and downs of our wave within a fixed time period compared to a low frequency. 

In [None]:
fig, ax = plt.subplots(2,1,figsize=(20,8))
ax[0].plot(t, y1, label="high freqency")
ax[0].plot(t, y2, label="low frequency")
ax[1].plot(t, y1+y2)
ax[0].legend();

for n in range(2):
    ax[n].set_xlabel("time t")
    ax[n].set_ylabel("elongation y")

Ok, with these values we obtain one wave with high frequency and low radius and another wave with low frequency and higher radius. Adding up both we can see the same kind of "problem" like above - a "big wave" with a small one "hidden". 

------------ Work in progress ---------------

## Fourier series

To understand what's going on during preprocessing, I like to go back to fourier series. With a fourier series we can try to approximate a periodic function f(x) with sine and cosine waves. This function can be our data composed of the signals given by terrestrical forces and perhaps also the signal of a simulated gravitational wave. But it can also be the function we have built above using two simple sine waves:

$$ f(x) = \frac{a_{0}}{2} + \sum_{k=1}^{\infty} \left[ a_{k} \cos \left(\frac{k \pi}{l} x\right) + b_{k} \sin \left(\frac{k \pi}{l} x\right) \right] + R_{n}(x)$$

The part without the residual $R_{n}(x)$ is called the fourier series. We have already seen that we can try to built a similar kind of "a hidden singal travelling on a big wave" by only adding up two sine waves. So I think it's a nice idea to say that we can try to built periodic, unknown functions by using a infinite sum of sine and cosine waves.

# Apply a window function <a class="anchor" id="windowing"></a>

* Suppress spectral leakage and spurious correlations in the phases between bins
* https://en.wikipedia.org/wiki/Window_function
* https://en.wikipedia.org/wiki/Welch%27s_method

In [None]:
dt = 0.000244140625
strain_len = example_strain.shape[1]

In [None]:
hp_window = 1
hp_tukey_alpha = 0.125
#NFFT = 1*strain_len # why 16?
NFFT = 2 * strain_len           # Use 4 seconds of data for each fourier transform
NOVL = 1 * NFFT / 2 # The number of points of overlap between segments used in Welch averaging
fband = [35.0, 150.0]

In [None]:
channel = 0
strain = example_strain[channel,:]

In [None]:
blackman_window = signal.blackman(int(strain_len*hp_window)) #signal.tukey(strain, alpha=1./8)
tukey_window = signal.tukey(strain_len*hp_window, hp_tukey_alpha)

In [None]:
fig, ax = plt.subplots(1,2,figsize=(20,5))

ax[0].plot(blackman_window, c="black")
ax[0].set_title("Blackman window")

ax[1].plot(tukey_window, c="purple")
ax[1].set_title("Tukey window")

for n in range(2):
    ax[n].set_ylabel("Amplitude")
    ax[n].set_xlabel("Sample")

Let's take a look how the window functions change our data:

In [None]:
fig, ax = plt.subplots(1,3,figsize=(20,5))

ax[0].plot(strain)
ax[0].set_title("Original data")

ax[1].plot(strain*blackman_window)
ax[1].set_title("With blackman window applied")

ax[2].plot(strain*tukey_window)
ax[2].set_title("With tukey window applied");

Hmm... what if our signal can be found in the beginning our end of the data? Then windowing would be very bad... wouldn't it?!

In [None]:
windowed_strain = strain*tukey_window

# Whitening <a class="anchor" id="whitening"></a>

* The data is dominated by low frequency noise (the large oscillations we can see! ;-))
* Divide fourier coefficients by an estimate of the amplitude spectral density of the noise
* This way we perform a down-weighting of the frequencies where the noise is loud
* To return to the time-domain do inverse Fourier transform

The amplitude spectral density is the square root of the power spectral density. Consequently we can use mlab.psd to continue:

In [None]:
psd_window = signal.tukey(NFFT, alpha=1./4)
Pxx_strain, freqs = mlab.psd(windowed_strain, Fs = strain_len, NFFT = NFFT,
                               window=psd_window,
                               noverlap=NOVL)

In [None]:
#Pxx_strain, freqs = mlab.psd(windowed_strain, Fs = strain_len, NFFT = NFFT)
PSD = interp1d(freqs, Pxx_strain)

In [None]:
def whiten(strain, interp_psd, dt, phase_shift=0, time_shift=0):
    """Whitens strain data given the psd and sample rate, also applying a phase
    shift and time shift.
    Args:
        strain (ndarray): strain data
        interp_psd (interpolating function): function to take in freqs and output 
            the average power at that freq 
        dt (float): sample time interval of data
        phase_shift (float, optional): phase shift to apply to whitened data
        time_shift (float, optional): time shift to apply to whitened data (s)
    
    Returns:
        ndarray: array of whitened strain data
    """
    Nt = len(strain)
    # take the fourier transform of the data
    freqs = np.fft.rfftfreq(Nt, dt)

    # whitening: transform to freq domain, divide by square root of psd, then
    # transform back, taking care to get normalization right.
    hf = np.fft.rfft(strain)
    
    # apply time and phase shift
    hf = hf * np.exp(-1.j * 2 * np.pi * time_shift * freqs - 1.j * phase_shift)
    norm = 1./np.sqrt(1./(dt*2))
    white_hf = hf / np.sqrt(interp_psd(freqs)) * norm
    white_ht = np.fft.irfft(white_hf, n=Nt)
    return white_ht

In [None]:
strain_whitened = whiten(windowed_strain, 
                         PSD, dt)

# Apply a bandpass filter <a class="anchor" id="bandpass_filtering"></a>

* https://en.wikipedia.org/wiki/Butterworth_filter

In [None]:
def bandpass(strain, fband, fs):
    """Bandpasses strain data using a butterworth filter.
    
    Args:
        strain (ndarray): strain data to bandpass
        fband (ndarray): low and high-pass filter values to use
        fs (float): sample rate of data
    
    Returns:
        ndarray: array of bandpassed strain data
    """
    bb, ab = butter(4, [fband[0]*2./fs, fband[1]*2./fs], btype='band')
    normalization = np.sqrt((fband[1]-fband[0])/(fs/2))
    strain_bp = filtfilt(bb, ab, strain) / normalization
    return strain_bp

In [None]:
bandpassed_strain = bandpass(strain_whitened, fband, strain_len)

In [None]:
fig, ax = plt.subplots(4,1,figsize=(20,15))
ax[0].plot(strain)
ax[0].set_title("Original data")
ax[1].plot(windowed_strain)
ax[1].set_title("Windowed data");
ax[2].plot(strain_whitened)
ax[3].plot(bandpassed_strain)

# Searching for the signal <a class="anchor" id="searching_signal"></a>

In [None]:
hp_window = 1
hp_tukey_alpha = 0.125
#NFFT = 1*strain_len # why 16?
NFFT = 4 * strain_len           # Use 4 seconds of data for each fourier transform
NOVL = 1 * NFFT / 2 # The number of points of overlap between segments used in Welch averaging
fband = [15.0, 350.0]
dt = 0.000244140625 # ?

In [None]:
fig, ax = plt.subplots(5,2,figsize=(20,20))

for m in range(5):
    for k in range(2):
        melspecs = []
        for channel in [0,0,0]:#range(3):

            example_strain = np.load(train_labels[train_labels.target==k].iloc[m].filepath)

            strain = example_strain[channel,:] / 2

            tukey_window = signal.tukey(strain_len*hp_window, hp_tukey_alpha)
            windowed_strain = strain*tukey_window

            psd_window = signal.tukey(NFFT, alpha=1./4)
            Pxx_strain, freqs = mlab.psd(windowed_strain, Fs = strain_len, NFFT = NFFT,
                                         window=psd_window, noverlap=NOVL)
            PSD = interp1d(freqs, Pxx_strain)
    
            strain_whitened = whiten(windowed_strain, 
                                 PSD, dt)
            bandpassed_strain = bandpass(strain_whitened, fband, strain_len)
            ax[m,k].plot(bandpassed_strain)
            ax[m,k].set_ylim([-10,10])
            
        ax[m,k].set_title("Target {}".format(k))

Puh... can you see something?! I tried out 2D spectrograms too for visualization but there is mainly just noise and it's hard to say if there is a signal or not. 

# Understanding the sources of gravitational waves





In [None]:
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/p__5MuhvWK0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')