<img src="virgo_logo.png" alt="Title" style="width:300px;"/>


# Virgo Gravitational Wave Data Exercise

The aim of this exercise is to prepare a ML tool which is able to compress and clean as much as possible a dataset.

**The autoencoder coded in this exercise is only used as an opportunity to acquire some insight into ML** 

In this exercise you will:

- explore the content of a dataset from the **International Gravitational-Wave Observatory Network** containing the gravitational strain as a function of time
- try to plot quantities and understand which features could help in separating signal from noise
- train a network in order to perform the job, and test their abilities

## **First of all, since this is a workgroup, we disable autosave in order not to mess the notebooks; the output should be 0**


In [None]:
%%javascript
Jupyter.notebook.set_autosave_interval(0)
element.text(Jupyter.notebook.autosave_interval);

## Gravitational Wave Data

For an introduction on what GW and interferometers are please refer to https://confluence.infn.it/display/MLINFN/7.+Virgo+Autoencoder+tutorial.

For a brief vocabulary see https://labcit.ligo.caltech.edu/~ll_news/0607a_news/LIGO_Vocabulary.htm.

<div class="alert alert-info">
    <b>IGWN</b> stands for International Gravitational-Wave Observatory Network, it is a comunity made by three collaborations: LIGO VIRGO and KAGRA.<br />
    <b>Strain</b> is the effect that produces an incoming gravitational wave on our detectors, which, in the case of interferometric detectors, consists of a differential variation of the arm length. This is also the observable quantity measured by the detectors.<br />
    <b>GWF</b> stands for Gravitational Wave File and is the format IGWN used to store (not only!) interferometer strain.
</div>

All data we will use in this tutorial is freely accessibile and published by **GWOSC** (https://www.gw-openscience.org/about/). GWOSC, The Gravitational Wave Open Science Center, provides data from gravitational-wave observatories, along with tutorials and software tools. You could find there many information if you are interested.

## Data Access

To ease the reading and decoding of the data file we will rely on a python package named `gwdama` that needs to be imported (and installed because not available by default on the system).

[gwdama](https://pypi.org/project/gwdama/) aims at providing a unified and easy to use interface to access Gravitational Wave (GW) data and output some well organised datasets, ready to be used for Machine Learning projects or Data Analysis purposes (source properties, noise studies, etc.)

In [None]:
#Uncomment the following line if you have an error like "No module named 'gwdama'"" below
#!pip install gwdama

In [None]:
from gwdama.io import GwDataManager
import numpy as np
from scipy.stats import norm
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline  

Datasets made available through the `gwdama` interface are organized by time and detector. Each data entry includes the strain recorded by the detector as a function of time. Datasets can be selected by indicating a time range and the detector. For this exercise we will select events in the gps time range 118674… corresponding to signal events acquired in August 2017.

<div class="alert alert-info">
    <b>Observing runs</b>, generally called "O" with a number to distinguish between many of them (for example <b>O3</b>), are our scientific and acquisition runs.<br />
    <b>gps time</b> is a timestamp relative to an arbitrarly event (6 January 1980), internally we use our tools to convert to usual epoch time. Not to be confuse with the <em>Unix time</em>
</div>

At https://www.gw-openscience.org/events/ you could find a list of events with their gps time and a brief description of the results.

Firstly, let's initialize the main "container" class to store and pre-process data. Let's label this instance "data":

In [None]:
dm = GwDataManager("data")

At the moment, this is just an empty container, with just some metadata (attributes) as a record of when this was created. The object `dm` returned by `GwDataManager` is a smart container for various datasets. Each dataset is identified by a unique *key*. We can list the keys defined within a container simplying printing it to standard output. Let's import some data, such as that between two gps times: 1186746568 and 1186746628. We can use the `.read_gwdata` method of the `GwDataManager` class to import deta from GWOSC. Alternatively to the gps times, you could have passed a UTC time as a string in the format "YYYY-MM-DD hh:mm(:ss.)", such as `"2017-08-14 12:00"`.
Also, ypu can specify a name (key) for the dataset you are going to import by passing the parameter `dts_key`; this will shortly be useful but for now let's keep things simple.

*Behind the scenes, read_gwdata downloads data from the openscience portal, hence executing the next cell may require a few seconds.*

In [None]:
event_gps = 1186746618  
dm = GwDataManager("data")
# Data from Virgo interferometer (ifo), labeled "V1". Other options are
# L1 (LIGO Livingston) and H1 (LIGO Hanford)
print("This may take some time... ",end='')
dm.read_gwdata(event_gps - 50, event_gps +10, ifo='V1')
print("done!")

Now, if you print the `GwDataManager` instance you will notice that this container has a dataset labelled with the key `strain` (default).

In [None]:
print(dm)

### Plotting
To acquire some confidence with data we can examine it with a plot. `gwdama` has a couple of useful methods for this purpose. Let's start with the most immediate one.

But firstly, let's make some preparatory ooperations:

In [None]:
# Let's associate the dataset to a variable, for convenience
dts=dm['strain']

# and see its attributes/metadata
dts.show_attrs

Gravitational wave strain units are dimensionless. We can however strass what this quantity represents changing the unit attribute. Also, let's specify the name of this channel:

In [None]:
dts.attrs['unit'] = 'strain'
dts.attrs['channel'] = 'Virgo strain data'

# and check the result
dts.show_attrs

Now, it is time to plot this time series:

In [None]:
dts.plot()

It seems there is a recurring oscillation pattern in this data. We will explore this aspect in more details in a moment.

<div class="alert alert-danger">
    <b>Exercise:</b> read multiple files and plot them with a single for loop. Hint: notice that you don't need to initialize multiple `GwDataManager` instances. One can contain all of your data. Just remember to label each dataset with a different key by passing the `dts_key` parameter when readiang a new set of data.
</div>

### Histograms

Histograms are another way to get information about the distribution of values in a time series. `gwdama` provides the ethod `.hist`, which plots a histogram of the values in a dataset. Let's create one comparing it with a Gaussian (or Normal) distribution.

<div class="alert alert-info">
    <b>norm</b> is a class of scipy to work with a normal distribution (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html). In this case we will use the fit method to estimate mean and standard deviation of the input signal. If a <b>gw*</b> is not present then we can assume for the scope of this exercise that the signal is normal distribute (<em>which is not in real life!</em>)
   </div>

In [None]:
# it is convenient to specify the number of bins and tell the method to plot
# the occupancy of each of them in logarithmic scale
myplot = dts.hist(bins=50, log=True)
# Fit with a normal distribution
mu, std = norm.fit(dts.data)

Let's customize the standard plot:

<div class="alert alert-info">
    <b>gca</b> method gets the current Axes, creating one if necessary<br />
    <b>get_xlim</b> return min and max value on the x axis<br />
    <b>linspace</b> return evenly spaced number in an interval<br />
    <b>pdf</b> return a probability density function (see norm before)<br />
    <b>plot</b> plot x,y data. 'r--' is used to control the line style (https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html)<br />
    <b>legend</b> add a legend to the plot<br />
    <b>show</b> is used to show the plot
   </div>

In [None]:
ax = myplot.gca()                     # Get the axis related to this plot (figure)
ax.patches[-1].set_label('Data')      # Set label for the histogram
# Plot the fit 
xmin, xmax = ax.get_xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, mu, std)
ax.plot(x, p, 'r--', linewidth=1, label='Gaussian fit')
ax.legend()

myplot.reshow()                       # New method to re-show closed figures

It seems that the values of our time series are well modeled by a Gaussian distribution, with possibly some deviations on the tails.

<div class="alert alert-danger"><b>Exercise:</b> do you think a different distribution can be more appropriate? Try for example the Student's distribution. The syntax is similar to the one used for the Gaussian/Normal distribution. Just import <code>t</code> from the <code>scipy.stats</code> module.
</div>

### Spectral estimations: the Power Spectral Density

When studying random time series, it is interesting to look for recurring patterns and periodicities in the values assumed by the data. This is commonly done in the *frequency domain*, looking at the intensity of the various frequency components (spectrum). The mathematical means to quantify this is the so called [Power Spectral Density (PSD)](https://en.wikipedia.org/wiki/Spectral_density). This is related to the squared modulus of the Fourier Transform of the signal, and measures the signal's power content versus frequency. A PSD is typically used to characterize broadband random signals. 

To estimate the PSD we must decide a duration for the Fast Fourier Transform and the overlap between consecutive segmets where thay are estimated. This are the two parameters to pass to the `.psd` method of the dataset class of `gwdama`.

In [None]:
# 2 seconds long fft's, with one second of overlap
dts_psd = dts.psd(2,1)

print(dm)

> **Notice** The previous cell creates a new dataset within our mydama manager. If you execute the cell twice you will get an error Unable to create link (name already exists) because an object named online_psd was already defined in the container mydama.

You can delete an existing key and the associated dataset with the command

    del dama[key]

for instance
    
    del dama['online_psd']

Now that a new dataset containing the values of this PSD has been added to our `GwDataManager` object with the key of the original datasat with the addition of the `_psd"` string.

<div class="alert alert-info">Notice that like all methods to create new datasets you could have specified the name of the psd dataset passing to the p[revious <code>.psd</code> method the optionla argument <code>dts_key</code> with specified the name you wanted for this.<br/>
Notice also that this new dataset is not a time series. In fact, it is saved as a <em>structured array</em>, as we will see momentarily.
</div>

Now let's plot the PSD.

In [None]:
fig, ax = plt.subplots()
# It is convenient to use "log-log plot"
ax.loglog('freq', 'PSD', data=dts_psd, alpha=.7)

ax.set(xlabel='Frequency [Hz]', ylabel='PSD [1/Hz]', xlim=(10,1500), ylim=(1e-47,1e-37), title=dts.attrs['channel'])
plt.show()

As you can see, the intensity of the various frequency components span several orders of magnitude. The vertical structures are called *sopectral lines*, and are usually associated to resonances of the apparatus, such as those from electrical origin (50 Hz mains and harmonics) or mechanical (suspensions).

<div class="alert alert-danger"><b>Exercise:</b> try to modify the parameters passed to the method to estimate the PSD and check how different the result looks. If you want to know more, read about the <a href="https://en.wikipedia.org/wiki/Welch%27s_method">Welch's method</a>.
</div>

### Whitening

It is often convenient to work with signals normalized such that to have all of their frequency components with the same intensity. In the *time domain*, this means that the signal is uncorrelated. This operation is done by a so-called *whitening* transform. There are several advantages in data analysis for doing this.

For convenience, `gwdama` provides a simple function that whiten the signal. It practically computes the Fourier transform of the data, devides it by the square root of its PSD, and retrurns back in the time domain. The optional arguments of this method are indeed the same as the `psd` one.

In [None]:
if 'strain_whiten' in dm.keys(): 
    del dm['strain_whiten']  ## Delete previously define datasets with the same name
# del dm['strain_whiten']
whiten = dts.whiten(1)

If you print `dm` you will notice that a new dataset has been added.

In [None]:
# let's plot it
whiten.plot()

A **whitned** signal is (usually) normalized such that the distribution of its values is described by a standard Normal distribution with zero mean and unit variance. SDOmetime, the normalization is fixed in the frequency domain, and the one of the time series is scaled by the number of samples.

<div class="alert alert-danger"><b>Exercise:</b> test the previous claim making a histogram of the whitened dataset.
</div>

As a quick check (not valid as an answer to the previous question!):

In [None]:
print("Mean:     {:.3f}\nStd.dev.: {:.3f}".format(*norm.fit(whiten.data)))

### Resampling

This data is sampled at 4096 Hz, but our interferometers acts like a band pass filter for the gravitational wave signazl. Indeed, looking at the plot of the PSD you can notice that the "bucket" where they achieve their best sensitivity is limited to a band between 30 and about 1000 Hz. For this reason, it could be usefull to resample the data in order to work with smaller files and remove some high frequency noise.

**decimate_recursive** resamples an input signal keeping only 1 elem every n

**resample** resamples an inpunt signal to a dest freq. We will use in this case 512 so we are keeping only 1 every 8 values being our original signal sampled at 4096

In [None]:
if 'strain_whiten_r512' in dm: 
    del dama['strain_whiten_r512']
wres = whiten.resample(512)

In [None]:
print(dm)

In [None]:
# let's plot it
dm['strain_whiten_r512'].plot()

Just for fun try to resample at 128, 256, 1024, too

In [None]:
for key in [k for k in dm.keys() if 'strain_whiten_r' in k]: 
    del dm[key]  ## Delete all resamplings
dm['strain_whiten'].resample(128)
dm['strain_whiten'].resample(256)
dm['strain_whiten'].resample(512)
dm['strain_whiten'].resample(1024)
print(dm)

If you prefer to work directly with the usual list/array of values, simply get them with:
values= dama[name_record].data 
times = dama[name_record].times

**Remark**: Through all exercises, we will assume a constant `sample_rate` with fixed lentgh of chunks

if you want to get more information on data and the cleaning process here a usefull link https://github.com/losc-tutorial/Data_Guide/blob/master/Guide_Notebook.ipynb

Once data is ready we can try to encode with an **autoencoder**

> [from wiki]: An autoencoder is a type of artificial neural network used to learn efficient data codings in an unsupervised manner. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction, by training the network to ignore signal “noise”. Along with the reduction side, a reconstructing side is learned, where the autoencoder tries to generate from the reduced encoding a representation as close as possible to its original input, hence its name

some examples of autoencoders in keras can be found at https://blog.keras.io/building-autoencoders-in-keras.html

The idea is to encoding and decoding to original dimension searching for an optimum set of parameters that makes filtered signal more similar to original one. The encoded signal will retain all needed information but with a reduction of size

To make our statement more concise we will summarize some ideas on Artificial Neural Networks and Deep Learning

## Artificial Neural networks 

**Artificial neural networks (ANNs)** are computing systems vaguely inspired by the biological neural networks that constitute animal brains.

An ANN is based on a collection of connected units or nodes called artificial neurons, which loosely model the neurons in a biological brain. Each connection, like the synapses in a biological brain, can transmit a signal to other neurons. An artificial neuron that receives a signal then processes it and can signal neurons connected to it. The "signal" at a connection is a real number, and the output of each neuron is computed by some non-linear function of the sum of its inputs. The connections are called edges. Neurons and edges typically have a weight that adjusts as learning proceeds. The weight increases or decreases the strength of the signal at a connection. Neurons may have a threshold such that a signal is sent only if the aggregate signal crosses that threshold. Typically, neurons are aggregated into layers. Different layers may perform different transformations on their inputs. Signals travel from the first layer (the input layer), to the last layer (the output layer), possibly after traversing the layers multiple time and along different paths

Artificial networks could be build by many layers, each one connected to each others. There are amny types of networks. For example:

A **feedforward neural network** is an artificial neural network wherein connections between nodes do not form a cycle, layers are connected starting from the top (input) to the bottom (output) and the activation flows from top to the end exclusively

A **recurrent neural network (RNN)** is a class of artificial neural networks where connections between nodes form a directed graph along a temporal sequence. This allows it to exhibit temporal dynamic behavior. Derived from feedforward neural networks, RNNs can use their internal state (memory) to process variable length sequences of inputs. Typically part of output of one layer is feed as input to the same layer at a different time

[https://en.wikipedia.org/wiki/Artificial_neural_network]

## Layers

**Layers** are group of nodes that mimic some concepts. Each node inside the layer shares same input and output and the same activation function

## Activation function

The **activation function** of a node defines the output of that node given an input or set of inputs. Experience shows that only nonlinear activation functions allow networks to compute nontrivial problems using only small number of nodes

- **rectifier or ReLU** activation function is an activation function defined as the positive part of its argument

- **Sigmoid** activation function is an activation function which applies a sigmoid to the input. The assumption here is that we are interested in intermediate value of input, so we treat them in a liner way, greater values in modulo are not so important because are extreme and show fewer variation

[ https://en.wikipedia.org/wiki/Activation_function ]

## Loss function
Loss is nothing but a prediction error of Neural Network. And the method to calculate the loss is called Loss Function. The output of keras contains two similar terms, val_loss and loss. *val_loss is the value of cost function for your cross-validation data and loss is the value of cost function for your training data.*

## Optimizer

The **optimizer** is used to explorer the parameters space searching which value correspond to a minimum value of the loss function

## Learning 

**Supervised learning** is the machine learning task of learning a function that maps an input to an output based on example input-output pairs

**Unsupervised learning** is the task of learning a function that maps an input to an output based on some cost function

## Overfitting

Overfitting is a modeling error in statistics that occurs when a function is too closely aligned to a limited set of data points. Overfitting the model generally takes the form of making an overly complex model to explain idiosyncrasies in the data under study

## Regularization

Regularization in machine learning is the process of regularizing the parameters that constrain, regularizes, or shrinks the coefficient estimates towards zero. In other words, this technique discourages learning a more complex or flexible model, reducing the risk of Overfitting

## Restatement of the task

An autoencoder has an input layer, an output layer and one or more hidden layers connecting them. The output layer has the same number of units as the input layer. Its purpose is to reconstruct its own inputs. Therefore, autoencoders are unsupervised learning models. We want our autoencoder to efficient encodes using unsupervised learning

<div class="alert alert-danger"><b>Question:</b> Could an autoencoder have in total only a single layer? Two layers? Three?
</div>

<div class="alert alert-info"><b>Remark:</b> The loss function is used to optimize your model. This is the function that will get minimized by the optimizer. A metric is used to judge the performance of your model!
</div>

We will use keras ( https://keras.io/ )
"Keras is an API designed for human beings, not machines. Keras follows best practices for reducing cognitive load: it offers consistent & simple APIs, it minimizes the number of user actions required for common use cases, and it provides clear & actionable error messages. It also has extensive documentation and developer guides."

Some usefull info:

- https://blog.keras.io/building-autoencoders-in-keras.html

- https://keras.io/api/models/

In the following we will use a slightly different approch from usual, instead of using a single command, we use sklearn as interface to call other functions. The idea is to learn to wrap Keras models for use in scikit-learn and how to use grid search. "GridSearchCV is a library function that is a member of sklearn's model_selection package. It helps *to loop through predefined hyperparameters* and fit your estimator (model) on your training set. So, in the end, you can select the best parameters from the listed hyperparameters" 

Please **Remember**: Keras is a high-level API built on Tensorflow. Scikit Learn is a general machine learning library built on top of NumPy

[https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html ] 

[https://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/ ]

Some layers used in keras:

- **Dense** layer makes output=activation(dot(input, kernel) + bias) where kernel is the matrix of parameters and activation is the activation function. Each node is fed by the whole input! 

- **Dropout** layer randomly sets input units to 0 with a frequency of rate at each step during training time. Inputs not set to 0 are scaled up by 1/(1 - rate) such that the sum over all inputs is unchanged

*Please remember, before process data with machine learning tecniques it is really important to normalize and resample at a fixed rate all the data. That's why we explained in the introduction how to do it*

## Starter Example
Write an autoencoder with keras only that takes as input a 1D signal

In [None]:
import keras
from keras import layers
from keras.utils.vis_utils import plot_model

# This is the size of our encoded representations
original_dim = 128
encoding_dim = 32  # 32 floats -> compression of factor 4.0, assuming the input is 128 floats

# This is our input image
input_signal = keras.Input(shape=(original_dim,))
# "encoded" is the encoded representation of the input
encoded = layers.Dense(encoding_dim, activation='relu')(input_signal)
# "decoded" is the lossy reconstruction of the input
decoded = layers.Dense(original_dim, activation='sigmoid')(encoded)

# This model maps an input to its reconstruction
autoencoder = keras.Model(input_signal, decoded)

autoencoder.compile(optimizer='adam', loss='mse')
autoencoder.summary()

# for fun, let's create a random input signal() mean=0 std=3) of 128*1000 elements. Why it is multiple of 128?
noise = np.random.normal(0, 3, original_dim*1000)

# a 2d-array of shape x=original_dim, y=100
x_signal = np.stack(np.split(noise,1000))
x_signal.shape

# first 90 rows to use as train
x_train=x_signal[0:900,:]
#last 10 rows to use as test
x_test = x_signal[900:,:]

# encode with a batch of 50 rows (how many iterations?) for 15 times...
autoencoder.fit(x_train, x_train, epochs=15, batch_size=30, shuffle=True, validation_data=(x_test, x_test))

## Starter example II
Add some layers for fun and acquire some confidence with keras.

The workflow will be the following:
```
original -> encoded size 32 -> convolution -> encoded size 16 -> decoded size 32 -> decoded size original
```
It is convenient to play a bit with layers that expect a different number of dimension in this case

Conv1D expects a 3d signal, for this reason we will add a singleton dimension doing

*shape=(original_dim,1,)*

please note the "1" added to the list of the dimension

In [None]:
import keras
from keras import layers

# This is the size of our encoded representations
original_dim = 128
encoding_dim = 32  # 32 floats -> compression of factor 4.0, assuming the input is 128 floats
encoding_dim_1 = 16  # 16 floats -> compression of factor 8.0, assuming the input is 128 floats

# This is our input image
input_signal = keras.Input(shape=(1,original_dim))
# "encoded" is the encoded representation of the input
convolved = layers.Conv1D(filters=64, kernel_size=3, strides=1, padding="causal", activation='relu', input_shape=(1,original_dim))(input_signal)
encoded = layers.Dense(encoding_dim, activation='relu')(convolved)
encoded_1 = layers.Dense(encoding_dim_1, activation='relu')(encoded)
# "decoded" is the lossy reconstruction of the input
decoded_1 = layers.Dense(encoding_dim, activation='sigmoid')(encoded_1)
decoded = layers.Dense(original_dim, activation='sigmoid')(decoded_1)

# This model maps an input to its reconstruction
autoencoder = keras.Model(input_signal, decoded)

autoencoder.compile(optimizer='adam', loss='mse')
autoencoder.summary()

# for fun, let's create a random input signal() mean=0 std=3) of 128*1000 elements. Why it is multiple of 128?
noise = np.random.normal(0, 3, original_dim*1000)

# reshape to fit data
# process the data to fit in a keras CNN properly
# input data needs to be (N, C, X, Y) - shaped where
# N - number of samples
# C - singleton dimension
# (X, Y) - sample size

X_signal = noise.reshape((1000, 1, original_dim, 1))
x_train = X_signal[0:900,:,:,:]
x_test = X_signal[900:,:,:,:]
# encode with a batch of 50 rows (how many iterations?) for 15 times...
autoencoder.fit(x_train, x_train, epochs=15, batch_size=30, shuffle=True, validation_data=(x_test, x_test))
X_signal.shape

Sometime you need to see what an intermediate layer is doing,

here a simple skeleton, get a list of all layers used, create a keras function with two input parameters [input, learning_phase()] and [single_layer]...then call it with data


inp = autoencoder.input                                           # input placeholder
outputs = [layer.output for layer in autoencoder.layers]          # all layer outputs
functors = [keras.function([inp, keras.learning_phase()], [out]) for out in outputs]    # evaluation functions

Testing
test = np.random.random(input_shape)[np.newaxis,...]
layer_outs = [func([test, 1.]) for func in functors]
print layer_outs

## Starter example III

Now we will try do denoise a simple sin wave using an autoencoder, the idea is that noise is small (and higher freqs) compared to signal. 

Autoencoding remove some data, maybe it could remove part of noise in this very simple case...

Here you could see what's happen if! Feel free to change parameters like offset signal, num_array, interval, epochs and so on

In [None]:
def gen_time(tmin=0,tmax=np.pi*10,step=0.05):
    time = np.arange(tmin, tmax, step)
    return time, len(time)

def gen_signal(time, amp=1):
    # Amplitude of the sine wave is sine of a variable like time.
    amplitude = amp* np.sin(time)
    return amplitude

def gen_noise(npoints=1000, sigma=3, mean=0, amplitude=0.1):
    noise = amplitude * np.random.normal(mean, sigma, npoints)
    return noise

def gen_signal_array(time, amp_signal, amp_noise, sigma, mean, nelems=1000, offset=0):
    npoints = len(time)
    arr = np.array([])
    for el in range(0,nelems):
        signal = gen_signal(time, amp_signal) + offset
        noise = gen_noise(npoints, sigma, mean, amp_noise)
        wave = signal + noise
        arr = np.concatenate((arr, wave))
    return arr

time, npoints = gen_time()

# This is the size of our encoded representations
original_dim = npoints
encoding_dim = npoints/2  
encoding_dim_1 = npoints/20
num_array = 10

arr = gen_signal_array(time, amp_signal=0.5, amp_noise=0.05, sigma=3, mean=0, nelems=num_array, offset=0.5)

# reshape to fit data
# process the data to fit in a keras CNN properly
# input data needs to be (N, C, X, Y) - shaped where
# N - number of samples
# C - singleton dimension
# (X, Y) - sample size

X_signal = arr.reshape((num_array, 1, original_dim, 1))
x_train = X_signal[0:round(0.9*num_array),:,:,:]
if round(0.9*num_array) == num_array:
    x_test = x_train
else:
    x_test = X_signal[round(0.9*num_array):,:,:,:]

def autoencoder_func(x_train, x_test, epochs=150):
    # This is our input image
    input_signal = keras.Input(shape=(1,original_dim))
    # "encoded" is the encoded representation of the input
    convolved = layers.Conv1D(filters=314, kernel_size=3, strides=1, padding="causal", activation='relu', input_shape=(1,original_dim))(input_signal)
    encoded = layers.Dense(encoding_dim, activation='relu')(convolved)
    encoded_1 = layers.Dense(encoding_dim_1, activation='relu')(encoded)
    # "decoded" is the lossy reconstruction of the input
    decoded_1 = layers.Dense(encoding_dim, activation='sigmoid')(encoded_1)
    decoded = layers.Dense(original_dim, activation='sigmoid')(decoded_1)

    # This model maps an input to its reconstruction
    autoencoder = keras.Model(input_signal, decoded)

    autoencoder.compile(optimizer='adam', loss='mse')
    autoencoder.summary()
    history = autoencoder.fit(x_train, x_train, epochs=epochs, batch_size=10, shuffle=True, validation_data=(x_test, x_test))
    decoded = autoencoder.predict(x_test)
    return history, decoded, autoencoder

history, decoded, model = autoencoder_func(x_train, x_test, epochs=20)

# plot results
plt.plot(time, x_train[0,:,:].reshape(len(time),1))
plt.plot(time, decoded[0,:,:].reshape(len(time),1))
plt.show()


**Query**: Sometimes the decoded signal get clamped, are you able to understand why?

**Query2**: Do you remember the difference between loss and val_loss?

**Task**: Are you able to make a plot of loss vs epochs?

**Query3**: The training time increase/decrease when changing the batch_size? and how about the loss?

draw the model!!!

In [None]:
plot_model(model, show_shapes=True, show_layer_names=True)

**Remark** There are many parameters, you have to chose them to obtain the minimum of val_loss/loss

It is a matter of chosing the right parameters, and you have to explore them, that's why we introduce a grid search

That's function is in sklearn, let's jump straight to the next example

## Simple example
In this example we will use different api from the one used before, it is easy to change between one format and the other

A skeleton autoencoder will take input, import relevant functions, define a model, optimize it and test it

Let's start!

In [None]:
# input
# we will split our data in chunk of n elements and we will feeds our network with all these chunks
n_features = 100 # is the size of a single chunk of data
n_encoded = n_features

# add a normal dataset to dama
if 'random_n' in dm.keys(): 
    del dm['random_n']
dm.create_dataset('random_n', data=np.random.normal(0, 1, (10000,)))

# access its data and store in variable input_data
input_data = dm['random_n'].data

# split data in chunks of 100 element each one and stack each chunk vertically
chunks = np.stack(np.split(input_data,100))
chunks.shape

In [None]:
import math

from keras import optimizers
from keras.optimizers import Nadam

from keras import regularizers

from keras.models import Sequential

from keras.layers import Dense, Activation, Dropout, TimeDistributed

from sklearn import metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error

from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import GridSearchCV

import joblib
import logging
import tensorflow as tf
tf.get_logger().setLevel('ERROR')

In [None]:
# a very simple model, feedforward neural network with two layers and with two different activation functions
def baseline_model(learning_rate=5e-6, activation='sigmoid', bias1=1e-9, bias2=1e-9, ker1=1e-9, ker2=1e-9):
  
    model = Sequential()
    model.add(Dense(n_encoded, activation=activation, input_shape=(n_features,), bias_regularizer=regularizers.l1_l2(l1=bias1, l2=bias2), kernel_regularizer=regularizers.l1_l2(l1=ker1, l2=ker2)))
    model.add(Dense(n_features, activation='relu', bias_regularizer=regularizers.l1_l2(l1=bias1, l2=bias2), kernel_regularizer=regularizers.l1_l2(l1=ker1, l2=ker2)))
    
    model.compile(optimizer=Nadam(lr=learning_rate), loss='mse', metrics=['mse'])

    return model

We had defined a function (with six "hyperparameters"!) that build a very minimalistic model

**Sequential** initialize the model

**add** method is used to add layers to your model

**compile** method id used to configure the model for training https://www.tensorflow.org/api_docs/python/tf/keras/Model#compile

some reference to its parameters:
- https://www.tensorflow.org/api_docs/python/tf/keras/optimizers
- https://www.tensorflow.org/api_docs/python/tf/keras/losses
- https://www.tensorflow.org/api_docs/python/tf/keras/metrics/Metric

You could get useful debugging message with **model.summary()**

In [None]:
# machine learning
# epochs is used for recurrent networks, we don't need them so we set to 1
# batch_size is the number of input chunks
#mlp = KerasRegressor(build_fn=baseline_model, epochs=1, batch_size=n_features, verbose=0)
mlp = KerasRegressor(build_fn=baseline_model, verbose=True)

**KerasRegressor** is a wrapper to use keras from sklearn. It takes as parameters, your function model, epochs, batch size. See for example https://www.tensorflow.org/api_docs/python/tf/keras/wrappers/scikit_learn/KerasRegressor

The **batch size** is a hyperparameter that defines the number of samples to work through before updating the internal model parameters

The number of **epochs** is a hyperparameter that defines the number times that the learning algorithm will work through the entire training dataset. See for example https://machinelearningmastery.com/difference-between-a-batch-and-an-epoch/

In [None]:
# space of parameters used when call our model, only one value for each....
# in the next example we will use a dictionary like the following, trying to search on parameters that regularise the network
# param_distr = dict(bias1 = [1e-9], bias2 = [1e-9], ker2 = [1e-9], ker1 = [1e-9])
param_distr = dict(learning_rate=[1e-6, 5e-6, 10e-6], activation=['relu', 'sigmoid'])

param_distr is a dictionary of parameters to try, we will explore on all combinations
**Query**: How many combinations will have in a general case? In this case?

In [None]:
# build our model all together
grid_search = GridSearchCV(estimator=mlp, param_grid=param_distr, cv=3)

**GridSearchCV** Exhaustive search over specified parameter values for an estimator. So we are trying all combinations! https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

In particular: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV.fit

X: array-like of shape (n_samples, n_features)

In [None]:
# fit it
# tuning all nodes based on param_distr to make output "near" to input. Near here mean output of a distance function 'mse' 
grid_search.fit(chunks, chunks)

In [None]:
# get some info
print("Best: %f using %s" % (grid_search.best_score_, grid_search.best_params_))
means  = grid_search.cv_results_['mean_test_score']
stds   = grid_search.cv_results_['std_test_score']
params = grid_search.cv_results_['params']

<div class="alert alert-danger"><b>Question:</b> Is output data only positive? why?
</div>

## Exercise 1: Alter model changing activation functions

Redo all steps but in the model function use a different activation function

In [None]:
# a very simple model, feedforward neural network with two layers and with two different activation functions
def baseline_model1(learning_rate=5e-6, activation='sigmoid', bias1=1e-9, bias2=1e-9, ker1=1e-9, ker2=1e-9):
  
    model1 = Sequential()
    model1.add(Dense(n_features, activation=activation, input_shape=(n_features,), bias_regularizer=regularizers.l1_l2(l1=bias1, l2=bias2), kernel_regularizer=regularizers.l1_l2(l1=ker1, l2=ker2)))
    model1.add(Dense(n_encoded, activation='sigmoid', bias_regularizer=regularizers.l1_l2(l1=bias1, l2=bias2), kernel_regularizer=regularizers.l1_l2(l1=ker1, l2=ker2)))
    
    model1.compile(optimizer=Nadam(lr=learning_rate), loss='mse', metrics=['mse'])

    return model1


mlp1 = KerasRegressor(build_fn=baseline_model1, epochs=1, batch_size=n_features, verbose=0)
param_distr1 = dict(learning_rate=[1e-5], activation=['sigmoid'], bias1 = [1e-9], bias2 = [1e-9], ker2 = [1e-9], ker1 = [1e-9])
grid_search1 = GridSearchCV(estimator=mlp1, param_grid=param_distr1, cv=3)
grid_search1.fit(chunks, chunks)
print("Best: %f using %s" % (grid_search1.best_score_, grid_search.best_params_))
means  = grid_search1.cv_results_['mean_test_score']
stds   = grid_search1.cv_results_['std_test_score']
params = grid_search1.cv_results_['params']

## Exercise 2: example with real data

We need to follow exactly what we did before but using our gw data insetad of our simulated data

Recall that to get data from dama you could use "dama['random_n'].data" 

In real life before start we will have to decide what will be
- our chunk size
- our metric

using frequency and other values as starting point

We can however start with an (nearly!) arbitrary value and see what will happen



In [None]:
print(dm['strain_whiten_r512'])
input_data=np.array(dm['strain_whiten_r512'].data)
input_data.shape

get a number of point multiple of 100, for example we will use first 49 * 100 elems

In [None]:
input_data[0:4900].shape

In [None]:
chunks = np.stack(np.split(input_data[0:4900],100))
chunks.shape

as already said, we have an array-like of shape (n_samples, n_features)
so here we have 49 features in 100 samples

To make it easier to tweak the model, we define the number of features and the number of encoded features as hyperparameters in the `baseline_model` function.

We can now proceed as usual

In [None]:
# a very simple model, feedforward neural network with two layers and with two different activation functions
def baseline_model2(n_features=49, n_encoded=100, learning_rate=5e-6, activation='sigmoid', bias1=1e-9, bias2=1e-9, ker1=1e-9, ker2=1e-9):
  
    model2 = Sequential()
    model2.add(Dense(n_encoded, activation=activation, input_shape=(n_features,), bias_regularizer=regularizers.l1_l2(l1=bias1, l2=bias2), kernel_regularizer=regularizers.l1_l2(l1=ker1, l2=ker2)))
    model2.add(Dense(n_features, activation='sigmoid', bias_regularizer=regularizers.l1_l2(l1=bias1, l2=bias2), kernel_regularizer=regularizers.l1_l2(l1=ker1, l2=ker2)))
    
    model2.compile(optimizer=Nadam(lr=learning_rate), loss='mse', metrics=['mse'])
    return model2


mlp2 = KerasRegressor(build_fn=baseline_model2, epochs=100, batch_size=10, verbose=0)
param_distr2 = dict(learning_rate=[1e-5], activation=['sigmoid'], bias1 = [1e-9], bias2 = [1e-9], ker2 = [1e-9], ker1 = [1e-9])
grid_search2 = GridSearchCV(estimator=mlp2, param_grid=param_distr2, cv=3)
grid_search2.fit(chunks, chunks)
print("Best: %f using %s" % (grid_search2.best_score_, grid_search2.best_params_))
means  = grid_search2.cv_results_['mean_test_score']
stds   = grid_search2.cv_results_['std_test_score']
params = grid_search2.cv_results_['params']

**Note**: Could you alter code to have 49 elems(samples) of 100 features each one?

## Exercise 3: Alter model, adding a dropout layer and a new dense layer

Let's make a complex example, adding a new dense layer(s), a droput layer and asking for a compression factor of 4 times

Dropout in this case help in reduce the number of neurons, in order to avoid the problem of "sleeping neurons", you could add a convolution layer for fun if you want

In [None]:
n_features=49

# a very simple model, feedforward neural network with two layers and with two different activation functions
def baseline_model3(dropout=0.1, bias1=1e-9, bias2=1e-9, ker1=1e-9, ker2=1e-9):
  
    model3 = Sequential()
    
    model3.add(Dense(n_features, activation='relu', input_shape=(n_features,), bias_regularizer=regularizers.l1_l2(l1=bias1, l2=bias2), kernel_regularizer=regularizers.l1_l2(l1=ker1, l2=ker2)))
    model3.add(Dropout(dropout))
    model3.add(Dense(int(n_features/2), activation='relu', bias_regularizer=regularizers.l1_l2(l1=bias1, l2=bias2), kernel_regularizer=regularizers.l1_l2(l1=ker1, l2=ker2)))
    model3.add(Dropout(dropout))
    model3.add(Dense(int(n_features/4), activation='relu', bias_regularizer=regularizers.l1_l2(l1=bias1, l2=bias2), kernel_regularizer=regularizers.l1_l2(l1=ker1, l2=ker2)))
    model3.add(Dropout(dropout))

    model3.add(Dense(int(n_features/4), activation='sigmoid', bias_regularizer=regularizers.l1_l2(l1=bias1, l2=bias2), kernel_regularizer=regularizers.l1_l2(l1=ker1, l2=ker2)))
    model3.add(Dense(int(n_features/2), activation='sigmoid', bias_regularizer=regularizers.l1_l2(l1=bias1, l2=bias2), kernel_regularizer=regularizers.l1_l2(l1=ker1, l2=ker2)))
    model3.add(Dense(n_features, activation='sigmoid', bias_regularizer=regularizers.l1_l2(l1=bias1, l2=bias2), kernel_regularizer=regularizers.l1_l2(l1=ker1, l2=ker2)))
    
    model3.compile(optimizer=Nadam(lr=5e-6), loss='mse', metrics=['mse'])

    return model3


mlp3 = KerasRegressor(build_fn=baseline_model3, epochs=1, batch_size=100, verbose=0)
param_distr3 = dict(dropout=[0.1, 0.5, 0.9], bias1 = [1e-9], bias2 = [1e-9], ker2 = [1e-9], ker1 = [1e-9])
grid_search3 = GridSearchCV(estimator=mlp3, param_grid=param_distr3, cv=3)
grid_search3.fit(chunks, chunks)
print("Best: %f using %s" % (grid_search3.best_score_, grid_search3.best_params_))
means  = grid_search3.cv_results_['mean_test_score']
stds   = grid_search3.cv_results_['std_test_score']
params = grid_search3.cv_results_['params']

We altered number of neurons, is metric better or worse compare to other exercise? was expected?

## Exercise 4: Tune parameters

**Add to param_distr both Batch size and Epochs, retrain and take best values**

...same model as before...

batch_size = [10, 20, 40, 60, 80, 100]

epochs = [10, 50, 100]

param_distr = dict(bias1 = [1e-9], bias2 = [1e-9], ker2 = [1e-9], ker1 = [1e-9], batch_size=batch_size, epochs=epochs )

grid_search4 = GridSearchCV(estimator=mlp4, param_grid=param_distr4, cv=3)

grid_result=grid_search4.fit(chunks, chunks)

print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

means = grid_result.cv_results_['mean_test_score']

stds = grid_result.cv_results_['std_test_score']

params = grid_result.cv_results_['params']

for mean, stdev, param in zip(means, stds, params):

    print("%f (%f) with: %r" % (mean, stdev, param))

**Same as before but optime optimizer**

optimizer = ['SGD', 'RMSprop', 'Adagrad', 'Adadelta', 'Adam', 'Adamax', 'Nadam']

param_grid = dict(optimizer=optimizer)

**With this system you could optimize any hyperparameters...but it works only for local minimum point. No one can guarantee that a minimum for non optimal parameters exists!**

https://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/

## Exercise 5: Plot output data and compare with input signal

Let's put all together and try to visualize the effect of the compression.

Define the best autoencoder that you can, possibly using the indications obtained from the GridSearchCV.
Then, try to increase the number of nodes of the internal hidden layer until you manage to reproduce exactly the input.
In this case you have no compression and no loss in terms of quality.
You can then try to decrease the number of encoded features and superpose the input and the predicted waveforms. The less the number of nodes of the hidden layer, the more the two waveforms are different, but the higher compression rate you are reaching.


In [None]:
grid_search2.predict(chunks).shape
def baseline_model4(hidden_layers, n_features=49):
    model3 = Sequential()
    model3.add(Dense(hidden_layers, activation='tanh', input_shape=(n_features,), kernel_initializer='he_normal' ))
    model3.add(Dense(n_features, activation='linear', kernel_initializer='he_normal'))
    model3.compile(optimizer=Nadam(lr=3e-3), loss='mse', metrics=['mse'])
    return model3

In [None]:
# Retrieve the data
n_features = 49
print(dm['strain_whiten_r512'])
from sklearn.preprocessing import StandardScaler

input_data=np.array(dm['strain_whiten_r512'].data)
q = StandardScaler ()
chunks = np.reshape(input_data[:(len(input_data)//n_features)*n_features], (-1, n_features))
chunks = q.fit_transform (chunks)
print (chunks.shape)

There are many ways to get the encoded signal, for example re-train the model, using as parameterets the ones found by GridSearchCV. In this case :

Best: -0.280207 using **{'bias1': 1e-09, 'bias2': 1e-09, 'ker1': 1e-09, 'ker2': 1e-09}**

In [None]:
# rebuid a model
model=baseline_model4(hidden_layers=49, n_features=49)
model.summary()
# fit it
history = model.fit(chunks,chunks, epochs=100, validation_split=0.1, verbose=True)
plt.plot (history.history['loss'], label = "loss")
plt.plot (history.history['val_loss'], label = "val_loss")
plt.legend()
plt.yscale('log')

In this case there is no compression and we expect a rather good agreement

In [None]:
%matplotlib inline
predicted_chunks = model.predict(chunks)
for original, compressed in zip (chunks, predicted_chunks[:5]):
    plt.plot(q.inverse_transform([original])[0], alpha=0.5, label = "Original")
    plt.plot(q.inverse_transform([compressed])[0], label = "Compressed")
    plt.legend()
    plt.ylabel ("Normalized strain")
    plt.xlabel ("Sample")
    plt.show()

In [None]:
# rebuid a model
model_25=baseline_model4(hidden_layers=25, n_features=49)
model_25.summary()
# fit it
history = model_25.fit(chunks,chunks, epochs=100, validation_split=0.1, verbose=True)
plt.plot (history.history['loss'], label = "loss")
plt.plot (history.history['val_loss'], label = "val_loss")
plt.legend()

plt.yscale('log')

In [None]:
%matplotlib inline
predicted_chunks = model_25.predict(chunks)
for original, compressed in zip (chunks, predicted_chunks[:5]):
    plt.plot(q.inverse_transform([original])[0], alpha=0.5, label = "Original")
    plt.plot(q.inverse_transform([compressed])[0], label = "Compressed")
    plt.legend()
    plt.ylabel ("Normalized strain")
    plt.xlabel ("Sample")
    plt.show()

**Note**: Compression at work! 25 Features instead of 49

You can access weight using

**model.get_weights()[0]** to get all values

**model.layers[0].get_weights()[0]** to get values of a single layer

In [None]:
# for example
# build a model of only encoding layers\n",
encoder = Sequential()

encoder.add(model_25.layers[0])

encoded = encoder.predict(chunks)
print(encoded)
encoded.shape

# for example
model.get_weights()[0] 
model.layers[0].get_weights()[0]

**We took only 4900 elems ... restart with a larger database, and use remain elems as test, so just try predict them and see how differ**

## Exercise 6: Save output files and compute their size and their entropy

In [None]:
# to save our model
model.save("test.h5")
# to save our encoded file
# don't forget to delete our useness item before saving to disk
dm.write_gwdama('outputfile')