# MLP Lesson 2

## Google Cloud Storage Boilerplate

This first cell has some boilerplate to connect the Google Cloud Storage bucket containing the data used for this tutorial to the Google Colab environment. 

In order to access the data for this workshop you'll need to run this cell, follow the link when prompted and copy the Google SDK token into the prompt. If everything works correctly a new folder called `data` should appear in the file browser on the left.

In [0]:
%tensorflow_version 2.x
from google.colab import auth
auth.authenticate_user()

project_id = 'sciml-workshop'
bucket_name = 'sciml-workshop'

!echo "deb http://packages.cloud.google.com/apt gcsfuse-bionic main" > /etc/apt/sources.list.d/gcsfuse.list
!curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
!apt -qq update
!apt -qq install gcsfuse

!gcloud config set project {project_id}

!mkdir data
!gcsfuse  --implicit-dirs --limit-bytes-per-sec -1 --limit-ops-per-sec -1 {bucket_name} /content/data

In [0]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
#plt.style.available
plt.style.use('ggplot')
from sklearn.utils import resample
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import Dropout, BatchNormalization

## Bigger data - deeper networks

The last exercise gave us an intro into how to build and tune some of the parameters of a neural network. Now we move to an example where we have more data. Generally as we increase the amount of training data we can make use of deeper networks, with more layers to give more accurate predictions.

This time around we load up the data from `ag-muon-data-tight.pkl`

You can load this up in the same way as in the previous notebook. We just take the first 90k examples to speed up training.

In [0]:
df = pd.read_pickle('/content/data/muon/ag-muon-data-tight.pkl').iloc[:90000]
X = np.array(df[3].to_list())
y = np.array(df[1].to_list())

## Class imbalance 
Don't forget to take care of class balance in your dataset. Do the same checks and use the `resample` function as you did in the previous notebook. Plot a historgram of the class balance.

### Initial network

We now have more data, so we could think about making a deeper network. Try out the architecture below. **Note** this needs to run for more epochs as the network takes some time to equilibrate, so a little bit of patience is needed. If you like you can try stopping it earlier (running fewer epochs), then plot the training and validation loss to see how it is doing.

In [0]:
model = Sequential()
model.add(Dense(64, input_dim=1000, activation='relu'))
model.add(BatchNormalization())
model.add(Dense(32, activation='relu'))
model.add(Dense(16, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
ad = Adam(learning_rate=1e-4, beta_1=0.9, beta_2=0.999, amsgrad=False)
model.compile(loss='binary_crossentropy', optimizer=ad, metrics=['accuracy'])
history_bn = model.fit(X, y, epochs=200, batch_size=64, validation_split=0.2)

## Plot your training results