## Lighthouse Labs - Synaptive Medical

### W6D6 Neural Networks

Instructor: Socorro Dominguez  
January 08, 2021

**Agenda:**
- Review: What is DL, comparison to classic ML
- Backpropagation
- Neural network - Demo
    - Epochs and batches

**Deep Learning** is a subfield of *machine learning* concerned with algorithms inspired by the structure and function of the brain called artificial neural networks.

![img](img/neural_nets.png)

### Why Deep Learning?

![img](img/unstructured_data.png)

**Review: What is a Neural Network?**

We define the function recursively:

$$ x^{(l+1)} = h\left( W^{(l)} x^{(l)} + b^{(l)}\right) $$

where $W^{(l)}$ is a matrix of parameters, $b^{(l)}$ is a vector of parameters. 

So what is $x^{(l)}$?
 * $x^{(0)}$ are the inputs
 * $x^{(L)}$ are the outputs, so we can say $\hat{y}=x^{(L)}$
 * we refer to $L-1$ as the _number of hidden layers_

Also: 
 - the $W^{(l)}$ do _not_ need to be square. 
 - the $x^{(l)}$ for $0<l<L$ are "intermediate states"
   - there are called _hidden units_ or _hidden neurons_
   - the _values_ of these units are called _activations_
 - we often refer to the elements of $W$ as "weights" and the elements of $b$ as "biases"
 





![](https://upload.wikimedia.org/wikipedia/commons/4/46/Colored_neural_network.svg)

In the diagrams above, circles are states and arrows carry weights.

Important note: neural nets map from $\mathbb{R}^d\rightarrow \mathbb{R}^k$ for some arbitrary $d$ and $k$. The outputs do not have to be scalars.

## Activation functions

 - $h$ is called the _activation function_. 
 - Question: why do we need $h$ at all?
 - Answer: if no $h$, then we are composing a bunch of linear functions, which just leaves us with a linear function.
 - Insight: if $h$ is nonlinear, then increasing the number of "layers" increases the complexity of the overall function. 
 
In neural networks, we choose $h$ to be an _elementwise_ nonlinear function. i.e.

$$h(x)\equiv\left[\begin{array}{c}h(x_1)\\h(x_2)\\ \vdots \\ h(x_d)  \end{array}\right]$$

Activation functions tend to be continuous, but are [not always smooth or monotonic](https://arxiv.org/pdf/1710.05941.pdf).

#### Batch

What is a batch?

The *batch size* is a hyperparameter that defines the number of samples to work through before updating the internal model parameters.

Think of a batch as a for-loop iterating over one or more samples and making predictions. At the end of the batch, the predictions are compared to the expected output variables and an error is calculated. From this error, the update algorithm is used to improve the model, e.g. move down along the error gradient.

A training dataset can be divided into one or more batches.

**Batch Gradient Descent.** Batch Size = Size of Training Set  
**Stochastic Gradient Descent.** Batch Size = 1  
**Mini-Batch Gradient Descent.** 1 < Batch Size < Size of Training Set  
>    In mini-batch GD, popular batch sizes are 32, 64, and 128 samples.

#### Epochs

What are epochs?

- An epoch is an entire pass through the training set.
- With minibatch size of 1, an epoch is `n` iterations.
- With a general minibatch size, 

$\text{epochs} = \frac{\text{iterations}}{n}\times \text{batch size}$

Example: if the dataset has $100,000$ examples and your minibatch size is $1000$, then an epoch is $100$ iterations of stochastic gradient descent.

### Differences or Batches and Epochs

* The number of epochs is the number of complete passes through the training dataset.

* The size of a batch must be more than or equal to one and less than or equal to the number of samples in the training dataset.

* The number of epochs can be set to an integer value between one and infinity. You can run the algorithm for as long as you like and even stop it using other criteria besides a fixed number of epochs, such as a change (or lack of change) in model error over time.

* Both are integer values and they are both hyperparameters for the learning algorithm, e.g. parameters for the learning process, not internal model parameters found by the learning process.

* You must specify the batch size and number of epochs for a learning algorithm.

* There are no magic rules for how to configure these parameters. You must try different values and see what works best for your problem.

![img](img/lossfunc.png)

![img](img/GD.png)

![img](img/GD2.png)

# Backpropagation


The squared loss (a not-uncommon choice for regression) is

$$f\left(\{W^{(l)}\}\right)= \sum_{i=1}^n \left(y_i - \hat{y}_i\right)^2  $$

Let's digest this...

- by $\{W^{(l)}\}$ I mean the set of all $W$'s and all their elements
- by $\hat{y}_i$ I mean our prediction for example $x_i$, which we get from applying our recurrence relation $L$ times.

We need $$\frac{df}{dW}$$

This is done via the chain rule. But we need to be careful not to _recompute_ things (remember dynamic programming?? it was all about not recomputing things). 

We can draw a graph of what depends on what. Consider $\frac{\partial f}{\partial W^{(0)}_{11}}$ and $\frac{\partial f}{\partial W^{(0)}_{12}}$. These two derivatives have a lot in commmon, namely...

$$ \frac{\partial f}{\partial x^{(L)}} \frac{\partial x^{(L)}}{\partial x^{(L-1)}} \cdots \frac{\partial x^{(2)}}{\partial x^{(1)}} \frac{\partial x^{(1)}}{\partial W^{(0)}}$$ 

only the last part is different.

- The method for applying the chain rule and not re-computing anything is called **backpropagation** or backprop for short. 
- Backprop is reverse-mode differentiation. So packages like AutoGrad do it "for free".
- Once we have the gradient, we can train with (stochastic) gradient descent. 

.

![img](img/GD3.png)

![img](img/Backprop.png)

### Demo

# Prepare the data

In [6]:
import pandas as pd

df = pd.read_csv('data/hourly_wages_data.csv')
df.head()

Unnamed: 0,wage_per_hour,union,education_yrs,experience_yrs,age,female,marr,south,manufacturing,construction
0,5.1,0,8,21,35,1,1,0,1,0
1,4.95,0,9,42,57,1,1,0,1,0
2,6.67,0,12,1,19,0,0,0,1,0
3,4.0,0,12,4,22,0,0,0,0,0
4,7.5,0,12,17,35,0,1,0,0,0


In [7]:
from sklearn.model_selection import  train_test_split

X = df.drop(columns=['wage_per_hour'])
y = df[['wage_per_hour']]

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=27)

---
# Train a multi-layer perceptron 

In [8]:
from tensorflow.keras.models import Sequential        # Helper to build a network from a sequence of layers
from tensorflow.keras.layers import Dense             # Fully-connected layer
from tensorflow.keras.callbacks import EarlyStopping  # To stop training early if val loss stops decreasing

# Create the model
model = Sequential()
model.add(Dense(10, activation='relu', input_shape=(X.shape[1],)))
model.add(Dense(10, activation='relu'))
model.add(Dense(1))

# Train the model
model.compile(optimizer='adam', loss='mean_squared_error')    # Builds the static computation graph
model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=300, batch_size=32, 
          callbacks=[EarlyStopping(patience=3)], verbose=1)

Epoch 1/300
Epoch 2/300
Epoch 3/300
Epoch 4/300
Epoch 5/300
Epoch 6/300
Epoch 7/300
Epoch 8/300
Epoch 9/300
Epoch 10/300
Epoch 11/300
Epoch 12/300
Epoch 13/300
Epoch 14/300
Epoch 15/300
Epoch 16/300
Epoch 17/300
Epoch 18/300
Epoch 19/300
Epoch 20/300
Epoch 21/300
Epoch 22/300
Epoch 23/300
Epoch 24/300
Epoch 25/300
Epoch 26/300
Epoch 27/300
Epoch 28/300
Epoch 29/300
Epoch 30/300
Epoch 31/300
Epoch 32/300
Epoch 33/300
Epoch 34/300
Epoch 35/300
Epoch 36/300
Epoch 37/300
Epoch 38/300
Epoch 39/300
Epoch 40/300
Epoch 41/300
Epoch 42/300
Epoch 43/300
Epoch 44/300
Epoch 45/300
Epoch 46/300
Epoch 47/300
Epoch 48/300
Epoch 49/300
Epoch 50/300
Epoch 51/300
Epoch 52/300
Epoch 53/300
Epoch 54/300
Epoch 55/300
Epoch 56/300
Epoch 57/300
Epoch 58/300
Epoch 59/300
Epoch 60/300
Epoch 61/300
Epoch 62/300
Epoch 63/300
Epoch 64/300
Epoch 65/300
Epoch 66/300
Epoch 67/300
Epoch 68/300
Epoch 69/300
Epoch 70/300
Epoch 71/300
Epoch 72/300
Epoch 73/300
Epoch 74/300
Epoch 75/300
Epoch 76/300
Epoch 77/300
Epoch 78

Epoch 82/300
Epoch 83/300
Epoch 84/300
Epoch 85/300
Epoch 86/300
Epoch 87/300
Epoch 88/300
Epoch 89/300
Epoch 90/300
Epoch 91/300
Epoch 92/300
Epoch 93/300
Epoch 94/300
Epoch 95/300
Epoch 96/300


<tensorflow.python.keras.callbacks.History at 0x7fd426077690>

---
# Comparison to linear regression

In [4]:
# Create the model
regression = Sequential()
regression.add(Dense(1, input_shape=(X.shape[1],)))

# Train the model
regression.compile(optimizer='adam', loss='mean_squared_error')
regression.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=300, batch_size=32, 
               callbacks=[EarlyStopping(patience=3)], verbose=1)

Epoch 1/300
Epoch 2/300
Epoch 3/300
Epoch 4/300
Epoch 5/300
Epoch 6/300
Epoch 7/300
Epoch 8/300
Epoch 9/300
Epoch 10/300
Epoch 11/300
Epoch 12/300
Epoch 13/300
Epoch 14/300
Epoch 15/300
Epoch 16/300
Epoch 17/300
Epoch 18/300
Epoch 19/300
Epoch 20/300
Epoch 21/300
Epoch 22/300
Epoch 23/300
Epoch 24/300
Epoch 25/300
Epoch 26/300
Epoch 27/300
Epoch 28/300
Epoch 29/300
Epoch 30/300
Epoch 31/300
Epoch 32/300
Epoch 33/300
Epoch 34/300
Epoch 35/300
Epoch 36/300
Epoch 37/300
Epoch 38/300
Epoch 39/300
Epoch 40/300
Epoch 41/300
Epoch 42/300
Epoch 43/300
Epoch 44/300
Epoch 45/300
Epoch 46/300
Epoch 47/300
Epoch 48/300
Epoch 49/300
Epoch 50/300
Epoch 51/300
Epoch 52/300
Epoch 53/300
Epoch 54/300
Epoch 55/300
Epoch 56/300
Epoch 57/300
Epoch 58/300
Epoch 59/300
Epoch 60/300
Epoch 61/300
Epoch 62/300
Epoch 63/300
Epoch 64/300
Epoch 65/300
Epoch 66/300
Epoch 67/300
Epoch 68/300
Epoch 69/300
Epoch 70/300
Epoch 71/300
Epoch 72/300
Epoch 73/300
Epoch 74/300
Epoch 75/300
Epoch 76/300
Epoch 77/300
Epoch 78

Epoch 82/300
Epoch 83/300
Epoch 84/300
Epoch 85/300
Epoch 86/300
Epoch 87/300
Epoch 88/300
Epoch 89/300
Epoch 90/300
Epoch 91/300
Epoch 92/300
Epoch 93/300
Epoch 94/300
Epoch 95/300
Epoch 96/300
Epoch 97/300
Epoch 98/300
Epoch 99/300
Epoch 100/300
Epoch 101/300
Epoch 102/300
Epoch 103/300
Epoch 104/300


<tensorflow.python.keras.callbacks.History at 0x7fd425c07e50>