# Hyperparameter tuning, Batch Normalization and Programming Frameworks

## Hyperparameter tuning

### Tuning process

**Hyperparameters**
- $\alpha$
- $\beta$
- mini-batch size
- no. of hidden unit
- no. of layers
- learning rate decay
- $\beta_1, \beta_2, \epsilon$

Use random sampling and plot them to the chart

### Using an appropriate scale to pick hyperparameters

**Appropriate scale for Learning rate $\alpha$**

$
use \ logarithmic \ r = log(\alpha) \\
\alpha \in [0.0001, 1]
r \in [-4, 0] \\
\rightarrow \alpha = 10^r
$

**Hyperparameters for exponentially weighted averages $\beta$**

$
\beta \in [0.9, 0.999] \\
(1 - \beta) \in [0.001, 0.1] \\
r \in [-3, -1] \\
\rightarrow (1-\beta) = 10^r \rightarrow \beta = 1 - 10^r
$

### Hyperparameters tuning in practice: Pandas vs. Caviar

- Re-test hyperparameters occasionally
- Pandas approach: Babysitting one model at a time: tune hyperparameters one day at a time if we don't have computation capacity to train multiple models at a time
- Caviar approach: Training many models in parallel

## Batch Normalization

### Normalizing activations in a network

**Input normalization in logistic regression**

Subtract mean:

$
\mu = \frac{1}{m}\sum_i X^{(i)} \\
X = X - \mu
$

Normalize variance:

$
\sigma^2 = \frac{1}{m}\sum_i X^{(i)2} \\
X = \frac{X}{\sigma^2}
$

**Normalization for hidden layers**

Can normalize for both Z and A, but Z is more often

**Batch Norm**

Given $z^{(1)}, z^{(2)}, ..., z^{(m)}$ in layer $l$ of NN

$
\mu = \frac{1}{m}\sum_i z^{(i)} \\
\sigma^2 = \frac{1}{m}\sum_i (z^{(i)} - \mu)^2 \\
z^{(i)}_{norm} = \frac{z^{(i)} - \mu}{\sqrt{\sigma^2 + \epsilon}} \\
\tilde{z^{(i)}} = \gamma z^{(i)}_{norm} + \beta \\
$

where $\gamma, \beta$ are learnable parameters of the models

$
if: \\
\ \ \ \ \gamma = \sqrt{\sigma^2 + \epsilon}\\
\ \ \ \ \beta = \mu \\
then: \tilde{z^{(i)}} = z^{(i)}
$

### Fitting Batch Norm into a neural network

$
X \xrightarrow{W^{[1]}, b^{[1]}} Z^{[1]} \xrightarrow[Batch \ Norm \ (BN)]{\beta^{[1]}, \gamma^{[1]}} \tilde{Z^{[1]}} \rightarrow A^{[1]} = g^{[1]}(\tilde{Z^{[1]}}) \xrightarrow{W^{[2]}, b^{[2]}} Z^{[2]} \xrightarrow[Batch \ Norm \ (BN)]{\beta^{[2]}, \gamma^{[2]}} \tilde{Z^{[2]}} \rightarrow A^{[2]} = g^{[2]}(\tilde{Z^{[2]}}) \rightarrow ...
$

Parameters: $W, b, \beta, \gamma$

Using Tensorflow

```python
# using tensorflow
tf.nn.batch_normallization
```

Normally, batch norm will be applied with mini-batch

When use batch norm, we can omit $b$

$
z^{[l]} = W^{[l]}a^{[l-1]}
$

Dimension of $z^{[l]}$ is $(n^{[l]} \times 1) \Rightarrow $ dimension of $\beta^{[l]} \ and \ \gamma^{[l]}$ is $(n^{[l]} \times 1)$ 

**Implementing gradient descent**

$
for \ t = 1 ... no. \ of \ mini-batches: \\
\ \ \ \ Compute \ forwardprop \ on \ X^{\{t\}} \\
\ \ \ \ \ \ \ \ In \ each \ layer, \ use \ BN \ to \ replace \ Z^{[l]} \ with \ \tilde{Z^{[l]}} \\
\ \ \ \ Use \ backprop \ to \ compute \ dW^{[l]}, db^{[l]}, d\beta^{[l]}, d\gamma^{[l]} \\
\ \ \ \ Update \ parameters \\
$

Work with momentum, RMSProp, Adam, etc.

### Why does Batch Norm work?

- Covariance shift

### Batch Norm at test time

## Muti-class classification

### Softmax regression

$
C = no. \ of \ classes \\
n^{[L]} = C
$

**Softmax layer**

$
Z^{[L]} = W^{[L]}a^{[L-1]} + b^{[L]} \\
Activation \ function: \\
t = e^{(z^{[L]})} \\
a^{[L]} = \frac{t_i}{\sum_{t=0}^C t_i} = \frac{e^{(z^{[L]}_i)}}{\sum_{t=0}^C e^{(z^{[L]}_i)}}
$

### Training a softmax classifier
...

## Programming framework

### Deep learning frameworks
- Caffe/Caffe2
- CNTK
- DL4J
- Keras
- Lasagne
- mxnet
- PaddlePaddle
- Tensorflow
- Theano
- Torch/PyTorch

**Choosing deep learning frameworks:**

- Ease of programming (development and deployment)
- Running speed
- Truly open (open source with good governance)

### Tensorflow

**Motivating problem**

We have cost function to minimize $J(w) = w^2 - 10w + 25$

In [31]:
import numpy as np
import tensorflow as tf

In [71]:
coefficients = np.array([[1.], [-20.], [100.]])

w = tf.Variable(0, dtype=tf.float32)
x = tf.placeholder(tf.float32, [3, 1])

# cost = tf.add(tf.add(w**2, tf.multiply(-10., w)), 25)
# cost = w**2 - 10*w + 25
# cost = (w-5)**2
cost = x[0][0]*w**2 + x[1][0]*w + x[2][0]

train = tf.train.GradientDescentOptimizer(0.01).minimize(cost)

init = tf.global_variables_initializer()
session = tf.Session()
session.run(init)
print(session.run(w))

0.0


In [76]:
# automatically compute backprop by computing forwardprop
session.run(train, feed_dict={x: coefficients})
print(session.run(w))

9.99998


In [77]:
for i in range(1000):
    session.run(train, feed_dict={x: coefficients})
print(session.run(w))

9.99998
