# Tuning Process

 There are many hyperparameters in training NN. 
 Examples: 
 - $\alpha$
 - $\beta$ momentum term
 - $\beta_1$ and $\beta_2$, $\epsilon$ from ADAM algorithm
 - $N$ of layers
 - $M$ of hidden units
 - learning rate decay 
 - ... 
 - mini-batch size

 The most important here are $\alpha$ and then $\beta$ and mini-batch size, and hidden units. 

 #### Setting values for hyperparameters.  
 Eary times the uniform sampling of a parameter space was used.  
 Nowardays, a random sampling of a hyperparameter space is adviced.  
 Then select a smaller region and sample it again. 

## Appropritate scale to chose hyperparams 

Small range - uniform sampleing can do  
Large valeus, e.g., $\alpha\in(10^{-5},1)$. There logarithmic scaling is better.  

#### Hyperparameters for exponentially weighted averages
$\beta \in (0.9...0.99999)$. Consider $1-\beta$ and logarithmic random uniform gridng $a=$rand  and $10^a$ is log random. 
> Note alrge sensitity to $\beta$ when $\beta\rightarrow1$, so more dens sempling there is preferred

#### Re-test hyperparameters occasionally

- Babysitting one model (cahnge parameters after each epoch/day)
- Training many models in parallel

This depends on the computational resourses 


## Batch normalization 

Recall that when learning, normalization of the input features can speed-up the process

$X=X/\sigma$

For deep NN, there are intermedeate activation function with $a^{[i]}$ which can also be normalized for faster training. 

The `batch normalization` is $Z^{[i]}$ normalization of **some** units in **some** hidden layers 

$$
z_{\rm norm}^{(i)} = \frac{z^{(i)-\mu}}{\sqrt{\sigma^2+\epsilon}}
$$

This leads to mean $\mu=0$ and standard unit variance $\sigma=1$. 
> However not olways the hiddle units should have mean 0 and variance 1. 
For some activation functions it is better to have varaince $\neq1$

Different distribution is desirable 
$$
\tilde{z}^{(i)} = \gamma z_{\rm norm}^{(i)} + \beta
$$
where $\gamma$ and $\beta$ are learniable parameters of the model via e.g., gradient descent.  
These parameters allow to have $\mu$ and $\tilde{z}$ to be any value. 

### Fitting Batch Norm into a NN

Recall. Each Unit of a NN computes first $z_{i}^{[l]}$ and then applies the activation function $a_{i}^{[l]}$.  

Appling batch norm (BN) we get now normalized

$$
X \overbrace{\rightarrow}^{w^{[1]},b^{[1]}} Z^{[1]} \underbrace{\overbrace{\rightarrow}^{\beta^{[1]},\gamma^{[1]}}}_{\text{BN}} \tilde{Z}^{[1]} \rightarrow a^{[1]}=g^{[1]}(\tilde{z}^{[1]}) \\\overbrace{\rightarrow}^{w^{[2]},b^{[2]}} Z^{1} \underbrace{\overbrace{\rightarrow}^{\beta^{[2]},\gamma^{[2]}}}_{\text{BN}} \tilde{Z}^{[2]} \rightarrow a^{[2]}=g^{[2]}(\tilde{z}^{[2]})
$$

using normalized values.  
The parameters of the NNs are $W^{[i]}$, $b^{[i]}$, $\beta^{[i]}$, $\gamma^{[i]}$, that are learned using gradient descent.  

Usaally batch-norm is applied using mini-batches, as shown above, but for each mini-batch separaterly, the $\tilde{z}^{i}$ are computed independently.

Note that in this algorithm, as we compute means for every $\tilde{z}$, the constant, $b^{[l]}$ will be removied automatically (as mean and variance are normalized). So the actual list of parameters is 
$W^{[i]}$, $\beta^{[i]}$, $\gamma^{[i]}$

Dimenstions of these parameters are $\beta^{[l]} = (n^{[l]},1)$, $\gamma^{[l]} = (n^{[l]},1)$

### WHy does Batch norm works

Batch norm is the normalization of inputs for hidden units.  
This makes performace of deeper layers to be more robust with respect to changes in previous layers.  

> `Covariant shift` - changes in the distribution in the training data (it requires retraining of the model)

A given hidden layer gets $a^{[i]}$ as input. However, contrary to the first layer of the NN, that takes $X$ as an input which is **constant**, the $a^{[i]}$ values **always cahnge** during training. This introduces **covariant shift**. 

> _Batch norm_ helps reducing the covariant shift for deeper layers of a NN, the distribution of hidden values does not change as much. 

**Overall**, NN becomes more stable. Each layer can learn more idependently from other layers. 

> Batch norm has a small regularization effect

This is because each mini-batch is scaled by the mean/varaince, computed on just that mini-batch. 
THis adds some noise to the values $z^{[l]}$ within that minibatch. So, similar to dropout, it adds some noise to each hidden layer activation. This leads to regularization effect. 

**However** it is not a regulizer!






### Batch norm at a test time

**Note** Batch norm processes train data __one minibatch at a time__. At a test time we process sometimes __one example at a time__. 

For computing $\mu$ and $\sigma$, all train examples in a given mini-batch are used. THis is, however, not possible for test time, when only one example enters. 

Other estimations for $\sigma$ and $\mu$ are needed.  
One option is the **exponentially weighted avegae** - avegare is across all mini-batches. There exists many *training* $\mu$ for each minibatch. From these we compute $\mu$ for train set. As running average for each layer. 


# Multiclass classification

### Soft-max regression

If there exist **many possible classes** that we need to identify, there is **soft-max regression**.

> SOftmax is a generalization of the logistic regression to more than two classes case

Let $C$ be the n of classes. The output layer than has the dimension of the number of classes. 

In this final layer we compute the usual $z^{[L]} = w^{[L]} a^{[L-1]} + b^{[L]}$.  

The activation function looks like 

$a^{[L]}=\frac{e^{Z^{[L]}}}{\sum_{i=1}^{4}t_i}$

where $t_i = e^{Z^{[L]}}$.

This is essencially normalization across possible outcomes. It takes $[c,1]$ shape vector and returns $[C,1]$ shape vector. 

### Training a softmax classifier

**Note** there is a _hard max_ function, which just like one-hot encoding, given for 4 classes $[1,0,0,0]^T$ vector. The _soft max_ instead allow values $\in(0,1)$, the probabilties, for each entry. 

> If $C=2$ the softmax reduces to logistic regression

**Loss function** for soft-max is 

$$
\mathcal{L(\hat{y},y)} = -\sum_{j=1}^4 y_j\log(\hat{y}_j)
$$

It looks at the ground trooth in your dataset and tries to make corresponding probabilities as large as possible.  
This is equivalent to maximum **liklihood estimateion**.  

The cost function as just a normalized sum of loss function as before.  

Gradient descent in this layer is done by appliying the activation function in a forward step.  
In the back prop, the 

$$
dz^{[L]} = \hat{y} - y
$$


# Deep learning frameworks

Currently, the following frameworks exist:
- Caffe/Caffe2
- CNTK
- DL4J
- Keras
- Lasagne
- mxnet
- PaddlePaddle
- TensorFlow
- Theano
- Torch

### TensorFlow

TensorFlow takes the cost function and constructs a **computation graph** for forward prop. Thus, it automacally generates the backprop steps.

Finished exercise
