# Mini-batch Gradient Descent

Training on a large data sets is slow. 

Optimization algorithms are needed. 

Vecotrization is one of the optimization tools. 

Processing the entire data sets for each pass of gradient descent is slow for very large training sets. The training sets can be split in into `mini-batches` of $X^{\{i\}}$ training data and $Y^{\{i\}}$ labels.  
Recall
- $(i)$ is for training example 
- $[i]$ is for layer
- $\{i\}$ is for mini-batch 

**Batch gradient descent** processing the enitre datasets
**Mini-batch grad. descent** processing mini-batche. 

In the algorithm, we loop over each mini-batch and in each of them, we loop over the data. The main loop can be parallelized. For eacj batch there is a cost function $J$ with regularization. Then compute backprob for each minibatch. This constitute one `epoch` of training. 

**Note** as the dataset was split, one epoch now constitutes $N$ epoches as each minibatch was processed independently and simultaneously. 

In _batch grad. descent_ the cost, the overall cost, is expected to go down on every iteration. 
In _mini-batch gard. descent_ for a given batch, each time you are training on a **different dataset** and the cost may increase or decrease. But it **should tend downwards**. 

### Choosing mini-batch size

If size = 1 (one example) this is `stochastic gradient descent`. 
The path for it is very noisy and almost never converge. 

In practice the size is $>1$ and $<m$, with $m$ being the data set size. 

In stochastic gradient descent the vectorization is very bad. 

For small training set, there is no need to use mini-batch gradient descent.  

For large training set, it is good to use powers of two, e.g., 64,128, 512... to get possible speed up due to memory layout. 

Make sure, that all mini-batch fits inside the CPU/GPU memory. This affects performance drastically. 

### Exponentially Weighted (moving) Averages 

Optimization algorithm. 

Consider a time-series data $\theta_t$. The average, $v_t$ is given by 

$$
v_t = \beta v_{t-1} + (1-\beta)\theta_t
$$

where $\beta\sim0.9$. 

The interpretation here is that $V_t$ averages over $1/(1-\beta)$ values of $\theta$. So for $0.9$ it averages over the last $10$ values.

The closer the $\beta$, to $1$, the larger the $n$ for averaging and the more shifted the resulted curve.   It intorudces __latency__. 

### Understanding exponentially Weighted (moving) Averages 

Note that in order to get $v_{m}$ for the last element, this recursively depends on each previous value. This is a weighted sum of $v_{i}$ in powers. This is _Expenentially decaying function_. The $n-1$ term is the most importnat and other terms have decreasing importance. _Exponentially decreasing_. The decay time is given by $(1-\epsilon)^{1/\epsilon} = 1/e$ where $\epsilon=1-\beta$. This is where the **exponential** part comes from. After $1/(1-\beta)$ timesteps, the weight decays by _one fold_. 

The implementation is **very** computationally simple, as the same value can be overritten. This is very memory efficient. Otherwise, with explicit window averaging, the memory requirements are higher. 

### Bias correction in EWA 

Bias is introduced by averaging over large values (due to weighted moving average) implementation (this is not a window average). This can be addressed. 

Solution: noramize $v_t/(1-\beta^t)$. 

> Bias correction is especially important for early-time data (when weighted average didn't have time to 'warm up')

It is rarely implemneted in practice as after several itrations, the algorithm has enough data to avoid the bias

### Gradient descent with momentum

> Idea: compute the exponentially weighted average of gradients and use it to update the weights instead. 

The cost function topology may favour a motion into a specific direction. To smooth-out possible oscillations. 

For a `mini-batch` grad. descten, for each batch it looks like:
$$
V_{dw} = \underbrace{\beta}_{\text{friction}} \underbrace{V_{dw}}_{\text{velocity}} + (1-\beta) \underbrace{dw}_{\text{acceleration}} \\
V_{db} = \beta V_{db} + (1-\beta) db \\
w := w - \alpha V_{dw} \\
b := b-\alpha V_{db}
$$

This algorithm allos to follow a _more stgithforward_ path. 

The derivativs here _provide accelereation_ and the $\beta$ terms are similar to _velocity_. 

> Consider an analogy of a ball rolling down the hill with acceleration and momentum and friction. 

Two hyperparameters are introduced $\beta=.9$ and $\alpha$  
Sometimes $\alpha$ absorbs $1-\beta$ from $V_{db} = \beta V_{db} + (1-\beta) db$ so that $V_{db} = \beta V_{db} + db$ 




### Root mean squared prop

This is another algorithm to optimize the gradient descent. It allows to otpimize the descent along the direction in which there are less oscillations and that lead quicker to the global minimum. 

In this algorithm, 

$$
S_{dw} = \beta_2 S_{dw} + (1-\beta_2) dw^2 \\
S_{db} = \beta_2 S_{db} + (1-\beta_2) db^2 \\
w := w - \alpha \frac{dw}{\sqrt{S_{dw}}+\epsilon} \\
b := b- \alpha \frac{S_{db}}{\sqrt{S_{db}}+\epsilon}
$$

where $^2$ is done element-wise and $\epsilon$ is added for numerical stability.  
Here $\sqrt{S_{dw}}$ is expected to be small, while $\sqrt{S_{db}}$ is big and will **slow-down** the algorithm. This expectation comes from the fact that derivatives are larger in the direction of a larger slope. So the algorithm **slows down** in the direction of a larger slope. 

This also allows to use _larger learning rate_. 

### Adaptive moment estimation (ADAM) optimization algorithm

Adam is a combination of RMS prop and Grad. Descent. with momentum

Initialization includes $V_{dw}=0$, $V_{db}=0$, $S_{dw}=0$ and $S_{db}=0$.

At each iteration, compute the $dw$, $db$ with __mini-batch gradient descent__, and then do the momentum-like update with $\beta_1$ and RMS-like update with $\beta_2$, **icluding** the bias correction as $V_{dw} = V_{dw} / (1-\beta_1^t)$. The full set of equations looks like:

$$
V_{dw} = \beta_1 V_{dw} + (1-\beta_1) dw \\
V_{db} = \beta_1 V_{db} + (1-\beta_1) db \\
V_{dw} = V_{dw} / (1-\beta_1^t) \\
V_{dw} = V_{dw} / (1-\beta_1^t) \\
$$

RMS-like update with $\beta_2$:
$$
S_{dw} = \beta_2 S_{dw} + (1-\beta_2) dw^2 \\
S_{db} = \beta_2 S_{db} + (1-\beta_2) db^2 \\
S_{dw} = S_{dw} / (1-\beta_2^t) \\
S_{dw} = S_{dw} / (1-\beta_2^t) \\
$$

And the final update is
$$
w := w - \alpha \frac{V_{dw}^{\rm corrected}}{\sqrt{S_{dw}^{\rm corrected}}+\epsilon} \\
b := b- \alpha \frac{V_{db}^{\rm corrected}}{\sqrt{S_{db}^{\rm corrected}}+\epsilon}
$$

The hyperparameters: $\alpha$ is free, $\beta_1\approx0.9$, $\beta_2\approx0.999$ and $\epsilon\approx10^{-8}$


### Learning-rate decay

Especially usefull with mini-batch learning where the convergence can oscillate 

$$ \alpha = 1/(1+\text{decayRate}\times\text{epochNumber}) $$

Other option: 
- Exponential decay
- power-law decay
- Mini-batch dependent decay
- Manual decay

## The problem of local optima

Most points with zero-gradient are saddle points. For a high-dimensional space especially. 

Plateoes can slow down the learning significantly. 

Adam can help in moving fast out of the plateue region

(see exercise for implementaiton of these methods)
