### Mini-batch gradient descent
- Vectorization allows you to efficiently compute on m examples.
- Training NN with a large data is slow. So to find an optimization algorithm that runs faster is a good idea.
- Suppose we have m = 50 million. To train this data it will take a huge processing time for one step.
   - because 50 million won't fit in the memory at once we need other processing to make such a thing.
- It turns out you can make a faster algorithm to make gradient descent process some of your items even before you finish the 50 million items.
- little baby training sets and these baby training sets are called mini-batches
- Suppose we have split m to mini batches of size 1000.
  - X{1} = 0 ... 1000
  - X{2} = 1001 ... 2000
  - ....
  - X{bs} = ...
- We similarly split X & Y.
- So the definition of mini batches ==> t: X{t}, Y{t}
- In Batch gradient descent we run the gradient descent on the whole dataset.
- While in Mini-Batch gradient descent we run the gradient descent on the mini datasets.
- Mini-Batch algorithm pseudo code:
<img src="img/Screen%20Shot%202019-02-01%20at%2022.33.14.png">
- The code inside an epoch should be vectorized.
- Mini-batch gradient descent works much faster in the large datasets.

### Understanding mini-batch gradient descent
- In mini-batch algorithm, the cost wont go down with each step as it does in batch algorithm. It could contain some ups and downs but generally it has to go down (unlike the batch gradient descent where cost function descreases on each iteration).
- Mini-batch size:
   - (mini batch size = m) ==> Batch gradient descent
   - (mini batch size = 1) ==> Stochastic gradient descent (SGD)
   - (mini batch size = between 1 and m) ==> Mini-batch gradient descent
- Batch gradient descent:
  - too long per iteration (epoch)
- Stochastic gradient descent:
    - too noisy regarding cost minimization (can be reduced by using smaller learning rate)
    - won't ever converge (reach the minimum cost)
    - lose speedup from vectorization
- Mini-batch gradient descent:
   1. faster learning:
        - you have the vectorization advantage
        - make progress without waiting to process the entire training set
   2. doesn't always exactly converge (oscelates in a very small region, but you can reduce learning rate)
- Mini-batch size is a hyperparameter.
- Guidelines for choosing mini-batch size:
   1. If small training set (< 2000 examples) - use batch gradient descent.
   2. It has to be a power of 2: (because of the way computer memory is layed out and accessed, sometimes your code runs faster if your mini-batch size is a power of 2): 64, 128, 256, 512, 1024, ...
   3. Make sure that mini-batch fits in CPU/GPU memory.

### Exponentially weighted averages

Optimization algorithms are faster than gradient descent. 
General equation: V(t) = beta * v(t-1) + (1-beta) * theta(t)
If we plot this it will represent averages over ~ (1 / (1 - beta)) entries:
   - beta = 0.9 will average last 10 entries
   - beta = 0.98 will average last 50 entries  (1/(1-0.98))
   - beta = 0.5 will average last 2 entries 

### Understanding exponentially weighted averages
Algorithm is very simple:

### Bias correction in exponentially weighted averages
- The bias correction helps make the exponentially weighted averages more accurate.
- v(t) = (beta * v(t-1) + (1-beta) * theta(t)) / (1 - beta^t)

### Gradient descent with momentum
- The momentum algorithm almost always works faster than standard gradient descent.
- The simple idea is to calculate the exponentially weighted averages for your gradients and then update your weights with the new values.
- Pseudo code:

- Momentum helps the cost function to go to the minimum point in a more fast and consistent way
- beta is another hyperparameter. beta = 0.9 is very common and works very well in most cases.
- In practice people don't bother implementing bias correction.

### RMSprop
This algorithm speeds up the gradient descent.

- RMSprop will make the cost function move slower on the vertical direction(b) and faster on the horizontal direction (w)
- Ensure that sdW is not zero by adding a small value epsilon (e.g. epsilon = 10^-8) to it:
    -  W = W - learning_rate * dW / (sqrt(sdW) + epsilon)
- With RMSprop you can increase your learning rate.

### Adam optimization algorithm
- Stands for Adaptive Moment Estimation.
- Adam optimization simply puts RMSprop and momentum together!

- Hyperparameters for Adam:
     - Learning rate: needed to be tuned.
     - beta1: parameter of the momentum - 0.9 is recommended by default.
     - beta2: parameter of the RMSprop - 0.999 is recommended by default.
     - epsilon: 10^-8 is recommended by default.

### Learning rate decay
- Slowly reduce learning rate.
- One technique equations islearning_rate = (1 / (1 + decay_rate * epoch_num)) * learning_rate_0
- epoch_num is over all data (not a single mini-batch).
- Other learning rate decay methods (continuous):
     - learning_rate = (0.95 ^ epoch_num) * learning_rate_0
     - learning_rate = (k / sqrt(epoch_num)) * learning_rate_0
- Some people are making changes to the learning rate manually.
- decay_rate is another hyperparameter.
- For Andrew Ng, learning rate decay has less priority.