## Types of Gradient Descent

1. **Batch Gradient Descent**: Uses the entire dataset to compute the gradient at each step. It provides accurate updates but can be computationally expensive for large datasets.
2. **Stochastic Gradient Descent (SGD)**: Uses one data point at a time to compute the gradient. It is much faster and can escape local minima, but it introduces more noise in the parameter updates.
3. **Mini-Batch Gradient Descent**: A compromise between batch and stochastic gradient descent. It uses a small random subset of the data (mini-batch) to compute the gradient. It balances the efficiency and accuracy of updates.

In Batch Gradient Descent a whole epoch where all the data has passed through the model allows us to take only 1 step in gradient descent, in mini-batch each training step (a smaller step with a subset of the data) gives us a step in gradient descent so when an epoch completes we can have as many steps as the number of mini batches when 1 epoch completes. This helps the model train faster.

Typical mini-batch size is like 32, 64, 256, 512 or similar powers of 2. 


In the illustration below we can see that with batch gradient descent the descent is much smoother because with each new step the model is seeing the same data again whereas in mini-batch gradient descent it is as if the model is seeing a new data in every training step. Also batches could differ in ease of training/pattern making.




![image.png](attachment:image.png)

Source: https://www.coursera.org/learn/deep-neural-network/lecture/lBXu8/understanding-mini-batch-gradient-descent

In Stochastic Gradient descent the step is taking after every single example training. It can be very slow in terms of learning speed.


![image.png](attachment:image.png)

Source: https://www.coursera.org/learn/deep-neural-network/lecture/lBXu8/understanding-mini-batch-gradient-descent

## Gradient Descent with Momentum

The aim of this modification to the standard gradient descent algorithm is to make gradient descent converge to the minima faster by moving to the direction of minima faster. The idea is to compute exponentially weighted average of the gradients and then use that average gradient to update the weights instead.


![image-2.png](attachment:image-2.png)

Imagine a contour plot representing the cost function you're trying to minimize. The red dot signifies the minimum point.
If you begin gradient descent at a specific point and take one iteration, you might end up at a different location on the plot. With each step, gradient descent moves closer to the minimum, but this process can involve slow, oscillating movements. Gradient descent can take many steps to converge due to oscillations, especially in the vertical direction. These oscillations slow down the process and prevent the use of a larger learning rate. Using a larger learning rate can cause the algorithm to overshoot the minimum, leading to divergence.



### Benefits of Momentum

- **Damping Oscillations**: The idea is to reduce the oscillations using averaged out weights/gradients. By averaging the gradients, oscillations in the vertical direction are reduced because the negative and positive oscillations in the opposite directions cancel out, and the movement in the horizontal direction becomes faster because all the oscillations are in one direction and get amplified, both these factors leading to smoother steps towards the minimum.
- **Directional Movement**: In the horizontal direction, the gradients align more consistently, allowing for faster movement toward the minimum.
- **Efficient Path**: The algorithm takes a more direct path to the minimum, reducing unnecessary oscillations.

The red line shows the gradient descent route when using momentum method or exponentially weighted average of weights to update the weights.

![image.png](attachment:image.png)

The 'v' are exponentially weighted average of the gradients. The 'v' values have the effect that they do on the new values of weights because the EWA tends to smoothen out the curve of gradients as beta value is assigned to the value of the previous average. The initial value of 'v' is 0. When v is used to update the weights instead of gradients directly the weight changes are more smoother or less erratic or far moving than using gradients directly because v is the more smoothened out value of the gradients.

## RMSProp

RMSprop is a gradient-based optimization technique used in training neural networks, introduced by Geoffrey Hinton, who is also known for developing back-propagation. Neural networks often face the challenge of gradients either vanishing or exploding as data propagates through the network (known as the vanishing gradients problem). RMSprop was created as a stochastic technique for mini-batch learning to address this issue.

RMSprop handles the problem by using a moving average of squared gradients to normalize the gradient. This normalization adjusts the step size (momentum), reducing it for large gradients to prevent exploding gradients and increasing it for small gradients to prevent vanishing gradients.

In essence, RMSprop employs an adaptive learning rate rather than treating the learning rate as a fixed hyperparameter. This means the learning rate dynamically changes over time.
Why we say that RMSProp uses an adaptive learning rate is visible from the equations below:

![image.png](attachment:image.png)


Source: https://medium.com/analytics-vidhya/a-complete-guide-to-adam-and-rmsprop-optimizer-75f4502d83be

We can see that this equation is almost the same as plain gradient descent (w = w - alpha * dw) except for the denominators which we can perceive as changes in the learning rate alpha. We can perceive the denominator as the denominator under alpha to modify it. Hence the adaptive learning rate.