## Optimization Algorithms in Deep Learning Networks
 
Optimization algorithms are methods used to minimize (or maximize) an objective function, such as the loss function in deep learning models. They play a crucial role in training neural networks by updating the model's parameters to improve performance.
 
## Types of Optimization Algorithms
 
 1. **Gradient Descent**: The most basic optimization algorithm, which updates parameters in the direction of the negative gradient of the loss function.
     - *Variants*: Batch Gradient Descent, Stochastic Gradient Descent (SGD), Mini-batch Gradient Descent
 2. **Momentum**: Accelerates gradient descent by considering the past gradients to smooth out updates.
 3. **Adagrad**: Adapts the learning rate for each parameter based on the historical gradients.
 4. **RMSprop**: Modifies Adagrad to perform better in non-convex settings by using a moving average of squared gradients.
 5. **Adam**: Combines ideas from Momentum and RMSprop, maintaining both a moving average of the gradients and their squared values.
 
 These algorithms help deep learning networks converge faster and achieve better performance.



## Optimization functions

Usually calculate the gradient i.e. the partial derivative of loss function with respect to weights, and the weights are modified in the opposite direction of the calculated gradient. This cycle is repeated until we reach the minima of loss function.

![Untitled picture.png](<attachment:Untitled picture.png>)


## Gradient Decent

## Gradident Decent
 
Gradient Decent is an optimization algorithm used to minimize the loss function in deep learning models. It works by calculating the gradient (partial derivatives) of the loss function with respect to the model's parameters and updating the parameters in the opposite direction of the gradient. This process is repeated iteratively until the loss function reaches a minimum value.
 
![alt text](<Untitled picture.png>)
 
The basic update rule for Gradient Decent is:
 
 $$
 w := w - \eta \frac{\partial L}{\partial w}
 $$
 
 where:
 - $w$ is the parameter (weight) to be updated,
 - $\eta$ is the learning rate,
 - $\frac{\partial L}{\partial w}$ is the gradient of the loss function $L$ with respect to $w$.
 
 There are different variants of Gradient Decent:
 - **Batch Gradient Decent**: Uses the entire dataset to compute the gradient at each step.
 - **Stochastic Gradient Decent (SGD)**: Uses a single data point to compute the gradient at each step.
 - **Mini-batch Gradient Decent**: Uses a small batch of data points to compute the gradient at each step.


## Batch Gradient Descent
 
Batch Gradient Descent is an optimization algorithm where the gradient of the loss function is computed using the entire training dataset. At each iteration, all training examples are used to calculate the gradients and update the model parameters. This approach ensures a stable and accurate estimate of the gradient, but can be computationally expensive for large datasets.
 
The update rule for Batch Gradient Descent is:
 
 $$
 w := w - \eta \frac{1}{N} \sum_{i=1}^{N} \frac{\partial L^{(i)}}{\partial w}
 $$
 
 where:
 - $w$ is the parameter (weight) to be updated,
 - $\eta$ is the learning rate,
 - $N$ is the number of training examples,
 - $\frac{\partial L^{(i)}}{\partial w}$ is the gradient of the loss for the $i$-th example with respect to $w$.
 
 **Pros:** Provides a precise direction for parameter updates and can converge smoothly.
 
 **Cons:** Can be slow and memory-intensive for large datasets, as it requires processing the entire dataset for each update.


## Stochastic Gradient Descent (SGD)
 
Stochastic Gradient Descent is an optimization algorithm where the gradient of the loss function is computed using a single randomly selected training example at each iteration. Instead of calculating the gradient over the entire dataset, SGD updates the model parameters more frequently, which can lead to faster convergence, especially for large datasets.
 
![alt text](<Untitled picture-1.png>)

The update rule for Stochastic Gradient Descent is:
 
 $$
 w := w - \eta \frac{\partial L^{(i)}}{\partial w}
 $$
 
 where:
 - $w$ is the parameter (weight) to be updated,
 - $\eta$ is the learning rate,
 - $\frac{\partial L^{(i)}}{\partial w}$ is the gradient of the loss for the $i$-th randomly selected example with respect to $w$.
 
 **Pros:** Can handle large datasets efficiently, allows for online learning, and often escapes shallow local minima due to its noisy updates.
 
 **Cons:** The parameter updates can be noisy, which may cause the loss function to fluctuate rather than decrease smoothly.


## Mini-Batch Gradient Descent
 
Mini-Batch Gradient Descent is an optimization algorithm that combines the advantages of both Batch Gradient Descent and Stochastic Gradient Descent. Instead of using the entire dataset (as in batch) or a single example (as in stochastic), it computes the gradient using a small, randomly selected subset of the training data called a "mini-batch" at each iteration.
 
The update rule for Mini-Batch Gradient Descent is:
 
 $$
 w := w - \eta \frac{1}{m} \sum_{j=1}^{m} \frac{\partial L^{(j)}}{\partial w}
 $$
 
 where:
 - $w$ is the parameter (weight) to be updated,
 - $\eta$ is the learning rate,
 - $m$ is the mini-batch size,
 - $\frac{\partial L^{(j)}}{\partial w}$ is the gradient of the loss for the $j$-th example in the mini-batch with respect to $w$.
 
**Pros:** Achieves a balance between the efficiency of SGD and the stability of Batch Gradient Descent. It can make better use of vectorized operations and hardware acceleration, and often leads to faster convergence in practice.
 
**Cons:** The choice of mini-batch size can affect performance and convergence. Too small may be noisy (like SGD), too large may be slow (like batch).
