# **Optimizers**
An optimizer is a function or an algorithm that adjusts the attributes of the neural network, such as weights and learning rates. Thus, it helps in reducing the overall loss and improving accuracy.

## **Types of optimizers**
- Batch GD
- Mini Batch GD
- Stochastic GD
- Momentum
- AdaGrad
- NAG
- Adam
- RMSProp
### Why we need other optimizers when we have these?

**Challenges**
- Learning rate (SOlution: LR Scheduling)
- You have to minimize multiple weights. For 10 weights you will have 10 directions to move in. And we can not have seperate LR for each weight and bais.
- Local Minima
- Saddle point


## Exponentially Weighted Moving Average

Is a technique using which you find trends in time series data.

Used in:
- Stock
- Time series forcasting
- Signal Process

**How it work?**
Every new data point have more weightage as compared to previous point. And every point weightage will reduce over time.

$$ V_t = βV_(t-1) +(1-β)θ_t $$

beta value is between 0 and 1


|Date|Temp|
|-------------|-------------|
| 1 | 23|
|2|12|
|3|17|
|4|19|

Suppose beta is 0.9

For $V_0 = θ_0$ Which is 23

For $V_1$ :  $0.9*23 + 0.1(23) = 23 $

For $V_2$ :  $0.9*23 + 0.1(12) = 21.9 $

For $V_3$ :  $0.9*21.9 + 0.1(17) = 21.41 $

For $V_4$ :  $0.9*21.41 + 0.1(19) = 21.16 $

<br><br>

**Impact of Beta value**

If beta value is 0.9, it will behave as it is average of last ten days.
if beta = 0.2
$ β → \frac{1}{1-β} → 10 $


# **SGD with Momentum**
## **The Why**

In GD there was three problems:
- Noisy Grad
- Consistent grad
- High curvature

Momentum address all three problems

## **The What**

Consider you are moving from point A to point B but you don't know where exactly is point B. You will people where is it, if every person you asked pointed in one direction you will speed up in that direction. If people give mix direction your speed will be slower. This is the whole idea of momentum Optimizer.

![alr](https://files.codingninjas.in/article_images/nesterov-accelerated-gradient-3-1644853109.webp)

[link](https://www.codingninjas.com/studio/library/nesterov-accelerated-gradient)

## **The How (Maths)**

In normal optimizer we update weights with following equation:
$$w_(t+1) = w_t - ηΔ w_t $$

In momentum we take a term v which we will called velocity and instead of derivation we use velocity.$$w_(t+1) = w_t - v_t $$


Here $v_t$ is calculated as: $$v_t = β v_(t-1) + η∇w_t$$

Here $β$ value is between 0 and 1. $v_(t-1)$ is velocity at previous step. And second term is learning rate times current gradient.


You are using history of velocity as momentum.

### ***Effect of Beta***
Beta is called Decaying factor.
If Beta value is zero. Momentum will behave like SGD.


If Beta value is one. There is no decay.


## **Problem with momentum**

It can be slow as compared to next optimizers. But will be fast than SGD.


# **Nesterov Accelerated Gradient (NAG)**

![alr](https://files.codingninjas.in/article_images/nesterov-accelerated-gradient-4-1644853109.webp)

[link](https://www.codingninjas.com/studio/library/nesterov-accelerated-gradient)

$$w_(t+1) = w_t − v_t$$
While calculating the $v_t$, We will include the look ahead gradient $(∇w_(la))$.
$$v_t = β ·v_(t−1) + η∇w_(la)$$

$∇w_(la)$ is calculated by:
$$w_(la) = w_t − β · v_(t−1)$$

This look-ahead gradient will be used in our update and will prevent overshooting.


**Disadvantage**
- You can get trapped in local minima



```
tf.keras.optimizers.SGD(
                      learning_rate = 0.01,
                      momentum = 0.0,
                      nestoriv = Fasle,
                      name = "SGD",
                      **kwargs)
```



# **AdaGrad**

- Adagrad is an optimization algorithm commonly used in training neural networks. It adapts the learning rate for each parameter based on the historical gradients, allowing for larger updates for infrequent parameters and smaller updates for frequent parameters. This adaptive learning rate can help improve convergence and training stability.

## AdaGrad work better in the following cases:
- Input features have different scale
- Features are sparse


**Enlongated Bowl Problem**

In the context of machine learning and neural network training, the Elongated Bowl Problem typically arises when the optimization landscape has steep gradients in some directions and shallow gradients in others. This can occur due to various factors such as:

**High Condition Number**: The condition number of the Hessian matrix (second-order derivatives of the loss function) is high, indicating that the curvature of the loss function varies significantly across different dimensions.

**Correlated Features**: If the input features are highly correlated, it can lead to elongated contours in the optimization landscape.

**Overparameterization**: Having a large number of parameters in the model can also contribute to the Enlongated Bowl Problem, as it increases the complexity of the optimization space.

**Irrelevant Features**: Including irrelevant features in the model can create flat regions in the optimization landscape, exacerbating the elongated bowl shape.

The Enlongated Bowl Problem can lead to difficulties in convergence and slow training of machine learning models. Gradient-based optimization algorithms may struggle to navigate the narrow valley of the loss function, resulting in slow progress or convergence to suboptimal solutions.

In [1]:
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adagrad
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Preprocess the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Create a neural network model
model = Sequential([
    Dense(64, activation='relu', input_shape=(X_train.shape[1],)),
    Dense(32, activation='relu'),
    Dense(3, activation='softmax')
])

# Compile the model with Adagrad optimizer
optimizer = Adagrad(learning_rate=0.01)
model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=50, batch_size=32, validation_split=0.1, verbose=0)

# Evaluate the model on test data
loss, accuracy = model.evaluate(X_test, y_test)
print(f'Test Accuracy: {accuracy:.4f}')


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 44ms/step - accuracy: 0.9667 - loss: 0.2201
Test Accuracy: 0.9667


# ADAM

Adam (Adaptive Moment Estimation) is an optimization algorithm commonly used for training deep learning models. It combines the advantages of both AdaGrad and RMSProp algorithms by adapting the learning rates for each parameter based on the first and second moments of the gradients. Adam is known for its efficiency, robustness, and ability to handle various optimization landscapes effectively.

Here's how Adam works and how you can implement it in a neural network using Keras with TensorFlow:

**Adaptive Learning Rates**: Adam adapts the learning rates for each parameter by considering two main factors:

    **First Moment (Mean)**: It calculates the exponentially decaying average of past gradients.
    **Second Moment (Variance)**: It calculates the exponentially decaying average of past squared gradients.
**Bias Correction**: Adam incorporates bias correction to account for the initialization bias of the first and second moment estimates, especially in the early stages of training.

**Update Rule**: The update rule for Adam is defined as follows:


\begin{align*}
m_t & = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t \\
v_t & = \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t^2 \\
\hat{m}_t & = \frac{m_t}{1 - \beta_1^t} \\
\hat{v}_t & = \frac{v_t}{1 - \beta_2^t} \\
\theta_{t+1} & = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \cdot \hat{m}_t
\end{align*}

where:
   
   $m_t$ and $v_t$ are the first and second moments of the gradients at time step $t$.
   
   $\beta_1$ and $\beta_2$ are exponential decay rates for the first and second moments (typically set to 0.9 and 0.999, respectively).
  
  $g_t$ is the gradient at time step $t$.
  
   $\eta$ is the learning rate.
   
   $\epsilon$ is a small constant (e.g., $10^{-7}$) to prevent division by zero.




In [4]:
from tensorflow.keras.optimizers import Adam

# Create a neural network model
model = Sequential([
    Dense(64, activation='relu', input_shape=(X_train.shape[1],)),
    Dense(32, activation='relu'),
    Dense(3, activation='softmax')
])

# Compile the model with Adam optimizer
optimizer = Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-7)
model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=50, batch_size=32, validation_split=0.1, verbose=0)

# Evaluate the model on test data
loss, accuracy = model.evaluate(X_test, y_test)
print(f'Test Accuracy: {accuracy:.4f}')


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 47ms/step - accuracy: 1.0000 - loss: 0.0781
Test Accuracy: 1.0000
