# 1. Batch Gradient Descent (Full Gradient Descent)
Batch Gradient Descent computes the gradient of the loss function with respect to the parameters for the entire training dataset and updates the parameters in the direction of the negative gradient.

![image.png](attachment:image.png)

## Advantages:
- Convergence is more stable because it uses the full dataset to calculate gradients.
- It can be more accurate in finding the global minimum.

## Disadvantages:
- Can be very slow and computationally expensive for large datasets.
- Requires more memory to compute the gradients.

## Use Case: 
##### Suitable for small to medium-sized datasets where computational resources are sufficient. It computes the gradient of the cost function for the entire dataset, which can be slow and resource-intensive for very large datasets.

## Example: 
##### Linear regression on a dataset with a few thousand samples.

![image.png](attachment:image.png)

## Example Code:

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD

# Sample data
X_train, y_train = ...  # Replace with actual training data
X_val, y_val = ...  # Replace with actual validation data

# Sample model
model = Sequential([
    Dense(64, activation='relu', input_shape=(X_train.shape[1])),
    Dense(32, activation='relu'),
    Dense(1, activation='sigmoid')
])

# Compile and train with Batch Gradient Descent
optimizer = SGD(learning_rate=0.01)
# learning_rate: Determines the step size for each iteration. Common values range from 0.01 to 0.1.

model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=10, batch_size=len(X_train))

# 2. Stochastic Gradient Descent (SGD)
SGD updates the weights by computing the gradient of the loss with respect to the weights for a single data point at a time.

![image.png](attachment:image.png)

## Advantages:
- Simplicity and ease of implementation.
- Memory efficiency since it uses only one sample at a time.

## Disadvantages:
- Can be slow to converge.
- Highly sensitive to the learning rate.
- Can get stuck in local minima.

## Use Case: 
##### Useful for large datasets and online learning where the model is updated incrementally with each training example. It introduces noise into the optimization process, which can help escape local minima.

## Example:
##### Real-time recommendation systems where the model needs to be updated frequently with new user data.


![image.png](attachment:image.png)

## Example Code:

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD

# Sample data
X_train, y_train = ...  # Replace with actual training data
X_val, y_val = ...  # Replace with actual validation data

# Sample model
model = Sequential([
    Dense(64, activation='relu', input_shape=(X_train.shape[1])),
    Dense(32, activation='relu'),
    Dense(1, activation='sigmoid')
])

# Compile and train with Stochastic Gradient Descent
optimizer = SGD(learning_rate=0.01)
# learning_rate: Determines the step size for each iteration. Common values range from 0.01 to 0.1.

model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=10, batch_size=1)

# 3. Mini-Batch Gradient Descent
A variant of SGD, where the gradients are computed for small batches of data.

![image.png](attachment:image.png)

## Advantages:
- Balances the efficiency of SGD and the stability of batch gradient descent.
- Reduces variance in the weight updates, leading to more stable convergence.

## Disadvantages:
- Requires tuning of the batch size.
- Still sensitive to learning rate.

## Use Case: 
##### Balances between batch gradient descent and SGD by updating the model using small batches of data. This improves computational efficiency and can provide more stable convergence.

## Example: 
##### Training deep neural networks on large datasets where a batch size of 32, 64, or 128 is common.

![image.png](attachment:image.png)

## Example Code:

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD

# Sample data
X_train, y_train = ...  # Replace with actual training data
X_val, y_val = ...  # Replace with actual validation data

# Sample model
model = Sequential([
    Dense(64, activation='relu', input_shape=(X_train.shape[1])),
    Dense(32, activation='relu'),
    Dense(1, activation='sigmoid')
])

# Compile and train with Mini-Batch Gradient Descent
optimizer = SGD(learning_rate=0.01)
# learning_rate: Determines the step size for each iteration. Common values range from 0.01 to 0.1.

model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=10, batch_size=32)  # Example mini-batch size of 32

# 4. Momentum
Momentum accelerates SGD by adding a fraction of the previous update to the current update.

![image.png](attachment:image.png)

## Advantages:
- Can help accelerate convergence and reduce oscillations.
- Helps to escape local minima.

## Disadvantages:
- Requires tuning of the momentum parameter.
- Can overshoot the minima if not properly tuned.

## Use Case: 
##### Helps accelerate SGD in relevant directions and dampens oscillations. Useful in cases where the optimization landscape has high curvature, small but consistent gradients, or noisy gradients.

## Example: 
##### Training convolutional neural networks for image classification tasks.

![image.png](attachment:image.png)

## Example Code:

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD
# or
from tensorflow.keras.optimizers.experimental import Momentum 

# Sample data
X_train, y_train = ...  # Replace with actual training data
X_val, y_val = ...  # Replace with actual validation data

# Sample model
# Sample model
model = Sequential([
    Dense(64, activation='relu', input_shape=(X_train.shape[1])),
    Dense(32, activation='relu'),
    Dense(1, activation='sigmoid')
])

# Compile and train with Momentum
optimizer = SGD(learning_rate=0.01, momentum=0.9)
# learning_rate: Determines the step size for each iteration. Common values range from 0.01 to 0.1.
# momentum: Factor for the momentum term (0.0 to 1.0, commonly 0.9).

# or

# optimizer = Momentum(learning_rate=0.01, momentum=0.9)
# learning_rate: Determines the step size for each iteration. Common values range from 0.01 to 0.1.
# momentum: Factor for the momentum term (0.0 to 1.0, commonly 0.9).

model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=10, batch_size=32)  # Example mini-batch size of 32

# 5. Nesterov Accelerated Gradient (NAG)
An improvement over momentum, NAG looks ahead to where the gradient is going and adjusts the update accordingly.

![image.png](attachment:image.png)

## Advantages:
- Provides a more accurate update direction.
- Can lead to faster convergence than standard momentum.

## Disadvantages:
- More complex to implement.
- Requires careful tuning of parameters.

## Use Case: 
##### An improvement over momentum that looks ahead to the future position of the parameters. Effective in scenarios where momentum struggles with too much overshooting.

## Example: 
##### Training recurrent neural networks for time-series prediction.

![image.png](attachment:image.png)

## Example Code:

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD
# or
from tensorflow.keras.optimizers.experimental import Nesterov

# Sample data
X_train, y_train = ...  # Replace with actual training data
X_val, y_val = ...  # Replace with actual validation data

# Sample model
model = Sequential([
    Dense(64, activation='relu', input_shape=(X_train.shape[1])),
    Dense(32, activation='relu'),
    Dense(1, activation='sigmoid')
])

# Compile and train with Nesterov Accelerated Gradient
optimizer = SGD(learning_rate=0.01, momentum=0.9, nesterov=True)
# learning_rate: Determines the step size for each iteration. Common values range from 0.01 to 0.1.
# momentum: Factor for the momentum term (0.0 to 1.0, commonly 0.9).

# or

# optimizer = Nesterov(learning_rate=0.01, momentum=0.9, nesterov=True)
# learning_rate: Determines the step size for each iteration. Common values range from 0.01 to 0.1.
# momentum: Factor for the momentum term (0.0 to 1.0, commonly 0.9).

model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=10, batch_size=32)  # Example mini-batch size of 32

# 6. Adagrad (Adaptive Gradient Algorithm)
Adagrad adapts the learning rate for each parameter based on the historical gradients.

![image.png](attachment:image.png)

## Advantages:
- No need to manually tune the learning rate.
- Effective for sparse data.

## Disadvantages:
- Learning rate can become too small, leading to slow convergence.

## Use Case: 
##### Adapts the learning rate for each parameter based on the frequency of updates, making it suitable for dealing with sparse data and feature-specific learning rates.

## Example: 
##### Natural language processing tasks like text classification, where certain features (words) are sparse.

## Example Code:

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adagrad

# Sample data
X_train, y_train = ...  # Replace with actual training data
X_val, y_val = ...  # Replace with actual validation data

# Sample model
model = Sequential([
    Dense(64, activation='relu', input_shape=(X_train.shape[1])),
    Dense(32, activation='relu'),
    Dense(1, activation='sigmoid')
])

# Compile and train with Adagrad
optimizer = Adagrad(learning_rate=0.01)
# learning_rate: Common values are around 0.01.

model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=10, batch_size=32)  # Example mini-batch size of 32

# 7. RMSprop (Root Mean Square Propagation)
RMSprop addresses the diminishing learning rate problem in Adagrad by using a moving average of squared gradients.

![image.png](attachment:image.png)

## Advantages:
- Effective for non-stationary objectives.
- Maintains a good balance between convergence speed and stability.

## Disadvantages:
- Requires tuning of decay rate.
- Can be sensitive to initial learning rate.

## Use Case: 
##### Maintains a moving average of squared gradients to normalize the gradient, suitable for non-stationary problems and mini-batch learning. Addresses the diminishing learning rates problem in Adagrad.

## Example: 
##### Training deep reinforcement learning models.

## Example Code:

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import RMSprop

# Sample data
X_train, y_train = ...  # Replace with actual training data
X_val, y_val = ...  # Replace with actual validation data

# Sample model
model = Sequential([
    Dense(64, activation='relu', input_shape=(X_train.shape[1])),
    Dense(32, activation='relu'),
    Dense(1, activation='sigmoid')
])

# Compile and train with RMSprop
optimizer = RMSprop(learning_rate=0.001, rho=0.9)
# learning_rate: Typically around 0.001.
# rho: Decay rate for moving average (usually around 0.9).

model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=10, batch_size=32)  # Example mini-batch size of 32

# 8. Adam (Adaptive Moment Estimation)
Adam combines the ideas of momentum and RMSprop, maintaining a moving average of both the gradients and their squares.

![image.png](attachment:image.png)

## Advantages:
- Works well in practice for a wide range of problems.
- Requires less tuning of the learning rate.

## Disadvantages:
- Computationally expensive.
- Can sometimes lead to poor generalization.

## Use Case: 
##### Combines the advantages of both RMSprop and momentum by maintaining moving averages of both the gradients and their second moments. Widely used due to its robustness and efficiency.

## Example: 
##### General-purpose optimizer for training deep neural networks in various domains such as computer vision, NLP, and more.

## Example Code:

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam

# Sample data
X_train, y_train = ...  # Replace with actual training data
X_val, y_val = ...  # Replace with actual validation data

# Sample model
model = Sequential([
    Dense(64, activation='relu', input_shape=(X_train.shape[1])),
    Dense(32, activation='relu'),
    Dense(1, activation='sigmoid')
])

# Compile and train with Adam
optimizer = Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07)
# learning_rate: Typically around 0.001.
# beta_1: Exponential decay rate for the first moment (usually 0.9).
# beta_2: Exponential decay rate for the second moment (usually 0.999).
# epsilon: Small value to prevent division by zero (e.g., 1e-07).

model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=10, batch_size=32)  # Example mini-batch size of 32

# 9. AdaDelta
An extension of Adagrad that seeks to reduce its aggressive, monotonically decreasing learning rate.

![image.png](attachment:image.png)

## Advantages:
- Reduces the need to set an initial learning rate.
- Adapts the learning rate based on a moving window of gradient updates.

## Disadvantages:
- More complex than Adagrad.
- May not perform as well as Adam on some problems.

## Use Case: 
##### An extension of Adagrad that seeks to reduce its aggressive, monotonically decreasing learning rate. Suitable for models requiring a more adaptive learning rate adjustment.

## Example: 
##### Training models with adaptive learning rate requirements without needing to set a default learning rate.

## Example Code:

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adadelta

# Sample data
X_train, y_train = ...  # Replace with actual training data
X_val, y_val = ...  # Replace with actual validation data

# Sample model
model = Sequential([
    Dense(64, activation='relu', input_shape=(X_train.shape[1])),
    Dense(32, activation='relu'),
    Dense(1, activation='sigmoid')
])

# Compile and train with AdaDelta
optimizer = Adadelta(learning_rate=1.0, rho=0.95, epsilon=1e-07)
# learning_rate: Usually 1.0 (Adadelta adjusts this internally).
# rho: Decay rate for moving average (typically around 0.95).
# epsilon: Small value to prevent division by zero (e.g., 1e-07).

model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=10, batch_size=32)  # Example mini-batch size of 32

# 10. Adamax
A variant of Adam based on the infinity norm.

![image.png](attachment:image.png)

## Advantages:
- Sometimes more stable than Adam.
- Effective for large models.

## Disadvantages:
- Less commonly used, less empirical evidence on its performance.

## Use Case: 
##### A variant of Adam based on the infinity norm. Works well when dealing with sparse gradients or when the model has large parameter magnitudes.

## Example: 
##### Deep learning applications with very sparse updates or extremely large parameter values.

## Example Code:

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adamax

# Sample data
X_train, y_train = ...  # Replace with actual training data
X_val, y_val = ...  # Replace with actual validation data

# Sample model
model = Sequential([
    Dense(64, activation='relu', input_shape=(X_train.shape[1])),
    Dense(32, activation='relu'),
    Dense(1, activation='sigmoid')
])

# Compile and train with Adamax
optimizer = Adamax(learning_rate=0.002, beta_1=0.9, beta_2=0.999, epsilon=1e-07)
# learning_rate: Usually around 0.002.
# beta_1: Exponential decay rate for the first moment (usually 0.9).
# beta_2: Exponential decay rate for the second moment (usually 0.999).
# epsilon: Small value to prevent division by zero (e.g., 1e-07).

model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=10, batch_size=32)  # Example mini-batch size of 32

# 11. Nadam
A combination of Adam and NAG, which incorporates Nesterov momentum into Adam.

![image.png](attachment:image.png)

## Advantages:
- Can provide better convergence properties than Adam.
- Effective for many types of neural networks.

## Disadvantages:
- More complex to implement.
- Requires careful parameter tuning.

## Use Case: 
##### Combines NAG with Adam, adding momentum to Adam’s adaptive learning rate. Useful for achieving faster convergence in deep learning models.

## Example: 
##### Complex neural network architectures where both acceleration and adaptive learning rates are beneficial, such as training GANs.

## Example Code:

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Nadam

# Sample data
X_train, y_train = ...  # Replace with actual training data
X_val, y_val = ...  # Replace with actual validation data

# Sample model
model = Sequential([
    Dense(64, activation='relu', input_shape=(X_train.shape[1])),
    Dense(32, activation='relu'),
    Dense(1, activation='sigmoid')
])

# Compile and train with Nadam
optimizer = Nadam(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07)
# learning_rate: Typically around 0.001.
# beta_1: Exponential decay rate for the first moment (usually 0.9).
# beta_2: Exponential decay rate for the second moment (usually 0.999).
# epsilon: Small value to prevent division by zero (e.g., 1e-07).

model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=10, batch_size=32)  # Example mini-batch size of 32

# Choosing an Optimizer
The choice of optimizer depends on the specific problem, the architecture of the neural network, and the characteristics of the data. Here are some guidelines:
- SGD is often used when training very large datasets with a simple model.
- Momentum and NAG are good choices when SGD is too slow.
- Adagrad and RMSprop are useful for dealing with sparse data or problems with noisy gradients.
- Adam and its variants (Adamax, Nadam) are popular choices for most deep learning applications due to their adaptive nature and good performance across various tasks.
- AdaDelta can be used when dealing with diminishing learning rates in Adagrad.