# Gradient Descent Optimization

**Note:** The PyTorch code snippets provided in this document are illustrative examples. They demonstrate how to use specific PyTorch components in a simplified context and may require additional definitions (e.g., model architecture, data loading, complete training loops) to be fully runnable.

## 2. Gradient Descent Optimization

Finding the best settings (weights and biases) for a neural network is complex. We can't just solve an equation. Instead, we use iterative numerical methods.

* **Goal:** Find the weights `w` that minimize an error function `E(w)`.
* **Process:**
    1.  Start with an initial guess for weights: $w^{(0)}$.
    2.  Update weights in steps: $w^{(\tau)} = w^{(\tau-1)} + \Delta w^{(\tau-1)}$
        * $\tau$ is the iteration step.
        * Different algorithms choose $\Delta w^{(\tau)}$ differently.
* The initial choice of $w^{(0)}$ can affect the solution. It's often good to try multiple random starting points.

### 2.1 Use of Gradient Information

* The **gradient** ($\nabla E$) tells us the direction of the steepest increase in error.
* Using gradient information is much faster than just evaluating the error function.
    * Without gradients, finding the minimum might take $\mathcal{O}(W^3)$ steps (W = number of parameters).
    * With gradients (using backpropagation), it's closer to $\mathcal{O}(W^2)$ steps.
* This efficiency is why gradient-based methods are standard for training neural networks.

### 2.2 Batch Gradient Descent

* **Idea:** Take a small step in the direction of the negative gradient (steepest decrease in error).
* **Update rule:** $w^{(\tau)} = w^{(\tau-1)} - \eta \nabla E(w^{(\tau-1)})$
    * $\eta$ (eta) is the **learning rate** (a small positive number controlling step size).
* The gradient $\nabla E$ is calculated using the **entire training dataset** at each step.
* This is called a "batch" method.
* **PyTorch Note:** Full batch gradient descent is less common in deep learning with large datasets. Optimizers like SGD are typically used with mini-batches.

### 2.3 Stochastic Gradient Descent (SGD)

* Batch methods are slow for very large datasets.
* **Idea:** Update weights based on **one data point at a time**.
* The error function is a sum of errors for each data point: $E(w) = \sum_{n=1}^{N} E_n(w)$.
* **Update rule:** $w^{(\tau)} = w^{(\tau-1)} - \eta \nabla E_n(w^{(\tau-1)})$
    * Cycle through the data points.
    * One full pass through all data = one **training epoch**.
* **Advantages:**
    * More efficient for large, redundant datasets.
    * Can help escape poor local minima because the error surface for a single point is different from the overall error surface.
* PyTorch `optim.SGD` docs: [https://pytorch.org/docs/stable/generated/torch.optim.SGD.html](https://pytorch.org/docs/stable/generated/torch.optim.SGD.html) (Used with mini-batches as shown below)

### 2.4 Mini-batches

* **Problem with SGD:** Gradients from single data points are noisy estimates of the true gradient.
* **Idea:** Use a small subset of data points (a **mini-batch**) to calculate the gradient at each step.
    * A compromise between batch and pure stochastic gradient descent.
* **Mini-batch size:**
    * Error in estimating the mean (gradient) decreases with $\sigma/\sqrt{N}$ (N = batch size). Diminishing returns for very large batches.
    * Often chosen based on hardware efficiency (e.g., powers of 2 like 32, 64, 128).
* **Important:** Randomly shuffle data before forming mini-batches to avoid correlations. Reshuffle between epochs.
* Often still called "SGD" even when using mini-batches.

* **PyTorch Example (Using `DataLoader` for Mini-batches with SGD):**
    ```python
    import torch
    import torch.optim as optim
    import torch.nn as nn
    from torch.utils.data import TensorDataset, DataLoader

    # Assume X_train and y_train are your training data and labels as tensors
    X_train = torch.randn(100, 10) # 100 samples, 10 features
    y_train = torch.randn(100, 1)  # 100 labels
    
    # Example model
    model = nn.Linear(10, 1) 
    
    # SGD Optimizer (typically used with mini-batches)
    optimizer_sgd = optim.SGD(model.parameters(), lr=0.01) 

    batch_size = 32
    train_dataset = TensorDataset(X_train, y_train)
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

    # Example snippet of a training loop:
    # num_epochs = 10
    # for epoch in range(num_epochs):
    #     for inputs, labels in train_loader: # inputs and labels are a mini-batch
    #         optimizer_sgd.zero_grad() 
    #         outputs = model(inputs)
    #         # loss = criterion(outputs, labels) # Assuming criterion is defined
    #         # loss.backward()         
    #         # optimizer_sgd.step()    
    #         pass # Placeholder for actual loss calculation and backprop
    ```
    * PyTorch `DataLoader` docs: [https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)

### 2.5 Parameter Initialization

* Iterative algorithms need a starting point for weights $w^{(0)}$.
* **Symmetry Breaking:** If all weights connected to a set of hidden units are initialized the same (e.g., to zero), those units will always learn the same thing, making them redundant.
    * **Solution:** Initialize weights randomly (e.g., from a uniform or Gaussian distribution).
* **Distribution choice:**
    * Uniform: $[-\epsilon, \epsilon]$
    * Gaussian: $\mathcal{N}(0, \epsilon^2)$
* **He Initialization:** A common heuristic for choosing $\epsilon$, especially with ReLU activation functions. Aims to keep the variance of activations stable across layers. For a unit with M inputs: $\epsilon = \sqrt{2/M}$.
* **Biases:** Often initialized to small positive values (e.g., 0.01 or 0.1) to ensure ReLU units are active initially.
* **Transfer Learning:** Initialize with weights from a network trained on a different (but related) task.

* **PyTorch Example (Default Initialization):**
    PyTorch layers (like `nn.Linear`, `nn.Conv2d`) have default initializations. For instance, `nn.Linear` often uses a Kaiming (He) uniform initialization for weights when an activation like ReLU is expected, and a uniform initialization for biases. You can find more details and other initialization methods in `torch.nn.init`.
    ```python
    import torch.nn as nn
    
    # Default initialization when creating a layer
    linear_layer = nn.Linear(10, 5) # Weights and biases are initialized automatically
    # The specific default depends on the PyTorch version and layer type.
    # print(linear_layer.weight)
    # print(linear_layer.bias)
    ```
    * PyTorch `torch.nn.init` docs: [https://pytorch.org/docs/stable/nn.init.html](https://pytorch.org/docs/stable/nn.init.html)

## 3. Convergence

How fast and reliably does our algorithm find the minimum?

* **The "Valley" Problem:** If the error surface has very different curvatures in different directions (like a long, narrow valley), standard gradient descent can be slow.
    * The negative gradient often doesn't point directly to the minimum.
    * It might oscillate across the narrow part of the valley, making slow progress along the length of the valley.
    * <img src="image/Figure_3.png" width="60%">
* **Learning Rate $\eta$:**
    * Too small: Very slow learning.
    * Too large: Oscillations can become unstable and diverge.
* **Mathematical Insight (Quadratic Approximation near minimum):**
    * The distance to the minimum along each principal direction (eigenvector $u_i$ of the Hessian matrix) changes by a factor $(1 - \eta \lambda_i)$ at each step, where $\lambda_i$ is the eigenvalue.
    * For convergence: $|1 - \eta \lambda_i| < 1$. This limits $\eta < 2/\lambda_{max}$.
    * Convergence rate is often limited by the smallest eigenvalue $\lambda_{min}$ (slowest direction).
    * If $\lambda_{min} / \lambda_{max}$ is very small (ill-conditioned Hessian), progress is slow.

### 3.1 Momentum

* **Idea:** Add "inertia" to the movement through weight space to smooth out oscillations and speed up progress in consistent directions.
* **Update rule:** $\Delta w^{(\tau-1)} = -\eta \nabla E(w^{(\tau-1)}) + \mu \Delta w^{(\tau-2)}$
    * $\mu$ (mu) is the **momentum parameter** (e.g., 0.9).
    * The current update includes a fraction of the previous update.
* **Effect:**
    * In flat regions (low curvature), effective learning rate becomes $\eta / (1-\mu)$, speeding up.
        * <img src="image/Figure_4.png" width="60%">
    * In oscillatory regions (high curvature), momentum terms tend to cancel, damping oscillations.
        * <img src="image/Figure_5.png" width="60%">
    * Overall, faster convergence without divergent oscillations.
        * <img src="image/Figure_6.png" width="60%">
* Introduces another hyperparameter $\mu$ to tune ($0 \le \mu < 1$).

* **PyTorch Example (SGD with Momentum):**
    ```python
    import torch.optim as optim
    import torch.nn as nn # Added for model definition
    model = nn.Linear(10,1) # Example model
    optimizer_momentum = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
    ```
    * (Momentum is a parameter of `optim.SGD`. See [PyTorch optim.SGD docs](https://pytorch.org/docs/stable/generated/torch.optim.SGD.html))

* **Nesterov Momentum:** A slight modification. First, make a step based on previous momentum, *then* calculate the gradient at this new "lookahead" position.
    * Update: $\Delta w^{(\tau-1)} = -\eta \nabla E(w^{(\tau-1)} + \mu \Delta w^{(\tau-2)}) + \mu \Delta w^{(\tau-2)}$
    * Can improve convergence for batch gradient descent; less clear for SGD.
* **PyTorch Example (SGD with Nesterov Momentum):**
    ```python
    import torch.optim as optim
    import torch.nn as nn # Added for model definition
    model = nn.Linear(10,1) # Example model
    optimizer_nesterov = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, nesterov=True)
    ```
    * (Nesterov is a parameter of `optim.SGD`. See [PyTorch optim.SGD docs](https://pytorch.org/docs/stable/generated/torch.optim.SGD.html))


### 3.2 Learning Rate Schedule

* **Idea:** Change the learning rate $\eta$ during training.
    * Start with a larger $\eta$ for faster initial progress.
    * Reduce $\eta$ over time to allow finer adjustments as we get closer to the minimum.
* **Update rule with schedule:** $w^{(\tau)} = w^{(\tau-1)} - \eta^{(\tau-1)} \nabla E_n(w^{(\tau-1)})$
* **Examples of schedules:**
    * Linear decay: $\eta^{(\tau)} = (1-\tau/K)\eta^{(0)} + (\tau/K)\eta^{(K)}$
    * Power law decay: $\eta^{(\tau)} = \eta^{(0)}(1+\tau/s)^{-c}$
    * Exponential decay: $\eta^{(\tau)} = \eta^{(0)}c^{\tau/s}$
* Hyperparameters ($\eta^{(0)}$, K, s, c) need empirical tuning. Monitoring the learning curve (error vs. iterations) is helpful.

* **PyTorch Example (Learning Rate Scheduler `StepLR`):**
    ```python
    import torch.optim as optim
    from torch.optim.lr_scheduler import StepLR
    import torch.nn as nn

    model = nn.Linear(10,1) # Example model
    optimizer = optim.SGD(model.parameters(), lr=0.1)
    # Reduces LR by a factor of gamma (0.1) every step_size (30) epochs
    scheduler = StepLR(optimizer, step_size=30, gamma=0.1) 

    # In a training loop (after optimizer.step()):
    # num_epochs = 100
    # for epoch in range(num_epochs):
    #     # ... train for one epoch: zero_grad, forward, loss, backward, optimizer.step() ...
    #     scheduler.step() # Update the learning rate
    ```
    * PyTorch `lr_scheduler` docs: [https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate](https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate)

### 3.3 RMSProp and Adam

These algorithms adapt the learning rate for *each parameter individually*.

* **AdaGrad (Adaptive Gradient):**
    * Reduces learning rates for parameters that have had large gradients historically.
    * Accumulates squared gradients: $r_i^{(\tau)} = r_i^{(\tau-1)} + (\frac{\partial E}{\partial w_i})^2$
    * Update: $w_i^{(\tau)} = w_i^{(\tau-1)} - \frac{\eta}{\sqrt{r_i^{(\tau)}} + \delta} \frac{\partial E}{\partial w_i}$
        * $\delta$ is a small constant for numerical stability.
    * **Problem:** Learning rates can become too small, prematurely stopping learning.

* **PyTorch Example (AdaGrad Optimizer):**
    ```python
    import torch.optim as optim
    import torch.nn as nn
    model = nn.Linear(10,1) # Example model
    optimizer_adagrad = optim.Adagrad(model.parameters(), lr=0.01)
    ```
    * PyTorch `optim.Adagrad` docs: [https://pytorch.org/docs/stable/generated/torch.optim.Adagrad.html](https://pytorch.org/docs/stable/generated/torch.optim.Adagrad.html)


* **RMSProp (Root Mean Square Propagation):**
    * Addresses AdaGrad's problem by using an exponentially weighted moving average of squared gradients, instead of accumulating them indefinitely.
    * Moving average: $r_i^{(\tau)} = \beta r_i^{(\tau-1)} + (1-\beta)(\frac{\partial E}{\partial w_i})^2$ (typical $\beta=0.9$)
    * Update: $w_i^{(\tau)} = w_i^{(\tau-1)} - \frac{\eta}{\sqrt{r_i^{(\tau)}} + \delta} \frac{\partial E}{\partial w_i}$

* **PyTorch Example (RMSprop Optimizer):**
    ```python
    import torch.optim as optim
    import torch.nn as nn
    model = nn.Linear(10,1) # Example model
    # alpha is equivalent to beta in the formula
    optimizer_rmsprop = optim.RMSprop(model.parameters(), lr=0.01, alpha=0.99) 
    ```
    * PyTorch `optim.RMSprop` docs: [https://pytorch.org/docs/stable/generated/torch.optim.RMSprop.html](https://pytorch.org/docs/stable/generated/torch.optim.RMSprop.html)


* **Adam (Adaptive Moments):**
    * Combines RMSProp with momentum.
    * Stores exponentially weighted moving averages of both the gradients (1st moment, like momentum) and squared gradients (2nd moment, like RMSProp).
        * **Why per-parameter momentum?** Allows each parameter to have its own "velocity" or "inertia". Parameters with consistent gradient directions accelerate, while those with oscillating gradients have their updates dampened. This adapts the update strength individually for each parameter.
    * 1st moment: $s_i^{(\tau)} = \beta_1 s_i^{(\tau-1)} + (1-\beta_1)\frac{\partial E}{\partial w_i}$
    * 2nd moment: $r_i^{(\tau)} = \beta_2 r_i^{(\tau-1)} + (1-\beta_2)(\frac{\partial E}{\partial w_i})^2$
    * Bias correction (important early in training as $s, r$ start at 0):
        * $\hat{s}_i^{(\tau)} = s_i^{(\tau)} / (1-\beta_1^\tau)$
        * $\hat{r}_i^{(\tau)} = r_i^{(\tau)} / (1-\beta_2^\tau)$
    * Update: $w_i^{(\tau)} = w_i^{(\tau-1)} - \eta \frac{\hat{s}_i^{(\tau)}}{\sqrt{\hat{r}_i^{(\tau)}} + \delta}$
    * Typical values: $\beta_1=0.9$, $\beta_2=0.99$.
    * Adam is a very popular and often default choice for deep learning optimization.

* **PyTorch Example (Adam Optimizer):**
    ```python
    import torch.optim as optim
    import torch.nn as nn
    model = nn.Linear(10,1) # Example model
    optimizer_adam = optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999))
    ```
    * PyTorch `optim.Adam` docs: [https://pytorch.org/docs/stable/generated/torch.optim.Adam.html](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html)


## 4. Normalization

Adjusting the scale of variables can make training more effective and stable.

### 4.1 Data Normalization

* **Problem:** Input variables might have vastly different ranges (e.g., height in meters vs. platelet count in 100,000s).
    * This can lead to an error surface with very different curvatures (like the "valley" problem), making training hard.
    * <img src="image/Figure_7.png" width="60%">
* **Solution:** Rescale input features before training.
    1.  Calculate mean ($\mu_i$) and standard deviation ($\sigma_i$) for each feature $i$ across the training set.
        * $\mu_i = \frac{1}{N} \sum_{n=1}^N x_{ni}$
        * $\sigma_i^2 = \frac{1}{N} \sum_{n=1}^N (x_{ni} - \mu_i)^2$
    2.  Normalize: $\tilde{x}_{ni} = \frac{x_{ni} - \mu_i}{\sigma_i}$
        * Normalized features have zero mean and unit variance.
* **Important:** Use the *same* $\mu_i$ and $\sigma_i$ (calculated from training data) to normalize validation and test data.

* **PyTorch Example (Data Normalization with `transforms` for image data):**
    This is typically done as a preprocessing step, often when loading image data.
    ```python
    import torchvision.transforms as transforms
    import torch
    from PIL import Image # For image loading example

    # Example for image data (e.g., 3 channels: R, G, B)
    # These are typical means and stds for ImageNet dataset
    means = [0.485, 0.456, 0.406] 
    stds = [0.229, 0.224, 0.225]
    
    data_transform = transforms.Compose([
        transforms.Resize(256), # Example transform
        transforms.CenterCrop(224), # Example transform
        transforms.ToTensor(), # Converts PIL image or numpy.ndarray to tensor
        transforms.Normalize(mean=means, std=stds)
    ])
    
    # Apply this transform when creating your Dataset or loading an image
    # Assuming you have an image file "path_to_image.jpg"
    # try:
    #     img = Image.open("path_to_image.jpg")
    #     normalized_img_tensor = data_transform(img)
    # except FileNotFoundError:
    #     print("Image file not found, using dummy tensor for demonstration.")
    #     normalized_img_tensor = torch.randn(3, 224, 224)


    # For general tensor data (not images):
    X = torch.randn(100, 5) # 100 samples, 5 features
    mean_vals = X.mean(dim=0, keepdim=True) # Calculate mean per feature
    std_vals = X.std(dim=0, keepdim=True)   # Calculate std per feature
    X_normalized = (X - mean_vals) / (std_vals + 1e-5) # Add epsilon for stability
    ```
    * PyTorch `transforms.Normalize` docs: [https://pytorch.org/vision/stable/generated/torchvision.transforms.Normalize.html](https://pytorch.org/vision/stable/generated/torchvision.transforms.Normalize.html)

### 4.2 Batch Normalization (Batch Norm)

* **Idea:** Apply normalization to the activations *within* hidden layers of the network, not just inputs.
* **Motivation:**
    * **Internal Covariate Shift:** The distribution of inputs to deeper layers changes as the parameters of earlier layers are updated. Batch Norm aims to reduce this.
    * **Vanishing/Exploding Gradients:** In very deep networks, gradients can become extremely small or large as they are backpropagated. Batch Norm helps stabilize gradients.
* **How it works (for pre-activations $a_i$ in a layer, per mini-batch of size K):**
    1.  Calculate mini-batch mean: $\mu_i = \frac{1}{K} \sum_{n=1}^K a_{ni}$
    2.  Calculate mini-batch variance: $\sigma_i^2 = \frac{1}{K} \sum_{n=1}^K (a_{ni} - \mu_i)^2$
    3.  Normalize: $\hat{a}_{ni} = \frac{a_{ni} - \mu_i}{\sqrt{\sigma_i^2 + \delta}}$
    4.  **Scale and Shift:** Introduce learnable parameters $\gamma_i$ (scale) and $\beta_i$ (shift) for each hidden unit $i$.
        * $\tilde{a}_{ni} = \gamma_i \hat{a}_{ni} + \beta_i$
        * This allows the network to learn the optimal mean and variance for each unit, if needed. $\gamma_i$ and $\beta_i$ are learned along with weights.
* **During Inference (Testing):**
    * We don't have mini-batches.
    * Use moving averages of $\mu_i$ and $\sigma_i^2$ (and the learned $\gamma_i, \beta_i$) collected during training. The `nn.BatchNorm` layers handle this automatically based on `model.train()` or `model.eval()` mode.
* **Why it works well is still debated:** Original motivation was reducing internal covariate shift, but newer studies suggest it makes the optimization landscape smoother.
* Illustration: <img src="image/Figure_8_a.png" width="60%"> 

* **PyTorch Example (Batch Normalization Layer):**
    ```python
    import torch.nn as nn
    import torch

    # For 1D data (e.g., after a Linear layer before activation)
    num_features_1d = 64 # Number of features from the previous layer
    batch_norm_1d = nn.BatchNorm1d(num_features_1d)

    # For 2D data (e.g., after a Conv2d layer, num_features is num_channels)
    num_channels_2d = 16
    batch_norm_2d = nn.BatchNorm2d(num_channels_2d)

    # Example usage in a model:
    model_bn = nn.Sequential(
        nn.Linear(10, num_features_1d),
        nn.BatchNorm1d(num_features_1d), # Apply batch norm
        nn.ReLU(),
        nn.Linear(num_features_1d, 1)
    )
    
    # Dummy input for demonstration
    # input_tensor = torch.randn(32, 10) # Batch of 32, 10 features
    # model_bn.train() # Set model to training mode
    # output_train = model_bn(input_tensor)
    # model_bn.eval()  # Set model to evaluation mode (uses running stats)
    # output_eval = model_bn(input_tensor)
    ```
    * PyTorch `nn.BatchNorm1d` docs: [https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm1d.html](https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm1d.html)
    * PyTorch `nn.BatchNorm2d` docs: [https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html](https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html)

### 4.3 Layer Normalization (Layer Norm)

* **Alternative to Batch Norm.**
* **Motivation:**
    * Batch Norm can be tricky if mini-batch size is very small (noisy stats).
    * Less efficient for some architectures (e.g., Recurrent Neural Networks, Transformers) where statistics might change at each time step.
* **How it works (for pre-activations $a_{ni}$ for a single data point $n$, across M hidden units in a layer):**
    1.  Calculate mean across units for data point $n$: $\mu_n = \frac{1}{M} \sum_{i=1}^M a_{ni}$
    2.  Calculate variance across units for data point $n$: $\sigma_n^2 = \frac{1}{M} \sum_{i=1}^M (a_{ni} - \mu_n)^2$
    3.  Normalize: $\hat{a}_{ni} = \frac{a_{ni} - \mu_n}{\sqrt{\sigma_n^2 + \delta}}$
    4.  **Scale and Shift:** Also uses learnable $\gamma_i$ and $\beta_i$ for each unit $i$: $\tilde{a}_{ni} = \gamma_i \hat{a}_{ni} + \beta_i$.
* **Key Difference from Batch Norm:** Normalization is done *across features/hidden units for each data point independently*, instead of across data points in a batch for each feature/hidden unit independently.
* The same normalization function can be used during training and inference (no need for moving averages).
* Illustration: <img src="image/Figure_8_b.png" width="40%">

* **PyTorch Example (Layer Normalization Layer):**
    ```python
    import torch.nn as nn
    import torch

    # normalized_shape is the shape of the input tensor part to be normalized.
    # If input is (N, C, H, W), normalized_shape can be [C, H, W] to normalize over C, H, W.
    # If input is (N, L, D) for sequences, normalized_shape can be [D] (normalize last dim - common in Transformers)
    # or [L,D] (normalize last two dims).
    
    # Example: Input tensor of shape (batch_size, num_features)
    num_features = 64
    # normalized_shape is [num_features]
    layer_norm_fc = nn.LayerNorm(num_features) 

    # Example: Input tensor for sequences (batch_size, sequence_length, embedding_dim)
    embedding_dim = 128
    # This will normalize over the embedding_dim for each element in the sequence
    layer_norm_seq = nn.LayerNorm(embedding_dim) 
    
    # Example usage in a model:
    model_ln = nn.Sequential(
        nn.Linear(10, num_features),
        nn.LayerNorm(num_features), # Apply layer norm
        nn.ReLU(),
        nn.Linear(num_features, 1)
    )
    dummy_input = torch.randn(32, 10) # Batch of 32, 10 features
    output = model_ln(dummy_input)
    ```
    * PyTorch `nn.LayerNorm` docs: [https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html)

---
**Reference:**
This summary is based on concepts primarily discussed in:
* Bishop, C. M. (2024). *Deep Learning: Foundations and Concepts*. Springer. (Chapter 7: Gradient Descent).
