# Fitting Models

In the first few chapters we learnt the process of passing the data forward through the (Shallow/Deep) network. <br>
In the previous chapter we discussed how to measure the missmatch between the network predictions and the ground truth for a training set. 

These are crucial but partial aspects of the model creation, indeed after recieveing an answer from the loss function we must correct our model to improve. <br>
This is known as $\text{Fitting or Training}$ the model.

This is a process where we'll improve our paramters by: 

1. Computing the derivative with respect to the parameters.
2. Adjust the parameters based on the gradients to reduce the loss.

Note that in Chapter 1, we were able to find a closed for the linear regression problem, however, when the network becomes to complex, this becomes a less viable option. <br>
Instead we'll follow an iterative approach which we'll show that under certain general conditions produces good approximation results.

Here we focus on step (2) in the process of improving the parameters.

---

<div  align="center">

### Optimization Problem

</div>


 $$\boxed{\hat{\phi} = \text{argmin}_{\phi}\big[L[\phi]\big]}$$

## Gradient Descent

$\hspace{7cm} \textbf{Step 0: Initialise parameters so some values: }$
$$ \phi = [\phi_0, \phi_1, \dots, \phi_k]^T$$


$$\boxed{\begin{aligned}
&\textbf{Step 1: Up-hill - Compute the derivatives of the loss with respect to the parameters} \\
&\hspace{6cm} \frac{\partial L}{\partial \phi} = \begin{bmatrix} \frac{\partial L}{\partial \phi_0} \\ \frac{\partial L}{\partial \phi_1} \\ \vdots \\ \frac{\partial L}{\partial \phi_k} \end{bmatrix} \\[1em]
&\textbf{Step 2: Down-Hill - Update the parameters according to the rule} \\
&\hspace{6cm} \phi = \phi - \alpha \cdot \frac{\partial L}{\partial \phi}
\end{aligned}}$$

### 1D Linear Regression Example 

In the first chapter we showed the closed form, now we'll present the iterative solution to this problem.

$$\boxed{\text{model: } f[x, \phi] \qquad \text{parameters:} \phi = [\phi_0, \phi_1]^T \qquad \text{Input: }x \in \mathbb{R}}$$

$$\boxed{y = f[x, \phi] = \phi_0 + \phi_1 x}$$

We know that our loss is the mean squared Error

$$L[\phi] = \sum_{i=1}^N l_i = \sum_{i=1}^N (f[x_i, \phi] - y_i)^2 =  \sum_{i=1}^N (\phi_0 + \phi_1 x - y_i)^2$$





**Step 1** 

$$\frac{\partial L[\phi]}{\partial \phi} = \frac{\partial }{\partial \phi}\sum_{i=1}^N l_i \underset{\text{linearity }}{=} \sum_{i=1}^N\frac{\partial l_i}{\partial \phi} = \sum_{i=1}^N \begin{bmatrix} \frac{\partial l_i}{\partial \phi_0} \\ \frac{\partial l_i}{\partial \phi_1} \end{bmatrix} = \sum_{i=1}^N \begin{bmatrix} 2\cdot 1(\phi_0 + \phi_1x_i - y_i) \\ 2 x_i(\phi_0 + \phi_1x_i - y_i)\cdot \end{bmatrix}$$

**Step 2**

$$ \phi = \phi - \alpha \frac{\partial L}{\partial \phi} $$

<div align="center">

**Gradient Descent Visualization:**

<img src="./images/chap5/1dgradDescent_reg.gif" alt="Gradient Descent Animation" width="500" />
<img src="./images/chap5/lossSpace.png" alt="Loss Space" width="390" />

*If the animation doesn't display, [click here to view the video](./images/chap5/1dgradDescent_reg.gif)*

</div>

---
---

### Gabor model Example 




The loss functions for linear regression problems always have a single well-defined minimum. This is known as the **convex property**. <br>
The advantage of this property is that gradient descent is guaranteed to find this minimum. <br>

However, the loss functions for most non-linear models are **non-convex**, meaning the loss landscape isn't "bowl-shaped".<br> This means the algorithm may get stuck in local minima and not necessarily find the global minimum. <br>

**Analogy:** Imagine you're hiking down a mountainous region at night, surrounded by dense forest and vegetation, trying to reach the lowest point in the valley. The only tool you have is your sense of the ground beneath your feet, which tells you the local slope (steepness) at your current position. 

- In a **convex landscape** (linear regression), there's one clear valley, and walking downhill always leads you to the bottom.
- In a **non-convex landscape** (deep networks), there are multiple valleys, hills, and plateaus. <br> Following the steepest descent might lead you to a nearby valley, but not necessarily the deepest one. You might get trapped in a local dip, unaware that a much deeper valley exists elsewhere.

This is the fundamental challenge of training deep neural networks.

We'll look at a simple nonlinear model with two parameters to understand the properties of non-convex loss functions:

$$f[x, \phi] = \sin[\phi_0 + 0.006 \cdot \phi_1 x] \cdot \exp\left(\frac{(\phi_0 + 0.006 \cdot \phi_1 x)^2}{32.0}\right)$$

The loss function is the mean squared error: $$L[\phi] = \sum_{i=1}^N(f[x_i, \phi] - y_i)^2$$







**Intuition:**

1. **Sine Component:** $\sin[\phi_0 + 0.006 \cdot \phi_1 x]$ 
   - Creates oscillations (waves) in the output
   - $\phi_0$ shifts the wave horizontally (phase shift)
   - $\phi_1$ controls the frequency (how quickly the wave oscillates)

2. **Exponential Component:** $\exp\left(\frac{(\phi_0 + 0.006 \cdot \phi_1 x)^2}{32.0}\right)$
   - Acts as an envelope that amplifies the sine wave
   - Grows as we move away from the origin
   - The same linear term $(\phi_0 + 0.006 \cdot \phi_1 x)$ controls the rate of growth

3. **Combined Effect:**
   - Produces a sinusoidal function whose **amplitude grows exponentially** as $|x| \to \infty$
   - Both the oscillation frequency and growth rate increase together
   - The parameters $\phi_0$ and $\phi_1$ jointly determine:
     - Where the pattern is centered
     - How rapidly it expands and oscillates
   - This creates a **non-convex loss landscape** with many local minima, making optimization challenging

<div align="center">

**Visualisation**

<style>
.zoom-img {
    transition: transform 0.3s ease;
    cursor: pointer;
}
.zoom-img:hover {
    transform: scale(1.5);
}
</style>

<table>
  <tr>
    <td align="center">
      <img src="./images/chap5/Gfunc1.png" alt="Gabor Function 1" width="400" class="zoom-img"/><br/>
      φ₀ = -50, φ₁ = 46
    </td>
    <td align="center">
      <img src="./images/chap5/Gfunc2.png" alt="Gabor Function 2" width="400" class="zoom-img"/><br/>
      φ₀ = 100, φ₁ = 46
    </td>
  </tr>
  <tr>
    <td align="center">
      <img src="./images/chap5/Gfunc3.png" alt="Gabor Function 3" width="400" class="zoom-img"/><br/>
      φ₀ = -10, φ₁ = 46
    </td>
    <td align="center">
      <img src="./images/chap5/Gfunc4.png" alt="Gabor Function 4" width="400" class="zoom-img"/><br/>
      φ₀ = -10, φ₁ = 24
    </td>
  </tr>
</table>

*Different parameter combinations showing various wave patterns and growth rates*

</div>





Now that we have a better understanding of the Gabor function construction, let's explore what happens when we try to fit it to data.

Suppose we're provided with a training set $\{x_i, y_i\}_{i=1}^N$

### Local Minima and Saddle Points

The images below show corresponding points in the loss space and their associated parameter values in the data space, illustrating the challenges of non-convex optimization.

**Key observations:**

1. **Multiple Local Minima:** The loss landscape contains many points where the gradient is zero. When gradient descent reaches such a point, it halts, even though it may not be the global minimum.

2. **Ambiguity Problem:** When the algorithm stops, we have no way to determine whether we've reached a local minimum or the global minimum—they look identical from the gradient's perspective.

3. **Saddle Points:** Most algorithms use early stopping mechanisms, making it difficult to achieve true zero gradient. This means we might stop at a **saddle point**—a point where the gradient is near zero but isn't a minimum at all (like the top of a mountain pass). Saddle points have similar gradient characteristics to minima, making them indistinguishable during optimization.

<div align="center">
 <img src="./images/chap5/minimasSpoint.png" alt="Gabor Function 4" width="500"/><br/>
</div>







## Stochastic Gradient Descent




### Motivation: Escaping Local Minima

Two potential approaches to address the local minima problem would be to either: 
1. Exhaustively try out **all** possible parameter combinations
2. Initialize the weights at multiple different starting positions

However, the number of parameters and potential minima are extremely large when dealing with Deep Neural Networks, making these approaches computationally infeasible. 

**Stochastic Gradient Descent** (SGD) tries to remedy this problem by adding **controlled randomness** to the algorithm at each step.

---

### Three Variants of Gradient Descent

#### 1. Full Batch Gradient Descent

So far in our algorithm, we've been using the **entire dataset** and then updating the parameters. This is known as $\textcolor{lightblue}{\text{Full Batch}}$ or $\textcolor{lightblue}{\text{Batch}}$ Gradient Descent.

$$\boxed{\phi_{t+1} = \phi_{t} - \alpha \cdot \frac{\partial L[\phi_t]}{\partial \phi} = \phi_{t} - \alpha \cdot \sum_{i=1}^{N} \frac{\partial l_i[\phi_t]}{\partial \phi}}$$

**Characteristics:**
- ✅ Stable, smooth convergence
- ✅ Guaranteed to find minimum in convex problems
- ❌ Computationally expensive for large datasets
- ❌ Can get stuck in local minima (non-convex problems)
- ❌ Memory intensive

---

#### 2. Stochastic Gradient Descent (SGD) - Single Sample

Instead of using all data, we can update parameters based on a **single training example** at each iteration:

$$\boxed{\phi_{t+1} = \phi_{t} - \frac{\alpha}{N} \cdot \frac{\partial l_i}{\partial \phi}}$$

$$\text{Where } l_i \text{ is the loss for a single randomly selected training example}$$

**Why "Stochastic"?** The term means "random" or "probabilistic"—we randomly select which training example to use at each step.

**Characteristics:**
- ✅ Very fast updates
- ✅ Can escape local minima (due to noise)
- ✅ Low memory requirements
- ❌ Very noisy gradient estimates
- ❌ Erratic convergence path
- ❌ May never fully converge

The high noise level means the model doesn't obtain a global view of the data like in full batch gradient descent.

---

#### 3. Mini-Batch SGD (The Practical Compromise)

To add a **moderate amount of randomness**, we choose a random subset of the training data and compute the gradient from this subset alone. This is known as $\textcolor{lightblue}{\text{mini-batch}}$ or simply $\textcolor{lightblue}{\text{batch}}$ gradient descent.

$$\boxed{\phi_{t+1} = \phi_{t} - \alpha \cdot \frac{1}{|B_t|}\sum_{i \in B_t} \frac{\partial l_i[\phi_t]}{\partial \phi}}$$

$$\text{Where } B_t \text{ is our current batch containing } |B_t| \text{ input-output pairs}$$

**Sampling Strategy:**
- We usually sample **without replacement** within an epoch
- Once we've used a batch for a parameter update, we select a different unseen batch
- Once we've iterated through the **entire training set**, this completes one $\textcolor{lightblue}{\text{Epoch}}$

**Characteristics:**
- ✅ Balanced trade-off between speed and stability
- ✅ Can still escape local minima
- ✅ Efficient GPU utilization
- ✅ Moderate memory requirements
- ✅ Most commonly used in practice

---

### Properties of Mini-Batch SGD

1. **Sensible Updates:** Even though we're adding randomness to the learning trajectory, the algorithm still improves the fit on average at each iteration, so the updates remain meaningful.

2. **Fair Representation:** Since we're iterating through the training examples without replacement, each training example contributes equally to the optimization trajectory within an epoch.

3. **Computational Efficiency:** Processing smaller batches is less computationally expensive than processing the entire dataset, allowing for more frequent parameter updates.

4. **Escape Mechanism:** The randomness introduced by mini-batching provides a mechanism to escape local minima and saddle points that would trap standard gradient descent.

---

### Comparison Summary
<div align="center">

| Method | Batch Size | Speed | Convergence | Memory | Use Case |
|--------|-----------|-------|-------------|---------|----------|
| **Full Batch** | $N$ (all data) | Slowest | Smooth | High | Small datasets, convex problems |
| **SGD** | 1 | Fastest | Noisy | Lowest | Online learning, very large datasets |
| **Mini-Batch** | $32-256$ | Fast | Balanced | Moderate | **Most deep learning applications** |

</div>

---
---

### Implementation: How many Batches and Epochs?

Choosing the right batch size and number of epochs is crucial for effective training. Here are the key considerations:


#### Batch Size Selection 

| **Small batches (1-32):** | **Large batches (256-1024+):** | 
|---------------------------|--------------------------------|
| - ✅ More noise → better exploration of loss landscape <br> - ✅ Can escape local minima more easily <br> - ✅ Less memory required <br> - ❌ Slower training (more updates needed) <br> - ❌ Noisy gradient estimates | - ✅ Faster training (fewer updates) <br> - ✅ More stable gradient estimates <br> - ✅ Better hardware utilization (GPUs) <br> - ❌ May get stuck in sharp minima <br> - ❌ Requires more memory <br> - ❌ Less exploration|

**Common practice:** Batch sizes of 32, 64, 128, or 256 provide a good balance.

#### Number of Epochs

An **epoch** is one complete pass through the entire training dataset.

**How many epochs?**
- Too few → **underfitting** (model hasn't learned enough)
- Too many → **overfitting** (model memorizes training data)

**Practical approach:**
1. Monitor training and validation loss
2. Stop when validation loss stops improving (**early stopping**)
3. Typical range: 10-100+ epochs depending on problem complexity

#### Example Calculation

Given:
- Training set size: $N = 1000$ examples
- Batch size: $B = 100$
- Number of epochs: $E = 50$

**Number of batches per epoch:** $\frac{N}{B} = \frac{1000}{100} = 10$ batches

**Total parameter updates:** $10 \times 50 = 500$ updates

## Momentum

As mentioned earlier our main problems with SGD are:
1. **Narrow ravines:** The loss function is steep vertically but shallow horizonatlly
2. **Zero Gradients:** Zero gradient in local minima and saffle points
3. **Noisy Gradients:** The more promininant with smaller batches/ single instances
    
A modification to SGD is to add a $momentum$ term. where the parameters are updated with a wighted combination of the gradient computed from current bach and the previous step: 

$$ m_{t+1} = \beta \cdot m_t + (1 - \beta)\sum_{i \in B_t} \frac{\partial l_i[\phi]}{\partial \phi} \\ \phi_{t+1} = \phi_{t} - \alpha \cdot m_{t+1}$$

$ \beta \in [0, 1): \text{ Control the degree to which the gradient is smoothed over time}$ <br>
$ \alpha \text{ : The learning rate}$

**Effect**

The efffective learning rate increases if all the gradients are aligned over multiple iterations but decreases when gradients changes repeatedly. <br> This leads to smoother trajectory.

### Nesterov Accelerated momentum

This is improvement to the stand-alone momentum by computing the gradients at the predicted points rather then the current point.

$$m_{t+1} = \beta \cdot m_t + (1 - \beta)\sum_{i \in B_t} \frac{\partial l_i[\phi_{t} - \alpha \beta \cdot m_{t}]}{\partial \phi} \\ \phi_{t+1} = \phi_{t} - \alpha \cdot m_{t+1}$$

<div align="center">
<img  src="images/chap5/momentum.png" alt="Momentum vs None " width="500" />
<img  src="images/chap5/nesterov.png" alt="Nesterov Momentum " width="270" />
</div>