# **Perceptron: Mathematical Batched Forward Pass (PyTorch):**

**`Batching`** is a fundamental technique in deep learning training that involves dividing the entire training dataset into smaller, manageable subsets of data called **`batches`** (or **`mini-batches`**).

The **`Batch Size`** is a crucial hyperparameter that specifies the exact number of training examples contained in one batch. This batch of data is then processed through the neural network together in a single iteration **(one forward pass and one backward pass)**.

### **Why We Do Batching?**

Batching is essential because it strikes a practical and computational balance between the two extremes of using a single data point or the entire dataset at once. 

**The main reasons for batching are:**

1. **`Memory Efficiency (The biggest reason)`:** Large, modern deep learning datasets (millions of images, terabytes of text) often cannot fit entirely into the memory of a single GPU or CPU. Batching allows you to process the data in chunks that fit within the hardware's capacity.

2. **`Computational Efficiency & Parallelism`:** Modern hardware accelerators like GPUs are optimized for parallel computation. Processing a small batch of inputs simultaneously (in parallel) is vastly more efficient than processing one example at a time. This significantly reduces the total training time.

3. **`Gradient Stability vs. Noise`:**  
   * Using the entire dataset gives a very stable, accurate gradient, but it's slow.   
    * Using a single example gives a very noisy, unstable gradient, but it's fast.   
    * Batching uses a small group of samples (e.g., $32, 64, 128$) to compute a **`better, less noisy estimate of the true gradient`** than a single example, while remaining computationally efficient.

### **Types of Batching and Gradient Descent:**

The choice of batch size directly defines the type of **`Gradient Descent`** optimization algorithm being used:

| Batching Type | Batch Size | Gradient Calculation | Update Frequency (per Epoch) | Trade-offs |  |
| --- | --- | --- | --- | --- | --- |
| **1. Batch Gradient Descent ($BGD$)** | **Entire Dataset ($N$)** | Uses the average loss/gradient from **all** N samples. | **$1$** | **Pros:** Very stable, true gradient direction. **Cons:** Extremely slow for large N, high memory usage, can get stuck in sharp local minima. |  |
| **2. Stochastic Gradient Descent ($SGD$)** | **$1$** | Uses the loss/gradient from **one single sample** at a time. | **$N$** (Number of training samples) | **Pros:** Very fast iterations, low memory, gradient noise helps escape local minima. **Cons:** Highly noisy and unstable convergence path, low computational efficiency due to poor GPU utilization. |  |
| **3. Mini-Batch Gradient Descent ($MBGD$)** | **$1 < \text{Batch Size} < N$** (e.g., $32, 64, 128$) | Uses the average loss/gradient from the small subset of samples in the mini-batch. | **$N / \text{Batch Size}$** | **Pros:** Excellent balance of stability and speed, highest computational efficiency, and is the **standard method** used in deep learning. |  |

### **Does Gradient Descent Depend on the Type of Batching?**

**Yes.** 

The type of batching (i.e., the batch size) is what fundamentally defines which of the three primary **`Gradient Descent`** variants you are using.

The gradient descent algorithm is the process of updating the model's weights ($\theta$) using the following formula:

**Where**:   
   * $\theta$: The model's weights and biases.
   * $\eta$: The learning rate.
   * $\nabla J(\theta)$: The **`Gradient`** of the loss function J with respect to the weights.

The core difference is in **how the gradient ($\nabla J(\theta)$) is calculated:**

* **$BGD$** computes the gradient using **all** samples, giving a highly accurate, deterministic direction.

* **$SGD$** computes the gradient using **one** sample, giving a highly noisy, stochastic (random-like) direction.

* **$MBGD$** computes the gradient using a **subset** of samples, providing a fast and reasonable approximation of the true gradient.

The choice of `batch size` is a critical hyperparameter that dictates the `computational efficiency`, the `memory footprint`, and the `stability` of the entire training process.

---------
---------
---------

**Dataset Representation ($7$ samples, $3$ features each):** 

```py
        X = [x₁₁  x₁₂  x₁₃]
            [x₂₁  x₂₂  x₂₃]
            [x₃₁  x₃₂  x₃₃]
            [x₄₁  x₄₂  x₄₃]
            [x₅₁  x₅₂  x₅₃]
            [x₆₁  x₆₂  x₆₃]
            [x₇₁  x₇₂  x₇₃]
```
Shape: $(7, 3)$ = `(batch_size, features)`

**True Labels (Binary Output):** 

```py 
        Y_true = [y₁]
                 [y₂]
                 [y₃]
                 [y₄]
                 [y₅]
                 [y₆]
                 [y₇]
```
Shape: $(7, 1)$ = `(batch_size, 1)`

**Network Architecture:**  
   - **Input layer**: $3$ features
   - **Output layer**: $1$ neuron
   - **Activation function**: $ReLU(z) = max(0, z)$
   - **Loss function**: `Squared Error Loss`

**Weight Representation (PyTorch Convention):**  
   - **Weight matrix W**: shape $(1, 4)$ = `(out_features, in_features_with_bias)`
   
   - $W = [w₀, w₁, w₂, w₃]$
     - $w₀ =$ bias term
   
     - $w₁, w₂, w₃ =$ weights for features

**Batched Forward Propagation (One Epoch, Batch Size $= 7$):**

**`Step 1`: Prepare Batch Input:**    
Original batch input matrix:
```py
        X = [x₁₁  x₁₂  x₁₃]
            [x₂₁  x₂₂  x₂₃]
            [x₃₁  x₃₂  x₃₃]
            [x₄₁  x₄₂  x₄₃]
            [x₅₁  x₅₂  x₅₃]
            [x₆₁  x₆₂  x₆₃]
            [x₇₁  x₇₂  x₇₃]
```
Shape: $(7, 3)$

**`Step 2`: Bias Augmentation (Prepend 1 as First Column):**    
Augmented batch input:
```py 
         X_aug = [1  x₁₁  x₁₂  x₁₃]
                 [1  x₂₁  x₂₂  x₂₃]
                 [1  x₃₁  x₃₂  x₃₃]
                 [1  x₄₁  x₄₂  x₄₃]
                 [1  x₅₁  x₅₂  x₅₃]
                 [1  x₆₁  x₆₂  x₆₃]
                 [1  x₇₁  x₇₂  x₇₃]
```
Shape: $(7, 4)$

**`Step 3`: Linear Transformation (Batch Matrix Multiplication):**

**Weight Matrix:**  
 
> $W = [w₀ , w₁,  w₂ , w₃]$ 

Shape: $(1, 4)$

**Transpose for Multiplication:**
```py
         W^T = [w₀]
               [w₁]
               [w₂]
               [w₃]
```
Shape: $(4, 1)$

**Batch Computation:**
```py 
         Z = X_aug @ W^T
```
Shape: $(7, 4) × (4, 1) = (7, 1)$

**Expanded Calculation:**
```py 
         Z = [1  x₁₁  x₁₂  x₁₃]   [w₀]
             [1  x₂₁  x₂₂  x₂₃]   [w₁]
             [1  x₃₁  x₃₂  x₃₃] @ [w₂]
             [1  x₄₁  x₄₂  x₄₃]   [w₃]
             [1  x₅₁  x₅₂  x₅₃]
             [1  x₆₁  x₆₂  x₆₃]
             [1  x₇₁  x₇₂  x₇₃]
```

**Result:**
```py 
         Z = [z₁]   [w₀ + w₁x₁₁ + w₂x₁₂ + w₃x₁₃]
             [z₂]   [w₀ + w₁x₂₁ + w₂x₂₂ + w₃x₂₃]
             [z₃] = [w₀ + w₁x₃₁ + w₂x₃₂ + w₃x₃₃]
             [z₄]   [w₀ + w₁x₄₁ + w₂x₄₂ + w₃x₄₃]
             [z₅]   [w₀ + w₁x₅₁ + w₂x₅₂ + w₃x₅₃]
             [z₆]   [w₀ + w₁x₆₁ + w₂x₆₂ + w₃x₆₃]
             [z₇]   [w₀ + w₁x₇₁ + w₂x₇₂ + w₃x₇₃]
```

Where:
- $z₁ = w₀ + w₁x₁₁ + w₂x₁₂ + w₃x₁₃$
- $z₂ = w₀ + w₁x₂₁ + w₂x₂₂ + w₃x₂₃$
- $z₃ = w₀ + w₁x₃₁ + w₂x₃₂ + w₃x₃₃$
- $z₄ = w₀ + w₁x₄₁ + w₂x₄₂ + w₃x₄₃$
- $z₅ = w₀ + w₁x₅₁ + w₂x₅₂ + w₃x₅₃$
- $z₆ = w₀ + w₁x₆₁ + w₂x₆₂ + w₃x₆₃$
- $z₇ = w₀ + w₁x₇₁ + w₂x₇₂ + w₃x₇₃$

**`Step 4`: Apply $ReLU$ Activation (Element-wise):**

**`ReLU` Function:** $ReLU(z) = max(0, z)$

```py 
         Y_pred = ReLU(Z) = [ReLU(z₁)]   [max(0, z₁)]
                           [ReLU(z₂)]    [max(0, z₂)]
                           [ReLU(z₃)] =  [max(0, z₃)]
                           [ReLU(z₄)]    [max(0, z₄)]
                           [ReLU(z₅)]    [max(0, z₅)]
                           [ReLU(z₆)]    [max(0, z₆)]
                           [ReLU(z₇)]    [max(0, z₇)]
```

**Predictions:**
```py 
                   Y_pred = [ŷ₁]
                            [ŷ₂]
                            [ŷ₃]
                            [ŷ₄]
                            [ŷ₅]
                            [ŷ₆]
                            [ŷ₇]
```
**Shape**: $(7, 1)$

Where:
   - $ŷ₁ = max(0, z₁)$
   - $ŷ₂ = max(0, z₂)$
   - $ŷ₃ = max(0, z₃)$
   - $ŷ₄ = max(0, z₄)$
   - $ŷ₅ = max(0, z₅)$
   - $ŷ₆ = max(0, z₆)$
   - $ŷ₇ = max(0, z₇)$

### **Loss Calculation:**

**`Step 5`: Calculate Squared Error for Each Sample:**

**Squared Error Formula:** $SE = (y_{true} - y_{pred})²$

For each sample:
```py 
         SE₁ = (y₁ - ŷ₁)²
         SE₂ = (y₂ - ŷ₂)²
         SE₃ = (y₃ - ŷ₃)²
         SE₄ = (y₄ - ŷ₄)²
         SE₅ = (y₅ - ŷ₅)²
         SE₆ = (y₆ - ŷ₆)²
         SE₇ = (y₇ - ŷ₇)²
```

**In Vector Form:**
```py 
         Errors = Y_true - Y_pred = [y₁ - ŷ₁]   [e₁]
                                    [y₂ - ŷ₂]   [e₂]
                                    [y₃ - ŷ₃] = [e₃]
                                    [y₄ - ŷ₄]   [e₄]
                                    [y₅ - ŷ₅]   [e₅]
                                    [y₆ - ŷ₆]   [e₆]
                                    [y₇ - ŷ₇]   [e₇]
```

```py 
         Squared_Errors = [e₁²]   [(y₁ - ŷ₁)²]
                          [e₂²]   [(y₂ - ŷ₂)²]
                          [e₃²] = [(y₃ - ŷ₃)²]
                          [e₄²]   [(y₄ - ŷ₄)²]
                          [e₅²]   [(y₅ - ŷ₅)²]
                          [e₆²]   [(y₆ - ŷ₆)²]
                          [e₇²]   [(y₇ - ŷ₇)²]
```

**`Step 6`: Calculate Total Loss (Sum of Squared Errors):**

**Total Loss:**
```py 
      Loss_total = SE₁ + SE₂ + SE₃ + SE₄ + SE₅ + SE₆ + SE₇

      Loss_total = (y₁ - ŷ₁)² + (y₂ - ŷ₂)² + (y₃ - ŷ₃)² + (y₄ - ŷ₄)² 
                  + (y₅ - ŷ₅)² + (y₆ - ŷ₆)² + (y₇ - ŷ₇)²
```

**Compact Notation:**
```py 
         Loss_total = Σᵢ₌₁⁷ (yᵢ - ŷᵢ)²
```

**Step 7: Calculate Loss Per Epoch (Mean Squared Error):**

**Loss Per Epoch (Average Loss):**
```py 
      Loss_per_epoch = Loss_total / batch_size

      Loss_per_epoch = [(y₁ - ŷ₁)² + (y₂ - ŷ₂)² + (y₃ - ŷ₃)² + (y₄ - ŷ₄)² 
                     + (y₅ - ŷ₅)² + (y₆ - ŷ₆)² + (y₇ - ŷ₇)²] / 7
```

**Compact Notation:**

> $$\text{Loss}_{\text{per epoch}} = \frac{1}{7} \sum_{i=1}^{7} (y_i - \hat{y}_i)^2$$

This is also known as **Mean Squared Error ($MSE$)**.

| Step | Operation | Input Shape | Output Shape | Result |
|------|-----------|-------------|--------------|--------|
| 1 | Original Input | $(7, 3)$ | $(7, 3)$ | $X$ |
| 2 | Bias Augmentation | $(7, 3)$ | $(7, 4)$ | $X_{aug}$ |
| 3 | Linear Transform | $(7, 4) × (4, 1)$ | $(7, 1)$ | $Z$ |
| 4 | ReLU Activation | $(7, 1)$ | $(7, 1)$ | $Y_{pred}$ |
| 5 | Error Calculation | $(7, 1) - (7, 1)$ | $(7, 1)$ | Errors |
| 6 | Squared Errors | $(7, 1)$ | $(7, 1)$ | $SE$ |
| 7 | Total Loss | sum of $(7, 1)$ | scalar | $Loss\_total$ |
| 8 | Loss Per Epoch | Loss_total / 7 | scalar | $MSE$ |

## **Key Concepts:**

1. **`Batched Processing`**: All 7 samples are processed simultaneously in one matrix operation, which is much more efficient than processing one at a time.

2. **`Matrix Dimensions`**:
   - **`Augmented Input`**: $(7, 4)$ represents 7 samples with 4 features (including bias)
   - **`Weight Transpose`**: (4, 1) represents 4 weights for 1 output
   - **`Output`**: $(7, 1)$ represents 7 predictions

3. **`ReLU Activation`**: 
   - Outputs the input if positive, otherwise outputs 0
   - Applied element-wise to each prediction

4. **`Loss Metrics`**:
   - **Total Loss**: Sum of all squared errors across the batch
   - **Loss Per Epoch ($MSE$)**: Average squared error, useful for comparing across different batch sizes

5. **`Vectorization`**: Using matrix operations allows the entire batch to be processed in parallel, which is the foundation of efficient deep learning.

6. **`One Forward Pass = One Epoch`**: Since our batch size equals the dataset size, processing the batch once completes one full epoch.