# 📌 Neural Networks (Basics) - A Complete Guide

This notebook provides a **detailed breakdown** of fundamental neural network concepts, including:  
✔ **Perceptron & Multi-layer Perceptron (MLP)**  
✔ **Activation Functions & Their Derivatives**  
✔ **Forward & Backpropagation with Detailed Math**  
✔ **Gradient Descent for Neural Networks**  
✔ **Implementation in PyTorch & Keras**

---

## 📖 1. Introduction to Neural Networks

Neural Networks are inspired by the **human brain** and consist of **layers of neurons** that transform input data through weighted connections.  

### 🔹 Why Use Neural Networks?
✔ **Can model complex relationships** between inputs and outputs.  
✔ **Can approximate any function (Universal Approximation Theorem)**.  
✔ **Used in deep learning applications like NLP, Vision, and Reinforcement Learning**.  

---

## 🧮 2. The Perceptron: The Basic Building Block

### 🔹 1. The Perceptron Model  
A perceptron is a **single-layer neural network** that performs **binary classification**:

$$
y = f(W \cdot X + b)
$$

where:
- **$X$** = Input vector
- **$W$** = Weights
- **$b$** = Bias
- **$f$** = Activation function

---

### 🔹 2. Perceptron Learning Rule

1. Initialize weights randomly.  
2. Compute the **weighted sum** of inputs:  

   $$
   z = W \cdot X + b
   $$

3. Apply an **activation function** (e.g., step function):

   $$
   y = \text{sign}(z)
   $$

4. **Update weights** using the perceptron update rule:

   $$
   W := W + \alpha (y_{\text{true}} - y_{\text{pred}}) X
   $$

5. Repeat until convergence.

✅ **Limitations**:  
- Can only solve **linearly separable** problems (e.g., AND, OR).
- **Cannot solve XOR** → Requires Multi-Layer Perceptrons (MLPs).

---

# 📖 3. Multi-Layer Perceptron (MLP)

A **Multi-Layer Perceptron (MLP)** consists of:
- **Input Layer**
- **Hidden Layers** (with non-linear activations)
- **Output Layer**

Each neuron in layer **$l$** receives input **$a^{(l-1)}$** from the previous layer:

$$
z^{(l)} = W^{(l)} a^{(l-1)} + b^{(l)}
$$

$$
a^{(l)} = f(z^{(l)})
$$

where:
- **$W^{(l)}$** = Weights of layer **$l$**  
- **$b^{(l)}$** = Bias  
- **$f$** = Activation function  

✅ **MLPs can learn complex patterns, including XOR**.

---

## 📖 4. Activation Functions & Their Derivatives

### 🔹 1. Sigmoid Function
Used in binary classification:

$$
f(z) = \frac{1}{1 + e^{-z}}
$$

Derivative:

$$
f'(z) = f(z) (1 - f(z))
$$

✅ **Smooth, differentiable**  
⚠ **Suffers from vanishing gradients**  

---

### 🔹 2. ReLU (Rectified Linear Unit)
Most popular in deep networks:

$$
f(z) = \max(0, z)
$$

Derivative:

$$
f'(z) =
\begin{cases}
1, & z > 0 \\
0, & z \leq 0
\end{cases}
$$

✅ **Faster convergence**  
⚠ **Can cause dead neurons (ReLU dying problem)**  

---

### 🔹 3. Softmax Function
Used in multi-class classification:

$$
f(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}
$$

Gradient:

$$
\frac{\partial f}{\partial z_i} = f(z_i) (1 - f(z_i))
$$

---

## 📖 5. Backpropagation: Training Neural Networks

To train an MLP, we use **Backpropagation**, which updates weights based on the **chain rule of calculus**.

### 🔹 1. Compute Loss Function
For **binary classification**, we use **Binary Cross-Entropy**:

$$
J(\theta) = - \frac{1}{m} \sum_{i=1}^{m} y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)
$$

For **multi-class classification**, we use **Categorical Cross-Entropy**:

$$
J(\theta) = - \sum_{i=1}^{m} \sum_{j=1}^{k} y_{ij} \log(\hat{y}_{ij})
$$

---

### 🔹 2. Compute Gradients Using Chain Rule

$$
\frac{\partial J}{\partial W} = \frac{\partial J}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial W}
$$

---

## 🚀 6. Implementing Neural Networks in PyTorch


In [None]:
# 📦 Import required libraries
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import numpy as np

# ✅ Generate synthetic data (XOR problem)
X = np.array([[0,0], [0,1], [1,0], [1,1]], dtype=np.float32)
y = np.array([[0], [1], [1], [0]], dtype=np.float32)

# Convert to PyTorch tensors
X_tensor = torch.tensor(X)
y_tensor = torch.tensor(y)

# Define MLP model
class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = nn.Linear(2, 4)
        self.layer2 = nn.Linear(4, 1)
        self.activation = nn.Sigmoid()

    def forward(self, x):
        x = torch.relu(self.layer1(x))
        x = self.activation(self.layer2(x))
        return x

# Initialize model, loss function, and optimizer
pt_model = MLP()
criterion = nn.BCELoss()  # Binary Cross Entropy Loss
optimizer = optim.Adam(pt_model.parameters(), lr=0.01)

# Train model
epochs = 1000
for epoch in range(epochs):
    optimizer.zero_grad()
    output = pt_model(X_tensor)
    loss = criterion(output, y_tensor)
    loss.backward()
    optimizer.step()

# Evaluate model
print(f"Final loss: {loss.item():.4f}")




## 🚀 7. Implementing Neural Networks in Keras


In [91]:
# 📦 Import required libraries
import tensorflow as tf
from tensorflow import keras

# Define MLP model in Keras
keras_model = keras.Sequential([
    keras.layers.Dense(4, activation='relu', input_shape=(2,)),
    keras.layers.Dense(1, activation='sigmoid')
])

# Compile model
keras_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train model
keras_model.fit(X, y, epochs=100, verbose=1)

# Evaluate model
loss, accuracy = keras_model.evaluate(X, y)
print(f"Final Loss: {loss:.4f}, Accuracy: {accuracy:.4f}")


Epoch 1/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1s/step - accuracy: 0.5000 - loss: 0.6974
Epoch 2/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 43ms/step - accuracy: 0.5000 - loss: 0.6972
Epoch 3/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 41ms/step - accuracy: 0.5000 - loss: 0.6970
Epoch 4/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 39ms/step - accuracy: 0.5000 - loss: 0.6967
Epoch 5/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 44ms/step - accuracy: 0.5000 - loss: 0.6965
Epoch 6/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 44ms/step - accuracy: 0.5000 - loss: 0.6962
Epoch 7/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 42ms/step - accuracy: 0.5000 - loss: 0.6960
Epoch 8/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 39ms/step - accuracy: 0.5000 - loss: 0.6957
Epoch 9/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m

KeyboardInterrupt: 

# 📌 Advanced Topics in Neural Networks - Dropout, Batch Normalization & More

Now that we've covered the basics of Neural Networks, let's dive into **advanced techniques** to improve performance:
✔ **Dropout: Regularization to prevent overfitting**  
✔ **Batch Normalization: Stabilizing training & speeding up convergence**  
✔ **Weight Initialization: Preventing vanishing/exploding gradients**  
✔ **Gradient Clipping: Handling exploding gradients**  
✔ **Learning Rate Scheduling: Adaptive learning rates**  

---

# 📖 8. Dropout - Preventing Overfitting

### 🔹 1. What is Dropout?

**Dropout** is a **regularization technique** used to **prevent overfitting** by randomly dropping units (neurons) during training.

### 🔹 2. How Dropout Works
1. At each training step, randomly **drop neurons** with probability **$p$**.
2. Forward pass continues **without these neurons**.
3. At test time, **all neurons are active** but their outputs are scaled by **$1 - p$** to maintain consistency.

### 🔹 3. Mathematical Formulation
If **$h_i$** is the output of neuron **$i$**, then with dropout:

$$
h_i^{\text{drop}} = \frac{h_i}{1 - p} \quad \text{(at test time)}
$$

where **$p$** is the dropout probability.

✅ **Why use Dropout?**
- Reduces **co-adaptation** between neurons.
- Acts as **ensemble learning** (averaging different sub-networks).
- Helps **generalization** by forcing the model to be more robust.

---

## 🚀 Implementing Dropout in PyTorch


In [80]:
import torch.nn.functional as F

# Define MLP model with Dropout
class MLP_Dropout(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = nn.Linear(2, 4)
        self.dropout = nn.Dropout(p=0.5)  # 50% neurons dropped
        self.layer2 = nn.Linear(4, 1)
        self.activation = nn.Sigmoid()

    def forward(self, x):
        x = F.relu(self.layer1(x))
        x = self.dropout(x)  # Apply dropout
        x = self.activation(self.layer2(x))
        return x

# Initialize model
model_dropout = MLP_Dropout()


## 🚀 Implementing Dropout in Keras

In [81]:
from tensorflow.keras.layers import Dropout

# Define MLP model with Dropout
model_dropout = keras.Sequential([
    keras.layers.Dense(4, activation='relu', input_shape=(2,)),
    keras.layers.Dropout(0.5),  # Dropout layer with 50% probability
    keras.layers.Dense(1, activation='sigmoid')
])

# Compile model
model_dropout.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])


## 🚀 Implementing Batch Normalization in PyTorch

In [82]:
# Define MLP model with BatchNorm
class MLP_BatchNorm(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = nn.Linear(2, 4)
        self.batchnorm = nn.BatchNorm1d(4)  # Batch Normalization layer
        self.layer2 = nn.Linear(4, 1)
        self.activation = nn.Sigmoid()

    def forward(self, x):
        x = self.layer1(x)
        x = self.batchnorm(x)  # Apply batch normalization
        x = F.relu(x)
        x = self.activation(self.layer2(x))
        return x

# Initialize model
model_batchnorm = MLP_BatchNorm()


## 🚀 Implementing Batch Normalization in Keras

In [83]:
from tensorflow.keras.layers import BatchNormalization

# Define MLP model with Batch Normalization
model_batchnorm = keras.Sequential([
    keras.layers.Dense(4, input_shape=(2,), activation='relu'),
    keras.layers.BatchNormalization(),  # Batch Normalization layer
    keras.layers.Dense(1, activation='sigmoid')
])

# Compile model
model_batchnorm.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# 📖 9. Other Advanced Topics in Neural Networks

In this section, we explore **advanced techniques** that improve training efficiency and stability:

✔ **Weight Initialization:** Preventing vanishing/exploding gradients  
✔ **Gradient Clipping:** Handling unstable gradients  
✔ **Learning Rate Scheduling:** Adapting learning rates dynamically  

---

## 🔹 1. Weight Initialization

### 🔹 Why is Weight Initialization Important?
- **Prevents vanishing/exploding gradients**  
- **Speeds up convergence**  
- **Ensures stable flow of activations through the network**  

### 🔹 Xavier (Glorot) Initialization
Used for **sigmoid & tanh activations**, ensures variance remains stable:

$$
W \sim \mathcal{N}(0, \frac{1}{n_{\text{in}}})
$$

### 🔹 He Initialization
Used for **ReLU & Leaky ReLU activations**, accounts for rectification:

$$
W \sim \mathcal{N}(0, \frac{2}{n_{\text{in}}})
$$


In [84]:
# ✅ Implementing Xavier & He Initialization in PyTorch

# Xavier (Glorot) Initialization
def xavier_init(shape):
    return torch.randn(shape) * torch.sqrt(torch.tensor(1.0) / shape[1])

# He Initialization (for ReLU networks)
def he_init(shape):
    return torch.randn(shape) * torch.sqrt(torch.tensor(2.0) / shape[1])


In [85]:
# ✅ Implementing Xavier & He Initialization in Keras
from tensorflow.keras.initializers import GlorotUniform, HeNormal

# Xavier Initialization (Glorot)
xavier_initializer = GlorotUniform()

# He Initialization (for ReLU)
he_initializer = HeNormal()


## 🔹 2. Gradient Clipping

### 🔹 Why Use Gradient Clipping?
- **Prevents exploding gradients**, especially in deep networks or RNNs  
- **Ensures stability** in weight updates  

### 🔹 Gradient Clipping Formula

For each weight **$w_i$**, if its gradient **$\nabla w_i$** exceeds a threshold **$c$**, rescale:

$$
\nabla w_i = \frac{c}{||\nabla w||} \nabla w_i
$$


In [86]:
# ✅ Implementing Gradient Clipping in PyTorch


# Define a simple MLP model with Batch Normalization
class MLP_BatchNorm(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = nn.Linear(2, 4)
        self.batchnorm = nn.BatchNorm1d(4)  # Ensure this works correctly
        self.layer2 = nn.Linear(4, 1)
        self.activation = nn.Sigmoid()

    def forward(self, x):
        x = self.layer1(x)
        x = self.batchnorm(x)  # Apply BatchNorm
        x = torch.relu(x)
        x = self.activation(self.layer2(x))
        return x

# Initialize model
torch_model = MLP_BatchNorm()

# Define optimizer
optimizer = optim.Adam(torch_model.parameters(), lr=0.01)

# ✅ Generate dummy batch data (Batch size = 2 or more)
X_batch = torch.tensor([[0.5, 0.2], [0.1, 0.7]])  # Ensure batch size > 1

# Training loop with Gradient Clipping
for epoch in range(10):
    torch_model.train()  # Ensure model is in training mode
    optimizer.zero_grad()
    
    output = torch_model(X_batch)  # Pass batch-sized input
    loss = output.mean()  # Example loss function
    
    loss.backward()
    
    # ✅ Apply Gradient Clipping
    utils.clip_grad_norm_(torch_model.parameters(), max_norm=1.0)
    
    optimizer.step()

    print(f"Epoch {epoch+1}, Loss: {loss.item():.4f}")



Epoch 1, Loss: 0.2987
Epoch 2, Loss: 0.2903
Epoch 3, Loss: 0.2819
Epoch 4, Loss: 0.2737
Epoch 5, Loss: 0.2655
Epoch 6, Loss: 0.2574
Epoch 7, Loss: 0.2495
Epoch 8, Loss: 0.2417
Epoch 9, Loss: 0.2339
Epoch 10, Loss: 0.2264


In [87]:
# ✅ Implementing Gradient Clipping in Keras
from tensorflow.keras.optimizers import Adam

# Define optimizer with gradient clipping
optimizer = Adam(learning_rate=0.01, clipnorm=1.0)  # clipnorm prevents exploding gradients

# Compile model with clipped gradients
keras_model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])


## 🔹 3. Learning Rate Scheduling

### 🔹 Why Use Learning Rate Scheduling?
- **Adjusts learning rate dynamically** to **improve convergence**  
- **Speeds up training** while avoiding overshooting  

### 🔹 Common Learning Rate Schedulers

#### **1. Exponential Decay**
Gradually reduces the learning rate:

$$
\alpha_t = \alpha_0 \times e^{-kt}
$$

#### **2. Step Decay**
Reduces the learning rate **after a fixed number of steps**:

$$
\alpha_t = \alpha_0 \times 0.5^{t//k}
$$


In [None]:
# ✅ Implementing Learning Rate Scheduling in PyTorch

# Define MLP model with Batch Normalization
class MLP_BatchNorm(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = nn.Linear(2, 4)
        self.batchnorm = nn.BatchNorm1d(4)  # Ensure batch size > 1
        self.layer2 = nn.Linear(4, 1)
        self.activation = nn.Sigmoid()

    def forward(self, x):
        x = self.layer1(x)
        x = self.batchnorm(x)  # Apply BatchNorm
        x = torch.relu(x)
        x = self.activation(self.layer2(x))
        return x

# Initialize model
torch_model = MLP_BatchNorm()

# Define optimizer
optimizer = optim.Adam(torch_model.parameters(), lr=0.01)

# Exponential Decay
scheduler_exp = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.9)

# Step Decay
scheduler_step = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.5)

# ✅ Generate dummy batch data (Batch size = 2 or more)
X_batch = torch.tensor([[0.5, 0.2], [0.1, 0.7]])  # Ensure batch size > 1

# Training loop with Learning Rate Scheduling
for epoch in range(20):
    torch_model.train()  # Ensure model is in training mode
    optimizer.zero_grad()
    
    output = torch_model(X_batch)  # Pass batch-sized input
    loss = output.mean()  # Example loss function
    
    loss.backward()
    
    # ✅ Apply Gradient Clipping
    utils.clip_grad_norm_(torch_model.parameters(), max_norm=1.0)
    
    optimizer.step()

    # ✅ Update learning rate
    scheduler_exp.step()
    
    print(f"Epoch {epoch+1}, Learning Rate: {scheduler_exp.get_last_lr()[0]:.6f}")



Epoch 1, Learning Rate: 0.009000
Epoch 2, Learning Rate: 0.008100
Epoch 3, Learning Rate: 0.007290
Epoch 4, Learning Rate: 0.006561
Epoch 5, Learning Rate: 0.005905
Epoch 6, Learning Rate: 0.005314
Epoch 7, Learning Rate: 0.004783
Epoch 8, Learning Rate: 0.004305
Epoch 9, Learning Rate: 0.003874
Epoch 10, Learning Rate: 0.003487
Epoch 11, Learning Rate: 0.003138
Epoch 12, Learning Rate: 0.002824
Epoch 13, Learning Rate: 0.002542
Epoch 14, Learning Rate: 0.002288
Epoch 15, Learning Rate: 0.002059
Epoch 16, Learning Rate: 0.001853
Epoch 17, Learning Rate: 0.001668
Epoch 18, Learning Rate: 0.001501
Epoch 19, Learning Rate: 0.001351
Epoch 20, Learning Rate: 0.001216


In [89]:
# ✅ Implementing Learning Rate Scheduling in Keras
from tensorflow.keras.callbacks import LearningRateScheduler

# Define exponential decay function
def exponential_decay(epoch, lr):
    return lr * np.exp(-0.1)

# Define step decay function
def step_decay(epoch, lr):
    if epoch % 10 == 0:
        return lr * 0.5
    return lr

# Apply scheduler callbacks
exp_decay_callback = LearningRateScheduler(exponential_decay)
step_decay_callback = LearningRateScheduler(step_decay)

# Train model with learning rate scheduler
keras_model.fit(X, y, epochs=20, callbacks=[exp_decay_callback])


Epoch 1/20
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2s/step - accuracy: 0.7500 - loss: 0.6492 - learning_rate: 0.0090
Epoch 2/20
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 58ms/step - accuracy: 0.7500 - loss: 0.6471 - learning_rate: 0.0082
Epoch 3/20
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 43ms/step - accuracy: 0.7500 - loss: 0.6434 - learning_rate: 0.0074
Epoch 4/20
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 46ms/step - accuracy: 0.7500 - loss: 0.6405 - learning_rate: 0.0067
Epoch 5/20
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 46ms/step - accuracy: 0.7500 - loss: 0.6384 - learning_rate: 0.0061
Epoch 6/20
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 48ms/step - accuracy: 0.7500 - loss: 0.6364 - learning_rate: 0.0055
Epoch 7/20
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 46ms/step - accuracy: 0.7500 - loss: 0.6346 - learning_rate: 0.0050
Epoch 8/20
[1m

<keras.src.callbacks.history.History at 0x201cc518230>

## 💡 10. Interview Questions

### 🔹 **Dropout Questions**
1️⃣ **Why does Dropout help prevent overfitting?**  
   - Forces the network to learn redundant representations.  
   
2️⃣ **How does Dropout affect training vs. inference?**  
   - During training, neurons are randomly dropped with probability **p**.  
   - During inference, all neurons are active, but their outputs are scaled by **(1 - p)** to maintain consistency.  

---

### 🔹 **Batch Normalization Questions**
3️⃣ **Why does BatchNorm improve training?**  
   - It stabilizes learning, reduces internal covariate shift, and allows for higher learning rates.  

4️⃣ **How does BatchNorm act as a regularization technique?**  
   - It introduces slight noise to the activations, similar to Dropout, which reduces overfitting.  

---

### 🔹 **Optimization Questions**
5️⃣ **Why use Xavier (Glorot) initialization?**  
   - It ensures that the variance of activations remains stable across layers, preventing vanishing/exploding gradients.  

6️⃣ **Why use He initialization for ReLU networks?**  
   - He initialization scales weights to maintain variance in ReLU-based networks, improving convergence.  

7️⃣ **What is the purpose of Gradient Clipping?**  
   - It prevents exploding gradients by capping their magnitude during backpropagation.  

8️⃣ **When should you use Learning Rate Scheduling?**  
   - When training loss plateaus, reducing the learning rate helps the model converge better.  

9️⃣ **What are the different types of Learning Rate Schedulers?**  
   - **Exponential Decay:** Decreases learning rate exponentially over time.  
   - **Step Decay:** Reduces learning rate after fixed steps.  
   - **Adaptive Methods (Adam, RMSprop):** Adjusts learning rates dynamically based on past gradients.  

---

### 🔹 **Weight Initialization Questions**
1️⃣ **Why is weight initialization important?**  
   - Prevents vanishing/exploding gradients.  

2️⃣ **When should you use Xavier initialization?**  
   - For **sigmoid and tanh** activation functions.  

3️⃣ **When should you use He initialization?**  
   - For **ReLU and Leaky ReLU** activations.  

---

### 🔹 **Gradient Clipping Questions**
4️⃣ **Why do we clip gradients?**  
   - To prevent **exploding gradients**, especially in deep networks and RNNs.  

5️⃣ **What is the difference between `clipnorm` and `clipvalue`?**  
   - **`clipnorm`** rescales gradients if their **L2 norm exceeds a threshold**.  
   - **`clipvalue`** clips each gradient component **individually**.  

---

### 🔹 **Learning Rate Scheduling Questions**
6️⃣ **Why use a dynamic learning rate instead of a fixed one?**  
   - **Higher learning rates** speed up training initially.  
   - **Lower learning rates** improve convergence at later stages.  

7️⃣ **What’s the difference between Exponential Decay and Step Decay?**  
   - **Exponential Decay** gradually decreases learning rate every step.  
   - **Step Decay** reduces learning rate **at fixed intervals**.  

8️⃣ **When should you decay the learning rate?**  
   - When the loss **plateaus** and is no longer improving.  

---