###### __Continution of Yesterdays's Class by Haribabu Sir__:

# __Comparison of the Four Gradient Descent Methods:__  

### 1. **Standard (Batch) Gradient Descent**  
- **Definition**: Computes the gradient using the entire dataset.  
- **Update Rule**:  
Formula:  
$$ w := w - \eta \cdot \frac{1}{n} \sum_{i=1}^{n} (y_{\text{pred}}^{(i)} - y_{\text{true}}^{(i)}) \cdot x^{(i)} $$  

#### Terms:
1. **$w$**: Weight (parameter) being optimized.
2. **$\eta$ (Learning Rate)**: Controls the size of the step in the direction of the negative gradient.
3. **$n$**: Total number of training samples in the dataset.
4. **$(y_{\text{pred}}^{(i)} - y_{\text{true}}^{(i)})$**: The difference between the predicted and true value for sample $i$ (error term).
5. **$x^{(i)}$**: Input feature corresponding to sample $i$.
6. **$\sum_{i=1}^{n}$**: Summation over all $n$ samples in the dataset.

**Key Property:** Uses the entire dataset for each update, ensuring accurate but slow updates.

- **Pros**: Accurate updates.  
- **Cons**: Slow for large datasets.  

---

### 2. **Stochastic Gradient Descent (SGD)**  
- **Definition**: Uses a single random data point for each update.  
- **Update Rule**:  
Formula:  
$$ w := w - \eta \cdot (y_{\text{pred}} - y_{\text{true}}) \cdot x $$  

#### Terms:
1. **$w$**: Weight (parameter) being updated.
2. **$\eta$**: Learning rate.
3. **$(y_{\text{pred}} - y_{\text{true}})$**: Error term for a single sample.
4. **$x$**: Input feature for the single sample being used for the update.

**Key Property:** Updates weights for each individual sample, making it faster but noisier.

- **Pros**: Fast updates, good for large datasets.  
- **Cons**: Noisy convergence.  

---

### 3. **Mini-Batch Gradient Descent**  
- **Definition**: Uses small batches of data (size $n$) for each update.  
- **Update Rule**:  
Formula:  
$$ w := w - \eta \cdot \frac{1}{b} \sum_{j=1}^{b} (y_{\text{pred}}^{(j)} - y_{\text{true}}^{(j)}) \cdot x^{(j)} $$  

#### Terms:
1. **$b$**: Batch size (number of samples in a mini-batch).  
2. **$\sum_{j=1}^{b}$**: Summation over all $b$ samples in the mini-batch.  
3. Other terms ($w$, $\eta$, $x$, $y_{\text{pred}}$, $y_{\text{true}}$) are as defined above.

**Key Property:** Uses a subset of data (mini-batch) for each update, balancing stability and efficiency.

- **Pros**: Balances stability and speed, hardware-efficient.  
- **Cons**: Requires choosing a proper batch size.  

---

### 4. **Momentum Gradient Descent**  
- **Definition**: Incorporates the past update direction for smoother and faster convergence.  
- **Update Rule**:  
Formulas:  
1. Velocity update:  
   $$ v_t := \gamma \cdot v_{t-1} + \eta \cdot \frac{1}{n} \sum_{i=1}^{n} (y_{\text{pred}}^{(i)} - y_{\text{true}}^{(i)}) \cdot x^{(i)} $$  
2. Weight update:  
   $$ w := w - v_t $$  

#### Terms:
1. **$v_t$**: Velocity at time step $t$, which incorporates past gradients.
2. **$\gamma$ (
   Momentum Coefficient): Determines the contribution of past gradients (e.g., $\gamma = 0.9$).
3. **$v_{t-1}$**: Velocity from the previous update step.
4. Other terms ($w$, $\eta$, $x$, $y_{\text{pred}}$, $y_{\text{true}}$) are as defined above.

**Key Property:** Introduces a momentum term to accelerate convergence, particularly in regions with oscillations.

- **Pros**: Reduces oscillations, accelerates convergence.  
- **Cons**: Requires tuning $\gamma$.  


---

### Key Differences:  
- **Batch**: Stable but slow.  
- **SGD**: Fast but noisy.  
- **Mini-Batch**: Balanced approach.  
- **Momentum**: Faster convergence by leveraging direction history.  

 # __Differences Between _Loss Function_ and _Gradient Descent_ :__  

### **Loss Function**
- **Definition**: A mathematical function that measures the error between predicted values and actual values.  
- **Purpose**: Quantifies how well the model is performing.  
- **Examples**:  
  - Mean Squared Error (MSE): $$L = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$  
  - Cross-Entropy Loss: $$L = -\frac{1}{n} \sum_{i=1}^{n} \left( y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i) \right)$$  
- **Role**: Determines *what* to minimize.  

---

### **Gradient Descent**
- **Definition**: An optimization algorithm used to minimize the loss function by updating model parameters.  
- **Purpose**: Finds the values of model parameters ($\theta$) that minimize the loss.  
- **Update Rule**:  
  $$\theta = \theta - \alpha \nabla J(\theta)$$  
  where $\alpha$ is the learning rate.  
- **Role**: Provides the *how* to minimize the loss.  

---

### Key Difference:  
- **Loss Function**: Defines the objective to minimize (error).  
- **Gradient Descent**: Provides the mechanism to minimize the loss function.

__Optimization:__

__Step - 1__: _Gradient Calcution: slope with direction._

__Step - 2__: _Updating the parameters._

__Step - 3__: _Repeat the Process from Step 3 in Previous Lecture Until Defined Epochs or decreased Loss._

_Note: Remember this is Single Perceptron._

Now, in Classification Problems we use Activation functions such as Sigmoid, which takes the values tending towrds +ve infinity it considers them as One and the Values tending towards -ve infinity as Zero.

# __Why not MSE in Classification Problems:__
We don't typically use **Mean Squared Error (MSE)** for classification tasks because:  

### 1. **Inefficiency in Capturing Probabilities**  
- MSE calculates the squared difference between predicted and true labels.  
- Classification often involves probabilities (e.g., from softmax or sigmoid), and MSE is not ideal for measuring the divergence between distributions.

### 2. **Gradient Saturation**  
- For models like logistic regression or neural networks, MSE gradients tend to saturate (become very small) as predictions approach the true label, slowing down learning.

### 3. **Better Alternatives Exist**  
- **Cross-Entropy Loss** aligns better with probabilistic models, directly penalizing incorrect predictions by comparing predicted probabilities to true labels:
  $$L = -\frac{1}{n} \sum_{i=1}^n \left( y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i) \right)$$  

__Additionals__: Saddle points (convex), potential of impact is minimized since in MSE the values recorded are between zero and one.

### Summary:  
MSE is less effective for classification because it doesn't handle probabilities well, leads to slow convergence, and lacks the efficiency of cross-entropy for penalizing wrong predictions.


In classification tasks, predictions typically involve probabilities—values between 0 and 1—that represent the likelihood of a data point belonging to a specific class. For example, in binary classification, a model might output $0.9$ for "class 1" and $0.1$ for "class 0."

Now, let's understand **why MSE is not ideal** for this setup:  

---

### 1. **Probabilities Are Not Linear**  
- MSE treats the difference between predicted probabilities and actual labels linearly. For instance, if the true label is 1, MSE penalizes a prediction of $0.5$ the same way as it would penalize $0.9$ (distance squared).  
- However, in classification, probabilities near $1$ or $0$ are much more meaningful than intermediate values, so this linear treatment is not suitable.  


---

__The most suitable quantity in this case is _Entropy_. Entropy is the degree of suprise. Less entropy the better. we use Cross Entropy because there are multiple values.__

### 2. **Cross-Entropy Fits Better for Probabilities**  
- **Cross-Entropy Loss** directly compares the predicted probability distribution to the true label's one-hot encoded distribution.  
- It penalizes incorrect confident predictions more harshly and rewards correct confident predictions effectively.  

For example:  
- True label: $y = 1$  
- Prediction: $\hat{y} = 0.9$ (confidence is high and correct).  
    - Cross-Entropy Loss: Very small penalty.  
    - MSE: Still penalizes moderately because $(1 - 0.9)^2 = 0.01$.  
- Prediction: $\hat{y} = 0.5$ (low confidence).  
    - Cross-Entropy Loss: Harsh penalty.  
    - MSE: Moderate penalty, which doesn't push the model as strongly to improve.

---

### 3. **Divergence Between Distributions**  
- Classification is about matching probability distributions (true labels vs. predicted probabilities).  
- Cross-Entropy Loss measures the divergence between these distributions (how far apart they are). MSE fails to do this effectively because it isn't designed for probabilistic outputs.

---

### Example Comparison (Binary Classification)  
True label: $y = 1$  
Predicted probabilities: $\hat{y} = 0.9$ vs. $\hat{y} = 0.5$  

| Prediction | Cross-Entropy Loss | MSE       |  
|------------|---------------------|-----------|  
| $\hat{y} = 0.9$ | $-1 \cdot \log(0.9) = 0.105$ | $(1 - 0.9)^2 = 0.01$ |  
| $\hat{y} = 0.5$ | $-1 \cdot \log(0.5) = 0.693$ | $(1 - 0.5)^2 = 0.25$ |  

- Cross-Entropy Loss sharply distinguishes the predictions, pushing the model to get closer to confident probabilities.  
- MSE provides less sharp gradients, making it harder for the model to improve efficiently.  

---

### Conclusion  
MSE is not ideal for classification because it treats probabilities linearly, doesn't effectively penalize confident wrong predictions, and fails to measure the divergence between probability distributions. Cross-Entropy Loss, on the other hand, is better suited for this task.

𝐇𝐀𝐑𝐄 𝐊𝐑𝐈𝐒𝐇𝐍𝐀  

In the context of classification tasks, **divergence between probability distributions** refers to how different the predicted probabilities of a model are from the true probabilities (or true labels in one-hot encoding). Let's break this down:

---

### 1. **Probability Distributions in Classification**
- In classification, the model predicts a probability distribution over possible classes.  
  For example, in binary classification:  
  - True label: $y = [1, 0]$ (class 1 is true).  
  - Model prediction: $\hat{y} = [0.9, 0.1]$ (90% confidence for class 1, 10% for class 2).  

- The goal is to make the predicted distribution ($\hat{y}$) as close as possible to the true distribution ($y$).  

---

### 2. **Why MSE Fails Here**
- **MSE** computes the squared difference between each element of the two distributions:  
  $$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$  

- This approach doesn't effectively capture the relationship between probabilities because it treats all differences linearly and focuses on minimizing squared errors rather than the "distance" between the distributions.  

---

### 3. **How Cross-Entropy Works**
- **Cross-Entropy Loss** is based on a concept from information theory and measures the "distance" between two probability distributions. It computes how much information (in bits) is needed to describe the true distribution using the predicted distribution.  
  $$L = -\sum_{i=1}^{n} y_i \cdot \log(\hat{y}_i)$$  

- In classification:  
  - If $\hat{y}_i$ (predicted probability for the true class) is close to $1$, the loss is small.  
  - If $\hat{y}_i$ is far from $1$, the loss increases rapidly, penalizing confident but wrong predictions.  

---

### 4. **Example**
Let’s take two cases for a binary classification:  
- True label: $y = [1, 0]$  
- Predictions:  
  1. $\hat{y} = [0.9, 0.1]$ (close to the true label).  
  2. $\hat{y} = [0.5, 0.5]$ (less confident, wrong direction).  

#### Loss Calculation:  
1. **MSE**:  
   $$\text{MSE} = \frac{1}{2} \left[(1 - 0.9)^2 + (0 - 0.1)^2\right] = 0.01 + 0.01 = 0.02$$  
   $$\text{MSE} = \frac{1}{2} \left[(1 - 0.5)^2 + (0 - 0.5)^2\right] = 0.25 + 0.25 = 0.5$$  

2. **Cross-Entropy Loss**:  
   $$L = -\left[1 \cdot \log(0.9) + 0 \cdot \log(0.1)\right] = -\log(0.9) = 0.105$$  
   $$L = -\left[1 \cdot \log(0.5) + 0 \cdot \log(0.5)\right] = -\log(0.5) = 0.693$$  

#### Observations:  
- **MSE** penalizes both predictions but doesn't emphasize the confident wrong predictions strongly enough.  
- **Cross-Entropy Loss** penalizes the less confident prediction ($\hat{y} = 0.5$) much more heavily, encouraging the model to improve probabilistic accuracy.  

---

### 5. **Summary of Divergence**
- MSE minimizes the squared differences, which doesn't directly address how "far apart" two distributions are.  
- Cross-Entropy Loss measures the divergence, prioritizing confident, correct predictions and penalizing confident wrong predictions harshly.  
- This makes **Cross-Entropy Loss** the preferred choice for classification tasks involving probabilities.