In [1]:
# @title
from IPython.display import display, HTML

display(HTML("""
<script>
const firstCell = document.querySelector('.cell.code_cell');
if (firstCell) {
  firstCell.querySelector('.input').style.pointerEvents = 'none';
  firstCell.querySelector('.input').style.opacity = '0.5';
}
</script>
"""))

html = """
<div style="display:flex; flex-direction:column; align-items:center; text-align:center; gap:12px; padding:8px;">
  <h1 style="margin:0;">üëã Welcome to <span style="color:#1E88E5;">Algopath Coding Academy</span>!</h1>

  <img src="https://raw.githubusercontent.com/sshariqali/mnist_pretrained_model/main/algopath_logo.jpg"
       alt="Algopath Coding Academy Logo"
       width="400"
       style="border-radius:15px; box-shadow:0 4px 12px rgba(0,0,0,0.2); max-width:100%; height:auto;" />

  <p style="font-size:16px; margin:0;">
    <em>Empowering young minds to think creatively, code intelligently, and build the future with AI.</em>
  </p>
</div>
"""

display(HTML(html))

In [1]:
import pandas as pd
import numpy as np

## **1. Predicting Test Scores**

Imagine you're a teacher who wants to predict student test scores based on **Hours studied**

You collect data from previous students and notice a pattern: students who study more tend to score higher. But you want to **quantify** this relationship!

Like for example: $$TestScore = w * HoursStudied + b$$

Let's say that the data you collected was:

| Hours Studied | Test Score |
|---------------|------------|
| 2             | 75         |
| 4             | 80         |
| 6             | 90         |
| 8             | 94         |
| 10            | 98         |

**Our task**:
- Find the values of $w$
- Build a model that learns the relationship between study time and test scores, then use it to predict scores for new students!

---

## **2. Linear Regression**

The model we are about to build is called **Linear Regression**

- **Linear** because we are trying to model a straight-line (linear) relationship between the input variable and output variable.
- **Regression** because we are predicting a continuous numerical value, not a category or label.

Linear regression is one of the simplest and most widely used machine learning models.

Let‚Äôs rewrite our earlier equation in a general form:

$$y = w x + b$$

Where:

- $y$ -> the predicted output (Test Score)
- $x$ -> Hours Studied
- $w$ -> Weight (coefficient) that measure how strongly the input affects the output
- $b$ -> Bias (intercept), which adjusts the baseline prediction

<div align="center">
  <img src="https://github.com/sshariqali/mnist_pretrained_model/blob/main/gd_v1.png?raw=true" 
       style="width: 1000px; clip-path: inset(0 50% 0 0); margin-right: -500px;" 
       alt="Linear Relationship">
  <p><i>Linear Relationship between X and Y</i></p>
</div>

---

## **3. Problem Framing**

Now we need to find the **best values** for our weights ($w$) and bias ($b$) that minimize the difference between our predictions and actual values.

The **best values** will give us what's called the **Line of Best Fit** that would have the **minimum average difference** between the actual values and predictions.

But how do we know what **best** means?

---

## **4. The Loss Function**

We measure how wrong our predictions are using a **loss function**. The most common one for regression is **Mean Squared Error (MSE)**:

$$\text{MSE (loss)} = \frac{1}{n}\sum_{i=1}^{n}(y_{\text{predicted}} - y_{\text{actual}})^2$$

- This calculates the average of **squared differences** between predictions and actual values
- Squaring ensures errors are always positive and penalizes larger errors more

**Our goal**: Minimize this loss! But how?

---

## **5. Enter Calculus: The Power of Derivatives**

To minimize the loss, we need to know **which direction** to adjust our parameters ($w$, $b$) and **by how much**. This is where **derivatives** (also called **gradients**) come in!

- A derivative tells us the **rate of change** of a function
- In our case: *"If I change $w_1$ by a tiny amount, how much does the loss change?"*
- Mathematically: $\frac{\partial \text{Loss}}{\partial w}$ means "the partial derivative of loss with respect to $w$"

Recall that the **Partial Derivative** $\frac{\partial \text{Loss}}{\partial w}$ is computed Mathematicaly:

$$
\boxed{
\frac{\partial \text{Loss}}{\partial w}
= \frac{2}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i) x_{i}
}
$$

**Sign of a derivative:**
1. **Positive derivative** ‚Üí Loss increases as parameter increases ‚Üí We should **decrease** the parameter
2. **Negative derivative** ‚Üí Loss decreases as parameter increases ‚Üí We should **increase** the parameter
3. **Zero derivative** ‚Üí We're at a minimum (or maximum)!

**The optimization rule:**
$$w_{\text{new}} = w_{\text{old}} - \alpha \frac{\partial \text{Loss}}{\partial w}$$

Where:
- $\alpha$ is the **learning rate** (how big a step we take)
- We **subtract** the derivative to move in the direction that reduces loss

## **6. Gradient Descent!**

When we **repeat** this process of:
1. Using derivatives to find which direction reduces loss
2. Taking a small step (learning rate) in that direction
3. Updating our parameters

Over and over until we reach the minimum loss ‚Äî this becomes the **Gradient Descent Algorithm**!

Here are the steps:

1. **Initialize**: Start with random values for $w$, and $b$

2. **Forward Pass**: Calculate predictions using current parameters
   $$\hat{y} = w x + b$$

3. **Calculate Loss**: Measure how wrong our predictions are using MSE
   $$\text{Loss} = \frac{1}{n}\sum_{i=1}^{n}(\hat{y}_i - y_i)^2$$

4. **Compute Gradients**: Calculate the derivatives (how much each parameter affects the loss)
   $$\frac{\partial \text{Loss}}{\partial w}, \quad \frac{\partial \text{Loss}}{\partial b}$$

5. **Update Parameters**: Take a step in the direction that reduces loss
   $$w = w - \alpha \frac{\partial \text{Loss}}{\partial w}$$
   $$b = b - \alpha \frac{\partial \text{Loss}}{\partial b}$$

6. **Repeat**: Go back to step 2 and repeat until the loss stops decreasing (convergence)

This iterative process gradually moves our parameters toward the optimal values that minimize the loss function!

<div align="center">
  <img src="https://cdn.analyticsvidhya.com/wp-content/uploads/2024/09/631731_P7z2BKhd0R-9uyn9ThDasA.webp" width="700"/>
  <p><i>Gradient Descent Curve</i></p>
</div>

---

## **7. PyTorch Implementation**

In [None]:
import torch

<torch._C.Generator at 0x1d4418105d0>

In [3]:
# --- Step 1: Prepare the data as PyTorch tensors ---

x = torch.tensor([2, 4, 6, 8, 10], dtype = torch.float32)
y = torch.tensor([75, 80, 90, 94, 98], dtype = torch.float32)

print("Input features (x):")
print(x)
print(x.shape)
print("\nTarget values (y):")
print(y)
print(y.shape)

Input features (x):
tensor([ 2.,  4.,  6.,  8., 10.])
torch.Size([5])

Target values (y):
tensor([75., 80., 90., 94., 98.])
torch.Size([5])


In [4]:
# --- Step 2: Initialize parameters randomly ---

torch.manual_seed(42)
w = torch.randn(1, dtype = torch.float32)  # 1 weight (w)
b = torch.randn(1, dtype = torch.float32)  # 1 bias

print(f"Initial weights:\n{w}")
print(f"Weights Shape: {w.shape}")
print(f"\nInitial bias: {b.item():.4f}")
print(f"Bias Shape: {b.shape}")

Initial weights:
tensor([0.3367])
Weights Shape: torch.Size([1])

Initial bias: 0.1288
Bias Shape: torch.Size([1])


In [5]:
# --- Step 3: Setup Hyperparameters ---

learning_rate = 0.01
epochs = 2000 # Number of iterations for gradient descent
n = len(x)  # Number of data points

In [6]:
# --- Step 4: Implement Gradient Descent with Manual Gradients ---

# Store loss history for visualization
loss_history = []
    
for epoch in range(epochs):
    
    # Forward pass: compute predictions
    y_pred = x * w + b

    # Compute loss (MSE)
    loss = torch.mean((y_pred - y) ** 2)
    loss_history.append(loss.item())

    # --- Manually compute gradients ---
    # Remember: dL/dw = (2/n) * X^T @ (y_pred - y)
    #           dL/db = (2/n) * sum(y_pred - y)

    error = y_pred - y

    # Gradient for weights
    dw = (2 / n) * torch.sum(x * error)

    # Gradient for bias
    db = (2 / n) * torch.sum(error)

    # Update parameters using gradient descent
    w = w - learning_rate * dw
    b = b - learning_rate * db

    # Print progress every 100 epochs
    if (epoch + 1) % 100 == 0:
        print(f"Epoch [{epoch+1}/{epochs}], Loss: {loss.item():.4f}")

print("\n‚úÖ Training complete!")

Epoch [100/2000], Loss: 412.0172
Epoch [200/2000], Loss: 202.6555
Epoch [300/2000], Loss: 100.2602
Epoch [400/2000], Loss: 50.1801
Epoch [500/2000], Loss: 25.6868
Epoch [600/2000], Loss: 13.7074
Epoch [700/2000], Loss: 7.8486
Epoch [800/2000], Loss: 4.9831
Epoch [900/2000], Loss: 3.5816
Epoch [1000/2000], Loss: 2.8962
Epoch [1100/2000], Loss: 2.5609
Epoch [1200/2000], Loss: 2.3970
Epoch [1300/2000], Loss: 2.3168
Epoch [1400/2000], Loss: 2.2775
Epoch [1500/2000], Loss: 2.2584
Epoch [1600/2000], Loss: 2.2490
Epoch [1700/2000], Loss: 2.2444
Epoch [1800/2000], Loss: 2.2421
Epoch [1900/2000], Loss: 2.2411
Epoch [2000/2000], Loss: 2.2405

‚úÖ Training complete!


In [None]:
# --- Step 5: Check final parameters ---

print(f"Trained weight: {w.item()}")
print(f"Trained bias: {b.item():.4f}")

Trained weight: 3.007255792617798
Trained bias: 69.3470


Now that we have trained our model and found the optimal values for **weight ($w$)** and **bias ($b$)**, we can use them to predict the test score for any number of hours studied!

Our final model equation is:

$$\text{Predicted Score} = w \times \text{Hours Studied} + b$$

For example, if a student studies for **5 hours**, we simply plug that value into our equation:

$$\text{Predicted Score} = (3.01 \times 5) + 69.35 \approx 84.40$$

This allows us to estimate performance for new students based on the patterns the model learned from the previous data.

---

In [8]:
# --- Step 6: Use the trained model to make predictions ---

x_new = torch.tensor([5, 7, 9], dtype = torch.float32)

new_predictions =  x_new * w + b

new_predictions

tensor([84.3833, 90.3978, 96.4123])

<div align="center">
  <img src="https://github.com/sshariqali/mnist_pretrained_model/blob/main/gd_3.png?raw=true" width="700"/>
  <p><i>Final Predictions</i></p>
</div>

## **8. PyTorch (Auto Grad)**

In the previous section, we manually calculated gradients using the formulas we derived. But imagine if we had:
- **Hundreds of parameters** instead of just 3
- **Complex neural networks** with multiple layers
- **Different activation functions** and architectures

Manually computing gradients would become extremely tedious and error-prone! üò∞

This is where **PyTorch's Autograd** (Automatic Differentiation) comes to the rescue! üéâ

**Autograd** automatically computes gradients for us by:
1. Tracking all operations on tensors that have `requires_grad=True`
2. Building a computational graph in the background
3. Using the chain rule to compute gradients automatically when we call `.backward()`

Let's implement the **exact same** linear regression model, but this time letting PyTorch handle all the gradient calculations!

In [9]:
# --- Step 1: Initialize parameters with requires_grad = True ---

torch.manual_seed(42)
w_auto = torch.randn(1, dtype = torch.float32, requires_grad = True)  # 1 weight (w)
b_auto = torch.randn(1, dtype = torch.float32, requires_grad = True)  # 1 bias

print(f"Initial weights:\n{w_auto}")
print(f"Weights Shape: {w_auto.shape}")
print(f"\nInitial bias: {b_auto.item():.4f}")
print(f"Bias Shape: {b_auto.shape}")

Initial weights:
tensor([0.3367], requires_grad=True)
Weights Shape: torch.Size([1])

Initial bias: 0.1288
Bias Shape: torch.Size([1])


In [10]:
# --- Step 2: Setup Hyperparameters ---

learning_rate = 0.01
epochs = 2000 # Number of iterations for gradient descent
n = len(x)  # Number of data points

In [11]:
# --- Step 3: Gradient Descent with Autograd ---

loss_history_auto = []

for epoch in range(epochs):

    # Forward pass
    y_pred = x * w_auto + b_auto

    # Compute loss
    loss = torch.mean((y_pred - y) ** 2)
    loss_history_auto.append(loss.item())

    # Compute gradients automatically!
    loss.backward()  # This computes dL/dw and dL/db for us!

    # Update parameters (we need to disable gradient tracking for this update)
    with torch.no_grad():
        w_auto -= learning_rate * w_auto.grad
        b_auto -= learning_rate * b_auto.grad

        # Zero out gradients for next iteration
        w_auto.grad.zero_()
        b_auto.grad.zero_()

    if (epoch + 1) % 100 == 0:
        print(f"Epoch [{epoch+1}/{epochs}], Loss: {loss.item():.4f}")

print("\n‚úÖ Training complete with Autograd!")

Epoch [100/2000], Loss: 412.0172
Epoch [200/2000], Loss: 202.6555
Epoch [300/2000], Loss: 100.2602
Epoch [400/2000], Loss: 50.1801
Epoch [500/2000], Loss: 25.6868
Epoch [600/2000], Loss: 13.7074
Epoch [700/2000], Loss: 7.8486
Epoch [800/2000], Loss: 4.9831
Epoch [900/2000], Loss: 3.5816
Epoch [1000/2000], Loss: 2.8962
Epoch [1100/2000], Loss: 2.5609
Epoch [1200/2000], Loss: 2.3970
Epoch [1300/2000], Loss: 2.3168
Epoch [1400/2000], Loss: 2.2776
Epoch [1500/2000], Loss: 2.2584
Epoch [1600/2000], Loss: 2.2490
Epoch [1700/2000], Loss: 2.2444
Epoch [1800/2000], Loss: 2.2422
Epoch [1900/2000], Loss: 2.2411
Epoch [2000/2000], Loss: 2.2405

‚úÖ Training complete with Autograd!


**Key Autograd Concepts**

*   **`requires_grad=True`**: This signals PyTorch to start building the "Forward Pass" graph so it knows how to apply the Chain Rule later during backpropagation.

*   **`torch.no_grad()`**: When updating the weights ($w = w - \alpha \cdot \text{grad}$), you don't want PyTorch to track that specific subtraction as part of the gradient calculation. It is a manual update step that shouldn't be part of the computational graph.

*   **In-place Updates**: We use `w_auto -= learning_rate * w_auto.grad` instead of `w_auto = w_auto - learning_rate * w_auto.grad` to modify the tensor data directly without creating a new tensor that would break the gradient tracking.

*   **`.zero_()`**: PyTorch accumulates gradients (adds them together) every time you call `.backward()`. If you don't zero them out, the current gradient will be added to the previous one, leading to incorrect updates.

In [12]:
# --- Step 4: Check final parameters ---

print(f"Trained weight: {w_auto.item()}")
print(f"Trained bias: {b_auto.item():.4f}")

Trained weight: 3.007255792617798
Trained bias: 69.3470


## **9. Multiple Linear Regression**


Suppose our dataset now has an additional column `Hours Slept` which will also be used in modelling `Test Score`

| Hours Studied | Hours Slept | Test Score |
|---------------|-------------|------------|
| 2             | 6           | 75         |
| 4             | 7           | 80         |
| 6             | 8           | 90         |
| 8             | 7.5         | 94         |
| 10            | 8           | 98         |

New Equation:

$$TestScore = w_1 * HoursStudied + w_2 * HoursSlept + b$$
$$y = w_1x_1 + w_2x_2 + b$$

**Your task**:
- Find the values of $w_1$ and $w_2$

In [14]:
# --- Step 1: Prepare the data as PyTorch tensors ---

x = torch.tensor([[2, 4, 6, 8, 10], [6, 7, 8, 7.5, 8]], dtype = torch.float32).reshape(5 , 2)
y = torch.tensor([75, 80, 90, 94, 98], dtype = torch.float32)

print("Input features (x):")
print(x)
print(x.shape)
print("\nTarget values (y):")
print(y)
print(y.shape)

Input features (x):
tensor([[ 2.0000,  4.0000],
        [ 6.0000,  8.0000],
        [10.0000,  6.0000],
        [ 7.0000,  8.0000],
        [ 7.5000,  8.0000]])
torch.Size([5, 2])

Target values (y):
tensor([75., 80., 90., 94., 98.])
torch.Size([5])


In [15]:
# --- Step 2: Initialize parameters with requires_grad = True ---

torch.manual_seed(42)
w_auto = torch.randn(2, 1, dtype = torch.float32, requires_grad = True)  # 2 weights (w1, w2)
b_auto = torch.randn(1, dtype = torch.float32, requires_grad = True)  # 1 bias

print(f"Initial weights:\n{w_auto}")
print(f"Weights Shape: {w_auto.shape}")
print(f"\nInitial bias: {b_auto.item():.4f}")
print(f"Bias Shape: {b_auto.shape}")

Initial weights:
tensor([[0.3367],
        [0.1288]], requires_grad=True)
Weights Shape: torch.Size([2, 1])

Initial bias: 0.2345
Bias Shape: torch.Size([1])


In [16]:
# --- Step 3: Setup Hyperparameters ---

learning_rate = 0.01
epochs = 5000 # Number of iterations for gradient descent
n = len(x)  # Number of data points

In [None]:
# --- Step 3: Gradient Descent with Autograd ---

loss_history_auto = []

for epoch in range(epochs):

    # Forward pass
    y_pred = x @ w_auto + b_auto # Matrix multiplication (5x2 @ 2x1 = 5x1)

    # Compute loss
    loss = torch.mean((y_pred.squeeze() - y) ** 2)
    loss_history_auto.append(loss.item())

    # Compute gradients automatically!
    loss.backward()  # This computes dL/dw and dL/db for us!

    # Update parameters (we need to disable gradient tracking for this update)
    with torch.no_grad():
        w_auto -= learning_rate * w_auto.grad
        b_auto -= learning_rate * b_auto.grad

        # Zero out gradients for next iteration
        w_auto.grad.zero_()
        b_auto.grad.zero_()

    if (epoch + 1) % 100 == 0:
        print(f"Epoch [{epoch+1}/{epochs}], Loss: {loss.item():.4f}")

print("\n‚úÖ Training complete with Autograd!")

Epoch [100/5000], Loss: 186.2492
Epoch [200/5000], Loss: 156.7641
Epoch [300/5000], Loss: 132.7553
Epoch [400/5000], Loss: 113.2028
Epoch [500/5000], Loss: 97.2796
Epoch [600/5000], Loss: 84.3120
Epoch [700/5000], Loss: 73.7513
Epoch [800/5000], Loss: 65.1508
Epoch [900/5000], Loss: 58.1467
Epoch [1000/5000], Loss: 52.4427
Epoch [1100/5000], Loss: 47.7974
Epoch [1200/5000], Loss: 44.0143
Epoch [1300/5000], Loss: 40.9335
Epoch [1400/5000], Loss: 38.4245
Epoch [1500/5000], Loss: 36.3812
Epoch [1600/5000], Loss: 34.7171
Epoch [1700/5000], Loss: 33.3619
Epoch [1800/5000], Loss: 32.2583
Epoch [1900/5000], Loss: 31.3596
Epoch [2000/5000], Loss: 30.6277
Epoch [2100/5000], Loss: 30.0316
Epoch [2200/5000], Loss: 29.5461
Epoch [2300/5000], Loss: 29.1508
Epoch [2400/5000], Loss: 28.8288
Epoch [2500/5000], Loss: 28.5666
Epoch [2600/5000], Loss: 28.3530
Epoch [2700/5000], Loss: 28.1791
Epoch [2800/5000], Loss: 28.0375
Epoch [2900/5000], Loss: 27.9222
Epoch [3000/5000], Loss: 27.8283
Epoch [3100/500

In [20]:
# --- Step 4: Check final parameters ---

print(f"Trained weights: {w_auto[0].item():.4f}, {w_auto[1].item():.4f}")
print(f"Trained bias: {b_auto.item():.4f}")

Trained weights: 1.8086, 1.9646
Trained bias: 62.2669


## **10. Reading Material**

### Partial Derivative of Loss - Mathematical Derivation!

To minimize the loss, we need to know **which direction** to adjust our parameters ($w_1$, $w_2$, $b$) and **by how much**. This is where **derivatives** (also called **gradients**) come in!

- A derivative tells us the **rate of change** of a function
- In our case: *"If I change $w_1$ by a tiny amount, how much does the loss change?"*
- Mathematically: $\frac{\partial \text{Loss}}{\partial w_1}$ means "the partial derivative of loss with respect to $w_1$"

Now lets see how this **Partial Derivative** $\frac{\partial \text{Loss}}{\partial w_1}$ is computed Mathematicaly:

$$
\text{Loss} = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2
$$

where the predicted value is given by the linear model:

$$
\hat{y}_i = w_1 x_{1i} + w_2 x_{2i} + w_0
$$

where $i$ means a student

---

1Ô∏è‚É£ **Substitute the prediction formula**
$$
\text{Loss} = \frac{1}{n} \sum_{i=1}^{n} ((w_1 x_{1i} + w_2 x_{2i} + w_0) - y_i)^2
$$

---

2Ô∏è‚É£ **Take the partial derivative of Loss with respect to $w_1$**
$$
\frac{\partial \text{Loss}}{\partial w_1}
= \frac{\partial}{\partial w_1}
\left( \frac{1}{n} \sum_{i=1}^{n} ((w_1 x_{1i} + w_2 x_{2i} + w_0) - y_i)^2 \right)
$$

Bring the constant $\frac{1}{n}$ outside:

$$
\frac{\partial \text{Loss}}{\partial w_1}
= \frac{1}{n} \sum_{i=1}^{n}
\frac{\partial}{\partial w_1}
((w_1 x_{1i} + w_2 x_{2i} + w_0 - y_i)^2)
$$

---

3Ô∏è‚É£ **Apply the chain rule**
$$
\frac{\partial}{\partial w_1} ((w_1 x_{1i} + w_2 x_{2i} + w_0 - y_i)^2)
= 2(w_1 x_{1i} + w_2 x_{2i} + w_0 - y_i) \cdot \frac{\partial}{\partial w_1}(w_1 x_{1i} + w_2 x_{2i} + w_0 - y_i)
$$

$$
\frac{\partial}{\partial w_1}(w_1 x_{1i} + w_2 x_{2i} + w_0 - y_i) = x_{1i}
$$

So,

$$
\frac{\partial}{\partial w_1} ((w_1 x_{1i} + w_2 x_{2i} + w_0 - y_i)^2)
= 2 x_{1i} (w_1 x_{1i} + w_2 x_{2i} + w_0 - y_i)
$$

---

4Ô∏è‚É£ **Substitute back**
$$
\frac{\partial \text{Loss}}{\partial w_1}
= \frac{2}{n} \sum_{i=1}^{n} ((w_1 x_{1i} + w_2 x_{2i} + w_0) - y_i) x_{1i}
$$

---

5Ô∏è‚É£ **Simplify**
$$
\boxed{
\frac{\partial \text{Loss}}{\partial w_1}
= \frac{2}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i) x_{1i}
}
$$

---

**Why derivatives help us minimize loss:**
1. **Positive derivative** ‚Üí Loss increases as parameter increases ‚Üí We should **decrease** the parameter
2. **Negative derivative** ‚Üí Loss decreases as parameter increases ‚Üí We should **increase** the parameter
3. **Zero derivative** ‚Üí We're at a minimum (or maximum)!

**The optimization rule:**
$$w_{\text{new}} = w_{\text{old}} - \alpha \frac{\partial \text{Loss}}{\partial w}$$

Where:
- $\alpha$ is the **learning rate** (how big a step we take)
- We **subtract** the derivative to move in the direction that reduces loss

Think of it like hiking down a mountain in fog:
- The derivative tells you which way is downhill (steepest descent)
- The learning rate determines how big your steps are
- You keep taking steps downhill until you reach the bottom (minimum loss)!

<div align="center">
  <img src="https://cdn.analyticsvidhya.com/wp-content/uploads/2024/09/631731_P7z2BKhd0R-9uyn9ThDasA.webp" width="700"/>
  <p><i>Gradient Descent Curve</i></p>
</div>

---

---

### **Conclusion: What We Learned**

Congratulations! You've just built a complete linear regression model from scratch using PyTorch! üéâ

**Key Concepts Covered:**

1. **Linear Regression Basics**
   - Modeling relationships between inputs and outputs
   - Understanding weights, bias, and predictions

2. **Loss Functions**
   - Mean Squared Error (MSE) to measure prediction quality
   - The goal: minimize loss to improve our model

3. **Calculus & Gradients**
   - Derivatives tell us how to improve parameters
   - Chain rule for computing gradients
   - Partial derivatives for multi-variable optimization

4. **Gradient Descent Algorithm**
   - Iteratively updating parameters to minimize loss
   - Learning rate controls step size
   - Forward pass ‚Üí compute loss ‚Üí compute gradients ‚Üí update parameters

5. **Manual Implementation**
   - Coded gradient formulas ourselves
   - Understood the math behind the optimization

6. **PyTorch Autograd**
   - Automatic differentiation eliminates manual gradient coding
   - `.backward()` does the calculus for us
   - Same results with less code and no errors!

**What's Next?**

This simple linear regression is the foundation for:
- **Deep Neural Networks** (stack multiple layers)
- **Convolutional Neural Networks** (for images)
- **Recurrent Neural Networks** (for sequences)
- **Transformers** (for language models like GPT)

The principles remain the same:
1. Define a model
2. Define a loss function
3. Use gradients to optimize
4. Let autograd handle the calculus!

---

**You now understand the core of deep learning!** üöÄ

Every complex neural network is just this same process scaled up with more layers, parameters, and data. The mathematics of gradient descent and backpropagation powers everything from ChatGPT to self-driving cars!

Keep exploring, and happy learning! üìö‚ú®