<a href="https://colab.research.google.com/github/werowe/HypatiaAcademy/blob/master/ml/extremely_simple_linear_regression_manually_pytorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Now for the training step. Here’s how it works:

1. loss.backward() runs backpropagation starting from the loss output node, and populates the tensor.grad attribute on all tensors that were involved in the computation of loss. tensor.grad represents the gradient of the loss with regard to that tensor.

2. We use the .grad attribute to recover the gradients of the loss with regard to W and b.

3. We update W and b using those gradients. Because these updates are not intended to be part of the backwards pass, we do them inside a torch.no_grad() scope, which skips gradient computation for everything inside it.

4. We reset the contents of the .grad property of our W and b parameters, by setting it None. If we didn’t do this, gradient values would accumulate across multiple calls to training_step(), resulting in invalid values.

## 1. Theoretical Derivative

**you're optimizing parameters $ W $ and $ b $ in the model**:

$$
\text{prediction} = W \cdot x + b
$$

with a loss function (mean squared error):

$$
\text{loss} = (y - \text{prediction})^2
$$

---

## 2. What is the Gradient Actually Calculated?

You are calculating the gradient of the loss **with respect to the parameters** $ W $ and $ b $, **not with respect to $ x $**.

Let’s compute the gradient with respect to $ W $:

Given:
- $ x = 1 $
- $ y = 2 $
- $ \text{prediction} = W \cdot x + b $
- $ \text{loss} = (y - \text{prediction})^2 $

The derivative of the loss with respect to $ W $ is:

$$
\frac{d(\text{loss})}{dW} = 2 \cdot (y - \text{prediction}) \cdot (-x)
$$

So, the gradient depends on the current values of $ W $ and $ b $.

---

## 3. Example Calculation

Suppose:
- $ W = 0.5 $
- $ b = 0 $
- $ x = 1 $
- $ y = 2 $

Then:
- $ \text{prediction} = 0.5 \times 1 + 0 = 0.5 $
- $ \text{loss} = (2 - 0.5)^2 = 2.25 $
- Gradient w.r.t. $ W $:

$$
\frac{d(\text{loss})}{dW} = 2 \cdot (2 - 0.5) \cdot (-1) = 2 \cdot 1.5 \cdot (-1) = -3
$$






In [155]:
import torch

epochs = 5

learning_rate = 0.1

# Data
X = torch.tensor([[1.0]])
y = 2 * X  + 1

# Parameters
W = torch.tensor([[2.3]], requires_grad=True)
b = torch.tensor(1.4, requires_grad=True)


In [156]:

def run_training(X, y, W, b):
  # Forward pass
  predictions = torch.matmul(X, W) + b
  loss = torch.square(y - predictions).sum()  # Ensure scalar loss
  print("\nloss", loss)
  # Backward pass
  loss.backward()

  # Update weights
  with torch.no_grad():
    W -= W.grad * learning_rate
    b -= b.grad * learning_rate

  print("W gradient:", W.grad.item())
  print("b gradient:", b.grad.item())
  print("W after:", W.item())
  print("b after:", b.item())

  # Zero gradients (optional, for next iteration)
  W.grad = None
  b.grad = None



In [154]:
for i in range(epochs):
  run_training(X, y, W, b)


loss tensor(0.4900, grad_fn=<SumBackward0>)
W gradient: 1.3999996185302734
b gradient: 1.3999996185302734
W after: 2.1600000858306885
b after: 1.2599999904632568

loss tensor(0.1764, grad_fn=<SumBackward0>)
W gradient: 0.8400001525878906
b gradient: 0.8400001525878906
W after: 2.0759999752044678
b after: 1.1759999990463257

loss tensor(0.0635, grad_fn=<SumBackward0>)
W gradient: 0.5039997100830078
b gradient: 0.5039997100830078
W after: 2.025599956512451
b after: 1.125599980354309

loss tensor(0.0229, grad_fn=<SumBackward0>)
W gradient: 0.3023996353149414
b gradient: 0.3023996353149414
W after: 1.995360016822815
b after: 1.0953600406646729

loss tensor(0.0082, grad_fn=<SumBackward0>)
W gradient: 0.1814403533935547
b gradient: 0.1814403533935547
W after: 1.9772160053253174
b after: 1.0772160291671753

loss tensor(0.0030, grad_fn=<SumBackward0>)
W gradient: 0.10886383056640625
b gradient: 0.10886383056640625
W after: 1.966329574584961
b after: 1.0663295984268188

loss tensor(0.0011, gra