In [None]:
import numpy as np

# Chapter 2: Visualizing Gradient Descent

Now that you've learned how gradient descent works, it's time to put your knowledge into action :-)

We're generating a new synthetic dataset using *b = 0.5* and *w = -3* for a **linear regression with a single feature (x)**:

$$
\Large
y = b + w x
$$

You'll implement the **five steps** of gradient descent in order to **learn these parameters** from the data.

## Data Generation

In [None]:
true_b = .5
true_w = -3
N = 100

# Data Generation
np.random.seed(42)
x = np.random.rand(N, 1)
epsilon = (.1 * np.random.randn(N, 1))
y = true_b + true_w * x + epsilon

# Shuffles the indices
idx = np.arange(N)
np.random.shuffle(idx)

# Uses first 80 random indices for train
train_idx = idx[:int(N*.8)]
# Uses the remaining indices for validation
val_idx = idx[int(N*.8):]

# Generates train and validation sets
x_train, y_train = x[train_idx], y[train_idx]
x_val, y_val = x[val_idx], y[val_idx]

## Step 0: Random Initialization

The first step - actually, the zeroth step - is the *random initialization* of the parameters. Using Numpy's `random.randn` method, you should write code to initialize both *b* and *w*:

### Answer

In [None]:
# Step 0 - Initializes parameters "b" and "w" randomly
np.random.seed(42)

b = np.random.randn(1)
w = np.random.randn(1)

print(b, w)

## Step 1: Compute Model's Predictions

The first step (for real) is the **forward pass**, that is, the **predictions** of the model. Our model is a linear regression with a single feature (x), and its parameters are *b* and *w*. You should write code to generate predictions (yhat):

### Answer

In [None]:
# Step 1 - Computes our model's predicted output - forward pass
yhat = b + w * x_train

## Step 2: Compute the Mean Squared Error (MSE) Loss

Since our model is a linear regression, the appropriate loss is the **Mean Squared Error (MSE)** loss:

$$
\Large
error_i = \hat{y_i} - y_i
\\
\Large
loss = \frac{1}{N}\sum_{i=0}^N{error_i^2}
$$

For each data point (i) in our training set, you should write code to compute the difference between the model's predictions (yhat) and the actual values (y_train), and use the errors of all N data points to compute the loss:

Obs.: DO NOT use loops!

### Answer

In [None]:
# Step 2 - Computing the loss
# We are using ALL data points, so this is BATCH gradient
# descent. How wrong is our model? That's the error!
error = (yhat - y_train)

# It is a regression, so it computes mean squared error (MSE)
loss = (error ** 2).mean()
print(loss)

## Step 3: Compute the Gradients

PyTorch's autograd will take care of that later on, so we don't have to compute any derivatives yourself! So, no need to manually implement this step.

You *still* should understand what the gradients *mean*, though.

In [None]:
# Step 3 - Computes gradients for both "b" and "w" parameters
b_grad = 2 * error.mean()
w_grad = 2 * (x_train * error).mean()
print(b_grad, w_grad)

The gradients above indicate that:
- for a tiny increase in the value of the parameter *b*, the loss will increase roughly 2.7 times as much
- for a tiny increase in the value of the parameter *w*, the loss will increase roughly 1.8 times as much

## Step 4: Update the Parameters

The fourth step is the **parameter update** - you should write code that use the gradients and a learning rate (set to 0.1) to update the parameters:

### Answer

In [None]:
# Sets learning rate - this is "eta" ~ the "n" like Greek letter
lr = 0.1

# Step 4 - Updates parameters using gradients and the 
# learning rate
b = b - lr * b_grad
w = w - lr * w_grad

print(b, w)

## Step 5: Rinse and Repeat!

The last step consists of putting the other steps together and organize them inside a loop. Write code to fill in the blanks in the loop below:

### Answer

In [None]:
# Step 0 - Initializes parameters "b" and "w" randomly
np.random.seed(42)

b = np.random.randn(1)
w = np.random.randn(1)

lr = 0.1

# Defines number of epochs
n_epochs = 1000

for epoch in range(n_epochs):
    # Step 1: Forward pass
    yhat = b + w * x_train
    
    # Step 2: Compute MSE loss
    error = (yhat - y_train)
    loss = (error ** 2).mean()
    
    # Step 3: Compute the gradients
    b_grad = 2 * error.mean()
    w_grad = 2 * (x_train * error).mean()

    # Step 4: Update the parameters
    b = b - lr * b_grad
    w = w - lr * w_grad
    
print(b, w)
print(loss)

Congratulations! Your model is able to learn both *b* and *w* that are **really close** to their true values. They will never be a perfect match, though, because of the *noise* we added to the synthetic data (and that's always present in real world data!).