# Linear Regression

## From Scratch

This section walks through building a complete linear regression implementation from scratch, covering the model, loss function, minibatch SGD optimizer, and training loop, then applying it to synthetic data. Although deep learning frameworks can automate these steps, implementing them manually builds a deeper understanding that is essential for customizing models in the future.

In [None]:
%matplotlib inline
import torch
from d2l import torch as d2l

### Model

Here, the model’s weights are initialized by sampling from a normal distribution with mean 0
and standard deviation 0.01.


In [None]:
class LinearRegressionScratch(d2l.Module):  #@save
    """The linear regression model implemented from scratch."""
    def __init__(self, num_inputs, lr, sigma=0.01):
        super().__init__()
        self.save_hyperparameters()
        self.w = torch.normal(0, sigma, (num_inputs, 1), requires_grad=True)
        self.b = torch.zeros(1, requires_grad=True)

Next, we define the model, specifying how the inputs and parameters are mathematically combined to produce the output.

In [None]:
@d2l.add_to_class(LinearRegressionScratch)  #@save
def forward(self, X):
    return torch.matmul(X, self.w) + self.b

### Loss Function

Here we use the squared loss function.

In [None]:
@d2l.add_to_class(LinearRegressionScratch)  #@save
def loss(self, y_hat, y):
    l = (y_hat - y) ** 2 / 2
    return l.mean()

### Optimization Algorithm

We define a SGD class and make an instance of it. 

In [None]:
class SGD(d2l.HyperParameters):  #@save
    """Minibatch stochastic gradient descent."""
    def __init__(self, params, lr):
        self.save_hyperparameters()

    def step(self):
        for param in self.params:
            param -= self.lr * param.grad

    def zero_grad(self):
        for param in self.params:
            if param.grad is not None:
                param.grad.zero_()

In [None]:
@d2l.add_to_class(LinearRegressionScratch)  #@save
def configure_optimizers(self):
    return SGD([self.w, self.b], self.lr)

### Training

With the parameters, loss function, model, and optimizer defined, we can now implement the main training loop to fit the model to the data.

In [None]:
@d2l.add_to_class(d2l.Trainer)  #@save
def prepare_batch(self, batch):
    return batch

@d2l.add_to_class(d2l.Trainer)  #@save
def fit_epoch(self):
    self.model.train()
    for batch in self.train_dataloader:
        loss = self.model.training_step(self.prepare_batch(batch))
        self.optim.zero_grad()
        with torch.no_grad():
            loss.backward()
            if self.gradient_clip_val > 0:  # To be discussed later
                self.clip_gradients(self.gradient_clip_val, self.model)
            self.optim.step()
        self.train_batch_idx += 1
    if self.val_dataloader is None:
        return
    self.model.eval()
    for batch in self.val_dataloader:
        with torch.no_grad():
            self.model.validation_step(self.prepare_batch(batch))
        self.val_batch_idx += 1

Note that in general, both the number of epochs and the learning rate are `hyperparameters`. In general, setting hyperparameters is tricky and we will usually want to use a three-way split, one set for training, a second for hyperparameter selection, and the third reserved for the final evaluation. 

In [None]:
model = LinearRegressionScratch(2, lr=0.03)
data = d2l.SyntheticRegressionData(w=torch.tensor([2, -3.4]), b=4.2)
trainer = d2l.Trainer(max_epochs=3)
trainer.fit(model, data)

In [None]:
with torch.no_grad():
    print(f'error in estimating w: {data.w - model.w.reshape(data.w.shape)}')
    print(f'error in estimating b: {data.b - model.b}')

## Exercises

1. Experiment using different learning rates to find out how quickly the loss function value drops. Can you reduce the error by increasing the number of epochs of training?

2. If the number of examples cannot be divided by the batch size, what happens to data_iter at the end of an epoch?

## Concise Implementation

In this section, we demonstrate a concise implementation of the linear regression model using high-level deep learning APIs. These abstractions streamline the code while preserving the same structure and logic as the from-scratch version.

In [None]:
import numpy as np
import torch
from torch import nn
from d2l import torch as d2l

### Model

Now we use a framework’s predefined layers, enabling us to focus on selecting and arranging the model’s layers without dealing with their low-level implementation details.

In [None]:
class LinearRegression(d2l.Module):  #@save
    """The linear regression model implemented with high-level APIs."""
    def __init__(self, lr):
        super().__init__()
        self.save_hyperparameters()
        self.net = nn.LazyLinear(1)
        self.net.weight.data.normal_(0, 0.01)
        self.net.bias.data.fill_(0)

In [None]:
@d2l.add_to_class(LinearRegression)  #@save
def forward(self, X):
    return self.net(X)

### Loss Function

Again, we use pre-defined loss function.

In [None]:
@d2l.add_to_class(LinearRegression)  #@save
def loss(self, y_hat, y):
    fn = nn.MSELoss()
    return fn(y_hat, y)

### Optimization Algorithm

Minibatch SGD is a common optimization method for training neural networks, and PyTorch’s optim module provides built-in support for it along with several variations of the algorithm.

In [None]:
@d2l.add_to_class(LinearRegression)  #@save
def configure_optimizers(self):
    return torch.optim.SGD(self.parameters(), self.lr)

### Training

Now that we have all the basic pieces in place, the training loop itself is the same as the one we implemented from scratch.

In [None]:
model = LinearRegression(lr=0.03)
data = d2l.SyntheticRegressionData(w=torch.tensor([2, -3.4]), b=4.2)
trainer = d2l.Trainer(max_epochs=3)
trainer.fit(model, data)

In [None]:
@d2l.add_to_class(LinearRegression)  #@save
def get_w_b(self):
    return (self.net.weight.data, self.net.bias.data)
w, b = model.get_w_b()

print(f'error in estimating w: {data.w - w.reshape(data.w.shape)}')
print(f'error in estimating b: {data.b - b}')

## Exercises

Ex. 3:

Consider the following definitions and then answer the question:
- __Aggregate loss__ (sum): the minibatch loss is defined by the sum of the individual sample losses.
- __Average loss__ (mean): the minibatch loss is defined as the mean of the sample losses (so the sum devided by the minibatch size).

How would you need to change the learning rate if you replace the aggregate loss with an average loss?

Ex. 4:

How do you access the gradient of the weights of the model?

Ex. 5:

Replace the squared loss with Huber’s robust loss function and run the training again. You can uncomment the line below to read more about the nn.HuberLoss available pytorch.

In [None]:
# nn.HuberLoss?