# Linear Neural Networks for Regression

Before we worry about making our neural networks **deep**, it will be helpful to implement some **shallow** ones, for which the inputs connect directly to the outputs. This will prove important for a few reasons. 
- First, rather than getting distracted by complicated architectures, we can focus on the basics of neural network training, including **parametrizing** the output layer, handling data, specifying a loss function, and training the model.
- Second, this class of shallow networks happens to comprise the set of linear models, which subsumes many classical methods of statistical prediction, including **linear** and **softmax regression**.

Understanding these classical tools is pivotal because they are widely used in many contexts and we will often need to use them as baselines when justifying the use of fancier architectures.

## Linear Regression

Regression problems pop up whenever we want to predict a numerical value. Common examples include predicting prices (of homes, stocks, etc.), predicting the length of stay (for patients in the hospital), forecasting demand (for retail sales), among numerous others. Not every prediction problem is one of classical regression. Later on, we will introduce classification problems, where the goal is to predict membership among a set of categories.

As a running example, suppose that we wish to estimate the prices of houses (in dollars) based on their area (in square feet) and age (in years). To develop a model for predicting house prices, we need to get our hands on data, including the sales price, area, and age for each home. In the terminology of machine learning, the dataset is called a **training dataset** or **training set**, and each row (containing the data corresponding to one sale) is called an **example** (or data point, instance, sample). The thing we are trying to predict (price) is called a **label** (or **target**). The variables (age and area) upon which the predictions are based are called **features** (or **covariates**).

In [1]:
%matplotlib inline
import math
import time
import numpy as np
import torch
from d2l import torch as d2l

## Basics

<img src="assets/lr1.png" />

## Model

<img src="assets/model1.png" />

<img src="assets/model2.png" />

## Loss Function

<img src="assets/loss fn.png" />

### Loss Function: $L(\mathbf{w}, b)$

$$
L(\mathbf{w}, b) = \frac{1}{n} \sum_{i=1}^{n} l^{(i)}(\mathbf{w}, b) = \frac{1}{n} \sum_{i=1}^{n} \frac{1}{2} \left( \mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)} \right)^2
$$

**Explanation**:
- **$L(\mathbf{w}, b)$**: This is the overall loss function, a function of the weights $ \mathbf{w} $ and bias $ b $.
- **$n$**: The number of training examples in the dataset.
- **$l^{(i)}(\mathbf{w}, b)$**: The loss for a single training example $ i $.
- **$\mathbf{w}^\top \mathbf{x}^{(i)} + b$**: This represents the predicted value $\hat{y}^{(i)}$ for the $i$-th training example, where $\mathbf{w}^\top$ is the transpose of the weights vector multiplied by the feature vector $\mathbf{x}^{(i)}$, and $b$ is the bias.
- **$y^{(i)}$**: The actual target value for the $i$-th training example.
- **$\frac{1}{2} \left( \mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)} \right)^2$**: This is the squared error for the $i$-th training example, divided by 2 for mathematical convenience (it simplifies the derivative calculations).

### Optimization Goal: $ \mathbf{w}^*, b^* = \arg \min_{\mathbf{w}, b} L(\mathbf{w}, b) $

$$
\mathbf{w}^*, b^* = \arg \min_{\mathbf{w}, b} L(\mathbf{w}, b)
$$

**Explanation**:
- **$ \mathbf{w}^*, b^* $**: These are the optimal values for the weights and bias that minimize the loss function.
- **$ \arg \min_{\mathbf{w}, b} $**: This notation means "the values of $ \mathbf{w} $ and $ b $ that minimize" the loss function $ L(\mathbf{w}, b) $.
- **Minimization**: The goal of training is to adjust the weights and bias to minimize the loss function, which in turn means the model’s predictions are as close as possible to the actual target values.

<img src="assets/loss fn 2.png" />

### Derivative of the Loss Function
$$
\frac{\partial}{\partial \mathbf{w}} \left\|\mathbf{y} - \mathbf{Xw}\right\|^2 = 2\mathbf{X}^\top (\mathbf{Xw} - \mathbf{y}) = 0
$$

**Explanation**:
- **Purpose**: This function represents the derivative of the loss function with respect to the weight vector $ \mathbf{w} $. In the context of linear regression, this derivative is set to zero to find the minimum point of the loss function, which gives the optimal weights.
- **Derivative of Squared Loss**: The term $ \left\|\mathbf{y} - \mathbf{Xw}\right\|^2 $ is the squared loss, which measures the difference between the actual values $ \mathbf{y} $ and the predicted values $ \mathbf{Xw} $. Taking the derivative with respect to $ \mathbf{w} $ and setting it to zero is how we find the minimum loss.
- **Result**: This derivative simplifies to $ \mathbf{X}^\top \mathbf{y} = \mathbf{X}^\top \mathbf{Xw} $. This equation is fundamental in solving for the optimal weights.

**Pronunciation**:
- "The derivative of the loss function with respect to the weight vector $ \mathbf{w} $ equals two times X-transpose multiplied by the difference between Xw and y, set to zero."

### Analytic Solution for $ \mathbf{w}^* $
$$
\mathbf{w}^* = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}
$$

**Explanation**:
- **Purpose**: This formula provides the optimal weights $ \mathbf{w}^* $ that minimize the loss function in a linear regression problem. It is derived by solving the equation from the first function.
- **Matrix Inversion**: $ (\mathbf{X}^\top \mathbf{X})^{-1} $ is the inverse of the matrix $ \mathbf{X}^\top \mathbf{X} $. This inversion is only possible if $ \mathbf{X}^\top \mathbf{X} $ is invertible, meaning the columns of $ \mathbf{X} $ must be linearly independent.
- **Optimal Weights**: The result $ \mathbf{w}^* $ gives the best possible weights for the model, ensuring that the predictions are as close as possible to the actual values.

**Pronunciation**:
- "The optimal weights $ \mathbf{w}^* $ are equal to the inverse of X-transpose times X, multiplied by X-transpose times y."

### Minibatch Stochastic Gradient Descent

Training deep learning models can be effective even when analytical solutions are difficult to obtain because these challenging models often **deliver superior results**, making the effort to train them worthwhile. The core technique used for optimization in deep learning is **`gradient descent`**, where **model parameters** are **updated iteratively** to reduce the loss function.

Gradient descent can be **slow** when applied to the entire dataset at once, so alternatives like **stochastic gradient descent (SGD)**, which updates based on a **single sample**, are used. However, SGD can be inefficient due to computational and statistical limitations. To balance efficiency and effectiveness, **minibatch SGD** is employed, where updates are made using small batches of samples, typically between **32** and **256** observations.

Minibatch SGD involves randomly sampling a minibatch, computing the gradient of the loss function with respect to the model parameters, and updating the parameters in the **direction that reduces the loss**. This process is repeated iteratively, with the minibatch size and learning rate being key hyperparameters that can be tuned.

Although the algorithm doesn't always find the exact minimizers of the loss function, it generally leads to parameters that perform well on training data. The true challenge lies in achieving good generalization, where the model makes accurate predictions on unseen data.

In [3]:
### Explanation of the Formula in Minibatch Stochastic Gradient Descent (SGD)

The image you've provided illustrates the Minibatch Stochastic Gradient Descent (SGD) algorithm. Let’s break down the formula and explain how it works step by step.

#### The Formula:
$$
(\mathbf{w}, b) \leftarrow (\mathbf{w}, b) - \frac{\eta}{|\mathcal{B}_t|} \sum_{i \in \mathcal{B}_t} \nabla_{(\mathbf{w}, b)} \ell^{(i)} (\mathbf{w}, b)
$$

#### How to Pronounce/Read the Formula in Natural Language:
- The pair of parameters $$(\mathbf{w}, b)$$ is updated to a new value.
- This update is performed by subtracting a quantity from the current values of $$(\mathbf{w}, b)$$.
- The quantity to subtract is the product of the learning rate $$\eta$$ divided by the size of the minibatch $$|\mathcal{B}_t|$$, and the average of the gradients of the loss function $$\ell^{(i)}$$ with respect to the parameters $$(\mathbf{w}, b)$$ for all samples in the minibatch $$\mathcal{B}_t$$.

#### How the Formula is Working:
1. **Initialization**: 
   - Before the iteration starts, the model parameters $$\mathbf{w}$$ (weights) and $$b$$ (bias) are initialized. They can be initialized randomly or using other techniques.
   
2. **Minibatch Sampling**:
   - At each iteration $$t$$, a minibatch $$\mathcal{B}_t$$ of training examples is randomly sampled from the training data. The minibatch $$\mathcal{B}_t$$ contains a fixed number of examples, denoted by $$|\mathcal{B}_t|$$.
   
3. **Compute Gradients**:
   - For each example $$i$$ in the minibatch $$\mathcal{B}_t$$, compute the gradient of the loss function $$\ell^{(i)}$$ with respect to the model parameters $$\mathbf{w}$$ and $$b$$. The gradient $$\nabla_{(\mathbf{w},b)} \ell^{(i)}(\mathbf{w}, b)$$ tells us the direction in which the loss increases. 

4. **Average the Gradients**:
   - The gradients for all examples in the minibatch are averaged. This averaging is done by summing up the individual gradients and dividing by the number of examples in the minibatch $$|\mathcal{B}_t|$$.
   
5. **Update the Parameters**:
   - The parameters $$\mathbf{w}$$ and $$b$$ are updated by moving them in the direction opposite to the averaged gradient. The amount by which we move is controlled by the learning rate $$\eta$$. The learning rate is a small positive number that determines the step size for each update.
   
6. **Repeat**:
   - This process is repeated for many iterations, sampling a new minibatch $$\mathcal{B}_t$$ each time, until the algorithm converges (i.e., the loss function stops decreasing significantly).

#### What Each Part is Doing:
- **$$\mathbf{w}$$ and $$b$$**: These are the parameters (weights and biases) of your model that you're trying to optimize.
- **$$\eta$$ (eta)**: This is the learning rate, a small positive scalar that controls the size of the step we take in the direction of the negative gradient.
- **$$|\mathcal{B}_t|$$**: This denotes the size of the minibatch $$\mathcal{B}_t$$, i.e., the number of training examples in the minibatch.
- **$$\nabla_{(\mathbf{w},b)} \ell^{(i)}(\mathbf{w}, b)$$**: This is the gradient of the loss function $$\ell^{(i)}$$ with respect to the parameters $$(\mathbf{w}, b)$$ for a single training example $$i$$. It tells us how much the loss will change if we slightly change the parameters $$(\mathbf{w}, b)$$.

#### Closed-Form Expansion:
The closed-form expansion provided is specific to quadratic loss functions and affine transformations. It shows how the weight vector $$\mathbf{w}$$ and bias $$b$$ are updated using the input features $$\mathbf{x}^{(i)}$$ and the actual output $$y^{(i)}$$.

This expansion explains the specific update rules for weights and bias:
$$$$
\mathbf{w} \leftarrow \mathbf{w} - \frac{\eta}{|\mathcal{B}_t|} \sum_{i \in \mathcal{B}_t} \mathbf{x}^{(i)} \left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right)
$$$$
$$$$
b \leftarrow b - \frac{\eta}{|\mathcal{B}_t|} \sum_{i \in \mathcal{B}_t} \left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right)
$$$$
- **$$\mathbf{x}^{(i)}$$**: The feature vector of the i-th training example.
- **$$y^{(i)}$$**: The actual target value for the i-th training example.
- **$$\mathbf{w}^\top \mathbf{x}^{(i)}$$**: The predicted output for the i-th training example based on the current model parameters.

In summary, minibatch SGD optimizes the model parameters by iteratively updating them in the direction that reduces the average loss over a small, randomly sampled subset (minibatch) of the training data.


SyntaxError: unterminated string literal (detected at line 3) (1772890781.py, line 3)

---