## Simple linear regression

Watch the 9-minute video below for a visual explanation of simple linear regression as a line-fitting problem.

```{admonition} Video
<iframe width="700" height="394" src="https://www.youtube.com/embed/PaFPbb66DxQ?start=24" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> 

[Explaining Linear Regression, by StatQuest](https://www.youtube.com/embed/PaFPbb66DxQ?start=24)
```


Then study the following sections to learn more about simple linear regression with examples in the text book.

### Import libraries and load data

Get ready by importing the APIs needed from respective libraries.
<!-- Firstly, we will import the required libraries and load the `Advertising` dataset. -->

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import scale
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

from statsmodels.formula.api import ols

%matplotlib inline

Load the [Advertising dataset](https://github.com/pykale/transparentML/raw/main/data/Advertising.csv) and display the column names, counts and data types.

<!-- Datasets available on https://www.statlearning.com/resources-first-edition -->

In [None]:
data_url = "https://github.com/pykale/transparentML/raw/main/data/Advertising.csv"

advertising_df = pd.read_csv(data_url, header=0, index_col=0)
advertising_df.info()

Display the first 5 rows for inspection.

In [None]:
advertising_df.head()

Simple linear regression assumes that there is an approximately linear relationship between a quantitative response $y$ and a single predictor $x$. Mathematically, the linear relationship can be expressed as

\begin{equation}
y \approx \beta_0 + \beta_1 x
\end{equation}

where $\beta_0$ and $\beta_1$ are two unknown constants that represent the *weight* and *bias* of the linear model, which are also known as *intercept* and *slope*. Together, $\beta_0$ and $\beta_1$ are called the *model coefficients*, or *parameters*. We can describe the linear relationship as *regressing* $y$ onto $x$. For example, $x$ may represent `TV` advertising and $y$ may represent `sales`. Then we can regress sales onto TV by fitting the model

<!-- \begin{equation} -->
$$
\textrm{sales} \approx \beta_0 + \beta_1 \times \textrm{TV}
$$
<!-- \end{equation} -->



### Estimating the coefficients

The goal of simple linear regression is to estimate the unknown parameters $\beta_0$ and $\beta_1$ from the data. Let 

$$
(x_1, y_1), (x_2, y_2), \ldots, (x_N, y_N)
$$

be the $n$ observations in the dataset, and $\hat{\beta}_0$ and $\hat{\beta}_1$ denote the estimated values of $\beta_0$ and $\beta_1$, respectively. The estimated values of $\hat{\beta}_0$ and $\hat{\beta}_1$ can be obtained by minimising the *residual sum of squares* (RSS)


\begin{equation}
\text{RSS} = \sum_{i=1}^N (y_i - \hat{y}_i)^2 = \sum_{i=1}^N (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2,
\end{equation}


where $y_i$ is the $i\text{th}$ observation of the response $y$, $\hat{y}_i$ is the $i\text{th}$ observation of the predicted response $\hat{Y}$, and $x_i$ is the $i\text{th}$ observation of the predictor $x$. The least squares estimates of $\beta_0$ and $\beta_1$ are given by


\begin{equation}
\hat{\beta}_1 = \frac{\sum_{i=1}^N (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^N (x_i - \bar{x})^2},
\end{equation}

\begin{equation}
\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}.
\end{equation}

where $\bar{x}$ and $\bar{y}$ are the sample means of $x$ and $y$ respectively. The least squares estimates of $\beta_0$ and $\beta_1$ are obtained by minimising the RSS. The least squares line is given by


\begin{equation}
\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x
\end{equation}

The least squares line is the line that minimises the sum of squared residuals. The least squares line is also known as the *regression line*.

Run the code below to fit `sales` onto `TV` for the `Advertising` dataset.

In [None]:
sns.regplot(
    x=advertising_df.TV,
    y=advertising_df.Sales,
    order=1,
    ci=None,
    scatter_kws={"color": "r", "s": 9},
)
plt.xlim(-10, 310)
plt.ylim(ymin=0)
plt.show()

This figure shows the least squares fit of `sales` onto `TV` for the `Advertising` dataset. The objective is to minimise the sum of squared residuals, which is the sum of the vertical distances between the data points (red dots) and the least squares line (blue line). 

**Example: fitting a linear regression model and visualising regression coefficients using `scikit-learn`**

The `LinearRegression` class in `scikit-learn` is used to fit a linear regression model. The `.fit()` method takes two arguments, the first is the predictor variable and the second is the response variable. The `.fit()` method returns an object that contains the estimated coefficients. The `.intercept_` and `.coef_` attributes of the fitted model can be used to obtain the estimated intercept ($\beta_0$) and slope ($\beta_1$) of the regression line. 
<!-- 
Note that the text in the book describes the coefficients based on unnormalised data, whereas the plot shows the model based on normalised data. The latter is visually more appealing for explaining the concept of a minimum RSS. I think that, in order not to confuse the reader, the values on the axis of the `beta_0` coefficients have been changed to correspond with the text. The axes on the plots below are unaltered. -->

Firstly, fit a linear regression model.

In [None]:
# Regression coefficients (Ordinary Least Squares)
regr = LinearRegression()

X = scale(advertising_df.TV, with_mean=True, with_std=False).reshape(-1, 1)
y = advertising_df.Sales

regr.fit(X, y)
print(regr.intercept_)
print(regr.coef_)

Create grid coordinates for plotting and compute the minimum of RSS.

In [None]:
beta_0 = np.linspace(regr.intercept_ - 2, regr.intercept_ + 2, 50)
beta_1 = np.linspace(regr.coef_ - 0.02, regr.coef_ + 0.02, 50)
xx, yy = np.meshgrid(beta_0, beta_1, indexing="xy")
Z = np.zeros((beta_0.size, beta_1.size))

# Calculate Z-values (RSS) based on grid of coefficients
for (i, j), v in np.ndenumerate(Z):
    Z[i, j] = ((y - (xx[i, j] + X.ravel() * yy[i, j])) ** 2).sum() / 1000

# minimised RSS
min_RSS = r"$\beta_0$, $\beta_1$ for minimised RSS"
min_rss = (
    np.sum((regr.intercept_ + regr.coef_ * X - y.values.reshape(-1, 1)) ** 2) / 1000
)
min_rss

Plot the RSS with corresponding $\beta_0$ and $\beta_1$ in 2D and 3D.

In [None]:
fig = plt.figure(figsize=(15, 6))
fig.suptitle("RSS - Regression coefficients", fontsize=20)

ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122, projection="3d")

# Left plot
CS = ax1.contour(xx, yy, Z, cmap=plt.cm.Set1, levels=[2.15, 2.2, 2.3, 2.5, 3])
ax1.scatter(regr.intercept_, regr.coef_[0], c="r", label=min_RSS)
ax1.clabel(CS, inline=True, fontsize=10, fmt="%1.1f")

# Right plot
ax2.plot_surface(xx, yy, Z, rstride=3, cstride=3, alpha=0.3)
ax2.contour(
    xx,
    yy,
    Z,
    zdir="z",
    offset=Z.min(),
    cmap=plt.cm.Set1,
    alpha=0.4,
    levels=[2.15, 2.2, 2.3, 2.5, 3],
)
ax2.scatter3D(regr.intercept_, regr.coef_[0], min_rss, c="r", label=min_RSS)
ax2.set_zlabel("RSS")
ax2.set_zlim(Z.min(), Z.max())
ax2.set_ylim(0.02, 0.07)

# settings common to both plots
for ax in fig.axes:
    ax.set_xlabel(r"$\beta_0$", fontsize=17)
    ax.set_ylabel(r"$\beta_1$", fontsize=17)
    ax.set_yticks([0.03, 0.04, 0.05, 0.06])
    ax.legend()

### Assessing the accuracy of the model

The accuracy of the linear model is dependent on the variability of the response $y$ and the predictor $x$. The variability of $y$ is measured by the variance of $y$, denoted by $\sigma^2$. The variability of $x$ is measured by the variance of $x$, denoted by $\sigma_x^2$. The coefficient of determination, denoted by $R^2$, is defined as

\begin{equation}
R^2 = \frac{\text{TSS} - \text{RSS}}{\text{TSS}} = 1 - \frac{\text{RSS}}{\text{TSS}}
\end{equation}

where $\text{TSS} = \sum_{i=1}^N (y_i - \bar{y})^2$ *is the total sum of squares*. Dividing the TSS by the total number of training samples gives the *mean squared error* (MSE) of the model. The MSE is the average squared distance between the observed response values and the response values predicted by the model. The MSE is also known as the *mean squared prediction error* (MSPE). The MSE is defined as

\begin{equation}
\text{MSE} = \frac{1}{N} \text{TSS} = \frac{1}{N} \sum_{i=1}^N (y_i - \hat{y}_i)^2 = \frac{1}{N} \sum_{i=1}^N (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2.
\end{equation}

The coefficient of determination is a proportion that measures the proportion of the variability in the response that is explained by the linear model. The coefficient of determination is always between 0 and 1. The coefficient of determination is 0 when the regression line does not fit the data at all, and the coefficient of determination is 1 when the regression line perfectly fits the data. The coefficient of determination is also known as the *coefficient of multiple determination*. The coefficient of determination is a measure of the goodness of fit of the linear model. The coefficient of determination is also known as the *coefficient of multiple determination*. The coefficient of determination is a measure of the goodness of fit of the linear model.

Watch the video below to learn more about $R^2$

```{admonition} Video
<iframe width="700" height="394" src="https://www.youtube.com/embed/2AQKmw14mHM?start=16" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> 

[Explaining $R^2$, by StatQuest](https://www.youtube.com/embed/2AQKmw14mHM?start=16)
```

Run the following code to fit a simple linear regression model, and then evaluate the learnt model with $R^2$ using `scikit-learn` (Table 3.1 & 3.2 of the text book)

In [None]:
regr = LinearRegression()

X = advertising_df.TV.values.reshape(-1, 1)
y = advertising_df.Sales

regr.fit(X, y)
print(regr.intercept_)
print(regr.coef_)

In [None]:
sales_pred = regr.predict(X)
print("R2 score:", r2_score(y, sales_pred))
print("Mean squared error: ", mean_squared_error(y, sales_pred))

### Assessing the accuracy of the coefficient estimates (Advanced)

The linear relationship between $x$ and $y$ can be written in an equation as

\begin{equation}
y = \beta_0 + \beta_1 x + \epsilon,
\end{equation}

where $\epsilon$ is a random error term that represents the difference between the observed response $y$ and the true response $\beta_0 + \beta_1 X$. The error term $\epsilon$ is assumed to be normally distributed with mean zero and constant variance $\sigma^2$. The coefficient estimates $\hat{\beta}_0$ and $\hat{\beta}_1$ are only estimates of the true coefficients $\beta_0$ and $\beta_1$. We can quantify the accuracy of the estimates by computing the *standard error* of the estimates. The following formulars can be used to compute the standard error associated with $\hat{\beta}_1$ and $\hat{\beta}_0$:

\begin{equation}
\text{SE}(\hat{\beta}_{1})^2 = \frac{\sigma^2}{\sum_{i=1}^N (x_i - \bar{x})^2}, \quad \text{SE}(\hat{\beta}_0)^2 = \sigma^2 \left[\frac{1}{N} + \frac{\bar{x}^2}{\sum_{i=1}^N (x_i - \bar{x})^2} \right],
\end{equation}

where $\sigma^2$ is an estimate of the variance of the error term $\epsilon= y - (\beta_0 + \beta_1 x) $, $\hat{y}_i$ is the $i\text{th}$ observation of the predicted response $\hat{y}$, and $x_i$ is the $i\text{th}$ observation of the predictor $x$, $\bar{x}$ and $\bar{y}$ are the sample means of $x$ and $y$ respectively. This estimate is called the *residual standard error* (RSE)

\begin{equation}
\text{RSE} = \sqrt{\frac{1}{N-2}\text{RSS}} = \sqrt{\frac{1}{N-2} \sum_{i=1}^N (y_i - \hat{y}_i)^2}.
\end{equation}


The standard error of $\hat{\beta}_1$ is a measure of the variability of the estimate of $\beta_1$ due to sampling error. The standard error of $\hat{\beta}_0$ is a measure of the variability of the estimate of $\beta_0$ due to sampling error. The standard error of $\hat{\beta}_1$ is also known as the *standard error of the regression*.

Standard errors can be used to compute confidence intervals. A 95\% is defined as a range of values such that with 95 \% probability. The range is defined in terms of lower and upper limits computed from the sample of data. For linear regression, the 95 % confidence interval for $\beta_1$ approximately takes the form

\begin{equation}
\hat{\beta}_1 \pm 2 \times \text{SE}(\hat{\beta}_1).
\end{equation}

Run the following code to compute the statistics, including the confidence intervals, of the learnt model using `statesmodels` (page 67 & Table 3.1 & 3.2 of the textbook).

In [None]:
est = ols("Sales ~ TV", advertising_df).fit()
est.summary().tables[1]

In [None]:
# RSS with regression coefficients
(
    (advertising_df.Sales - (est.params[0] + est.params[1] * advertising_df.TV)) ** 2
).sum() / 1000