## Multiple linear regression

We have examined the relationship between `sale` and `TV` of the `Advertising` dataset for simple linear regression. There are two more predictor variables `Radio` and `Newspaper` in the dataset. How can we account for the effect of these two variables in the model? 

Watch the 5-minute video below for a visual explanation of multiple linear regression.

```{admonition} Video
<iframe width="700" height="394" src="https://www.youtube.com/embed/zITIFTsivN8?start=21" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> 

[Explaining Multiple Linear Regression, by StatQuest](https://www.youtube.com/embed/zITIFTsivN8?start=21)
```

Then study the following sections to learn more about multiple linear regression with examples in the text book.

### Import the required libraries and load the dataset.

In [None]:
# %load ../standard_import.txt
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import axes3d

from sklearn.linear_model import LinearRegression
from statsmodels.formula.api import ols

%matplotlib inline

**Load Datasets**

Load `Advertising` dataset.

In [None]:
data_url = "https://github.com/pykale/transparentML/raw/main/data/Advertising.csv"
advertising_df = pd.read_csv(data_url, header=0, index_col=0)

To accommodate multiple predictor variables, one option is to run simple linear regression separately for each predictor variable. 

The following code run a simple linear regression model of `Radio`, and `Newspaper` onto `Sales` using `statsmodels`, respectively (Table 3.3 in the textbook).

In [None]:
est = ols("Sales ~ Radio", advertising_df).fit()
est.summary().tables[1]

In [None]:
est = ols("Sales ~ Newspaper", advertising_df).fit()
est.summary().tables[1]

However, fitting a separate simple linear regression model for each predictor is not very efficient. It is unclear how to make a single prediction, while the effect of other predictors is ignored. A better approach is to use multiple linear regression. Multiple linear regression is an extension of simple linear regression. It allows us to predict a quantitative response using more than one predictor variable. The equation for a multiple linear regression model with two predictor variables is given by:

\begin{equation}
y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_D x_D + \epsilon,
\end{equation}

where $y$ is the response, $x_1, x_2, ..., x_D$ are the predictors, $D$ is the total number of features (predictor variables), and $\epsilon$ is the error term. The $\beta$'s are called the regression coefficients. The $\beta_0$ is the bias (intercept), and the $\beta_1, \beta_2, ..., \beta_D$ are the weights (slopes). The equation can be written in matrix form as:

\begin{equation}
\mathbf{Y} = \mathbf{X} \mathbf{\beta} + \boldsymbol{\epsilon},
\end{equation}

where $\mathbf{Y}$ is a $N \times 1$ vector of responses, $\mathbf{X}$ is a $N \times (p+1)$ matrix of predictors, $\mathbf{\beta}$ is a $(p+1) \times 1$ vector of regression coefficients, and $\boldsymbol{\epsilon}$ is a $N \times 1$ vector of errors. The $\mathbf{X}$ matrix contains a column of 1s to account for the intercept. The $\mathbf{\beta}$ vector contains the intercept in the first position and the slopes for the remaining $p$ predictors. The $\mathbf{\epsilon}$ vector contains the error terms for each observation. Using the `Advertising` dataset as an example, we can fit a multiple linear regression model:

\begin{equation}
\text{Sales} = \beta_0 + \beta_1 \text{TV} + \beta_2 \times \text{Radio} + \beta_3 \times \text{Newspaper} + \epsilon.
\end{equation}

### Estimating the regression coefficients

Similar to simple linear regression, we can estimate the regression coefficients using least squares. The least squares estimates for the regression coefficients are given by:


\begin{align}
\begin{aligned}
\text{RSS} = & \sum_{i=1}^N (y_i - \hat{y}_i)^2 \\
= & \sum_{i=1}^N (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_{i1} - \hat{\beta}_2 x_{i2} - ... - \hat{\beta}_D x_{iD})^2,
\end{aligned}
\end{align}

where $y_i$ is the $i$th response, $\hat{y}_i$ is the $i$th predicted response, $\hat{\beta}_0$ is the intercept, $\hat{\beta}_1$ is the slope for $x_{i,1}$, $\hat{\beta}_2$ is the slope for $x_{i,2}$, and so on. The $\hat{\beta}$'s are the least squares estimates for the regression coefficients. The least squares estimates for the regression coefficients in matrix form are given by:

\begin{equation}
\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{Y}.
\end{equation}

The following code run a multiple linear regression model of `TV`, `Radio`, and `Newspaper` onto `Sales` using `statsmodels`, and display the learnt coefficients (Table 3.4 in the textbook).

In [None]:
est = ols("Sales ~ TV + Radio + Newspaper", advertising_df).fit()
est.summary()

```{admonition} How to interpret the results? 
:class: tip, dropdown
We interpret these results as follows: 
- For a given amount of `TV` and `newspaper `advertising, spending an additional \$1,000 on `radio` advertising is associated with approximately 189 units of additional `sales`. 
- Comparing these coefficients to the estimates in simple linear regression, we notice that the multiple regression coefficient estimates for `TV` and `radio` are pretty similar to the simple linear regression coefficient estimates. However, while the `newspaper` regression coefficient estimate in simple linear regression was significantly non-zero, the coefficient estimate for `newspaper` in the multiple regression model is close to zero, and the corresponding p-value is no longer significant, with a value around 0.86. 
- This illustrates that the simple and multiple regression coefficient can be quite different. This difference stems from the fact that in the simple regression case, the slope term represents the average increase in product sales associated with a \$1,000 increase in newspaper advertising, ignoring other predictors such as `TV` and `radio`. By contrast, in the multiple regression setting, the coefficient for `newspaper` represents the average increase in product `sales` associated with increasing `newspaper` spending by \$1,000 while holding `TV` and `radio` fixed.
```

Why the relationship between `Sales` and `Newspaper` are opposite in the simple linear regression and multiple linear regression? Use following code displays the correlation matrix of the `Advertising` dataset for further analysis.

In [None]:
advertising_df.corr()

```{admonition} How to interpret the results?
:class: tip, dropdown
This indicates that markets with high `newspaper` advertising tend to also have high `radio` advertising. Now suppose that the multiple regression is correct and `newspaper` advertising is not associated with sales, but `radio` advertising is associated with `sales`. Then in markets where we spend more on `radio` our sales will tend to be higher, and as our correlation matrix shows, we also tend to spend more on newspaper advertising in those same markets. Hence, in a simple linear regression which only examines sales versus `newspaper`, we will observe that higher values of `newspaper` tend to be associated with higher values of `sales` , even though newspaper advertising is not directly associated with sales. So newspaper advertising is a surrogate for `radio` advertising; `newspaper` gets “credit” for the association between `radio` on `sales`.
```

### Important questions in multiple linear regression

#### Is at least one of the predictors $x_1, x_2, ..., x_D$ useful in predicting the response?

We can answer this question by testing the null hypothesis that all the regression coefficients are zero, i.e.

$$
H_0: \beta_1 = \beta_2 = ... = \beta_D = 0.
$$

versus the alternative hypothesis:

$$
H_a: \text{at least one of the regression coefficients is non-zero}. 
$$ 

This hypothesis test is performed by computing the $F$-statistic, which is defined as:

\begin{equation}
F = \frac{(\text{TSS} - \text{RSS})/p}{\text{RSS}/(N-D-1)},
\end{equation}

where $\text{TSS} = \sum(y_i - \bar{y})^2$ and $\text{RSS} = \sum(y_i - \hat{y}_i)^2$. When there is no relationship between the response and predictors, one would expect the $F$-statistic to take on a value close to 1. On the other hand, if $H_a$ a is true, we can expect $F$ to be greater than 1. 

The $F$-statistic for the multiple linear regression model obtained by regressing `sales` onto `radio`, `TV`, and `newspaper` is 570.3 (displayed above). Since this is far larger than 1, it provides compelling evidence against the null hypothesis $H_0$. In other words, the large $F$-statistic suggests that at least one of the advertising media must be related to `sales`.

Watch the following video to learn more about the $F$-statistic.

```{admonition} Video
<iframe width="700" height="394" src="https://www.youtube.com/embed/nk2CQITm_eo?start=969&end=1525" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> 

[Calculating a p-value for $R^2$, by StatQuest](https://www.youtube.com/embed/nk2CQITm_eo?start=969&end=1525)
```

<!-- then $E\{\text{TSS} - \text{RSS}/p\} > \sigma^2$, where $\sigma$ is the standard deviation of the error term, so we expect $F$ to be greater than 1. -->

<!-- The numerator of the F-statistic is the ratio of the total sum of squares to the residual sum of squares, and the denominator is the ratio of the residual sum of squares to the degrees of freedom. The F-statistic has an $F$ distribution with $p$ and $n-p-1$ degrees of freedom. The null hypothesis is that all the regression coefficients are zero, and the alternative hypothesis is that at least one of the regression coefficients is non-zero.  -->


#### Do all the predictors help to explain $y$, or is only a subset of the predictors useful?

It is more often that the response is only associated with a subset of the predictors. The task of determining which predictors are associated with the response, in order to fit a single model involving only those predictors, is referred to as *variable selection* (or *feature selection*). This will be discussed extensively in {doc}`Feature Selection and Shrinkage <../06-ftr-select-shrink/overview>`.

There are three common approaches to variable selection:

- Forward selection. We begin with a model containing no predictors, and then consider adding predictors one at a time until all predictors have been considered. The best single predictor is added to the model, and the process is repeated until all predictors have been added to the model. This is a greedy algorithm, and it is not guaranteed to find the best model containing a subset of the predictors. 
- Backward selection. We begin with a model containing all predictors, and then consider removing predictors one at a time until no predictors remain. The worst single predictor is removed from the model, and the process is repeated until no predictors remain in the model. This is also a greedy algorithm, and it is not guaranteed to find the best model containing a subset of the predictors.
- Mixed selection. We begin with some initial model containing a subset of the predictors. We then consider adding or removing each predictor individually, and retain the best model that results. This is also a greedy algorithm, and it is not guaranteed to find the best model containing a subset of the predictors.

#### How well does the model fit the data?

Similar to simple linear. We can answer this question by computing the $R^2$ and RSE statistics. The $R^2$ for multiple linear regression is defined as:

\begin{equation}
R^2 = 1 - \frac{\text{RSS}}{\text{TSS}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}.
\end{equation}

The $R^2$ statistic provides an indication of the proportion of the variance in the response that is predictable from the predictors. In this case, the $R^2$ statistic is 0.897, which indicates that 89.7% of the variance in `sales` is predictable from `TV`, `radio`, and `newspaper`.

The RSE for multiple linear regression is defined as:

\begin{equation}
\text{RSE} = \sqrt{\frac{\text{RSS}}{N-D-1}} = \sqrt{\frac{\sum(y_i - \hat{y}_i)^2}{N-D-1}}.
\end{equation}

Run the following code to fit and then evaluate a multiple linear regression model using `sci-kit learn`:

Firstly, fit a linear regression to `sales` using `TV` and `radio` as predictors.

In [None]:
regr = LinearRegression()

X = advertising_df[["Radio", "TV"]].values
y = advertising_df.Sales

regr.fit(X, y)
print(regr.coef_)
print(regr.intercept_)

Show the min/max values of Radio & TV to set up the range of the plot.

In [None]:
# What are the
# Use these values to set up the grid for plotting.
advertising_df[["Radio", "TV"]].describe()

Create a coordinate grid

In [None]:
radio = np.arange(0, 50)
tv = np.arange(0, 300)

beta_1, beta_2 = np.meshgrid(radio, tv, indexing="xy")
Z = np.zeros((tv.size, radio.size))

for (i, j), v in np.ndenumerate(Z):
    Z[i, j] = (
        regr.intercept_ + beta_1[i, j] * regr.coef_[0] + beta_2[i, j] * regr.coef_[1]
    )

Create 3D plot of `sales` vs `TV` and `radio`

In [None]:
# Create plot
fig = plt.figure(figsize=(10, 6))
fig.suptitle("Regression: Sales ~ Radio + TV Advertising", fontsize=20)

ax = axes3d.Axes3D(fig, auto_add_to_figure=False)
fig.add_axes(ax)

ax.plot_surface(beta_1, beta_2, Z, rstride=10, cstride=5, alpha=0.4)
ax.scatter3D(advertising_df.Radio, advertising_df.TV, advertising_df.Sales, c="r")

ax.set_xlabel("Radio")
ax.set_xlim(0, 50)
ax.set_ylabel("TV")
ax.set_ylim(ymin=0)
ax.set_zlabel("Sales")
plt.show()

#### Given a set of predictor values, what response value should we predict, and how accurate is our prediction? (Advanced, add more content of prediction interval/or remove this sub-section later)

Once we have fit a multiple linear regression model, we can use the model to make predictions of the response for a given set of predictor values. 

\begin{equation}
y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_D x_D.
\end{equation}

However, we must be careful when making predictions, because the observed values of the predictors may not have been part of the data used to fit the model. In this case, the prediction may not be very accurate. We use a confidence interval to quantify the uncertainty associated with the prediction. 

