### Linear Regression

 Recall the Advertising data from Chapter 2. Figure 2.1 displays sales (in thousands of units) for a particular product as a function of advertising budgets (in thousands of dollars) for TV, radio, and newspaper media. Suppose that in our role as statistical consultants we are asked to suggest,
 on the basis of this data, a marketing plan for next year that will result in high product sales. What information would be useful in order to provide such a recommendation? Here are a few important questions that we might seek to address:

  1. *Is there a relationship between advertising budget and sales?* Our first goal should be to determine whether the data provide evidence of an association between advertising expenditure and sales. If the evidence is weak, then one might argue that no money should be spent on advertising!

  2. *How strong is the relationship between advertising budget and sales?* Assuming that there is a relationship between advertising and sales, we would like to know the strength of this relationship. Does knowledge of the advertising budget provide a lot of information aboutproduct sales?

  3. *Which media are associated with sales?*
 Are all three media—TV, radio, and newspaper—associated with sales, or are just one or two of the media associated? To answer this question, we must find a way to separate out the individual contribution of each medium to sales when we have spent money on all three media.

 4. *How large is the association between each medium and sales?*
 For every dollar spent on advertising in a particular medium, by
 what amount will sales increase? How accurately can we predict this
 amount of increase?

 5. *How accurately can we predict future sales?*
 For any given level of television, radio, or newspaper advertising, what is our prediction for sales, and what is the accuracy of this prediction?

 6. *Is the relationship linear?*
 If there is approximately a straight-line relationship between advertising expenditure in the various media and sales, then linear regression is an appropriate tool. If not, then it may still be possible to transform the predictor or the response so that linear regression can be used.

 7. *Is there synergy among the advertising media?*
 Perhaps spending $50,000 on television advertising and $50,000 on radio advertising is associated with higher sales than allocating $100,000 to either television or radio individually. In marketing, this is known as a synergy effect, while in statistics it is called an interaction effect.



#### Simple Linear Regression

*Simple linear regression* lives up to its name: it is a very straightforward approach for predicting a quantitative response $Y$ on the basis of a single predictor variable $X$. It assumes that there is approximately a linear relationship between $X$ and $Y$. Mathematically, we can write this linear relationship as

\begin{equation}
Y \approx \beta_0 + \beta_1 X.
\tag{3.1}
\end{equation}

You might read “$\approx$” as “is approximately modeled as”. We will sometimes describe (3.1) by saying that we are regressing $Y$ on $X$ (or $Y$ onto $X$).

 For example, X may represent TV advertising and Y may represent sales. Then we can regress sales onto TV by fitting the model

 \begin{equation}
sales \approx \beta_0 + \beta_1 \times TV.
\tag{3.2}
\end{equation}


In Equation 3.1, $\beta_0$ and $\beta_1$ are two unknown constants that represent the *intercept* and *slope* terms in the linear model. Together, $\beta_0$ and $\beta_1$ are known as the model *coefficients* or *parameters*. Once we have used our training data to produce estimates $\hat{\beta}_0$ and $\hat{\beta}_1$ for the model coefficients, we can predict future sales on the basis of a particular value of TV advertising by computing

\begin{equation}
\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x,
\tag{3.2}
\end{equation}

where $\hat{y}$ indicates a prediction of $Y$ on the basis of $X = x$. Here we use a hat symbol, $\hat{~}$, to denote the estimated value for an unknown parameter or coefficient, or to denote the predicted value of the response.



#### Estimating the Coefficients

In practice, $\beta_0$ and $\beta_1$ are unknown. So before we can use (3.1) to make predictions, we must use data to estimate the coefficients. Let

$$(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)$$

represent $n$ observation pairs, each of which consists of a measurement of $X$ and a measurement of $Y$. In the *Advertising* example, this data set consists of the TV advertising budget and product sales in $n = 200$ different markets. (Recall that this data are displayed in Figure 2.1.) Our goal is to obtain coefficient estimates $\hat{\beta}_0$ and $\hat{\beta}_1$ such that the linear model (3.1) fits the available data well—that is, so that $y_i \approx \hat{\beta}_0 + \hat{\beta}_1 x_i$ for $i = 1, \ldots, n$. In other words, we want to find an intercept $\hat{\beta}_0$ and a slope $\hat{\beta}_1$ such that the resulting line is as close as possible to the $n = 200$ data points. There are a number of ways of measuring *closeness*. However, by far the most common approach involves minimizing the *least squares criterion*, and we take that approach in this chapter. 

Let $\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i$ be the prediction for $Y$ based on the $i$ th value of $X$. Then $e_i = y_i - \hat{y}_i$ represents the $i$th *residual*—this is the difference between the $i$ th observed response value and the $i$th response value that is predicted by our linear model. We define the *residual sum of squares* (RSS) as

$$RSS = e_1^2 + e_2^2 + \cdots + e_n^2,$$

or equivalently as

\begin{equation}
\text{RSS} = (y_1 - \hat{\beta}_0 - \hat{\beta}_1 x_1)^2 + (y_2 - \hat{\beta}_0 - \hat{\beta}_1 x_2)^2 + \cdots + (y_n - \hat{\beta}_0 - \hat{\beta}_1 x_n)^2.
\tag{3.3}
\end{equation}

The least squares approach chooses $\hat{\beta}_0$ and $\hat{\beta}_1$ to minimize the RSS. Using some calculus, one can show that the minimizers are

$$
\hat{\beta}_1 \;=\; \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}
{\sum_{i=1}^{n} (x_i - \bar{x})^2}
$$

$$
\hat{\beta}_0 \;=\; \bar{y} - \hat{\beta}_1 \bar{x}
$$

where $\bar{y} = \frac{1}{n} \sum_{i=1}^n y_i$ and $\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i$ are the sample means. In other words, (3.4) defines the *least squares coefficient estimates* for simple linear regression.

Figure 3.1 displays the simple linear regression fit to the *Advertising* data, where $\hat{\beta}_0 = 7.03$ and $\hat{\beta}_1 = 0.047$. In other words, according to this approximation, an additional \$1000 spent on TV advertising is associated with selling approximately 47 additional units of the product. In Figure 3.2, we have computed RSS for a number of values of $\hat{\beta}_0$ and $\hat{\beta}_1$, using the advertising and sales as the response and $X$ as the predictor. In each plot, the red dot represents the pair of least squares estimates $(\hat{\beta}_0, \hat{\beta}_1)$ given by (3.4). These values clearly minimize the RSS.

#### Assessing the Accuracy of the Coefficient Estimates

Recall from $(2.1)$ that we assume that the *true* relationship between $X$ and $Y$ takes the form $Y = f(X) + \epsilon$ for some unknown function $f$, where $\epsilon$ is a mean-zero random error term. If $f$ is to be approximated by a linear function, then we can write this relationship as

\begin{equation}
Y = \beta_0 + \beta_1 X + \epsilon.
\tag{3.5}
\end{equation}

Here $\beta_0$ is the intercept term—that is, the expected value of $Y$ when $X = 0$, and $\beta_1$ quantifies the association between $X$ and $Y$. As we saw in the previous section, we use data to produce estimates of $\beta_0$ and $\beta_1$ via the *least squares* approach. If the true relationship is probably not linear, there may be relationships not captured by this model, and there may be measurement error. We typically assume that the error term is independent of $X$.

The analogy between linear regression and estimation of the mean of a
random variable is an apt one based on the concept of bias. 

If we use the bias sample mean $\hat{\mu}$ to estimate $\mu$, this estimate is **unbiased**, in the sense that on average we expect $\hat{\mu}$ to equal $\mu$.

What exactly does this mean? It means that on the basis of one particular set of observations $y_1, \dots, y_n$, $\hat{\mu}$ might overestimate $\mu$, and on the basis of another set of observations, $\hat{\mu}$ might underestimate $\mu$. 

But if we could average a huge number of estimates of $\mu$ obtained from a huge number of sets of observations, then this average would exactly equal $\mu$. Hence, an **unbiased estimator** does not systematically over- or under-estimate the true parameter. 

The property of unbiasedness holds for the least squares coefficient estimates given by Equation (3.4) as well: if we estimate $\beta_0$ and $\beta_1$ on the basis of a particular data set, then our estimates won’t be exactly equal to $\beta_0$ and $\beta_1$. 

But if we could average the estimates obtained over a huge number of data sets, then the average of these estimates would be spot on! In fact, we can see from the right-hand panel of Figure 3.3 that the average of many least squares lines, each estimated from a separate data set, is pretty close to the true population regression line.

We continue the analogy with the estimation of the population mean $\mu$ of a random variable $Y$. A natural question is as follows: how accurate is the sample mean $\bar{y}$ as an estimate of $\mu$? We have established that the average of $\bar{y}$'s over many data sets will be very close to $\mu$, but that a single estimate $\bar{y}$ may be a substantial underestimate or overestimate of $\mu$. How far off will that single estimate of $\mu$ be? In general, we answer this question by computing the *standard error* of $\bar{y}$, written as $\mathrm{SE}(\bar{y})$. We have the well-known formula

\begin{equation}
\mathrm{Var}(\bar{y}) = \mathrm{SE}(\bar{y})^2 = \frac{\sigma^2}{n},
\tag{3.7}
\end{equation}

where $\sigma$ is the standard deviation of each of the realizations $y_i$ of $Y$. Roughly speaking, the standard error tells us the average amount that this estimate $\bar{y}$ differs from the actual value of $\mu$. Equation 3.7 also tells us how this deviation shrinks with $n$—the more observations we have, the smaller the standard error of $\bar{y}$. In a similar vein, we can wonder how close $\hat{\beta}_0$ and $\hat{\beta}_1$ are to the true values $\beta_0$ and $\beta_1$. To compute the standard errors associated with $\hat{\beta}_0$ and $\hat{\beta}_1$, we use the following formulas:

$$
\mathrm{SE}(\hat{\beta}_0)^2 
= \sigma^2 \left[ \frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^n (x_i - \bar{x})^2} \right]
\quad
\mathrm{SE}(\hat{\beta}_1)^2 
= \frac{\sigma^2}{\sum_{i=1}^n (x_i - \bar{x})^2}
\tag{3.8}
$$

where $\sigma^2 = \mathrm{Var}(\epsilon)$. For these formulas to be strictly valid, we need to assume that the errors $\epsilon_i$ for each observation have common variance $\sigma^2$ and are uncorrelated. This is clearly not true in Figure 3.1, but the formula still turns out to be a good approximation. Notice in the formula that $\mathrm{SE}(\hat{\beta}_1)$ is smaller when the $x_i$ are more spread out; intuitively we have more *leverage* to estimate a slope when this is the case. We also see that $\mathrm{SE}(\hat{\beta}_0)$ would be the same as $\mathrm{SE}(\bar{y})$ if we even zero (in which case $\bar{x}$ would be equal to zero). In general, $\sigma^2$ is not known, but can be estimated from the data. This estimate of $\sigma$ is the *residual standard error*, and is given by the formula

$$
\hat{\sigma} = \sqrt{\frac{1}{n-2} \sum_{i=1}^n e_i^2} = \sqrt{\frac{\text{RSS}}{n-2}}.
$$

Strictly speaking, when $\sigma^2$ is estimated from the data we should also use $\mathrm{SE}_b(\hat{\beta}_1)$ to indicate that an estimate has been made. But for simplicity we will not use this extra “hat”.

Standard errors can be used to compute confidence intervals. A 95% confidence interval is defined as a range of values such that with 95% probability, the range will contain the true unknown value being estimated. The standard error comes into play in the formula for upper and lower confidence interval endpoints. A 95% confidence interval has the following property: if we compare a large number of sets, 95% of confidence intervals constructed in this way will contain the true unknown value. For example, a 95% confidence interval for the true value of $\beta_1$ is

$$
\hat{\beta}_1 \pm 2 \cdot \mathrm{SE}(\hat{\beta}_1),
$$

meaning that there is approximately a 95% chance that the true value of $\beta_1$ falls within the interval.


That is, there is approximately a 95% chance that the interval

\begin{equation}
\left[ \hat{\beta}_1 - 2 \cdot \mathrm{SE}(\hat{\beta}_1),\ \hat{\beta}_1 + 2 \cdot \mathrm{SE}(\hat{\beta}_1)\right]
\tag{3.10}
\end{equation}

will contain the true value of $\beta_1$. Similarly, a confidence interval for $\beta_0$ approximately takes the form

\begin{equation}
\hat{\beta}_0 \pm 2 \cdot \mathrm{SE}(\hat{\beta}_0).
\tag{3.11}
\end{equation}

In the case of the advertising data, the 95% confidence interval for $\beta_0$ is $[6.130, 7.935]$ and the 95% confidence interval for $\beta_1$ is $[0.042, 0.053]$. Therefore, we can conclude that in the absence of any advertising, sales will, on average, fall somewhere between 6,130 and 7,935 units. Furthermore, for each \$1,000 increase in television advertising, there will be an average increase in sales of between 42 and 53 units.

Standard errors can also be used to perform *hypothesis tests* on the coefficients. The most common hypothesis test involves testing the *null hypothesis* of

\begin{equation}
H_0: \text{There is no relationship between $X$ and $Y$}
\tag{3.12}
\end{equation}

versus the *alternative hypothesis*

\begin{equation}
H_a: \text{There is some relationship between $X$ and $Y$}
\tag{3.13}
\end{equation}

Mathematically, this corresponds to testing

\[
H_0 : \beta_1 = 0
\]
\[
H_a : \beta_1 \neq 0.
\]

since if $\beta_1 = 0$ then the model (3.5) reduces to $Y = \beta_0 + \epsilon$, and $X$ is not associated with $Y$. To test the null hypothesis, we need to determine whether $\hat{\beta}_1$, our estimate for $\beta_1$, is sufficiently far from zero that we can be confident that $\beta_1$ is non-zero. How far is far enough? This of course depends on the accuracy of $\hat{\beta}_1$—that is, it depends on $\mathrm{SE}(\hat{\beta}_1)$. If $\mathrm{SE}(\hat{\beta}_1)$ is small, then even relatively small values of $\hat{\beta}_1$ may provide strong evidence that $\beta_1 \neq 0$, and hence that there is a relationship between $X$ and $Y$. In contrast, if $\mathrm{SE}(\hat{\beta}_1)$ is large, then $\hat{\beta}_1$ must be large in absolute value in order for us to reject the null hypothesis. In practice, we compute a *t*-statistic, given by

\begin{equation}
t = \frac{\hat{\beta}_1}{\mathrm{SE}(\hat{\beta}_1)}
\tag{3.14}
\end{equation}


That is, there is approximately a 95% chance that the interval

\begin{equation}
\left[ \hat{\beta}_1 - 2 \cdot \mathrm{SE}(\hat{\beta}_1),\ \hat{\beta}_1 + 2 \cdot \mathrm{SE}(\hat{\beta}_1)\right]
\tag{3.10}
\end{equation}

will contain the true value of $\beta_1$. Similarly, a confidence interval for $\beta_0$ approximately takes the form

\begin{equation}
\hat{\beta}_0 \pm 2 \cdot \mathrm{SE}(\hat{\beta}_0).
\tag{3.11}
\end{equation}

In the case of the advertising data, the 95% confidence interval for $\beta_0$ is $[6.130, 7.935]$ and the 95% confidence interval for $\beta_1$ is $[0.042, 0.053]$. Therefore, we can conclude that in the absence of any advertising, sales will, on average, fall somewhere between 6,130 and 7,935 units. Furthermore, for each \$1,000 increase in television advertising, there will be an average increase in sales of between 42 and 53 units.

Standard errors can also be used to perform *hypothesis tests* on the coefficients. The most common hypothesis test involves testing the *null hypothesis* of

\begin{equation}
H_0: \text{There is no relationship between $X$ and $Y$}
\tag{3.12}
\end{equation}

versus the *alternative hypothesis*

\begin{equation}
H_a: \text{There is some relationship between $X$ and $Y$}
\tag{3.13}
\end{equation}

Mathematically, this corresponds to testing

$$
H_0 : \beta_1 = 0
$$
versus
$$
H_a : \beta_1 \neq 0.
$$

since if $\beta_1 = 0$ then the model (3.5) reduces to $Y = \beta_0 + \epsilon$, and $X$ is not associated with $Y$. To test the null hypothesis, we need to determine whether $\hat{\beta}_1$, our estimate for $\beta_1$, is sufficiently far from zero that we can be confident that $\beta_1$ is non-zero. How far is far enough? This of course depends on the accuracy of $\hat{\beta}_1$—that is, it depends on $\mathrm{SE}(\hat{\beta}_1)$. If $\mathrm{SE}(\hat{\beta}_1)$ is small, then even relatively small values of $\hat{\beta}_1$ may provide strong evidence that $\beta_1 \neq 0$, and hence that there is a relationship between $X$ and $Y$. In contrast, if $\mathrm{SE}(\hat{\beta}_1)$ is large, then $\hat{\beta}_1$ must be large in absolute value in order for us to reject the null hypothesis. In practice, we compute a *t*-statistic, given by

\begin{equation}
t = \frac{\hat{\beta}_1}{\mathrm{SE}(\hat{\beta}_1)}
\tag{3.14}
\end{equation}


|               | Coefficient | Std. error | t-statistic | p-value |
|---------------|-------------|------------|-------------|---------|
| Intercept     | 7.0325      | 0.4578     | 15.36       | < 0.0001|
| TV            | 0.0475      | 0.0027     | 17.67       | < 0.0001|

 For the Advertising data, coefficients of the least squares model 
for the regression of number of units sold on TV advertising budget. An increase 
of \$1000 in the TV advertising budget is associated with an increase in sales by 
approximately 47.5 units. (Recall that in this dataset, sales are measured in 
thousands of units, and the TV variable is in thousands of dollars.)


The larger the number of standard deviations that $\hat{\beta}_j$ is away from 0, 
i.e. there really is no relationship between $X$ and $Y$, then we expect that (3.14) 
will have a t-distribution with $n - 2$ degrees of freedom. The t-distribution 
has a bell shape and for values of t greater than approximately 2 it is quite 
similar to the standard normal distribution. Consequently, it is a simple matter 
to compute the probability of observing any number equal to $|t|$ or larger in 
absolute value, assuming $\beta_j = 0$. We call this probability the *p-value*. 

Roughly speaking, we interpret the p-value as follows: a small p-value indicates 
that it is unlikely to observe such a substantial association between the predictor 
and the response due to chance, in the absence of any real association between the 
predictor and the response. Hence, if we see a small p-value, then we can infer that 
there is an association between the predictor and the response. We reject the 
*null hypothesis*—that is, we declare a relationship to exist between predictor 
and response—if the p-value is small enough. Typical p-value cutoffs for rejecting 
the null hypothesis are 5% or 1%, although this topic will be explored later in much 
greater detail. When $n = 30$, these correspond to t-statistics (3.14) 
of around 2 and 2.75, respectively. 

The table provides details of the least squares model for the regression of number 
of units sold on TV advertising budget for the Advertising data. Note that the 
p-values for $\hat{\beta}_0$ and $\hat{\beta}_1$ are very large relative to their 
standard errors, so the t-statistics are also large; the probabilities of seeing 
such values of t if $H_0$ is true are virtually zero. Hence we can conclude that 
$\beta_0 \neq 0$ and $\beta_1 \neq 0$.


#### Assessing the Accuracy of the Model

If we reject the null hypothesis (3.12) in favor of the alternative hypothesis (3.13), 
it is natural to want to quantify the *extent* to which the regression model fits 
the data. The quality of a linear regression fit is typically assessed using two 
related quantities: the *residual standard error* (RSE) and the $R^2$ statistic.

| Quantity                  | Value |
|---------------------------|-------|
| Residual standard error   | 3.26  |
| $R^2$                     | 0.612 |
| F-statistic               | 312.1 |

For the Advertising data, more information about the least 
squares model for the regression of number of units sold on TV advertising budget.


The table displays the RSE, the $R^2$ statistic, and the F-statistic (to be described 
in Section 3.2.2) for the linear regression of number of units sold on TV advertising budget.


##### Residual Standard Error

Recall from the model (3.5) that associated with each observation is an error term $\epsilon$. 
Due to the presence of these error terms, even if we knew the true regression line 
(i.e. even if $\beta_0$ and $\beta_1$ were known), we would not be able to perfectly 
predict $Y$ from $X$. The RSE is an estimate of the standard deviation of $\epsilon$. 

Roughly speaking, it is the average amount that the response will deviate from the 
true regression line. It is computed using the formula

\[
RSE = \sqrt{\frac{1}{n-2}RSS} = \sqrt{\frac{1}{n-2}\sum_{i=1}^n (y_i - \hat{y}_i)^2}.
\tag{3.15}
\]

Note that RSS was defined in Section 3.1.1, and is given by the formula

\[
RSS = \sum_{i=1}^n (y_i - \hat{y}_i)^2.
\tag{3.16}
\]


In the case of the advertising data, we see from the linear regression output in 
Table 3.2 that the RSE is 3.26. In other words, actual sales in each market deviate 
from the true regression line by approximately 3.26 units, on average. Another way 
to think about this is that even if the model were correct and the true values of 
the unknown coefficients $\beta_0$ and $\beta_1$ were known exactly, any prediction 
of sales on the basis of TV advertising would still be off by about 3.26 units on 
average. Of course, whether or not 3.26 units is an acceptable prediction error 
depends on the application. In this case, the average sales are around 14,000 units, 
so 3.26 units correspond to a percentage error of $3.26/14,000 \approx 0.02\%$. 


The RSE is considered a measure of the *lack of fit* of the model (3.5) to the data. 
If the predictions obtained using the model are very close to the true outcomes 
— that is, if $\hat{y}_i$ is very close to $y_i$ for $i = 1, \dots, n$ — then 
(3.15) will be small, and we can conclude that the model fits the data very well. 
On the other hand, if $\hat{y}_i$ is far from $y_i$ for one or more observations, 
then the RSE may be quite large, indicating that the model doesn’t fit the data well.


##### $R^2$ Statistic
 The RSE provides an absolute measure of lack of fit of the model $(3.5)$ to the data. But since it is measured in the units of Y, it is not always clear what constitutes a good RSE. The $R^2$ statistic provides an alternative measure of fit. 

It takes the form of a proportion—the proportion of variance explained—and so it always takes on a value between 0 and 1, and is independent of the scale of $Y$.

To calculate $R^2$, we use the formula

$$
R^2 = \frac{\text{TSS} - \text{RSS}}{\text{TSS}}
= 1 - \frac{\text{RSS}}{\text{TSS}}
\tag{3.17}
$$

where 
$$
\text{TSS} = \sum_{i=1}^n (y_i - \bar{y})^2
$$

is the **total sum of squares**, and RSS is defined in Equation (3.16).  

TSS measures the total variance in the response $Y$, and can be thought of as the amount of variability inherent in the response before the regression is performed. In contrast, RSS measures the amount of variability that is left unexplained after performing the regression. 

Hence, $\text{TSS} - \text{RSS}$ measures the amount of variability in the response that is explained (or removed) by performing the regression, and $R^2$ measures the proportion of variability in $Y$ that can be explained using $X$.  

An $R^2$ statistic that is close to 1 indicates that a large proportion of the variability in the response is explained by the regression. A number near 0 indicates that the regression does not explain much of the variability in the response; this might occur because the linear model is wrong, the error variance $\sigma^2$ is high, or both.

The $R^2$ statistic is a measure of the linear relationship between $X$ and $Y$. 

Recall that correlation, defined as

$$
\text{Cor}(X, Y) \;=\; 
\frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}
{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2} \, \sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}}
\tag{3.18}
$$

is also a measure of the linear relationship between $X$ and $Y$.  

This suggests that we might be able to use $r = \text{Cor}(X, Y)$ instead of $R^2$ in order to assess the fit of the linear model.


Simple regression of sales on radio

| Coefficient | Std. Error | t-statistic | p-value  |
|-------------|------------|-------------|----------|
| Intercept   | 9.312      | 0.563       | 16.54    | <0.0001 |
| radio       | 0.203      | 0.020       | 9.92     | <0.0001 |


Simple regression of sales on newspaper

| Coefficient | Std. Error | t-statistic | p-value  |
|-------------|------------|-------------|----------|
| Intercept   | 12.351     | 0.621       | 19.88    | <0.0001 |
| newspaper   | 0.055      | 0.017       | 3.30     | 0.00115 |


More simple linear regression models for the Advertising data.  

Coefficients of the simple linear regression model for number of units sold on Top: radio advertising budget. Bottom: newspaper advertising budget.

A \$1,000 increase in spending on radio advertising is associated with an average increase in sales by around **203 units**, while the same increase in spending on newspaper advertising is associated with an average increase in sales by around **55 units**.  

(Note that the sales variable is in thousands of units, and the radio and newspaper variables are in thousands of dollars.)  


In the linear regression setting, we have

$$
R^2 = r^2
$$

In other words, the squared correlation and the $R^2$ statistic are identical.



#### Multiple Linear Regression

Simple linear regression is a useful approach for predicting a response on the basis of a single predictor variable. However, in practice we often have more than one predictor. 

For example, in the **Advertising** data, we have examined the relationship between sales and TV advertising. We also have data for the amount of money spent advertising on the radio and in newspapers, and we may want to know whether either of these two media is associated with sales. How can we extend our analysis of the advertising data in order to accommodate these two additional predictors?

One option is to run three separate simple linear regressions, each of which uses a different advertising medium as a predictor. For instance, we can fit a simple linear regression to predict sales on the basis of the amount spent on radio advertisements. Results are shown in **Table 3.3 (top table)**. We find that a \$1,000 increase in spending on radio advertising is associated with an increase in sales of around **203 units**. **Table 3.3 (bottom table)** contains the least squares coefficients for a simple linear regression of sales onto newspaper advertising budget. A \$1,000 increase in newspaper advertising budget is associated with an increase in sales of approximately **55 units**.


However, the approach of fitting a separate simple linear regression model for each predictor is not entirely satisfactory.  

- First of all, it is unclear how to make a single prediction of sales given the three advertising media budgets, since each of the budgets is associated with a separate regression equation.  
- Second, each of the three regression equations ignores the other two media in forming estimates for the regression coefficients.  

We will see shortly that if the media budgets are correlated with each other in the 200 markets in our data set, then this can lead to very misleading estimates of the association between each media budget and sales.

Instead of fitting a separate simple linear regression model for each predictor, a better approach is to extend the simple linear regression model (3.5) so that it can directly accommodate multiple predictors. We can do this by giving each predictor a separate slope coefficient in a single model.

In general, suppose that we have $p$ distinct predictors. Then the multiple linear regression model takes the form

$$
Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p + \varepsilon
\tag{3.19}
$$

where $X_j$ represents the $j$-th predictor and $\beta_j$ quantifies the association between that variable and the response. We interpret $\beta_j$ as the average effect on $Y$ of a one-unit increase in $X_j$, holding all other predictors fixed.

In the advertising example, (3.19) becomes:

$$
\text{sales} = \beta_0 + \beta_1 \, \text{TV} + \beta_2 \, \text{radio} + \beta_3 \, \text{newspaper} + \varepsilon
\tag{3.20}
$$



##### Estimating the Regression Coefficients

As was the case in the simple linear regression setting, the regression coefficients $\beta_0, \beta_1, \dots, \beta_p$ in (3.19) are unknown, and must be estimated. 

Given estimates $\hat{\beta}_0, \hat{\beta}_1, \dots, \hat{\beta}_p$, we can make predictions using the formula

$$
\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \cdots + \hat{\beta}_p x_p
\tag{3.21}
$$

The parameters are estimated using the same least squares approach that we saw in the context of simple linear regression. We choose $\beta_0, \beta_1, \dots, \beta_p$ to minimize the sum of squared residuals:

$$
\text{RSS} = \sum_{i=1}^n \Big( y_i - \hat{y}_i \Big)^2 
= \sum_{i=1}^n \Big( y_i - \hat{\beta}_0 - \hat{\beta}_1 x_{i1} - \hat{\beta}_2 x_{i2} - \cdots - \hat{\beta}_p x_{ip} \Big)^2
\tag{3.22}
$$

The values $\hat{\beta}_0, \hat{\beta}_1, \dots, \hat{\beta}_p$ that minimize (3.22) are the **multiple least squares regression coefficient estimates**. Unlike the simple linear regression coefficient estimates given in (3.4), the multiple regression coefficient estimates have somewhat complicated forms that are most easily represented using matrix algebra.

The Multiple Regression Coefficient Estimates table displays the multiple regression coefficient estimates when TV,
radio, and newspaper advertising budgets are used to predict product sales
using the Advertising data. We interpret these results as follows: for a given
amount of TV and newspaper advertising, spending an additional $1,000 on
radio advertising is associated with approximately 189 units of additional
sales. Comparing these coefficient estimates to those displayed in Tables 3.1
and 3.3, we notice that the multiple regression coefficient estimates for
TV and radio are pretty similar to the simple linear regression coefficient
estimates. However, while the newspaper regression coefficient estimate in
Table 3.3 was significantly non-zero, the coefficient estimate for newspaper
in the multiple regression model is close to zero, and the corresponding 
*p*-value is no longer significant, with a value around **0.86**.  

This illustrates that the simple and multiple regression coefficients can be
quite different. This difference stems from the fact that in the simple regression
case, the slope term represents the average increase in product sales associated
with a $1,000 increase in newspaper advertising, **ignoring other predictors such as
TV and radio**.  

By contrast, in the multiple regression setting, the coefficient for newspaper
represents the average increase in product sales associated with increasing 
newspaper spending by $1,000 **while holding TV and radio fixed**.


Multiple Regression Coefficient Estimates

| Coefficient | Std. error | t-statistic | p-value   |
|-------------|------------|-------------|-----------|
| Intercept   | 2.939      | 0.3119      | 9.42      | <0.0001 |
| TV          | 0.046      | 0.0014      | 32.81     | <0.0001 |
| radio       | 0.189      | 0.0086      | 21.89     | <0.0001 |
| newspaper   | 0.001      | 0.0059      | 0.18      | 0.8599  |

*For the Advertising data, least squares coefficient estimates of the multiple
linear regression of number of units sold on TV, radio, and newspaper
advertising budgets.*


Does it make sense for the multiple regression to suggest no relationship
between sales and newspaper while the simple linear regression implies the
opposite?  

In fact it does. Consider the **correlation matrix** for the three predictor
variables and response variable, displayed in Table 3.5. Notice that the
correlation between radio and newspaper is **0.35**. This indicates
that markets with high newspaper advertising tend to also have high radio
advertising.


Correlation Matrix (TV, radio, newspaper, sales)

|             | TV     | radio  | newspaper | sales  |
|-------------|--------|--------|-----------|--------|
| TV          | 1.0000 | 0.0548 | 0.0567    | 0.7822 |
| radio       | 0.0548 | 1.0000 | 0.3541    | 0.5762 |
| newspaper   | 0.0567 | 0.3541 | 1.0000    | 0.2283 |
| sales       | 0.7822 | 0.5762 | 0.2283    | 1.0000 |


Now suppose that the multiple regression is correct and
**newspaper advertising is not associated with sales, but radio advertising
is associated with sales.** Then in markets where we spend more on radio
our sales will tend to be higher, and as our correlation matrix shows, we
also tend to spend more on newspaper advertising in those same markets.  

Hence, in a simple linear regression which only examines sales versus
newspaper, we will observe that higher values of newspaper tend to be
associated with higher values of sales, even though newspaper advertising is
not directly associated with sales.  

So newspaper advertising is a **surrogate** for radio advertising; newspaper
gets “credit” for the association between radio on sales.


This slightly counterintuitive result is very common in many real-life
situations. Consider an absurd example to illustrate the point:  

- Running a regression of **shark attacks versus ice cream sales** for data collected at
a given beach community over a period of time would show a positive
relationship, similar to that seen between sales and newspaper.  
- Of course no one has (yet) suggested that ice creams should be banned at beaches
to reduce shark attacks.  

In reality:  
- Higher temperatures cause more people to visit the beach, which in turn 
results in more ice cream sales and more shark attacks.  
- A **multiple regression** of shark attacks onto ice cream sales **and temperature**
reveals that, as intuition implies, ice cream sales is no longer a significant
predictor after adjusting for temperature.



##### Some Important Questions
When we perform multiple linear regression, we usually are interested in
answering a few important questions.
 1. Is at least one of the predictors X1,X2,...,Xp useful in predicting
 the response?
 2. Do all the predictors help to explain Y , or is only a subset of the
 predictors useful?
 3. How well does the model fit the data?
 4. Given a set of predictor values, what response value should we predict,
 and how accurate is our prediction?

##### One: Is There a Relationship Between the Response and Predictors

Recall that in the simple linear regression setting, in order to determine  
whether there is a relationship between the response and the predictor, we  
can simply check whether $\beta_1 = 0$.  

In the multiple regression setting with $p$ predictors, we need to ask whether  
all of the regression coefficients are zero, i.e.:  

$$ 
H_0 : \beta_1 = \beta_2 = \cdots = \beta_p = 0
$$  

versus the alternative  

$$ 
H_a : \text{at least one } \beta_j \neq 0
$$  

This hypothesis test is performed by computing the **F-statistic**:  

$$
F = \frac{(TSS - RSS)/p}{RSS/(n - p - 1)} \tag{3.23}
$$  

where:  

- $TSS = \sum_{i=1}^{n} (y_i - \bar{y})^2$ (Total Sum of Squares)  
- $RSS = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$ (Residual Sum of Squares)  

If the linear model assumptions are correct, then:  

$$
E\left\{\frac{RSS}{n - p - 1}\right\} = \sigma^2
$$  

and, provided $H_0$ is true,  

$$
E\left\{\frac{TSS - RSS}{p}\right\} = \sigma^2
$$  

- When there is **no relationship** between the response and predictors,  
  the F-statistic should be close to 1.  

- When $H_a$ is true, then  

$$
E\left\{\frac{TSS - RSS}{p}\right\} > \sigma^2
$$  

so we expect $F > 1$.  

For example, the F-statistic for the multiple linear regression model obtained  
by regressing **sales** onto **radio, TV, and newspaper** is:  

$$
F = 570
$$  

Since this is far larger than 1, it provides compelling evidence against the  
null hypothesis $H_0$.  
The large F-statistic suggests that at least one of the advertising media  
must be related to sales.  However, if the F-statistic had been closer to 1, we would not reject $H_0$. How large does the F-statistic need to be before we can reject $H_0$?

And how large must the F-statistic be before we can conclude that there is a relationship?  
It turns out that the answer depends on the values of $n$ and $p$.  

- When $n$ is large, an F-statistic that is just a little larger than 1 might still provide evidence against $H_0$.  
- In contrast, a larger F-statistic is needed to reject $H_0$ if $n$ is small.  

| **Quantity**               | **Value** |
|-----------------------------|-----------|
| Residual standard error     | 1.69      |
| \(R^2\)                     | 0.897     |
| F-statistic                 | 570       |

More information about the least squares model for the regression  
of number of units sold on TV, newspaper, and radio advertising budgets in the  
Advertising data. Other information about this model was displayed in the Multiple Regression Coefficient Estimates table.

When $H_0$ is true and the errors $\epsilon_i$ have a normal distribution, the F-statistic follows an **F-distribution**.  
For any given value of $n$ and $p$, statistical software can be used to compute the **p-value** associated with the F-statistic.  
Based on this p-value, we can determine whether or not to reject $H_0$.  

For the advertising data, the p-value associated with the F-statistic in Table 3.6 is essentially zero,  
so we have extremely strong evidence that at least one of the media is associated with increased sales.  


In (3.23) we are testing $H_0$ that **all the coefficients are zero**.  
Sometimes, we want to test that a particular subset of $q$ of the coefficients are zero.  
This corresponds to the null hypothesis:  

$$
H_0 : \beta_{p-q+1} = \beta_{p-q+2} = \cdots = \beta_p = 0
$$  

where (for convenience) the variables chosen for omission are listed at the end.  

In this case we fit a second model that uses all the variables except those last $q$.  
Suppose the residual sum of squares for that reduced model is $RSS_0$.  
Then the appropriate F-statistic is:  

$$
F = \frac{(RSS_0 - RSS)/q}{RSS/(n - p - 1)} \tag{3.24}
$$  


Notice that in Table 3.4, for each individual predictor a **t-statistic** and a **p-value** were reported.  
These provide information about whether each individual predictor is related to the response, after adjusting for the other predictors.  

It turns out that each of these is exactly equivalent to the F-test that omits that single variable from the model (i.e. $q=1$ in (3.24)).  
So it reports the **partial effect** of adding that variable to the model.  

For example, as discussed earlier:  
- The p-values indicate that **TV** and **radio** are related to sales.  
- There is no evidence that **newspaper** is associated with sales, when TV and radio are held fixed.   

Given the individual p-values, it may seem that if any one of them is small, then at least one predictor is related to the response.  
However, this reasoning is flawed, especially when the number of predictors $p$ is large.  

- Suppose $p=100$ and $H_0 : \beta_1 = \beta_2 = \cdots = \beta_p = 0$ is true.  
- In this case, no variable is truly associated with the response.  
- However, by chance, about 5% of the p-values will fall below 0.05.  
- So we expect around five small p-values even in the absence of any real associations.  
- In fact, it is very likely that we will observe at least one p-value below 0.05 by chance!  

Hence, if we rely only on individual t-tests, we are very likely to **incorrectly conclude** that there is a relationship.  

The **F-statistic** avoids this problem, because it adjusts for the number of predictors.  
If $H_0$ is true, there is only a 5% chance that the F-statistic will produce a p-value below 0.05,  
regardless of the number of predictors or the number of observations.  

Limitations of the F-statistic  

The F-statistic is useful when $p$ is relatively small compared to $n$.  
However:  

- If $p > n$, there are more coefficients $\beta_j$ to estimate than there are observations.  
- In this case, we cannot fit the multiple linear regression model using least squares.  
- As a result, the F-statistic (and many related concepts) cannot be used.  

When $p$ is large, alternative approaches—such as **forward selection** and other model selection techniques—are required.  



## Two: Deciding on Important Variables

As discussed in the previous section, the first step in a multiple regression
analysis is to compute the F-statistic and to examine the associated *p*-value.  
If we conclude on the basis of that *p*-value that at least one of the
predictors is related to the response, then it is natural to wonder **which are
the guilty ones!** We could look at the individual *p*-values as in Table 3.4,
but as discussed (and as further explored in Chapter 13), if \(p\) is large we
are likely to make some false discoveries.

It is possible that all of the predictors are associated with the response,
but it is more often the case that the response is only associated with
a subset of the predictors. The task of determining which predictors are
associated with the response, in order to fit a single model involving only
those predictors, is referred to as **variable selection**. The variable selection
problem is studied extensively in Chapter 6, and so here we will provide
only a brief outline of some classical approaches.

---

### Considering Subsets of Models

Ideally, we would like to perform variable selection by trying out a lot of
different models, each containing a different subset of the predictors.  

For instance, if \(p = 2\), then we can consider four models:  
1. A model containing no variables.  
2. A model containing \(X_1\) only.  
3. A model containing \(X_2\) only.  
4. A model containing both \(X_1\) and \(X_2\).  

We can then select the best model out of all of the models that we have
considered.  

How do we determine which model is best? Various statistics can be used to
judge the quality of a model. These include:  
- **Mallow’s \(C_p\)**  
- **Akaike Information Criterion (AIC)**  
- **Bayesian Information Criterion (BIC)**  
- **Adjusted \(R^2\)**  

These are discussed in more detail in Chapter 6. We can also determine which
model is best by plotting various model outputs, such as the residuals, in order
to search for patterns.

---

### The Problem of Too Many Models

Unfortunately, there are a total of \(2^p\) models that contain subsets of \(p\)
variables. This means that even for moderate \(p\), trying out every possible
subset of the predictors is infeasible.  

- If \(p = 2\), then there are \(2^2 = 4\) models to consider.  
- If \(p = 30\), then we must consider \(2^{30} = 1,073,741,824\) models!  

This is not practical. Therefore, unless \(p\) is very small, we cannot consider
all \(2^p\) models, and instead we need an automated and efficient approach to
choose a smaller set of models to consider.

---

### Classical Approaches to Variable Selection

There are three classical approaches for this task:

#### • Forward Selection
We begin with the **null model**—a model that contains an intercept but no
predictors.  
- Fit \(p\) simple linear regressions.  
- Add to the null model the variable that results in the lowest RSS.  
- Add to that model the variable that results in the lowest RSS for the new
two-variable model.  
- Continue this process until some stopping rule is satisfied.

---

#### • Backward Selection
We start with **all variables** in the model.  
- Remove the variable with the largest *p*-value (least statistically significant).  
- Fit the new \((p-1)\)-variable model.  
- Again remove the variable with the largest *p*-value.  
- Continue this procedure until a stopping rule is reached (e.g., all remaining
variables have a *p*-value below some threshold).  

---

#### • Mixed Selection
A combination of forward and backward selection.  
- Start with no variables in the model.  
- Add the variable that provides the best fit (forward step).  
- At each step, check if any variable in the model has a *p*-value above a
threshold. If so, remove it (backward step).  
- Continue until:  
  - All variables in the model have sufficiently low *p*-values, and  
  - All variables outside the model would have a large *p*-value if added.  

---

### Summary
- **Backward selection** cannot be used if \(p > n\).  
- **Forward selection** can always be used, but it is greedy and might include
variables that later become redundant.  
- **Mixed selection** can remedy this by combining the strengths of both
approaches.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from ISLP import load_data