### Linear Regression

 Recall the Advertising data from Chapter 2. Figure 2.1 displays sales (in thousands of units) for a particular product as a function of advertising budgets (in thousands of dollars) for TV, radio, and newspaper media. Suppose that in our role as statistical consultants we are asked to suggest,
 on the basis of this data, a marketing plan for next year that will result in high product sales. What information would be useful in order to provide such a recommendation? Here are a few important questions that we might seek to address:

  1. *Is there a relationship between advertising budget and sales?* Our first goal should be to determine whether the data provide evidence of an association between advertising expenditure and sales. If the evidence is weak, then one might argue that no money should be spent on advertising!

  2. *How strong is the relationship between advertising budget and sales?* Assuming that there is a relationship between advertising and sales, we would like to know the strength of this relationship. Does knowledge of the advertising budget provide a lot of information aboutproduct sales?

  3. *Which media are associated with sales?*
 Are all three media—TV, radio, and newspaper—associated with sales, or are just one or two of the media associated? To answer this question, we must find a way to separate out the individual contribution of each medium to sales when we have spent money on all three media.

 4. *How large is the association between each medium and sales?*
 For every dollar spent on advertising in a particular medium, by
 what amount will sales increase? How accurately can we predict this
 amount of increase?

 5. *How accurately can we predict future sales?*
 For any given level of television, radio, or newspaper advertising, what is our prediction for sales, and what is the accuracy of this prediction?

 6. *Is the relationship linear?*
 If there is approximately a straight-line relationship between advertising expenditure in the various media and sales, then linear regression is an appropriate tool. If not, then it may still be possible to transform the predictor or the response so that linear regression can be used.

 7. *Is there synergy among the advertising media?*
 Perhaps spending $50,000 on television advertising and $50,000 on radio advertising is associated with higher sales than allocating $100,000 to either television or radio individually. In marketing, this is known as a synergy effect, while in statistics it is called an interaction effect.



#### Simple Linear Regression

*Simple linear regression* lives up to its name: it is a very straightforward approach for predicting a quantitative response $Y$ on the basis of a single predictor variable $X$. It assumes that there is approximately a linear relationship between $X$ and $Y$. Mathematically, we can write this linear relationship as

\begin{equation}
Y \approx \beta_0 + \beta_1 X.
\tag{3.1}
\end{equation}

You might read “$\approx$” as “is approximately modeled as”. We will sometimes describe (3.1) by saying that we are regressing $Y$ on $X$ (or $Y$ onto $X$).

 For example, X may represent TV advertising and Y may represent sales. Then we can regress sales onto TV by fitting the model

 \begin{equation}
sales \approx \beta_0 + \beta_1 \times TV.
\tag{3.2}
\end{equation}


In Equation 3.1, $\beta_0$ and $\beta_1$ are two unknown constants that represent the *intercept* and *slope* terms in the linear model. Together, $\beta_0$ and $\beta_1$ are known as the model *coefficients* or *parameters*. Once we have used our training data to produce estimates $\hat{\beta}_0$ and $\hat{\beta}_1$ for the model coefficients, we can predict future sales on the basis of a particular value of TV advertising by computing

\begin{equation}
\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x,
\tag{3.2}
\end{equation}

where $\hat{y}$ indicates a prediction of $Y$ on the basis of $X = x$. Here we use a hat symbol, $\hat{~}$, to denote the estimated value for an unknown parameter or coefficient, or to denote the predicted value of the response.



#### Estimating the Coefficients

In practice, $\beta_0$ and $\beta_1$ are unknown. So before we can use (3.1) to make predictions, we must use data to estimate the coefficients. Let

$$(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)$$

represent $n$ observation pairs, each of which consists of a measurement of $X$ and a measurement of $Y$. In the *Advertising* example, this data set consists of the TV advertising budget and product sales in $n = 200$ different markets. (Recall that this data are displayed in Figure 2.1.) Our goal is to obtain coefficient estimates $\hat{\beta}_0$ and $\hat{\beta}_1$ such that the linear model (3.1) fits the available data well—that is, so that $y_i \approx \hat{\beta}_0 + \hat{\beta}_1 x_i$ for $i = 1, \ldots, n$. In other words, we want to find an intercept $\hat{\beta}_0$ and a slope $\hat{\beta}_1$ such that the resulting line is as close as possible to the $n = 200$ data points. There are a number of ways of measuring *closeness*. However, by far the most common approach involves minimizing the *least squares criterion*, and we take that approach in this chapter. 

Let $\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i$ be the prediction for $Y$ based on the $i$ th value of $X$. Then $e_i = y_i - \hat{y}_i$ represents the $i$th *residual*—this is the difference between the $i$ th observed response value and the $i$th response value that is predicted by our linear model. We define the *residual sum of squares* (RSS) as

$$RSS = e_1^2 + e_2^2 + \cdots + e_n^2,$$

or equivalently as

\begin{equation}
\text{RSS} = (y_1 - \hat{\beta}_0 - \hat{\beta}_1 x_1)^2 + (y_2 - \hat{\beta}_0 - \hat{\beta}_1 x_2)^2 + \cdots + (y_n - \hat{\beta}_0 - \hat{\beta}_1 x_n)^2.
\tag{3.3}
\end{equation}

The least squares approach chooses $\hat{\beta}_0$ and $\hat{\beta}_1$ to minimize the RSS. Using some calculus, one can show that the minimizers are

$$
\hat{\beta}_1 \;=\; \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}
{\sum_{i=1}^{n} (x_i - \bar{x})^2}
$$

$$
\hat{\beta}_0 \;=\; \bar{y} - \hat{\beta}_1 \bar{x}
$$

where $\bar{y} = \frac{1}{n} \sum_{i=1}^n y_i$ and $\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i$ are the sample means. In other words, (3.4) defines the *least squares coefficient estimates* for simple linear regression.

Figure 3.1 displays the simple linear regression fit to the *Advertising* data, where $\hat{\beta}_0 = 7.03$ and $\hat{\beta}_1 = 0.047$. In other words, according to this approximation, an additional \$1000 spent on TV advertising is associated with selling approximately 47 additional units of the product. In Figure 3.2, we have computed RSS for a number of values of $\hat{\beta}_0$ and $\hat{\beta}_1$, using the advertising and sales as the response and $X$ as the predictor. In each plot, the red dot represents the pair of least squares estimates $(\hat{\beta}_0, \hat{\beta}_1)$ given by (3.4). These values clearly minimize the RSS.

#### Assessing the Accuracy of the Coefficient Estimates

Recall from $(2.1)$ that we assume that the *true* relationship between $X$ and $Y$ takes the form $Y = f(X) + \epsilon$ for some unknown function $f$, where $\epsilon$ is a mean-zero random error term. If $f$ is to be approximated by a linear function, then we can write this relationship as

\begin{equation}
Y = \beta_0 + \beta_1 X + \epsilon.
\tag{3.5}
\end{equation}

Here $\beta_0$ is the intercept term—that is, the expected value of $Y$ when $X = 0$, and $\beta_1$ quantifies the association between $X$ and $Y$. As we saw in the previous section, we use data to produce estimates of $\beta_0$ and $\beta_1$ via the *least squares* approach. If the true relationship is probably not linear, there may be relationships not captured by this model, and there may be measurement error. We typically assume that the error term is independent of $X$.

 The analogy between linear regression and estimation of the mean of a
 random variable is an apt one based on the concept of bias. If we use the bias
 sample mean ˆµ to estimate µ, this estimate is unbiased, in the sense that unbiased
 on average, we expect ˆµ to equal µ. What exactly does this mean? It means
 that on the basis of one particular set of observations y1,...,yn, ˆµ might
 overestimate µ, and on the basis of another set of observations, ˆµ might
 underestimate µ. But if we could average a huge number of estimates of
 µ obtained from a huge number of sets of observations, then this average
 would exactly equal µ. Hence, an unbiased estimator does not systematically
 over- or under-estimate the true parameter. The property of unbiasedness
 holds for the least squares coefficient estimates given by (3.4) as well: if
 we estimate b0 and b1 on the basis of a particular data set, then our
 estimates won’t be exactly equal to b0 and b1. But if we could average
 the estimates obtained over a huge number of data sets, then the average
 of these estimates would be spot on! In fact, we can see from the right
hand panel of Figure 3.3 that the average of many least squares lines, each


That is, there is approximately a 95% chance that the interval

\begin{equation}
\left[ \hat{\beta}_1 - 2 \cdot \mathrm{SE}(\hat{\beta}_1),\ \hat{\beta}_1 + 2 \cdot \mathrm{SE}(\hat{\beta}_1)\right]
\tag{3.10}
\end{equation}

will contain the true value of $\beta_1$. Similarly, a confidence interval for $\beta_0$ approximately takes the form

\begin{equation}
\hat{\beta}_0 \pm 2 \cdot \mathrm{SE}(\hat{\beta}_0).
\tag{3.11}
\end{equation}

In the case of the advertising data, the 95% confidence interval for $\beta_0$ is $[6.130, 7.935]$ and the 95% confidence interval for $\beta_1$ is $[0.042, 0.053]$. Therefore, we can conclude that in the absence of any advertising, sales will, on average, fall somewhere between 6,130 and 7,935 units. Furthermore, for each \$1,000 increase in television advertising, there will be an average increase in sales of between 42 and 53 units.

Standard errors can also be used to perform *hypothesis tests* on the coefficients. The most common hypothesis test involves testing the *null hypothesis* of

\begin{equation}
H_0: \text{There is no relationship between $X$ and $Y$}
\tag{3.12}
\end{equation}

versus the *alternative hypothesis*

\begin{equation}
H_a: \text{There is some relationship between $X$ and $Y$}
\tag{3.13}
\end{equation}

Mathematically, this corresponds to testing

\[
H_0 : \beta_1 = 0
\]
\[
H_a : \beta_1 \neq 0.
\]

since if $\beta_1 = 0$ then the model (3.5) reduces to $Y = \beta_0 + \epsilon$, and $X$ is not associated with $Y$. To test the null hypothesis, we need to determine whether $\hat{\beta}_1$, our estimate for $\beta_1$, is sufficiently far from zero that we can be confident that $\beta_1$ is non-zero. How far is far enough? This of course depends on the accuracy of $\hat{\beta}_1$—that is, it depends on $\mathrm{SE}(\hat{\beta}_1)$. If $\mathrm{SE}(\hat{\beta}_1)$ is small, then even relatively small values of $\hat{\beta}_1$ may provide strong evidence that $\beta_1 \neq 0$, and hence that there is a relationship between $X$ and $Y$. In contrast, if $\mathrm{SE}(\hat{\beta}_1)$ is large, then $\hat{\beta}_1$ must be large in absolute value in order for us to reject the null hypothesis. In practice, we compute a *t*-statistic, given by

\begin{equation}
t = \frac{\hat{\beta}_1}{\mathrm{SE}(\hat{\beta}_1)}
\tag{3.14}
\end{equation}

estimated from a separate data set, is pretty close to the true population regression line.

We continue the analogy with the estimation of the population mean $\mu$ of a random variable $Y$. A natural question is as follows: how accurate is the sample mean $\bar{y}$ as an estimate of $\mu$? We have established that the average of $\bar{y}$'s over many data sets will be very close to $\mu$, but that a single estimate $\bar{y}$ may be a substantial underestimate or overestimate of $\mu$. How far off will that single estimate of $\mu$ be? In general, we answer this question by computing the *standard error* of $\bar{y}$, written as $\mathrm{SE}(\bar{y})$. We have the well-known formula

\begin{equation}
\mathrm{Var}(\bar{y}) = \mathrm{SE}(\bar{y})^2 = \frac{\sigma^2}{n},
\tag{3.7}
\end{equation}

where $\sigma$ is the standard deviation of each of the realizations $y_i$ of $Y$. Roughly speaking, the standard error tells us the average amount that this estimate $\bar{y}$ differs from the actual value of $\mu$. Equation 3.7 also tells us how this deviation shrinks with $n$—the more observations we have, the smaller the standard error of $\bar{y}$. In a similar vein, we can wonder how close $\hat{\beta}_0$ and $\hat{\beta}_1$ are to the true values $\beta_0$ and $\beta_1$. To compute the standard errors associated with $\hat{\beta}_0$ and $\hat{\beta}_1$, we use the following formulas:

\[
\mathrm{SE}(\hat{\beta}_0)^2 = \sigma^2 \left[ \frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^n (x_i - \bar{x})^2} \right],
\quad
\mathrm{SE}(\hat{\beta}_1)^2 = \frac{\sigma^2}{\sum_{i=1}^n (x_i - \bar{x})^2}
\tag{3.8}
\]

where $\sigma^2 = \mathrm{Var}(\epsilon)$. For these formulas to be strictly valid, we need to assume that the errors $\epsilon_i$ for each observation have common variance $\sigma^2$ and are uncorrelated. This is clearly not true in Figure 3.1, but the formula still turns out to be a good approximation. Notice in the formula that $\mathrm{SE}(\hat{\beta}_1)$ is smaller when the $x_i$ are more spread out; intuitively we have more *leverage* to estimate a slope when this is the case. We also see that $\mathrm{SE}(\hat{\beta}_0)$ would be the same as $\mathrm{SE}(\bar{y})$ if we even zero (in which case $\bar{x}$ would be equal to zero). In general, $\sigma^2$ is not known, but can be estimated from the data. This estimate is the *residual standard error*, and is given by the formula

\[
\hat{\sigma} = \sqrt{\frac{1}{n-2} \sum_{i=1}^n e_i^2} = \sqrt{\frac{\text{RSS}}{n-2}}.
\]

Strictly speaking, when $\sigma^2$ is estimated from the data we should also use $\mathrm{SE}_b(\hat{\beta}_1)$ to indicate that an estimate has been made. But for simplicity we will not use this extra “hat”.

Standard errors can be used to compute confidence intervals. A 95% confidence interval is defined as a range of values such that with 95% probability, the range will contain the true unknown value being estimated. The standard error comes into play in the formula for upper and lower confidence interval endpoints. A 95% confidence interval has the following property: if we compare a large number of sets, 95% of confidence intervals constructed in this way will contain the true unknown value. For example, a 95% confidence interval for the true value of $\beta_1$ is

\[
\hat{\beta}_1 \pm 2 \cdot \mathrm{SE}(\hat{\beta}_1),
\]

meaning that there is approximately a 95% chance that the true value of $\beta_1$ falls within the interval.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from ISLP import load_data