### Moving Beyond Linearity

 Linear models are relatively simple to describe and implement, and have advantages over
other approaches in terms of interpretation and inference. However, stan
dard linear regression can have significant limitations in terms of predic
tive power. This is because the linearity assumption is almost always an
approximation, and sometimes a poor one. In Chapter 6 we see that we can
improve upon least squares using ridge regression, the lasso, principal com
ponents regression, and other techniques. In that setting, the improvement
is obtained by reducing the complexity of the linear model, and hence the
variance of the estimates. But we are still using a linear model, which can
only be improved so far! In this chapter we relax the linearity assumption
while still attempting to maintain as much interpretability as possible. We
do this by examining very simple extensions of linear models like polyno
mial regression and step functions, as well as more sophisticated approaches
such as splines, local regression, and generalized additive models.

- $Polynomial$ $regression$ extends the linear model by adding extra pre
dictors, obtained by raising each of the original predictors to a power.
For example, a cubic regression uses three variables, $X$, $X^2$, and $X^3$,
as predictors. This approach provides a simple way to provide a non
linear fit to data.

- $Step$ $functions$ cut the range of a variable into $K$ distinct regions in
order to produce a qualitative variable. This has the effect of fitting
a piecewise constant function.

- $Regression$ $splines$ are more flexible than polynomials and step func
tions, and in fact are an extension of the two. They involve dividing
the range of $X$ into $K$ distinct regions. Within each region, a poly
nomial function is fit to the data. However, these polynomials are constrained so that they join smoothly at the region boundaries, or
knots. Provided that the interval is divided into enough regions, this
can produce an extremely flexible fit.

- $Smoothing$ $splines$ are similar to regression splines, but arise in a
slightly different situation. Smoothing splines result from minimizing
a residual sum of squares criterion subject to a smoothness penalty.

- $Local$ $regression$ is similar to splines, but differs in an important way.
The regions are allowed to overlap, and indeed they do so in a very
smooth way.

- $Generalized$ $additive$ models allow us to extend the methods above to
deal with multiple predictors.

In Sections 7.1–7.6, we present a number of approaches for modeling the
relationship between a response $Y$ and a single predictor $X$ in a flexible
way. In Section 7.7, we show that these approaches can be seamlessly in
tegrated in order to model a response $Y$ as a function of several predictors
$X1$,$...$,$Xp$.

#### Polynomial Regression

Historically, the standard way to extend linear regression to settings in which the relationship between the predictors and the response is non-linear has been to replace the standard linear model

$$
y_i = \beta_0 + \beta_1 x_i + \epsilon_i
$$  
 

with a polynomial function

$$
y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \beta_3 x_i^3 + \ldots + \beta_d x_i^d + \epsilon_i,
$$  
(7.1) 
where $\epsilon_i$ is the error term. This approach is known as polynomial regression, and in fact we saw an example of this method in Section $3.3.2$. For large enough degree $d$, a polynomial regression allows us to produce an extremely non-linear curve. 

Notice that the coefficients in $(7.1)$ can be easily estimated using least squares linear regression because this is just a standard linear model with predictors $x_i, x_i^2, x_i^3, \ldots, x_i^d$. Generally speaking, it is unusual to use $d$ greater than 3 or 4 because for large values of $d$, the curve can become overly flexible and can take on some very strange shapes. This is especially true near the boundary of the $X$ variable.

The left-hand panel in Figure $7.1$ is a plot of `wage` against `age` for the `Wage` data set, which contains income and demographic information for males who reside in the central Atlantic region of the United States. We see the results of fitting a degree-4 polynomial using least squares (solid blue curve). Even though this is a linear regression model like any other, the individual coefficients are not of particular interest. Instead, we look at the entire fitted function across a grid of 63 values for `age` from 18 to 80 in order to understand the relationship between `age` and `wage`.

In Figure 7.1, a pair of dashed curves accompanies the fit; these are $2 \times$ standard error curves. Let’s see how these arise. Suppose we have computed the fit at a particular value of age, $x_0$:

$$
\hat{f}(x_0) = \hat{\beta}_0 + \hat{\beta}_1 x_0 + \hat{\beta}_2 x_0^2 + \hat{\beta}_3 x_0^3 + \hat{\beta}_4 x_0^4. \quad (7.2)
$$

What is the variance of the fit, i.e., $\text{Var} \hat{f}(x_0)$? Least squares returns variance estimates for each of the fitted coefficients $\hat{\beta}_j$, as well as the covariances between pairs of coefficient estimates. We can use these to compute the estimated variance of $\hat{f}(x_0)$. The estimated $pointwise$ standard error of $\hat{f}(x_0)$ is the square-root of this variance. This computation is repeated at each reference point $x_0$, and we plot the fitted curve, as well as twice the standard error on either side of the fitted curve. We plot twice the standard error because, for normally distributed error terms, this quantity corresponds to an approximate $95$% confidence interval.

It seems like the wages in Figure $7.1$ are from two distinct populations: there appears to be a $high$ $earners$ group earning more than $ $250,000$ per annum, as well as a $lower$ $earners$ group. We can treat `wage` as a binary variable by splitting it into these two groups. Logistic regression can then be used to predict this binary response, using polynomial functions of `age`