# 7 Moving Beyond Linearity

- ___`Polynomial regression`___ extends the linear model by adding extra predictors, obtained by raising each of the original predictors to a power.
>__For example,__ a cubic regression uses three variables, $X, X_{2}$ , and $X_{3}$ , as predictors. This approach provides a simple way to provide a non-linear fit to data.

- ___`Step functions`___ cut the range of a variable into K distinct regions in order to produce a qualitative variable. This has the effect of fitting a piecewise constant function.


- ___`Regression splines`___ are more flexible than polynomials and step functions, and in fact are an extension of the two. They involve dividing the range of X into K distinct regions. Within each region, a polynomial function is fit to the data. However, these polynomials are constrained so that they join smoothly at the region boundaries, or knots. Provided that the interval is divided into enough regions, this can produce an extremely flexible fit.


- ___`Smoothing splines`___ are similar to regression splines, but arise in a slightly different situation. Smoothing splines result from minimizing a residual sum of squares criterion subject to a smoothness penalty.


- ___`Local regression`___ is similar to splines, but differs in an important way. The regions are allowed to overlap, and indeed they do so in a very smooth way.


- ___`Generalized additive models(GAM)`___ allow us to extend the methods above to deal with multiple predictors.

In Sections <a href="#7.1-Polynomial-Regression">7.1</a> – <a href="#7.6-Local-Regression">7.6</a>, we present a number of approaches for modeling the relationship between a response $Y$ and a single predictor $X$ in a flexible way. In Section <a href="#7.7-Generalized-Additive-Models">7.7</a>, we show that these approaches can be seamlessly integrated in order to model a response $Y$ as a function of several predictors $X_{1} , \dots , X_{p}$ .

## 7.1 Polynomial Regression

Historically, the standard way to `extend linear regression` to `settings in which the relationship between` __the predictors and the response is non-linear__ has `been to replace` the __standard linear model__

<font size=5><center> $ y_{i} = \beta_{0} + \beta_{1}x_{i} + \epsilon_{i} $ </center></font>

with a ___`Polynomial Function`___

<a id="Formula7.1"></a>
<font size=5><center>  $ y_{i} = \beta_{0} + \beta_{1}x_{i} + \beta_{2}x^2_{i} + \beta_{3}x^3_{i}+ \dots + \beta_{d}x^d_{i} + \epsilon_{i} $  </center></font>
>where, $\epsilon_{i}$ is the error term

This approach is known as ___`"Polynomial Regression"`___



<a id="Figure7.1"></a>
![image.png](Figures/Figure7.1.png)
>__FIGURE 7.1.__ The `Wage data`. 
<br>__Left:__ The `solid blue curve` is a `degree-4 polynomial` of `wage` (in thousands of dollars) as a `function of age`, fit by `least squares`. The `dotted curves` indicate an `estimated 95 % confidence interval`. 
<br>__Right:__ We model the `binary event` $wage\gt250$ using `logistic regression`, again with a `degree-4 polynomial`. The fitted posterior probability of wage exceeding &dollar;250,000 is shown in `blue`, along
with an `estimated 95 % confidence interval`.


## 7.2 Step Functions

Using `polynomial functions` of the `features` as `predictors` in a `linear model` imposes a `global structure` on the `non-linear function` of $X$. We can `instead use` __`step functions`__ in order to avoid imposing such a `global structure`. Here we `break the range` of $X$ into __`bins`__, and fit a different constant in each `bin`. This amounts to `converting a continuous variable` into an ___`ordered categorical variable`___.

- In greater `detail`, we `create` __cutpoints__ $c_{1} , c_{2} , \dots , c_{K}$ in the `range of` $X$,
and `then construct`$K + 1$ `new variables`

<a id="Figure7.2"></a>
![image.png](Figures/Figure7.2.png)
>__FIGURE 7.2.__ The `Wage data`. <br>__Left:__ The `solid curve` displays the `fitted value` from a `least squares regression` of wage (in thousands of dollars) using __`step functions of age`__. The `dotted curves` indicate an `estimated 95 % confidence interval`. 
<br>__Right:__ We model the `binary` event `wage` $\gt250 $ using `logistic regression`, again using `step functions` of age. The `fitted` posterior probability of wage `exceeding` &dollar;250,000 is shown, along with an `estimated 95 % confidence interval`.

__where__ $I(·)$ is an ___`indicator function`___ that `returns a` $1$ if the `condition is true`,
and `returns a`$ 0 $ otherwise. 
<br>For example, $I(c_{K} \le X)$ equals $1$ if $c_{K} \le X$, and `equals` $0$ otherwise. 

These are sometimes called ___`dummy variables`___. 

Notice that for any value of $X, C_{0} (X) + C_{1} (X) + \dots + C_{K} (X) = 1$, since $X$ must be in exactly one of the $K + 1$ intervals. 

We then use `least squares` to `fit` a `linear model` using $C_{1} (X), C_{2} (X), \dots , C_{K} (X)$ as $predictors^2$:

<a id="Formula7.5"></a>
<font size=5><center> $ y_{i} = β_{0} + β_{1} C_{1} (x_{i} ) + β_{2} C_{2} (x_{i} ) + \dots + β_{K} C_{K} (x_{i}) + \epsilon_{i} . $ </center></font>

## 7.3 Basis Functions

`Polynomial` and `piecewise-constant regression models` are in fact `special
cases` of a ___`basis function approach`___. The idea is to have at hand a `family of functions` or `transformations` that `can be applied` to a `variable` $X: b_{1} (X), b_{2} (X), \dots , b_{K} (X)$. 

Instead of `fitting a linear model` in $X$, we `fit` the `model`

<a id="Formula7.7"></a>
<font size=5><center> $ y _{i} = β_{0} + β_{1} b_{1} (x_{i} ) + β_{2} b_{2} (x_{i} ) + β_{3} b_{3} (x_{i} ) + \dots + β_{K} b_{K} (x_{i}) + \epsilon_{i}. $ </center></font>


Note that the basis function $b_{1} (·), b_{2} (·), \dots , b_{K} (·)$ are fixed and known.

For `polynomial regression`, the `basis functions` are $b_{j} (x_{i} ) = x_{ji}$ , and for `piecewise constant
functions` they are $b_{j} (x_{i} ) = I(c_{j} \le x_{i} \lt c_{j+1} )$.

Thus far we have considered the use of `polynomial functions` and `piece-wise constant functions` for our `basis functions`; however, many `alternatives are possible`. For `instance`, we can use `wavelets` or `Fourier series` to construct `basis functions`. In the `next section`, we `investigate` a `very common choice for a basis function`: __regression splines.__

## 7.4 Regression Splines

Let's discuss a flexible class of basis functions that extends upon the polynomial and piecewise constant regression approaches that we just have seen.

### 7.4.1 Piecewise Polynomials

Instead of fitting a high-degree polynomial over the entire range of $X$, ___`piecewise polynomial regression`___ involves `fitting separate` _low-degree polynomials over different regions_ of $X$. 

__For example,__ a `piecewise cubic polynomial` works by fitting a `cubic regression model` of the form

<a id="Formula7.8"></a>
<font size=5><center> $ y_{i} = β_{0} + β_{1} x_{i} + β_{2} x^2_{i} + β_{3} x^3_{i} + \epsilon_{i}$,</center></font>
>where, 
- the coefficients $\beta_{0},\beta_{1},\beta_{2}$ and $\beta_{3}$ differ in different parts of the range of $X$.
- The points where the coefficient change are called ___`knot`___


<div class="alert alert-block alert-warning">
    <b>For example,</b> a <b>piecewise cubic</b> with no knots is just a standard cubic polynomial, as in <a href="#Formula7.1">(7.1)</a> with d = 3. <br>A piecewise cubic polynomial with a single knot at a point c takes the form
</div>

<font size=5><center> $ y_{i} = \Bigg\{_{ \beta_{02}+\beta{12}x_{i}+\beta_{22}x^2_{i}+\beta_{32}x^3_{i} +\epsilon_{i}\ \ \ \ \ \ \ \ if x_{i} \ge C}^{\beta_{01}+\beta{11}x_{i}+\beta_{21}x^2_{i}+\beta_{31}x^3_{i} +\epsilon_{i}\ \ \ \ \ \ \ \ if x_{i} \lt C} $ </center></font>

__In other words, we fit `two different polynomial functions` to the `data`,__
1. __one on the subset of the observations with $x_{i} \lt c$, and__
2. __one on the subset of the observations with $x_{i} \ge c$.__


- The `first polynomial function` has `coefficients` $\beta_{01} , \beta_{11} , \beta_{21} , \beta_{31}$, and 
- the `second has coefficients` $\beta_{02} , \beta_{12} , \beta_{22} , \beta_{32}$ . 
- Each of these `polynomial functions` can be fit using `least squares` applied to simple functions of the `original predictor`.

<div class="alert alert-block alert-warning">
    <b>Using more knots leads to a more flexible piecewise polynomial.</b> 
    In general, if we place $K$ different knots throughout the range of $X$, then we
will end up fitting $K + 1$ different cubic polynomials. Note that we do not
    need to use a cubic polynomial. <br><b>For example,</b> we can instead fit piecewise
linear functions. In fact, our piecewise constant functions of <a href="#7.2-Step-Functions">Section 7.2</a> are
piecewise polynomials of degree 0!
</div>


### 7.4.2 Constraints and Splines

The top left panel of <a href="#Figure7.3">Figure 7.3</a> looks wrong because the `fitted curve` is `just too flexible`. To remedy this problem, we can `fit a piecewise polynomial` under the `constraint` that the `fitted curve must be continuous`. 

__In other words, there cannot be a jump when $age=50$.__ 

The `top right plot` in <a id="#Figure7.3">Figure 7.3</a> shows the `resulting fit`. This looks `better than` the `top left plot`, but the $V$-`shaped join looks unnatural`.

<a id="Figure7.3"></a>
![image.png](Figures/Figure7.3.png)
>__FIGURE 7.3.__ Various piecewise polynomials are fit to a subset of the `Wage
data`, with `a knot` at $age=50$. 
<br>__Top Left:__ The cubic polynomials are unconstrained.
<br>__Top Right:__ The cubic polynomials are constrained to be continuous at $age=50$.
<br>__Bottom Left:__ The cubic polynomials are constrained to be continuous, and to have continuous first and second derivatives. 
<br>__Bottom Right:__ A linear spline is shown, which is constrained to be continuous.

In the lower left plot, we have added two additional constraints: now both the first and second derivatives of the piecewise polynomials are continuous at age=50 . In other words, we are requiring that the piecewise polynomial be not only continuous when age=50 , but also very smooth. Each constraint that we impose on the piecewise cubic polynomials effectively frees up one degree of freedom, by reducing the complexity of the resulting piecewise polynomial fit. So in the top left plot, we are using eight degrees of free-dom, but in the bottom left plot we imposed three constraints (continuity,continuity of the first derivative, and continuity of the second derivative) and so are left with five degrees of freedom. The curve in the bottom left plot is called a cubic spline. 3 In general, a cubic spline with K knots uses a total of 4 + K degrees of freedom.

In Figure 7.3, the lower right plot is a linear spline, which is continuous at age=50 . The general definition of a degree-d spline is that it is a piecewise degree-d polynomial, with continuity in derivatives up to degree d − 1 at each knot. Therefore, a linear spline is obtained by fitting a line in each region of the predictor space defined by the knots, requiring continuity at each knot.

### 7.4.3 The Spline Basis Representation

<a id="Formula7.9"></a>
<font size=5><center> $ y_{i} = \beta_{0} + \beta_{1} b_{1} (x_{i} ) + \beta_{2} b_{2} (x_{i} ) + \dots + \beta_{K+3} b_{K+3} (x_{i} ) + \epsilon_{i}, $ </center></font>

<a id="Formula7.10"></a>
<font size=5><center> $ h(x,\xi) = (x - \xi)^3_{+} = \Bigg \{_{ 0 \ \ \ \ \ \ \ otherwise}^{ (x-\xi)^3 \ \ \ \ if x \gt \xi } $ </center></font>

<a id="Figure7.4"></a>
![image.png](Figures/Figure7.4.png)
>__FIGURE 7.4.__ A `cubic spline` and a `natural cubic spline`, with `three knots`, `fit` to a `subset of the Wage data`.


### 7.4.4 Choosing the Number and Locations of the Knots

<a id="Figure7.5"></a>
![image.png](Figures/Figure7.5.png)
>__FIGURE 7.5.__ A `natural cubic spline` function with __`four degrees of freedom`__ is `fit` to the `Wage data`. 
<br>__Left:__ A `spline is fit` to `wage` (in thousands of dollars) as a function of `age`. 
<br>__Right:__ `Logistic regression` is used to `model` the `binary event` $wage>250$ as a `function of age`. The fitted posterior probability of wage exceeding &dollar;250,000 is shown.

<a id="Figure7.6"></a>
![image.png](Figures/Figure7.6.png)
>__FIGURE 7.6.__ `Ten-fold cross-validated` __mean squared errors__ for `selecting` the __`degrees of freedom`__ when fitting splines to the `Wage data`. The `response` is wage and the `predictor age`. 
<br>__Left:__ A natural cubic spline. 
<br>__Right:__ A cubic spline.



### 7.4.5 Comparison to Polynomial Regression

<a id="Figure7.7"></a>
![image.png](Figures/Figure7.7.png)
>__FIGURE 7.7.__ On the `Wage data set`, a `natural cubic spline` with __`15 degrees of freedom`__ is `compared` to a __`degree-15 polynomial`__. <br>`Polynomials` can show wild behavior, especially near the tails.



## 7.5 Smoothing Splines
### 7.5.1 An Overview of Smoothing Splines

<a id="Formula7.11"></a>
<font size=5><center> $ \sum_{i=1}^{n}(y_{i}-g(x_{i}))^2 + \lambda \int g'' (t)^2 dt $ </center></font>

### 7.5.2 Choosing the Smoothing Parameter $\lambda$

## 7.6 Local Regression

## 7.7 Generalized Additive Models

### 7.7.1 GAMs for Regression Problems

### 7.7.2 GAMs for Classification Problems