# Chapter 7 - Moving Beyond Linearity

## 7.0 Introduction to the Model Methods

* We are going to explore other methods that extend past linearity like polynomial regression, step functions, regression splines, smoothing splines, local regression, generative additive model.

1. **Polynomial Regression** - take the linear model and add extra predictors raised to a certain power

1. **Step Functions** - cut the  $X$ into $K$ pieces inorder to produce a qualitive function. Makes it into a piece wise function

1. **Regression Splines** -  combination of the previous two. $X$ is cut into $K$ pieces then on these pieces we apply polgnomial regression. The polynomials are constrained so they combined smoothly at these piecewise regions called *knots*.

1. **Smoothing Splines** - smiliar to regression splines, but result from minimizing the residual sum of squares subject to the smoothnes penalty.

1. **Local Regression** - similar to the splines except the regions are now allowed to overlap.

1. **Generative Additive Model** - allows us to extend the methods above to deal with muliple predictors

## 7.1 Polynomial Regression

* Take the linear formula of $ y = β_{0} + β_{1}X_{i} + ϵ $


* Transform it into Polynomial Version $ y = β_{0} + β_{1}X_{i} + β_{2}X_{i}^{2} ...... β_{n}X_{i}^{n} + ϵ $


## 7.2 Step Functions

* Polynomial Regression involves impossing a gloabal structure on the non linear function $X$ instead we want to break $X$ into bins and fit a different constant for each bin. So we break a continuous variable into an ordered categorical variable.

* Assume we make k cuts with $c_{1}..... c_{k}$ and construct $K+1$ variables:

>> $ C_{0} = I(X <c_{1})$

>> $ C_{1} = I(c_{1}≤ X < c_{2}) $

>> $ . $

>> $ . $

>> $ C_{k} = I(c_{k} ≤ X) $

* $I$ is a indicator variable that returns 1 if the statement is true and 0 if it is false. Thus the above function of $ C_{0} + C_{1} + ... + C_{K} = 1 $ will always be true.

* Apply our linear formula from 7.1.0 to the piecewise above to get

>> $ y = β_{0} + β_{1}C_{1}(x_{i}) + ..... + β_{k}C_{k}(x_{i}) $

* This formula can also be applied to logistic regression.


## 7.3 Basis Functions

* Polynomial and Step functions belong to the basis functions class

* Basis function is a function that takes the form of

>> $ y = β_{0} + β_{1}b_{1}(x_{i}) + ..... + β_{k}b_{k}(x_{i}) $

* Many other choices for basis functions like wavelets or Fourier series. And the next section regression splines.




## 7.4 Regression Splines

* Class of basis functions that combined the polynomial and step functions we just learned about.

### 7.4.1 Piecewise Polynomials

* Let pick a polynomial of X = 3 then we fit our standard polynomail expression to be
  * $ y = β_{0} + β_{1}X_{i} + β_{2}X_{i}^{2} + β_{3}X_{i}^{3} + ϵ $
  * Each point where we change the coeficients are called knots and rely on our piecewise breakdown
  * Lets assume we break down $K = 2$ and a split $c$ so $y =$
     \begin{cases}
       β_{01} + β_{11}X_{i} + β_{21}X_{i}^{2} + β_{31}X_{i}^{3} + ϵ & if x_{i} < c  \\
      β_{02} + β_{12}X_{i} + β_{22}X_{i}^{2} + β_{32}X_{i}^{3} + ϵ & if x_{i} ≥ c
     \end{cases}
* The higher the K the more knots thus more flexible the model.

* However this method can lead to non continuos piece wise functions.


### 7.4.2 Constraints and Splines

* **Constraint 1** The fitted curve must be continuous. We can make this change but it can sometimes result in a weird curve like a $V$.

* **Constraint 2** We fix the above by making sure the first and second derivatives of the piecewise functions are continuous.

* These constraints are freeing up degrees of freedom since we are making certain parts of our above equation equal to 0 by taking derivates. Thus with Constraint 1 and Constaint 2 we have freed up 3 degrees of freedom.

* Previously we had 8 degrees of freedom since each equation had 4 degrees of freedom and we have one c value so 2 splits. Now we have 5 degrees of freedom.

* This creates a *cubic spline*. Normally a cubic spline with $K$ knots use $4+ K$ degrees of freedom.

* *Linear Spline* is a piecewise $d$ degree polynomial with continunity in derivates up to $d-1$ at each knot.

### 7.4.3 The Spline Basis Representation

* Use the basis formula to represent a *cubic spline* with $K$ knots
   * $ y = β_{0} + β_{1}b_{1}(x_{i}) + ..... + β_{K+3}b_{K+3}(x_{i}) $
* Now normally use the above formula than add one truncated power formula per knot.
  * Turns it to zero if $x ≤ ξ$ else it performs $(x - ξ)^{3}$
  * Thus it adds an extra term which we call $h(x, ξ)$
  * We need to add an exta term per knot thus we get K +4 knots for a degree 3 polynomial
  * Also this wont effect constraint 1,2 since it is to the power 3. It would only not be continuous at derivate 3.
  
* *Natural Spline* is a regression spline with addinional boundary constraint. The function is required to be linear at the boundaries. (Boundary knots are the two where $X< c_{0}$ and $ X≥ c_{k}$)

* Natural splines are better at estimating values at end of the boundaries

* Splines are great but can take on high variance at the outter end of the predictors so just use natural splines for those values

### 7.4.4 Choosing the Number and Location of the Knots

* Could place more knots in higher varying data or just place them uniformly across the function.

* Normally the knots are just spread out evenly over the data.

* How many knots to pick? One option is to pick several values and look at the graphs. Pick the smoothest, most continuous curve. This is rather subjective tho

* More objective way is to use cross validation. Train-Test-Split the data then on the training set us Cross Validation with a certain k prob 5,10. Try a bunch of different values of knots and see which gives the small RSS.

* With muliple predictors (variables) we typically don't do this CV as it would require a large computational power. Instead we just pick a value

### 7.4.5 Comparison to Polynomial Regression

* Cubic Splines normally outperform Polynomial Regression. This is because the Polynomial regression might have to take on a high degree polynomial like 17. However with cubic splines we would still only got up to 3. We would compensate with K knots however. This allows us to fit a better approach since we can adjust each of those terms.

* Also with cubic splines we can add more knots at areas of a high degree of change and less knots at constant areas. Polynomial regression can not do this.

## 7.5 Smoothing Splines

* Regression Splines we picked a certain number of knots, producing a sequence of K basis functions, then used least square to estimate the spline coefficients. Now we will go over a different way to get the formula for a spline.

### 7.5.1 An Overview of Smoothing Splines

* We want to find a function $g(x)$ that fits the data well to minimize the RSS. However if we don't constrain $g(x)$ then we will get an exact fit which will overfit the data.

* So we want to make $g(x)$ smooth and make RSS small. How do we do this? One formual is
 * $∑_{i=1}^{n}(y_{i} - g(x))^2 - λ∫g^{''}(t)^2dt$
 *  λ is a non negative turning parameter
  * The formula takes on the form of loss + penalty.
    * The loss is the RSS equation
    * The penalty is the new λ term
    * The second derivate states the change of the slope. The more wiggly then the higher the value. The less wiggly the smaller the value.
    * The intergeral is the summuation over the range of t.
    * Thus the second term can be thought of as the summation of the change of the slope. So if it is jumpy then it is a high penalty, if it is smooth then a low penalty.
    * λ = 0 then it will be extremely jumpy
    * λ = ∞ then it will be a straight line
    * Thus need to pick a λ that accurately represents the function and data
  * The function that equates to all of this is a natural cubic spline with knots at each unique value of x with shrunked values. So close to the formula in 7.4 with more knots but shrunked in regards to the λ given

### 7.5.2 Choosing the Smoothing Parameter λ

* It might seem that smoothing spline will have too many knots since we have one at each x. This will lead to extremely high degree of freedom.

* However the λ controls the number of effective degree of freedom. We use this over degree of freedom since we are shrinking some of these terms.

* Thus we need to pick an effective λ to create the most optimal effective degrees of freedom.

* We can use Cross Validation and futher Leave one out cross validation which performs extremely fast in this situation.

## 7.6 Local Regression

* Compute the fit of a target point$x_{0}$ by using other neighboring points within a threshold.

* We need to pick a span to perform the algorithm. We can calculate the optimal span with cross validation. Span is the porpotion of points used to calculate the fit at a certain point.
  * Also need to calculate a weight, but a constant can be given
  * The higher the s the more wiggly the fit
  * Smaller the s the more rigid.

* Local regression is very useful in a global setting where all the variables share a common globabl variable and are local in a another. Like time

* Works well in a 2 dimensional model when 2 variables are related and parametic in nature. However like K nearest neighbors it suffers in high dimensionality and shouldn't be used over p = 3 or p =4 .

* Different that k nearest neights since we assign less weight value to points futher away and pick a span s which is a portion of the values choosen. Then perform least squares regression for each x.

## 7.7 Generative Additive Models

* previous sections we learned a bunch of models to build of the basis functions of linearity with 1 variable. However if we have X predictors? Then we will use Generative Additive Models (GAM)

* GAM can be applied with quanative and qualitive response.

* GAM takes extends the linear model by allowing non linear functions of each variable while maintaining additivity

### 7.7.1 GAMS for Regression Problems Quantitative

* A natural way to extend the linear formula is
  * $ y = β_{0} + β_{1}X_{i1} ....... + β_{n}X_{in}+ ϵ $
  * Replace each equation $β_{j}X_{ij}$ with a smooth non linear function $f_{j}X_{ij}$
* Now we can write it as
  * $ y = β_{0} + ∑_{j=1}^{n}f_{j}(X_{ij}) + e_{i}$
  * We call this model additive since we are adding each predictors function together.
* Thus since we have an additive model we can select several different methods to discover $f_{j}X_{ij}$.
  * We can use smoothing splines on one, regression splines on another, logistic regression, hot encoding, dummy variables etc.

* Advantage from additive is we can keep all the other function variables fixed and experiment with one predictor at a time.

* Disadvantage is since it is additive it can miss important interaction between variables, which we can add it. Can also add low dimensional interaction between variables manually.

### 7.7.2 GAMS for classification Problems

* The pros and cons are the exact same as 7.7.1

* The formula is the same since we take the log odds of the normal logistic regression formula.

  * $ log( \frac {p(X)}{1- p(X)}) = β_{0} + ∑_{j=1}^{n}f_{j}(X_{ij}) $

* We can fit each $f_{j}X_{ij}$ individually with the same constraints as the one above