## Nonlinear Methods

One immediate way to extend models to capture nonlinearity is to use ___polynomial functions___ in linear regression. The coefficients can be estimated similarily as this is just the standard linear model with $x_i, x_i^2, x_i^3 ...$ predictors. Although this works great to capture nonlinearity, the polynomial function takes a specific shape for a given degree, and are hence not very flexible. Polynomial regression also often leads to overfitting. Although there are several places where polynomial regression works well, there are other nonlinear methods that are more flexible.

### Basis Functions and Splines
The idea behind Basis functions is to have a family of functions $b_1, b_2 ... b_k$ that can be applied to $X$. We can then fit the model

$$y_i = \beta_0 + \beta_1b_1(x_i) + ... + \beta_kb_k(x_i) + \epsilon_i$$

This is just a standard linear model with predictors $b_1(x_i), b_2(x_i) ... b_k(x_i)$, and can be fit using least squares. Polynomial regression can be seen as a special case of Basis functions where $b_j(x_i) = x_i^j$. We could also divide the input space $X$ into bins and fit piecewise constant functions. These are known as ___Step Functions___, which can be treated as a special case of basis functions with $b_j(x_i) = I(c_j \le x_i \le c_{j+1})$, where $I()$ is an indicator function that returns 1 if the condition is true and 0 otherwise, and $c_1, c_2 ... c_k$ are the points that divide $X$ into bins. For any given value of $X$ there can be at most one $C_i$ that can be non-zero. There are several choices that can be used under the framework of basis functions, thereby expanding our repertoire of models.

#### Regression splines
One could also build a ___Piecewise Polynomial___ model where separate polynomials are fit over different regions of $X$. For example, a piecewise cubic polynomial with a single cut point (aka knot) at $c$ takes the form

\begin{align}
y_i = 
\begin{cases}
\beta_{01} + \beta_{11}x_i + \beta_{21}x_i^2 + \beta_{31}x_i^3 + \epsilon_i, & \text{if $x_i < c$}\\
\beta_{02} + \beta_{12}x_i + \beta_{22}x_i^2 + \beta_{32}x_i^3 + \epsilon_i, & \text{if $x_i \ge c$}
\end{cases}
\end{align}

One immediate problem obvious with this method is that the functions are discontinuous at the knots and hence the overall function isn't smooth. We can however impose constraints that makes the fit smooth. For example, for the cubic piecewise polynomial, we can impose continuity of the function, continuity of the first and second derivatives of the polynomial function at the knots. These are known as ___regression splines___. A degree-$d$ spline is a piecewise degree-$d$ polynomial, with continuity in derivatives up to degree $d-1$. The constraints for continuity in derivatives are imposed by introducing ___truncated power___ functions for each knot. For example, the cubic ploynomial can be modeled as

$$y_i = \beta_0 + \beta_1x_i + \beta_2x_i^2 + \beta_3x_i^3 + \beta_4h(x_i, c)$$

Where $h(x, c)$ with exponent $r$ is the truncated power function defined as

\begin{align}
h(x, c) = (x - c)_+^r = 
\begin{cases}
(x - c)^r, & \text{if $x > c$}\\
0, & \text{otherwise}
\end{cases}
\end{align}

For the cubic polynomial $r=3$. The addition of the truncated power function to the cubic polynomial will ensure continuity upto the 2nd derivatives.

In the original piecewise cubic polynomial with a single knot, there are 8 parameters to be determined and hence have 8 ___degrees of freedom___. Each constraint we impose reduces the degrees of freedom by 1. Therefore with 3 constraints (continuity of function, 1st derivative, 2nd derivative), the cubic spline has 5 degrees of freedom. A cubic spline with $k$ knots has $k + 4$ degrees of freedom. Cubic splines are generally considered to be the lowest-degree spline for which the knot-discontintuity isn't visible to human eye. 

Another way to control the fit of the model is by placing more knots in regions where the function varies rapidly, and less knots where the function is stable. One can also impose constraints at the boundaries of the spline, and thus decreases the otherwise common high variance problem at boundaries. These are callse ___natural splines___. Thus using the basis representation with splines, one can achieve models far more flexible than polynomial regression, and hence provide better and a more complicated fit. The best part is that the fitting procedures don't have to change.


#### Smoothing Splines

In fitting a smooth curve to the data, we are really trying to find a function $g$ such that we minimize

$$RSS = \sum_{i = 1}^n(y_i - g(x_i))^2$$

However, if $g$ is kept unconstrained, we will often find models that overfit the data. We need to constrain $g$ so that it is smooth. Smoothness of a curve can be measured by the second derivative of the function, as it tells us how the slope is changing. Therefore, to impose the smoothness constraint, we need to minimize

$$\sum_{i = 1}^n(y_i - g(x_i))^2 + \lambda \int \{g^{\prime \prime}(t)\}^2 dt$$

Where $\lambda$ is the tuning parameter and the penalty term $\int \{g^{\prime \prime}(t)\}^2 dt$ is the measure of total change in slope of the function over its entire range. $\lambda = 0$ leads to a function $g$ that interpolates the data, and $\lambda = \infty$ leads to a very smooth $g$ that is basically the least square solution. The function $g(x)$ that can be shown to explicitly minimize the above equation is the natural cubic spline with knots at the $x_1, x_2 ... x_n$. The smoothing spline is a shrunken version of the natural cubic spline that can be generated as a regression spline with knots at $x_1 ... x_n$.


## Generalized Additive Models

Generalized Additive Models (GAMs) provide a framework for expanding the standard linear model with nonlinear methods for multiple predictors. We could model the multiple regression with $p$ predictors as

$$y_i = \beta_0 + \sum_{j=1}^pf_j(x_{ij}) + \epsilon_i$$

where $f_j(x_{ij})$ is a smooth nonlinear function. This is very convenient, as we can fit separate $f_j$ for each predictor $X_j$ and just add them together. This makes GAMs very useful for inference problems where we can study the effect of just 1 predictor. We could also have functions of the form $f_{jk}(X_j, X_k)$ to capture interaction dynamics within the model. GAMs provide ultimate flexibility, while still remaining a parameteic 