# Table of Contents 
### 1. [Polynomial Regression](#polynomial)
### 2. [Step Functions](#stepfunctions) 
### 3. [Regression Splines](#spline)
#### 3.1 [Choosing the Number and Locations of the Knots](#numberknot)
### 4. [Smoothing Splines ](#smooth) 
### 5. [Local Regression](#localregression)


## Polynomial Regression <a class ="anchor" id="polynomial"></a>

The standard linear model:  
$$y_i = \beta_0 + \beta_1x_i + \epsilon_i$$
A polynomial function
$$y_i = \beta_0 + \beta_1x_i + \beta_2x_i^2 + ...+\beta_dx_i^d + \epsilon_i $$
where $\epsilon_i$ is the error term.  
  
For large enough degree *d*, a polynomial regression allows us to produce an extremely non-linear curve. Notice that the coefficients in polynomial function can be easily estimated using least squares linear regression because this is just a standard linear model with predictors $x_i, x_i^2, ..., x_i^d$.  

Suppose we have computed the fit at a particular value
$$ \hat{f}(x_0)= \hat\beta_0 + \hat\beta_1x_0 + \hat\beta_2x_0^2+ \hat\beta_3x_0^3+\hat\beta_4x_0^4$$
   
What is the variance of the fit $Var\hat{f}(x_0)$?   
Least squares returns variance estimates for each of the fitted coefficients $\hat\beta_j$, as well as the covariances between pairs of coefficient estimates.

# Step Functions <a name="stepfunctions"></a>
We break the range of X into bins, and fit a different constant in each bin. This amounts to converting a continuous variable into an *ordered categorical variable*.  
  
We create cutpoints $c_1, c_2,...,c_K$ in the range of $X$, and then construct $K+1$ new variables: 
$$C_0(X) = I(X<c_1),$$
$$C_1(X) = I(c_1 <= X<c_2),$$
$$C_2(X) = I(c_2 <= X<c_3),$$
$$.$$
$$.$$
$$C_{K-1}(X) = I(c_{K-1} <= X<c_K),$$
$$C_K(X) = I(c_K <= X),$$
where $I(.)$ is an *indicator function* that returns a 1 if the condition is true, and returns a 0 otherwise. These are sometimes called *dummy variables*.   
Notice that for any value of $X, C_0(X)+C_1(X)+...+C_K(X) =1$ since $X$ must be in exactly one of the $K+1$ intervals   
We then use least squares to fit a linear model using $C_1(X), C_2(X),...C_K(X)$ as predictors:  
$$y_i = \beta_0 + \beta_1C_1(x_i)+\beta_2C_2(x_i)+...+\beta_KC_K(x_i)+\epsilon_i$$



# Basis Functions
We fit the model:
$$y_i = \beta_0 + \beta_1b_1(x_i)+\beta_2b_2(x_i)+...+\beta_Kb_K(x_i)+\epsilon_i$$
   
Note that the basis functions $b_1(.), b_2(.)...,b_K(.)$ are fixed and known. For polynomial regression, the basis functions are $b_j(x_i) =x_i^j$, and for step functions they are $b_j(x_i) = I(c_j \leq x_i < c_{j+1})$

# Regression Splines <a name="spline"></a>
## Piecewise Polynomials  
Piecewise polynomial regression involves fitting seperate low-degree polynomials over different region of $X$.   
   
For example, a *piecewise cubic polynomial* works by fitting a cubic regression model with a single knot at a point $c$ takes the form  
  
$$
 y_i = 
  \begin{cases} 
   \beta_{01} + \beta_{11}x_i+\beta_{21}x_i^2 +\beta_{31}x_i^3+ \epsilon_i & \text{if } x \geq 0 \\
   \beta_{02} + \beta_{12}x_i+\beta_{22}x_i^2 +\beta_{32}x_i^3+ \epsilon_i       & \text{if } x < 0
  \end{cases}
$$
   
It can be written this way:  
$$ y_i = \displaystyle\sum_{j=0}^3 \beta_{j1}(x_i-0)_+^j-\displaystyle\sum_{j=0}^3 \beta_{j2}(0-x_i)_{+}^j +\epsilon_i$$
   
where $(a)_+ = a $ if $a \geq 0$ and 0 otherwise 
  

In other words, we fit two different polynomial functions to the data, one on the subset of the observations with $x_i < c$, and one on the subset of the observations with $x_i \geq c$  
  
Using more knots leads to a more flexible piecewise polynomial. In general, if we place $K$ different knots throughout the range of $X$, the we will end up fitting $K+1$ different cubic polynomials.  
  
But the problem is that we will see a the function is discontinuous. Since each polynomial has four parameters, we are using a total of eight *degrees of freedom* in fitting this piecewise polynomial model.  
  
To remedy this problem, we can fit a piecewise polynomial under the constraint that the fitted curve must be continuous. We can add two additional constraints: both the first and second derivatives of the piecewise polynomials are continuous (we will take derivatives of 2 polynomial functions above and make them equal).   
  
So we have 3 constraint equations which effectively frees up 3 degree of freedom. We are left with five degrees of freedom.   
In general, a cubic spline with $K$ knots uses a total of $K+4$ degrees of freedom.
  


## The Spline Basis Representation 
  
A cubic spline with K knots can be modeled as 
$$y_i = \beta_0 + \beta_1b_1(x_i)+\beta_2b_2(x_i)+...+\beta_{K+3}b_{K+3}(x_i)+\epsilon_i$$ 
  
for an appropriate choice of basis functions $b_1,b_2...b_{K+3}$  
  
Suppose that we have K knots $c_1, c_2...,c_K$, one simple way to represent a cubic spline is that we perform least squares regression with an intercept and $3+K$ predictors, of the form $X, X^2, X^3, h(X,c_1),h(X,c_2),...,h(X,c_K)$   
  
$$y_i = \beta_0 + \beta_1X+\beta_2X^2+\beta_3X^3+\beta_4h(X,c_1)+...+\beta_{K+3}h(x,c_K) +\epsilon_i$$
   
### Inconvenients:
Splines can have high variance at the outer range of the predictors (when X is very small or very large).       
    
A **natural cubic spline** is a regression spline with additional **boundary constraints**: the function is required to be linear at the boundary (in the region where $X$ is
smaller than the smallest knot, or larger than the largest knot). This additional constraint means that natural splines generally produce more stable estimates at the boundaries

## Choosing the Number and Locations of the Knots  <a name="numberknot"></a>
### Where should we put the knots? 
The regression spline is most flexible in regions that contain a lot of knots, because in those regions the polynomial coefficients can change rapidly. Hence, one option is to place more knots in places where we feel the function might vary most rapidly, and to place fewer knots where it seems more stable. In practice, we cand place knots at uniform quantiles of the data.  
### How many knots should we use or equivalently how many degrees of freedom should our spline contain? 
=> Use **Cross-validation**: we remove a portion of the data (say 10%), fit a spline with a certain number of knots to the remaining data, and then use the spline to make predictions for the held-out portion. We repeat this process multiple times until each observation has been left out once, and then compute the overall cross-validated RSS. This procedure can be repeated for different numbers of knotsK. Then the value of $K$ giving the smallest RSS is chosen.
  
### Comparison to Polynomial Regression? 
Regression splines often give superior results to polynomial regression. 
1. Because Polynomials must use a high degree to produce flexible fits, splines introduce flexibility by increasing the number of knots but keeping the degree fixed. So generally this approach produces more stable estimates.   
2.  Splines also allow us to place more knots, and hence flexibility, over regions where the function $f$ seems to be changing rapidly, and fewer knots where $f$ appears more stable.

# Smoothing Splines <a name="smooth"></a>
In fitting a smooth curve to a set of data, what we really want to do is find some function, say g(x) that fits the observed data well:  
We want $RSS = \displaystyle\sum_{i=1}^n(y_i - g(x_i))^2$ to be small.  
To make g is smooth (don't overfit data), a natural approach is to find the function g that minimizes 
$$ \displaystyle\sum_{i=1}^n(y_i - g(x_i))^2 + \lambda \int g{''}(t)^2dt$$ 
  
where $\lambda$ is a nonnegative tuning parameter or bias-variance trade-off parameter.   
   
### 1. Interpretation of the term $\int g{''}(t)^2dt$:   
The term $\lambda \int g{''}(t)^2dt$ is a penalty term that penalizes the variability in g. The first derivative $g'(t)$  measures the slope of a function at $t$, and the second derivative corresponds to the amount by which the slope is changing. In other words, $\int g''(t)^2dt$ is simply a measure of the total change in the function $g'(t)$ over its entire range.   
If *g* is very smooth, $g'(t)$ will be close to constant and $\int g''(t)^2dt$ will take on a small value.   
If g is jumpy and variable then $g'(t)$ will vary significantly and $\int g''(t)^2dt$ will take on a large value.  
### 2. Interpretation of $\lambda$
When $\lambda = 0$ then the penalty term has no effect, and so the function *g* will be very jumpy and will exactly interpolate the training observations.  
When $\lambda$ is very large, g will be perfectly smooth, it will be a straight line that passes as closely as possible to the training points.  

# Local Regression <a name = "localregression"> </a>
This method involves computing the fit at a target point $x_0$ using only the nearby training observations. 
  
### Algorithm: 
1. Gather the fraction $s = k/n$ of training points whose $x_i$ are closes to $x_0$   
2. Assign a weight $K_{i0}= K(x_i, x_0)$ to each point in this neighborhood, so that the point furthest from $x_0$ has weight zero, and the closest has the highest weight. All but these k nearest neighbors get weight zero.   
3. Fit a *weighted least squares regression* of the $y_i$ on the $x_i$ using the aforementioned weights, by finding $\hat\beta_0$ and $\hat\beta_1$ that minimize 
$$ \displaystyle\sum_{i=1}^n K_{i0}(y_i - \beta_0 - \beta_1x_i)^2 $$  
  
4. The fitted value at $x_0$ is given by $\hat{f}(x_0) = \hat\beta_0 + \hat\beta_1 x_0$  
    
### Procedure 
To implement the procedure, need to specify:  
1. Fraction of data in each local neighborhood (smoothing parameter $s$)   
2. Weight functin for the least square fit
3. Degree of locally fitted polynomial (linear, quadratic...)
4. Number of iterative weighted least square fits.  
   
### Interpretation: 
#### 1. The span $s$ 
The smaller the value of $s$, the more local and wiggly will be our fit   
A very large value of $s$ with lead to a global fit to the data using all of the training observations.  
We can use cross-validation to choose $s$ or we can specify it directly   
   
Another interpretation of $s$ is $h(x_0)$  
For computational and theoretical purposes we will define this weight function so that only values within a *smoothing window* $[x_0 + h(x_0), x_0-h(x_0)]$ will be considered in the estimate of $f(x_0)$ 

#### 2. The weight $K_{i0}$
Purposes: the point furthest from $x_0$ has weight zero, and the closest has the highest weight. All but these k nearest neighbors get weight zero.   
This is achieved by considering weight functions that are 0 outside of $[-1,1]$.   
For example Tukey's tri-weight function: 
$$
 W(u) = 
  \begin{cases} 
   (1- |u|^3)^3 & \text{if } |u| \leq 1 \\
   0     & \text{if } |u| >1 
  \end{cases}
$$  
So $$K(x_i,x_0) = W\big(\frac{x_i-x_0}{h(x)}\big) = \bigg( 1- \big( \frac{d(x_j,x_i)}{max d(x_l,x_i)} \big)^3 \bigg)^3$$ 
#### 3. Degree of locally fitted polynomial   
It uses the **Taylor-decomposition** of the function f on each point, and a local weigthing of the points, to find the values.   
$$f(x) = f(x_0) + \displaystyle\sum_{k=1}^K \frac{f^{(k)}(x_0)}{k!}(x-x_0)^k + o(|x-x_0|^K)   \quad \text{as } |x-x_0| \to 0 $$  
    
    
Case1 : If degree = 1 => linear regression  
We tend to minimize $ \displaystyle\sum_{i=1}^n K_{i0}(y_i - \beta_0 - \beta_1x_i)^2 $   
   
Case2 : if degree = 2 => quadratic regression   
$$f(x) \approx \beta_0 + \beta_1(x-x_0) + \frac{1}{2}\beta_2(x-x_0)^2 \quad \text{for } x \in [x_0-h(x_0),x_0+h(x_0)] $$
    
So, we tend to minimize $ \displaystyle\sum_{i=1}^n K_{i0}\big(y_i - \beta_0 - \beta_1x_i - \frac{1}{2}\beta_2(x-x_0)\big)^2 $  
    
#### Practice: In python, we can visualize this method:    
``` python 
import pyqt_fit.nonparam_regression as smooth
from pyqt_fit import npr_methods
#linear (degree =1)
k1 = smooth.NonParamRegression(xs, ys, method=npr_methods.LocalPolynomialKernel(q=1))
# quadratic (degree = 2)
k2 = smooth.NonParamRegression(xs, ys, method=npr_methods.LocalPolynomialKernel(q=2))
#cubic (degree = 3)
k3 = smooth.NonParamRegression(xs, ys, method=npr_methods.LocalPolynomialKernel(q=3))
k1.fit(); k2.fit(); k3.fit()
plt.figure()
plt.plot(xs, ys, 'o', alpha=0.5, label='Data')
plt.plot(grid, k3(grid), 'y', label='cubic', linewidth=2)
plt.plot(grid, k2(grid), 'k', label='quadratic', linewidth=2)
plt.plot(grid, k1(grid), 'g', label='linear', linewidth=2)
plt.plot(grid, f(grid), 'r--', label='Target', linewidth=2)
plt.legend(loc='best')
```

In [4]:
import numpy as np
from pyqt_fit import plot_fit
import pyqt_fit.nonparam_regression as smooth
from pyqt_fit import npr_methods
import matplotlib.pyplot as plt
%matplotlib inline


def f(x):
    return 3*np.cos(x/2) + x**2/5 + 3
xs = np.random.rand(200) * 10
ys = f(xs) + 2*np.random.randn(*xs.shape)


grid = np.r_[0:10:512j]
plt.plot(grid, f(grid), 'r--', label='Reference')
plt.plot(xs, ys, 'o', alpha=0.5, label='Data')
plt.legend(loc='best')
k1 = smooth.NonParamRegression(xs, ys, method=npr_methods.LocalPolynomialKernel(q=1))
k2 = smooth.NonParamRegression(xs, ys, method=npr_methods.LocalPolynomialKernel(q=2))
k3 = smooth.NonParamRegression(xs, ys, method=npr_methods.LocalPolynomialKernel(q=3))
k12 = smooth.NonParamRegression(xs, ys, method=npr_methods.LocalPolynomialKernel(q=12))
k1.fit(); k2.fit(); k3.fit(); k12.fit()
plt.figure()
plt.plot(xs, ys, 'o', alpha=0.5, label='Data')
plt.plot(grid, k12(grid), 'b', label='polynom order 12', linewidth=2)
plt.plot(grid, k3(grid), 'y', label='cubic', linewidth=2)
plt.plot(grid, k2(grid), 'k', label='quadratic', linewidth=2)
plt.plot(grid, k1(grid), 'g', label='linear', linewidth=2)
plt.plot(grid, f(grid), 'r--', label='Target', linewidth=2)
plt.legend(loc='best')


ModuleNotFoundError: No module named 'pyqt_fit'

# Generalized Additive Models 
Generalized Additive Models (GAMs) provide a general framework for extending a standard linear model by allowing non-linear functions of each of the variables, while maintaining additivity.  
The model become 
$$y_i = \beta_0 + \displaystyle\sum_{j=1}^p f_j(x_{ij}) + \epsilon_i$$ 
$$ = \beta_0 + f_1(x_{i1}) + f_2(x_{i2})+...+ f_p(x_{ip}) + \epsilon_i $$