#### Introduction to Statistical Learning, Lab 7.1

# Splines
In order to fit regression splines in python, we use the  `dmatrix`  module from the `patsy`  library. In lecture, we saw that regression splines can be fit by constructing an appropriate matrix of basis functions. The  `bs()`  function generates the entire matrix of basis functions for splines with the specified set of knots. 

In [None]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt

from sklearn.preprocessing import PolynomialFeatures
import statsmodels.api as sm
import statsmodels.formula.api as smf
from islpy import datasets
from patsy import dmatrix

%matplotlib inline

In [None]:
df = datasets.Wage()

It will be needed to generate an array containing all the ages in `Wage`. 

In [None]:
age_grid = np.arange(df.age.min(), df.age.max()).reshape(-1,1)

Fitting `wage`  to  `age`  using a regression spline is simple, first we need to specify the amount of knots we are going to use, lets use $3$ knots and use cubic splines:

In [None]:
transformed_x1 = dmatrix("bs(df.age, knots=(25,40,60), degree=3, include_intercept=False)",
                        {"df.age": df.age}, return_type='dataframe')

Now lets fit `wage` to `age` using our data converted into a 3-Spline in order to work with a GLM

In [None]:
fit1 = sm.GLM(df.wage, transformed_x1).fit()
fit1.params

There are 7 values because the four cubic fits done with `wage` to `age` have 4 d.o.f. each one, but each knot has 3 constraints ($y_j(k_i)=y_{j+1}(k_i)$,$\frac{d y_j(k_i)}{dx}=\frac{dy_{j+1}(k_i)}{dx}$ and $\frac{d^2 y_j(k_i)}{dx^2}=\frac{d^2y_{j+1}(k_i)}{dx^2}$ ), and since our fit has 3 knots, 9 d.o.f. are supressed. 

Now lets visualize our 3Spline fit:

In [None]:
pred1 = fit1.predict(dmatrix("bs(age_grid, knots=(25,40,60), include_intercept=False)",
                             {"age_grid": age_grid}, return_type='dataframe'))

In [None]:
plt.scatter(df.age, df.wage, facecolor='g', edgecolor='b', alpha=0.4)
plt.plot(age_grid, pred1, color='k', label='cubic 3-Spline ')
[plt.vlines(i , 0, 350, linestyles='dashed', lw=2, colors='k') for i in [25,40,60]]
plt.legend()
plt.xlim(15,85)
plt.ylim(0,350)
plt.xlabel('age')
plt.ylabel('wage');

Here we have prespecified knots at ages $25$, $40$, and $60$. This produces a spline with six basis functions. (Recall that a cubic spline with three knots has seven degrees of freedom; these degrees of freedom are used up by an intercept, plus six basis functions.) We could also use the  `df`  option to produce a spline with knots at uniform quantiles of the data specifying 6 d.o.f:

In [None]:
 transformed_x2 = dmatrix("bs(df.age, df=6, include_intercept=False)",
                        {"df.age": df.age}, return_type='dataframe')
fit2 = sm.GLM(df.wage, transformed_x2).fit()
fit2.params

Now lets visualize our Spline fit:

In [None]:
pred2 = fit2.predict(dmatrix("bs(age_grid, df=6, include_intercept=False)",
                             {"age_grid": age_grid}, return_type='dataframe'))

plt.scatter(df.age, df.wage, facecolor='g', edgecolor='b', alpha=0.4)
plt.plot(age_grid, pred2, color='r', label='6 d.o.f. Spline ', lw=2)
[plt.vlines(i , 0, 350, linestyles='dashed', lw=2, colors='r') for i in [25,40,60]]
plt.legend()
plt.xlim(15,85)
plt.ylim(0,350)
plt.xlabel('age')
plt.ylabel('wage');

In this case python chooses knots which correspond to the 25th, 50th, and 75th percentiles of `age`. The function  `bs()`  also has a degree  argument, so we can fit splines of any degree, rather than the default degree of 3 (which yields a cubic spline).

In order to instead fit a natural spline, we use the  cr()  function. Here we fit a natural spline with four degrees of freedom:

In [None]:
transformed_x3 = dmatrix("cr(df.age, df=4)", {"df.age": df.age}, return_type='dataframe')
fit3 = sm.GLM(df.wage, transformed_x3).fit()
fit3.params

Lets visualize the natural fit:

In [None]:
pred3 = fit3.predict(dmatrix("cr(age_grid, df=4)", {"age_grid": age_grid}, return_type='dataframe'))


plt.scatter(df.age, df.wage, facecolor='g', edgecolor='b', alpha=0.4)
plt.plot(age_grid, pred2, color='y', label='Natural Spline d.o.f=4')
[plt.vlines(i , 0, 350, linestyles='dashed', lw=2, colors='y') for i in [25,40,60]]
plt.legend()
plt.xlim(15,85)
plt.ylim(0,350)
plt.xlabel('age')
plt.ylabel('wage');

As with the  `bs()`  function, we could instead specify the knots directly using the   `knots`  option.

Let's see how these three models stack up:

In [None]:
plt.scatter(df.age, df.wage, facecolor='g', edgecolor='b', alpha=0.1)
plt.plot(age_grid, pred1, color='k', label='cubic 3-Spline', lw=4)
plt.plot(age_grid, pred2, color='r', label='6 d.o.f. Spline', lw=4)
plt.plot(age_grid, pred3, color='y', label='Natural Spline d.o.f.=4', lw=4)
[plt.vlines(i , 0, 350, linestyles='dashed', lw=2, colors='b') for i in [25,40,60]]
plt.legend()
plt.xlim(15,85)
plt.ylim(0,350)
plt.xlabel('age')
plt.ylabel('wage')