# 7.8 Lab: Non-linear Modeling 

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import scipy
import pandas as pd 

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import PolynomialFeatures
import statsmodels.api as sm
%matplotlib inline

In [None]:
"""
In this lab, we will use Wage data. Let us read in the CSV data ans look at a sample of this data.
"""
Wage = pd.read_csv('data/Wage.csv', header=0, na_values='NA')
print Wage.shape
Wage.head()

## 7.8.1 Polynomial Regression and Step Functions

We will examine how to fit a polynomial regression model on the wage dataset. As all the techniques, we have multiple ways to do this. Here I will use sklearn as we alreadly used statsmodel.api before in Chapter 3.  If you are looking for more built-in functions around p-value, significance, confidence intervie, etc., I would recommend to use statsmodel.api. 

But scikit-learn does not have built error estimates for doing inference. But this problem forces us to think about a more general method to find Confidence Interview (key word: Bootstrap) 

Numpy also has a nice function to do ploynomial regression: https://www.ritchieng.com/machine-learning-polynomial-regression/

In [None]:
n_deg = 4
X = Wage.age
y = Wage.wage
X = X.reshape(X.shape[0], 1)
y = y.reshape(y.shape[0], 1)

polynomial_features= PolynomialFeatures(degree=n_deg)
X_poly = polynomial_features.fit_transform(X)


reg = LinearRegression()
reg.fit(X_poly, y)

# get coefficients and compare with the numbers as the end of page 288.
print reg.intercept_, reg.coef_

We now create a grid of values for age at which we want predictionsm and the call the generic predict() function 

In [None]:
# generate a sequence of age values spanning the range
age_grid = np.arange(Wage.age.min(), Wage.age.max()).reshape(-1,1)

# generate test data use PolynomialFeatures and fit_transform
X_test = PolynomialFeatures(degree=n_deg).fit_transform(age_grid)

# predict the value of the generated ages
y_pred = reg.predict(X_test)

# creating plots
plt.plot(age_grid, y_pred, color='red')
plt.show()

### Decide on the polynomial to use. 

In the book, the authors did this by using hypothesis testing. ANOVA using F-test was explanied. In order
to use the ANOVA function, $M_1$ and $M_2$ must be nested model: the predictors in $M_1$ must be a subset of the predictors in $M_2$. statsmodel.api has a nice built-in function to do that. 

As an alternative to using hypothesis tests and ANOVA, we could choose the polynomial degree using cross-validation, as discussed in before. 

In [None]:
X1 = PolynomialFeatures(1).fit_transform(X)
X2 = PolynomialFeatures(2).fit_transform(X)
X3 = PolynomialFeatures(3).fit_transform(X)
X4 = PolynomialFeatures(4).fit_transform(X)
X5 = PolynomialFeatures(5).fit_transform(X)
fit1 = sm.GLS(y, X1).fit()
fit2 = sm.GLS(y, X2).fit()
fit3 = sm.GLS(y, X3).fit()
fit4 = sm.GLS(y, X4).fit()
fit5 = sm.GLS(y, X5).fit()


In [None]:
import statsmodels.api as sm
print(sm.stats.anova_lm(fit1, fit2, fit3, fit4, fit5, typ=1))

The row of the above take shows the fit1 to the quadratic model fit2 is $2.36*10^{-32}$, indicating that a quadratic model is significant informative to a linear model. Similarly, the cubic model is significnat informative to a quadratic model ($p = 1.68 * 10^{-2}$).Hence, either a cubic or a quartic polynomial appear to provide a reasonable fit to the data, but lower- or higher-order models are not justified.

In the book, the authors also discussed logistic regression and the polynomial terms. In python, sm.GLM function provided some functions similar to glm() in R.

In [None]:
logistic_model = sm.GLM ((y>250), X4, family=sm.families.Binomial())
logistic_fit = logistic_model.fit()
print(logistic_fit.summary())

### Step function

In order to fit a step function, we use the cut() function:

In [None]:
age_cut, bins = pd.cut(Wage.age, bins=4, retbins=True, right=True)
age_cut.value_counts(sort=False)

Here cut() automatically picked the cutpoints at 33.5, 49, and 64.5 years of age. We could also have specified our own cutpoints directly using the breaks option  (set bins into a sequence of scalars, e.g. [0, 10, 20, 40, 100]). Note in the following code, I manually added a constant column and dropped the lowest value bin (17.938, 33.5] dummy variable.

In [None]:
age_cut_dummies = pd.get_dummies(age_cut)
age_cut_dummies = sm.add_constant(age_cut_dummies)
fit_age_cut = sm.GLM(Wage.wage, age_cut_dummies.drop(age_cut_dummies.columns[1], axis=1)).fit()
print(fit_age_cut.summary())