# Ridge Regression on Pollution Dataset

*This notebook builds ridge regression models for predicting mortality rate from air quality.*

The data is from 
> McDonald and Schwing (1973), "Instabilities of Regression Estimates Relating Air Pollution to Mortality," Technometrics, 15, 463-481

and is available at [NCSU](https://www4.stat.ncsu.edu/~boos/var.select/pollution.html).

## Import Dependencies

In [1]:
import peak_engines
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
import pandas as pd
import numpy as np

## Prepare the dataset

In [2]:
df = pd.read_csv('pollution.tsv', header=0, delim_whitespace=True)
X = np.array(df.iloc[:, :-1].values, dtype=float)
X = sklearn.preprocessing.StandardScaler().fit_transform(X)
y = np.array(df.iloc[:,-1].values, dtype=float)

## Compare performance on a leave-one-out cross-validation

In [3]:
cv_splits = KFold(len(y))
models = [
    ('LS       ', LinearRegression()),
    ('RR-LOOCV1', peak_engines.RidgeRegressionModel(score='loocv', normalize=True)),
    ('RR-LOOCV2', peak_engines.RidgeRegressionModel(score='loocv', num_groups=2, 
                                                    normalize=True)), 
        # Uses two regularizers and groups regressors by their magnitude
    ('RR-GCV1', peak_engines.RidgeRegressionModel(score='gcv', normalize=True)),
    ('RR-GCV2', peak_engines.RidgeRegressionModel(score='gcv', num_groups=2, 
                                                  normalize=True)), 
        # Uses two regularizers and groups regressors by their magnitude
]
for name, model in models:
    cv_results = cross_val_score(model, X, y, cv = cv_splits, scoring='neg_mean_squared_error')
    print(name, np.sqrt(-cv_results.mean()))

LS        46.225370299269144
RR-LOOCV1 41.13025109202713
RR-LOOCV2 39.63797831378044
RR-GCV1 40.89446991681711
RR-GCV2 39.619374460805155
