#### Ridge Regression on Polynomial Features

Now we will implement this polynomial features regression with cross validation and ridge regression. It is important to remember that we have to standardize our variables before doing so.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# read in data
df = pd.read_csv('../data/Advertising.csv')
df.head()

Unnamed: 0,TV,radio,newspaper,sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9


Split the data twice so that we have a holdout set. If we do cross validation (or just a regular train test split), after we update the parameters we will have seen data from the test set and optimized based on that data which is data leakage. Cross-validation with a holdout set is basically a way to test on our data prior to using the actual testing data.

In [3]:
# set up X and y then do splits
X = df.drop('sales', axis = 1)
y = df['sales']

# train test split for now
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures, StandardScaler

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

This would normally be done in a pipeline but we are just going through the steps for clarity this time.

In [4]:
# polynomial set up
poly = PolynomialFeatures(degree=3, include_bias=False)

# fit transform training, transform testing
X_train_poly = poly.fit_transform(X_train)

# transform testing
X_test_poly = poly.transform(X_test)

In [5]:
# standard scaler
scaler = StandardScaler()

# fit transform training, transform testing
X_train_scaled = scaler.fit_transform(X_train_poly)

X_test_scaled = scaler.transform(X_test_poly)

In [6]:
# ridge regression
from sklearn.linear_model import Ridge

In [7]:
# create model
ridge = Ridge(alpha=10)

# fit
ridge.fit(X_train_scaled, y_train)

# predict
y_pred = ridge.predict(X_test_scaled)

In [8]:
from sklearn.metrics import root_mean_squared_error

rmse = root_mean_squared_error(y_test, y_pred)
rmse

0.8609608361825305

In [9]:
# using cross validation
from sklearn.model_selection import KFold, cross_val_score

# set up our folds
cv = KFold(n_splits=5, random_state=42, shuffle=True)

# cross val scores
scores = cross_val_score(ridge, X_train_scaled, y_train, scoring='neg_root_mean_squared_error', cv = cv)

In [10]:
scores

array([-0.72975631, -1.70206768, -0.54096587, -0.77658989, -0.69474171])

In [41]:
# get mean of scores
np.mean(scores)

np.float64(-1.0343771451925177)

In [None]:
from sklearn.linear_model import RidgeCV

ridge_cv = RidgeCV([0.1, 1, 5, 10])

