# Supervised Learning Algorithms: Ridge Regression

*In this template, only **data input** and **input/target variables** need to be specified (see "Data Input & Variables" section for further instructions). None of the other sections needs to be adjusted. As a data input example, .csv file from IBM Box web repository is used.*

## 1. Libraries

*Run to import the required libraries.*

In [7]:
%matplotlib notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

## 2. Data Input and Variables

*Define the data input as well as the input (X) and target (y) variables and run the code. Do not change the data & variable names **['df', 'X', 'y']** as they are used in further sections.*

In [11]:
### Data Input
# df = 

### Defining Variables  
# X = 
# y = 

### Data Input Example 
df = pd.read_csv('https://ibm.box.com/shared/static/q6iiqb1pd7wo8r3q28jvgsrprzezjqk3.csv')

X = df[['horsepower']]
y = df['price']

## 3. The Model

*Run to build the model.*

In [19]:
from sklearn.linear_model import Ridge

# train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   random_state = 0)
# feature normalization
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# ridge regression def
linridge = Ridge(alpha=20.0).fit(X_train_scaled, y_train)

### intercept & coefficient, # of non-zero features & weights, R-squared for training & test data set
print('Ridge regression linear model intercept: {}'
     .format(linridge.intercept_))
print('Ridge regression linear model coeff: {}\n'
     .format(linridge.coef_))
print('R-squared score (training): {:.3f}'
     .format(linridge.score(X_train_scaled, y_train)))
print('R-squared score (test): {:.3f}\n'
     .format(linridge.score(X_test_scaled, y_test)))
print('Number of non-zero features: {}'
     .format(np.sum(linridge.coef_ != 0)))

Ridge regression linear model intercept: 12555.906666666666
Ridge regression linear model coeff: [5036.92170125]

R-squared score (training): 0.614
R-squared score (test): 0.625

Number of non-zero features: 1


### 3.1. Regularization parameter alpha on R-squared

*Run to check how alpha affects the model score.*

In [22]:
print('Ridge regression: effect of alpha regularization parameter\n')
for this_alpha in [0, 1, 10, 20, 50, 100, 1000]:
    linridge = Ridge(alpha = this_alpha).fit(X_train_scaled, y_train)
    r2_train = linridge.score(X_train_scaled, y_train)
    r2_test = linridge.score(X_test_scaled, y_test)
    num_coeff_bigger = np.sum(abs(linridge.coef_) > 1.0)
    print('Alpha = {:.2f}\nnum abs(coeff) > 1.0: {}, \
r-squared training: {:.2f}, r-squared test: {:.2f}\n'
         .format(this_alpha, num_coeff_bigger, r2_train, r2_test))

Ridge regression: effect of alpha regularization parameter

Alpha = 0.00
num abs(coeff) > 1.0: 1, r-squared training: 0.62, r-squared test: 0.67

Alpha = 1.00
num abs(coeff) > 1.0: 1, r-squared training: 0.62, r-squared test: 0.66

Alpha = 10.00
num abs(coeff) > 1.0: 1, r-squared training: 0.62, r-squared test: 0.65

Alpha = 20.00
num abs(coeff) > 1.0: 1, r-squared training: 0.61, r-squared test: 0.63

Alpha = 50.00
num abs(coeff) > 1.0: 1, r-squared training: 0.58, r-squared test: 0.56

Alpha = 100.00
num abs(coeff) > 1.0: 1, r-squared training: 0.52, r-squared test: 0.48

Alpha = 1000.00
num abs(coeff) > 1.0: 1, r-squared training: 0.15, r-squared test: 0.07

