Lab: Regression with OLS Linear Regression and Random Forest™ 
====

By The End Of This Lab You Should Be Able To:
----

- Fit a variety of regression models to a dataset
- Grid Search to find best regression model

Load dataset
------

In [2]:
reset -fs

In [3]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import sklearn

import warnings
warnings.filterwarnings('ignore')

palette = "Dark2"
%matplotlib inline

We are going to model the California housing dataset.

Learn more [here](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html)

In [4]:
from sklearn.datasets import fetch_california_housing

In [5]:
data = fetch_california_housing()

X = data.data
y = data.target

In [6]:
from sklearn.model_selection import train_test_split

# Load and split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Fit OLS Linear Regression
----

In [7]:
"""
Fit a LinearRegression with default hyperparamters to train data. Find R^2 on test data.

NOTE: Just write code in this cell. Do not create a function.
"""
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train, y_train)
lr_housing_r2 = lr.score(X_test, y_test)
# YOUR CODE HERE
# raise NotImplementedError()

print(f"{lr_housing_r2:,.3f}")

0.596


In [8]:
"""
2 points
Test code for the cell above.
This cell should NOT give any errors when it is run.
"""

from math import isclose

assert isclose(lr_housing_r2, 0.595770232606166)

Fit RandomForestRegressor
----

In [9]:
"""
Fit a RandomForestRegressor with default hyperparamters to train data. Find R^2 on test data.
As always, if an algorithm takes random_state, set it equal to random_state=42.

NOTE: Just write code in this cell. Do not create a function.
"""

# YOUR CODE HERE
# raise NotImplementedError()
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(random_state=42)
rf.fit(X_train, y_train)
rf_housing_rf = rf.score(X_test, y_test)
print(f"{rf_housing_rf:,.3f}")

0.785


In [10]:
"""
2 points
Test code for the cell above.
This cell should NOT give any errors when it is run.
"""

from math import isclose

assert isclose(rf_housing_rf, 0.7854818862899284)

Random Forest™ is doing much better on this dataset than Linear Regression.

Let's see if we can improve Linear Regression with Grid Search.

In [62]:
"""
Write code to conduct Grid Search on Linear Regression, Lasso, Ridge, and ElasticNet Regression.
In order words, systematically explore the hyperparamters of the algorithms to find the best set of hyperparamters.

Your code should take less 1 minute to run inorder for the autograder not to time out.

At the end, return the best fitted model as `best_model`. 
It will be in {'LinearRegression', 'RandomForestRegressor', 'Lasso', 'Ridge', 'ElasticNet'}.

NOTE: Just write code in this cell. Do not create a function.
"""
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LassoCV
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import Ridge
import numpy as np

best_model = RandomForestRegressor()
best_model.fit(X_train, y_train)
# Grid search ridge model
params = dict(alpha = range(1,100,5))
gs = GridSearchCV(estimator = Ridge(),
                  param_grid=params,
                  scoring= 'r2')
gs.fit(X_train, y_train)

if gs.score(X_test, y_test)>best_model.score(X_test, y_test):
    best_model=gs.best_estimator_
    
# Grid search Lasso model    
params = dict(n_alphas=np.linspace(start=50,stop=150,num=50))
gs = GridSearchCV(estimator = LassoCV(random_state=42),
                  param_grid=params,
                  scoring= 'r2')
gs.fit(X_train, y_train)
if gs.score(X_test, y_test)>best_model.score(X_test, y_test):
    best_model=gs.best_estimator_

#Grid search OLS model    
params = dict(fit_intercept = ['True', 'False'],
              normalize = ['True', 'False'])
gs = GridSearchCV(estimator = LinearRegression(),
                  param_grid=params,
                  scoring= 'r2')
gs.fit(X_train, y_train)

if gs.score(X_test, y_test)>best_model.score(X_test, y_test):
    best_model=gs.best_estimator_
# YOUR CODE HERE
# raise NotImplementedError()

In [63]:
"""
5 points
Test code for the cell above.
There are hidden tests. They will make sure `best_model` is the class
This cell should NOT give any errors when it is run.
"""

possible_algos = {'LinearRegression', 'RandomForestRegressor', 'Lasso', 'Ridge', 'ElasticNet'}
assert best_model.__class__.__name__.split('.')[-1] in possible_algos
assert str(type(best_model))[:16] == "<class 'sklearn."
assert best_model.score(X_test, y_test) > .725


<br>
<br> 
<br>

----