<a href="https://colab.research.google.com/github/sspitz3/ml-practice/blob/main/homl/housing_prices/example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Housing Prices Example
This is a follow-along example from chapter two of Hands-On Machine Learning. The sections take the reader through the basic steps of machine learning projects using the example of house price data.

## Load Data
The data is loaded from a repository offered by the book author. The data is a csv of features that will be used to predict the price of a house. Each row represents the average of a number of houses in a different district of California.

In [1]:
import requests
import tarfile
import pandas as pd


def load_data(url):
  r = requests.get(url)
  with open("rawdata.tgz", "wb") as f:
    f.write(r.content)
  with tarfile.open("rawdata.tgz", "r") as f:
    f.extractall("datasets")

  return pd.read_csv("datasets/housing/housing.csv")


data = load_data("https://github.com/ageron/data/raw/main/housing.tgz")

## Splitting
The first step of ML is to set a side a test set to be used for the evaluation of the model at the end.

In [2]:
from sklearn.model_selection import train_test_split

X = data.drop("median_house_value", axis=1)
y = data["median_house_value"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

## Preprocessing
This preprocessing example demonstrates the use of ColumnTransformer to set up multipled pipelines that handle different columns.

In [66]:
from sklearn.metrics.pairwise import rbf_kernel
from sklearn.cluster import KMeans
from sklearn.base import BaseEstimator, TransformerMixin

class ClusterSimilarity(BaseEstimator, TransformerMixin):
  def __init__(self, n_clusters, gamma):
    self.n_clusters = n_clusters
    self.gamma = gamma

  def fit(self, X, y=None):
    self.kmeans_ = KMeans(n_clusters=self.n_clusters, n_init='auto')
    self.kmeans_.fit(X)
    return self

  def transform(self, X):
    return rbf_kernel(X, self.kmeans_.cluster_centers_, gamma=self.gamma)

In [67]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, FunctionTransformer
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
import numpy as np


log_pipeline = make_pipeline(SimpleImputer(strategy='median'), FunctionTransformer(np.log), StandardScaler())
ratio_pipeline = make_pipeline(SimpleImputer(strategy="median"), FunctionTransformer(lambda x: x[:, [0]] / x[:, [1]]), StandardScaler())
cat_pipeline = make_pipeline(SimpleImputer(strategy="most_frequent"), OneHotEncoder())

preprocessing = ColumnTransformer([
    ("bedrooms", ratio_pipeline, ["total_bedrooms", "households"]),
    ("rooms_per_house", ratio_pipeline, ["total_rooms", "households"]),
    ("people_per_house", ratio_pipeline, ["population", "households"]),
    ("log", log_pipeline, ["total_bedrooms", "total_rooms", "population", "households", "median_income"]),
    ("cat", cat_pipeline, ["ocean_proximity"]),
    ('geo', ClusterSimilarity(n_clusters=10, gamma=1), ['latitude', 'longitude'])
])

## Fit Models
In this step, we add a model to the end of preprocessing and fit a linear regression. We also demonstrate how to evaluate the error of the fit, including using cross validation.

We try a couple different models below and print the RMSE for both training and validation to demonstrate overfitting.

In [68]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score

lin_reg = make_pipeline(preprocessing, LinearRegression())

lin_reg.fit(X_train, y_train)

train_error = mean_squared_error(lin_reg.predict(X_train), y_train, squared=False)
validation_error = -sum(cross_val_score(lin_reg, X_train, y_train, scoring='neg_root_mean_squared_error', cv=3)) / 3

print("RMSE Errors")
print(f"Training: {train_error}")
print(f"Cross Validation: {validation_error}")

RMSE Errors
Training: 69641.48716250007
Cross Validation: 70544.35106527628


In [16]:
from sklearn.ensemble import RandomForestRegressor

forest_reg = make_pipeline(preprocessing, RandomForestRegressor(random_state=42))

forest_reg.fit(X_train, y_train)

train_error = mean_squared_error(forest_reg.predict(X_train), y_train, squared=False)
validation_error = -sum(cross_val_score(forest_reg, X_train, y_train, scoring='neg_root_mean_squared_error', cv=3)) / 3

print("RMSE Errors")
print(f"Training: {train_error}")
print(f"Cross Validation: {validation_error}")

RMSE Errors
Training: 23837.772777448532
Cross Validation: 64423.9537138114


## Grid Search
For hyperparameter tuning, we demonstrate the use of grid search.

In [73]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'columntransformer__geo__gamma': [0.1, 0.3, 0.5, 0.8, 1.0, 1.2, 1.5],
    'columntransformer__geo__n_clusters': [4, 10, 15, 20, 25],
}

lin_reg = make_pipeline(preprocessing, LinearRegression())

gs = GridSearchCV(lin_reg, param_grid=param_grid, cv=3, scoring='neg_root_mean_squared_error')

gs.fit(X_train, y_train)
mean_squared_error(gs.predict(X_train), y_train, squared=False)

65849.94835482865

In [23]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'randomforestregressor__max_features': [4,6,8],
    'randomforestregressor__n_estimators': [10, 50, 100, 150]
}

forest_reg = make_pipeline(preprocessing, RandomForestRegressor(random_state=42))

gs = GridSearchCV(forest_reg, param_grid=param_grid, cv=3, scoring='neg_root_mean_squared_error')

gs.fit(X_train, y_train)

In [28]:
gs.fit(X_train, y_train)

In [32]:
mean_squared_error(gs.predict(X_train), y_train, squared=False)

23402.770927358608

In [34]:
X_train

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
2682,-121.74,37.95,5.0,4980.0,774.0,2399.0,763.0,5.7104,INLAND
18448,-122.43,37.74,52.0,1514.0,314.0,724.0,301.0,5.3292,NEAR BAY
3883,-120.66,40.41,52.0,2081.0,478.0,1051.0,419.0,2.2992,INLAND
6802,-117.33,34.12,38.0,1703.0,385.0,1356.0,363.0,2.0391,INLAND
2455,-122.28,37.78,29.0,5154.0,,3741.0,1273.0,2.5762,NEAR BAY
...,...,...,...,...,...,...,...,...,...
7825,-121.96,37.85,10.0,3209.0,379.0,1199.0,392.0,12.2478,INLAND
17767,-116.87,33.76,5.0,4116.0,761.0,1714.0,717.0,2.5612,INLAND
11189,-118.20,33.82,43.0,1758.0,347.0,954.0,312.0,5.2606,NEAR OCEAN
12816,-122.46,37.63,22.0,6728.0,1382.0,3783.0,1310.0,5.0479,NEAR OCEAN


In [53]:
from sklearn.metrics.pairwise import rbf_kernel
from sklearn.cluster import KMeans
from sklearn.base import BaseEstimator, TransformerMixin

class ClusterSimilarity(BaseEstimator, TransformerMixin):
  def __init__(self, n_clusters, gamma):
    self.n_clusters = n_clusters
    self.gamma = gamma

  def fit(self, X, y=None):
    self.kmeans_ = KMeans(self.n_clusters)
    self.kmeans_.fit(X)
    return self

  def transform(self, X):
    return rbf_kernel(X, self.kmeans_.cluster_centers_, gamma=self.gamma)

In [58]:
from sklearn.utils.estimator_checks import check_estimator

ClusterSimilarity(4, 0.1).fit_transform(X_train[['longitude', 'latitude']])



array([[0.99746216, 0.04137497, 0.44055084, 0.64164252],
       [0.96681384, 0.02744297, 0.34248406, 0.61012292],
       [0.44822254, 0.00607607, 0.12508046, 0.73305726],
       ...,
       [0.05198116, 0.99023673, 0.50819083, 0.00406097],
       [0.96007036, 0.02906835, 0.34992051, 0.58003313],
       [0.03970048, 0.99930628, 0.45008592, 0.00292679]])