In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing()
X = housing.data
y = housing.target

In [3]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Remember to scale the input!

In [4]:
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVR
from sklearn.pipeline import make_pipeline

lin_reg = make_pipeline(StandardScaler(), LinearSVR(dual=True, random_state=42))
lin_reg.fit(X_train, y_train)



Our model has not converged yet, so we will increase its `max_iter` hyperparameter.

In [5]:
lin_reg = make_pipeline(
    StandardScaler(), LinearSVR(dual=True, max_iter=5000, random_state=42)
)
lin_reg.fit(X_train, y_train)

Let's see how it performs on the training set itself.

In [6]:
from sklearn.metrics import mean_squared_error

y_predict = lin_reg.predict(X_train)
mean_squared_error(y_predict, y_train, squared=False)

0.979565447829459

In this dataset, the targets are represented in the unit of hundreds of thousands of dollars. The RMSE gives us the rough idea of how the model performs. Here, even when we train on the whole training set, the performance is not great: We expect errors close to $98,000! 

Now we switch to RBF kernel to see how it performs. We will use a randomized search with cross validation to find a good set of values for `C` and `gamma` hyperparameters.

In [7]:
from sklearn.svm import SVR
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform, uniform

svm_reg = make_pipeline(StandardScaler(), SVR())

param_distribution = {
    "svr__gamma": loguniform(0.001, 0.1),
    "svr__C": uniform(1, 10),
}
random_search_cv = RandomizedSearchCV(
    svm_reg, param_distribution, n_iter=100, cv=5, random_state=42
)
random_search_cv.fit(X_train, y_train)

In [8]:
random_search_cv.best_params_

{'svr__C': 4.63629602379294, 'svr__gamma': 0.08781408196485979}

In [9]:
from sklearn.model_selection import cross_val_score

-cross_val_score(
    random_search_cv.best_estimator_,
    X_train,
    y_train,
    scoring="neg_root_mean_squared_error",
)

array([0.58835648, 0.57468589, 0.58085278, 0.57109886, 0.59853029])

Looks much better than the linear model. Let select this model and evaluate it on the test set.

In [10]:
y_predict = random_search_cv.best_estimator_.predict(X_test)
mean_squared_error(y_test, y_predict, squared=False)

0.5894352084613013

So SVMs work well on the Wine dataset, but not so much on the California housing dataset. In chapter 2, we found that Random Forest work better ont his dataset.