# Exercise - SVM for regression

1. Standardize the data and implement a linear, polynomial, and RBF SVM. What is the performance (MSE) of each model now? Is the linear model still best?
1. Try to split your training data (again using $\texttt{train_test_split}$) to obtain a validation set. Try to optimize the performance of your model on the validation data, focusing particularly on regularization ($C$). Can you achieve test MSE below 10 (this is not trivial!)? In the original paper, they achieve an MSE of 7.2 (although it is not directly comparable). Remember to use standadization!

**Note**: Large values of $C$ may be VERY slow to fit (for some of the models)! Try not to go too extreme, as your code may crash.

**See slides for more details!**

In [1]:
#from sklearn.datasets import load_boston # NOTE how we use the Boston data
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error # NOTE how we use a new metric!
from sklearn import svm
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np

data = './data/HousingData.csv'
raw_df = pd.read_csv(data).dropna()

y = raw_df.values[:, 2]
X = raw_df.values[:, 3:]

#X, y = load_boston(return_X_y=True) # `load_boston` has been removed from scikit-learn since version 1.2.

# Use `train_test_split` to split your data into a train and a test set.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(315, 11) (79, 11) (315,) (79,)


# Exercise 1

Standardize the data and implement a linear, polynomial, and RBF SVM. What is the performance (MSE) of each model now? Is the linear model still best?

In [2]:
# We will call the standardized X for Z.
# It is important, that you only fit your scaler on your training and not your test data!
# This is akin to when you fit your model - you do not want to "peak" at the test data

scaler = StandardScaler()

Z_train = scaler.fit_transform(X_train)
Z_test = scaler.transform(X_test)

In [3]:
# Linear SVM
svm_linear = svm.SVR(kernel='linear')
svm_linear.fit(Z_train, y_train) # remember fit to Z_train, NOT X_train

# ... and predicting
y_test_hat_linear = svm_linear.predict(Z_test) # And Z_test here instead of X_test
mse_linear = mean_squared_error(y_test_hat_linear, y_test)
print(f'Linear SVM achieved {round(mse_linear, 3)} MSE.')

Linear SVM achieved 16.74 MSE.


In [4]:
# Polynomial SVM (you decide degree)
svm_poly = svm.SVR(kernel='poly', degree=2)
svm_poly.fit(Z_train, y_train)

# ... and predicting
y_test_hat_poly = svm_poly.predict(Z_test)
mse_poly = mean_squared_error(y_test_hat_poly, y_test)
print(f'Polynomial SVM achieved {round(mse_poly, 3)} MSE.')

Polynomial SVM achieved 30.633 MSE.


In [5]:
svm_rbf = svm.SVR(kernel='rbf')
svm_rbf.fit(Z_train, y_train)

# ... and predicting
y_test_hat_rbf = svm_rbf.predict(Z_test)
mse_rbf = mean_squared_error(y_test_hat_rbf, y_test)
print(f'RBF SVM achieved {round(mse_rbf, 3)} MSE.')

RBF SVM achieved 12.779 MSE.


# Exercise 2

Try to split your training data (again using $\texttt{train_test_split}$) to obtain a validation set. Try to optimize the performance of your model on the validation data, focusing particularly on regularization ($C$). Can you achieve test MSE below 10 (this is not trivial!)? In the original paper, they achieve an MSE of 7.2 (although it is not directly comparable). Remember to use standadization!

In [6]:
# Start by splitting the train data to also obtain validation data
Z_train, Z_val, y_train, y_val = train_test_split(Z_train, y_train, test_size=0.2, random_state=42)
print(Z_train.shape, Z_val.shape, Z_test.shape, y_train.shape, y_val.shape, y_test.shape)

(252, 11) (63, 11) (79, 11) (252,) (63,) (79,)


In [7]:
# Now try different values of kernels, C, epsilon, as well as any other settings you want to tune

kernels = ['linear', 'poly', 'rbf'] # input values seperated by ",".
Cs = [0.001, 0.5, 1, 10, 100] # input values seperated by ",".
epsilons = [0.01, 1] # input values seperated by ",".

results = []

for kernel in kernels:
    for C in Cs:
        for epsilon in epsilons:
            svm_current = svm.SVR(kernel=kernel, C=C, epsilon=epsilon)
            svm_current.fit(Z_train, y_train)
            y_val_hat = svm_current.predict(Z_val)
            mse = mean_squared_error(y_val_hat, y_val)

            results.append([mse, kernel, C, epsilon])

results = pd.DataFrame(results)
results.columns = ['MSE', 'Kernel', 'C', 'epsilon']
print(results)

          MSE  Kernel        C  epsilon
0   46.227528  linear    0.001     0.01
1   43.848799  linear    0.001     1.00
2   14.727000  linear    0.500     0.01
3   13.971696  linear    0.500     1.00
4   14.545967  linear    1.000     0.01
5   13.044947  linear    1.000     1.00
6   13.190397  linear   10.000     0.01
7   13.196421  linear   10.000     1.00
8   13.260039  linear  100.000     0.01
9   13.198375  linear  100.000     1.00
10  56.179933    poly    0.001     0.01
11  52.872447    poly    0.001     1.00
12  21.900023    poly    0.500     0.01
13  22.188300    poly    0.500     1.00
14  22.277434    poly    1.000     0.01
15  21.558435    poly    1.000     1.00
16  24.992528    poly   10.000     0.01
17  19.395817    poly   10.000     1.00
18  28.791545    poly  100.000     0.01
19  22.473270    poly  100.000     1.00
20  56.936973     rbf    0.001     0.01
21  53.603987     rbf    0.001     1.00
22  18.384409     rbf    0.500     0.01
23  18.928444     rbf    0.500     1.00


In [8]:
# Extract best parameters.
results[results['MSE'] == results['MSE'].min()]

Unnamed: 0,MSE,Kernel,C,epsilon
29,7.184748,rbf,100.0,1.0


In [9]:
# Initialize your final model
svm_optimized = svm.SVR(kernel='rbf', C=100, epsilon=1)

# Use both training and validation data to fit it (np.concatenate "stacks" the array like rbind in R)
svm_optimized.fit(np.concatenate([Z_train, Z_val]), np.concatenate([y_train, y_val]))

# Predict on test data
y_val_hat_optimized = svm_optimized.predict(Z_test)

# Obtain and check accuracy on test data
accuracy_optimized = mean_squared_error(y_val_hat_optimized, y_test)
print(f'Optimized SVM achieved {round(accuracy_optimized, 3)} MSE.')

Optimized SVM achieved 3.811 MSE.
