<a href="https://colab.research.google.com/github/tallerzalan/Applied-Machine-Learning/blob/main/SVMs/Exercise_3_svm_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise - SVM for Regression

1. Standardize the data and implement a linear, polynomial, and RBF SVM.

   What is the performance (MSE) of each model now?

   Is the linear model still best?


2. Try to split your training data (again using $\texttt{train_test_split}$) to obtain a validation set.

   Try to optimize the performance of your model on the validation data, focusing particularly on regularization ($C$).
   
   Can you achieve test MSE below 10 (this is not trivial!)?
   
   In the original paper, they achieve an MSE of 7.2 (although it is not directly comparable).
   
   Remember to use standadization!

**Note**: Large values of $C$ may be VERY slow to fit (for some of the models)!

Try not to go too extreme, as your code may crash.

**See slides for more details!**

In [None]:
from sklearn.datasets import load_boston # NOTE how we use the Boston data
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error # NOTE how we use a new metric!
from sklearn import svm
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np

X, y = load_boston(return_X_y = True)

# Use 'train_test_split' to split your data into a train and a test set.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 9)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(404, 13) (102, 13) (404,) (102,)



    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

## Exercise 1.

Standardize the data and implement a linear, polynomial, and RBF SVM.

What is the performance (MSE) of each model now?

Is the linear model still best?

In [None]:
# We will call the standardized X for Z.

# It it important, that you only fit your scaler on your training and not your test data!

# This is akin to when you fit your model - you do not want to "peak" at the test data

scaler = StandardScaler()

Z_train = scaler.fit_transform(X_train)
Z_test = scaler.transform(X_test)

In [None]:
# Linear SVM
svm_linear = svm.SVR(kernel = 'linear')
svm_linear.fit(Z_train, y_train) # Remember fit to Z_train, NOT X_train

# ... and predicting
y_test_hat_linear = svm_linear.predict(Z_test) # And Z_test here instead of X_test
mse_linear = mean_squared_error(y_test_hat_linear, y_test)
print(f'Linear SVM achieved {round(mse_linear, 3)} MSE.')

Linear SVM achieved 28.574 MSE.


In [None]:
# Polynomial SVM (you decide the degree)
svm_poly = svm.SVR(kernel = 'poly', degree = 3)
svm_poly.fit(Z_train, y_train)

# ... and predicting
y_test_hat_poly = svm_poly.predict(Z_test)
mse_poly = mean_squared_error(y_test_hat_poly, y_test)
print(f'Polynomial SVM achieved {round(mse_poly, 3)} MSE.')

Polynomial SVM achieved 31.457 MSE.


In [None]:
svm_rbf = svm.SVR(kernel = 'rbf')
svm_rbf.fit(Z_train, y_train)

# ... and predicting
y_test_hat_rbf = svm_rbf.predict(Z_test)
mse_rbf = mean_squared_error(y_test_hat_rbf, y_test)
print(f'RBF SVM achieved {round(mse_rbf, 3)} MSE.')

RBF SVM achieved 34.238 MSE.


## Exercise 2.

Try to split your training data (again using $\texttt{train_test_split}$) to obtain a validation set.

Try to optimize the performance of your model on the validation data, focusing particularly on regularization ($C$).

Can you achieve test MSE below 10 (this is not trivial!)? 

In the original paper, they achieve an MSE of 7.2 (although it is not directly comparable).

Remember to use standadization!

In [None]:
# Start by splitting the train data to also obtain validation data
Z_train, Z_val, y_train, y_val = train_test_split(Z_train, y_train, test_size = 0.2, random_state = 9)
print(Z_train.shape, Z_val.shape, Z_test.shape, y_train.shape, y_val.shape, y_test.shape)

(323, 13) (81, 13) (102, 13) (323,) (81,) (102,)


In [None]:
# Now try different values of kernels, C, epsilon, as well as any other settings you want to tune

kernels = ['linear', 'poly', 'rbf', 'sigmoid'] # Input values seperated by ",".
Cs = [C for C in range(1, 1002, 100)] # Input values seperated by ",".
epsilons = [epsilon for epsilon in np.arange(0, 1.1, 0.1)]

results = []

for kernel in kernels:
    for C in Cs:
        for epsilon in epsilons:
            svm_current = svm.SVR(kernel = kernel, C = C, epsilon = epsilon)
            svm_current.fit(Z_train, y_train)
            y_val_hat = svm_current.predict(Z_val)
            mse = mean_squared_error(y_val_hat, y_val)

            results.append([mse, kernel, C, epsilon])

results = pd.DataFrame(results)
results.columns = ['MSE', 'Kernel', 'C', 'epsilon']
print(results)

              MSE   Kernel     C  epsilon
0    2.434936e+01   linear     1      0.0
1    2.440257e+01   linear     1      0.1
2    2.426419e+01   linear     1      0.2
3    2.424726e+01   linear     1      0.3
4    2.424111e+01   linear     1      0.4
..            ...      ...   ...      ...
479  9.275608e+06  sigmoid  1001      0.6
480  9.275440e+06  sigmoid  1001      0.7
481  9.275272e+06  sigmoid  1001      0.8
482  9.275105e+06  sigmoid  1001      0.9
483  9.274937e+06  sigmoid  1001      1.0

[484 rows x 4 columns]


In [None]:
# Extract best parameters.
results[results['MSE'] == results['MSE'].min()]

Unnamed: 0,MSE,Kernel,C,epsilon
257,12.552348,rbf,101,0.4


In [None]:
# Initialize your final model
svm_optimized = svm.SVR(kernel = 'rbf', C = 101, epsilon = 0.4)

# Use both training and validation data to fit it (np.concatenate "stacks" the array like rbind in R)
svm_optimized.fit(np.concatenate([Z_train, Z_val]), np.concatenate([y_train, y_val]))

# Predict on test data
y_val_hat_optimized = svm_optimized.predict(Z_test)

# Obtain and check accuracy on test data
accuracy_optimized = mean_squared_error(y_val_hat_optimized, y_test)
print(f'Optimized SVM achieved {round(accuracy_optimized, 3)} MSE.')

Optimized SVM achieved 9.91 MSE.
