# Example of SVM regression hypertuning 

In the example, two approaches to systematic hyper-parameter search are presented: **Grid Search** and **Randomized Search**. While the former exhaustively considers all parameter combinations for given values, the latter selects a number of candidates from a parameter space with a particular random distribution.

Sources:

- [3.2. Tuning the hyper-parameters of an estimator](https://scikit-learn.org/stable/modules/grid_search.html)
    - [sklearn.model_selection.GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)
    - [sklearn.model_selection.RandomizedSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html#sklearn.model_selection.RandomizedSearchCV)
- [Introduction to hyperparameter tuning with scikit-learn and Python](https://pyimagesearch.com/2021/05/17/introduction-to-hyperparameter-tuning-with-scikit-learn-and-python/)
    - [Abalone Dataset](https://www.kaggle.com/datasets/rodolfomendes/abalone-dataset?resource=download)
- [Hyperparameter tuning using Grid Search and Random Search: A Conceptual Guide](https://medium.com/@jackstalfort/hyperparameter-tuning-using-grid-search-and-random-search-f8750a464b35)

Import the necessary packages:

In [1]:
# general packages
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
import pandas as pd

# additional packages for grid search
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import GridSearchCV

# additional packages for randomized search
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import RepeatedKFold
from scipy.stats import loguniform

# import class MeasExecTimeOfProgram from python file MeasExecTimeOfProgramclass.py
from MeasExecTimeOfProgram_class import MeasExecTimeOfProgram

Set path and columns of the abalone dataset for import:

In [2]:
# specify the path of the dataset
CSV_PATH = "./datasets/abalone_dataset.csv"

# specify the column names of our dataframe
COLS = ["Sex", "Length", "Diameter", "Height", "Whole weight",
        "Shucked weight", "Viscera weight", "Shell weight", "Age"]

Load dataset and split it into subsets for training and testing in the ratio 85% to 15%:

In [3]:
# load the dataset, separate the features and labels, and perform a
# training and testing split using 85% of the data for training and
# 15% for evaluation
dataset = pd.read_csv(CSV_PATH, names=COLS, header=0)

# omit also 1. column due to non-float categorical
dataX = dataset[dataset.columns[1:-1]]
# take only last comlumn for ages
dataY = dataset[dataset.columns[-1]]
# split into train and test data subsets (ratio: 85% to 15%)
(trainX, testX, trainY, testY) = train_test_split(dataX, dataY, random_state=3, test_size=0.15, shuffle=True)

Standardize the feature values by computing the **mean**, subtracting the mean from the data points, and then dividing by the **standard deviation**:

In [4]:
scaler = StandardScaler()
trainX = scaler.fit_transform(trainX)
testX = scaler.transform(testX)

#testX

## Finding a baseline

The aim of this sub-step is to establish a baseline on the [Abalone Dataset](https://www.kaggle.com/datasets/rodolfomendes/abalone-dataset?resource=download) by training a **Support Vector Regression (SVR)** with no hyperparameter tuning.

Train the model with **no tuning of hyperparameters** to find the baseline for later improvements:

In [5]:
model = SVR()

# initiate measuring execution time
execTime = MeasExecTimeOfProgram()
execTime.start()

model.fit(trainX, trainY)

# print time delta
print('Execution time: {:.2f} s'.format(execTime.stop()/1000))

Execution time: 0.67 s


Evaluate our model using R^2-score (1.0 is the best value):

In [6]:
print("R2: {:.2f}".format(model.score(testX, testY)))

R2: 0.51


In [7]:
from sklearn.metrics import mean_squared_error

y_pred = model.predict(testX)

mean_squared_error = mean_squared_error(testY, y_pred)

print("Mean squared error: {:.2f} %".format(mean_squared_error))

Mean squared error: 4.63 %


## Grid Search

Initialize the SVR model and define the **space of the hyperparameters** to perform the **grid-search** over:

In [44]:
model = SVR()
kernel = ["linear", "rbf", "sigmoid", "poly"]
tolerance = [1e-3, 1e-4, 1e-5, 1e-6]
C = [1, 1.5, 2, 2.5, 3]
grid = dict(kernel=kernel, tol=tolerance, C=C)

Initialize a **cross-validation fold** and **perform a grid-search** to tune the hyperparameters:

In [45]:
cvFold = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
gridSearch = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1,
                          cv=cvFold, scoring="neg_mean_squared_error")

# initiate measuring execution time
execTime = MeasExecTimeOfProgram()
execTime.start()

searchResults = gridSearch.fit(trainX, trainY)

# print time delta
print('Execution time: {:.2f} s'.format(execTime.stop()/1000))

Execution time: 415.62 s


Extract the best model and evaluate it:

In [46]:
bestModel = searchResults.best_estimator_

print("R2: {:.2f}".format(bestModel.score(testX, testY)))

R2: 0.52


In [47]:
from sklearn.metrics import mean_squared_error

y_pred = bestModel.predict(testX)

mean_squared_error = mean_squared_error(testY, y_pred)

print("Mean squared error: {:.2f} %".format(mean_squared_error))

Mean squared error: 4.49 %


In [48]:
bestModel.get_params()

{'C': 3,
 'cache_size': 200,
 'coef0': 0.0,
 'degree': 3,
 'epsilon': 0.1,
 'gamma': 'scale',
 'kernel': 'rbf',
 'max_iter': -1,
 'shrinking': True,
 'tol': 1e-05,
 'verbose': False}

## Randomized Search

Initialize the SVR model and define the **space of the hyperparameters** to perform the **randomized-search** over:

In [51]:
model = SVR()
kernel = ["linear", "rbf", "sigmoid", "poly"]
tolerance = loguniform(1e-6, 1e-3)
C = [1, 1.5, 2, 2.5, 3]
grid = dict(kernel=kernel, tol=tolerance, C=C)

Initialize a **cross-validation fold** and **perform a randomized-search** to tune the hyperparameters:

In [52]:
cvFold = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

randomSearch = RandomizedSearchCV(estimator=model, n_jobs=-1,
                                  cv=cvFold, param_distributions=grid,
                                  scoring="neg_mean_squared_error")

# initiate measuring execution time
execTime = MeasExecTimeOfProgram()
execTime.start()

searchResults = randomSearch.fit(trainX, trainY)

# print time delta
print('Execution time: {:.2f} s'.format(execTime.stop()/1000))

Execution time: 60.81 s


Extract the best model and evaluate it:

In [53]:
bestModel = searchResults.best_estimator_

print("R2: {:.2f}".format(bestModel.score(testX, testY)))

R2: 0.52


In [54]:
from sklearn.metrics import mean_squared_error

y_pred = bestModel.predict(testX)

mean_squared_error = mean_squared_error(testY, y_pred)

print("Mean squared error: {:.2f} %".format(mean_squared_error))

Mean squared error: 4.53 %


In [55]:
bestModel.get_params()

{'C': 2,
 'cache_size': 200,
 'coef0': 0.0,
 'degree': 3,
 'epsilon': 0.1,
 'gamma': 'scale',
 'kernel': 'rbf',
 'max_iter': -1,
 'shrinking': True,
 'tol': 0.0006847359084131038,
 'verbose': False}