#### Domanda 1
**What is the fundamental idea behind Support Vector Machines?**

SVM are a class of learning algorithms that perform both regression and classification using an algebraic approach. The common idea behind all these methods is finding the hperplane that divides the data in two different classes. For regression the goal is reversed, we are trying to find the best hyperplane to fit the data. 

#### Domanda 2
**What is a support vector?**

Support vectors is the name give to points in the training set that are most near to the classification boundary. Infat, from a mathematica perspective, the hyperplane we fing by solving the constrained optimization problem, depend only from the nearest points. For example, in the case of a SVC, the support vectors are the points within the margin. Changin them leads to changing the whole hyperplane. Changing the position of points outside of the margin doesn't affect the hyperplane at all. So, in a wide sense, the support vectors are the points that $support$ the classification boundary.  

#### Domanda 3
**Why is it important to scale the inputs when sing a SVM?**

SVM try to find the hyperplane by maximizing the distance from the nearest points (maximizing the margin). If the feature scale is not scaled, the SVM will tend to neglect smaller features. 

#### Domanda 4
**Can an SVM classifier output a confidence score when it classifies an istance? What about a probability?**

The distance of the classifies point from th hyperplane can be interpreted as a measure of confidence. Infact, the farther the point is form the border, the more confident we are in its classification.
This distance can be computed as the dot product between the point in th feature space (x) and the normal vector to th hyperplane:
$$\text{measure of confidence }=\beta^tx$$
On the other hand, an SVM cannot output a probability (a score) for the point (x) of being of a specific class. Specifically, for how the model is built, it can only classify a point based upon its position wrt to the division boundary:
$$\beta^t x = \begin{cases}
        > 0 & \text{if } y = 1\\
        < 0 & \text{if } y = -1\\
        = 0 & \text{if } x \in H 
    \end{cases}$$
A model that can output that probability would be the logistic regression. 

#### Domanda 5
**Should you use the primal or the dual form of the SVM problem to train a model on a training set with millions of instances and hundreds of features?**

The dual form is far more computationally expensive than the primal form. In a case like this using the primal form is reccomended. 

#### Domanda 6
**Say you have trained an SVM classifier with an RBF kernel, bt it seems to underfit the training set. Should you increase or decrease $\gamma$ or C?**

A Gaussian RBF kernel function has the following form: 
$$\phi_\gamma(x,l_i) = \exp{(-\gamma ||x-l||^2)} \forall i = 1,...,m$$
C is a parameter of the SVm optimization problem. The higher it is (looking at the way Geron defines the problem) the more soft the boundary will be, allowing more classification errors. The higher C, the more forgiving the model would be, the larger the margin will be, the more generalizing probability the model will have. 
The higher C and $\gamma$ the more the classification bounary will be curvy. So, in case of underfit of the training data, I would increase $\gamma$ and C, in order to achieve a more flexible model. 

#### Domanda 7
**Train a LinearSVC on a linearly separable dataset. Then train an SVC and an SGDClassifier on the same dataset. See if you can get them to produce the same model**

In [None]:
from sklearn.datasets import load_iris

data = load_iris()
X = data.data # Features
y = data.target # Labels

In [None]:
import numpy as np
y = (data["target"] == 2).astype(np.float64) # Iris Virginica
# create a label vector where 1 is Iris Virginica and 0 is not
# we reduce ourselves to a binary classification problem

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC, SVC
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline

support_vector_classifier = Pipeline([
    ("scaler", StandardScaler()), # Standardize features
    ("svm_clf", LinearSVC(C=1, loss="hinge", random_state=42))
])

svm_clf = Pipeline([
    ("scaler", StandardScaler()), # Standardize features
    ("svm_clf", SVC(C=2, kernel="linear", random_state=42))
])

# C = 1 is the regularization parameter. This value controls the trade-off between achieving a low training error and a low testing error.

sgd_clf = Pipeline([
    ("scaler", StandardScaler()), # Standardize features
    ("sgd_clf", SGDClassifier(max_iter=1000, tol=1e-3, random_state=42))
])

In [None]:
support_vector_classifier.fit(X, y)
svm_clf.fit(X, y)
sgd_clf.fit(X, y)
# Train the classifiers

In [None]:
support_vector_classifier.predict([[5.0, 3.0, 1.0, 0.2]])
# Predict the class of a new sample

In [None]:
svm_clf.predict([[5.0, 3.0, 1.0, 0.2]])

In [None]:
sgd_clf.predict([[5.0, 3.0, 1.0, 0.2]])

In [None]:
from sklearn.model_selection import cross_val_score

cross_val_score(support_vector_classifier, X, y, cv=3, scoring="accuracy")

In [None]:
cross_val_score(svm_clf, X, y, cv=3, scoring="accuracy")

In [None]:
cross_val_score(sgd_clf, X, y, cv=3, scoring="accuracy")

Remember that: 
$$\text{accuracy} = \frac{\text{correct classification}}{\text{total predictions}}=\frac{TP +TN}{TP+FN+TN+FP}$$

In [None]:
from sklearn.metrics import confusion_matrix

svc_cm = confusion_matrix(y, support_vector_classifier.predict(X))
print(svc_cm)

In [None]:
# Ensure svm_clf is fitted before making predictions
if not hasattr(svm_clf.named_steps['svm_clf'], 'support_'):
	svm_clf.fit(X, y)

svm_cm = confusion_matrix(y, svm_clf.predict(X))
print(svm_cm)

In [None]:
sgd_cm = confusion_matrix(y, sgd_clf.predict(X))
print(sgd_cm)

In [None]:
from sklearn.model_selection import GridSearchCV

# fine tuning with grid search tje svm.
param_grid = [
    {'svm_clf__C': [1, 10, 100, 1000], 'svm_clf__kernel': ['linear']},
    {'svm_clf__C': [1, 10, 100, 1000], 'svm_clf__kernel': ['rbf'], 'svm_clf__gamma': [0.001, 0.01, 0.1]}
]

# Create a grid search object
# it will search for the best combination of hyperparameters in the specified parameter grid.
# The cv parameter specifies the number of cross-validation folds to use.
# The scoring parameter specifies the metric to optimize during the search.
# The refit parameter indicates whether to refit the model with the best found parameters after the search is complete.
# The return_train_score parameter indicates whether to include training scores in the results.
# Once the best combition of hyperparameters is found, the grid search objects becomes the model with the best parameters.
grid_search = GridSearchCV(svm_clf, param_grid, cv=5, scoring='accuracy', refit=True, return_train_score=True)
grid_search.fit(X, y)

In [None]:
# C = 10, Linear Kernel is the best model
print("Best parameters:", grid_search.best_params_)

In [None]:
svm_cm = confusion_matrix(y, grid_search.predict(X))
print(svm_cm) 

# confusion matrix before grid search:
#[[95  5]
# [ 0 50]]

#### Domanda 8
**Train an SVM classifier on the MNIST dataset. Since SVM classifier are binary classifiers, you will need to perform one-versus-the-rest to clasifi all 10 digits. You may want to tune the hyperparamters using small validation sets to speed up the process. What accuracy can you reach?**

In [None]:
from sklearn.svm import LinearSVC
from sklearn.datasets import load_digits 
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [None]:
data = load_digits()
X = data.data # Features
y = data.target # Labels

# 1797 numbers in 8x8 pixels images
# 10 classes (0-9)

In [None]:
import matplotlib.pyplot as plt

# Visualize the first digit in the dataset
plt.imshow(data.images[0], cmap=plt.cm.gray_r, interpolation="nearest")
plt.title(f"Digit: {data.target[0]}")
plt.axis("off")
plt.show()

In [None]:
linear_svc = Pipeline([
    ("scaler", StandardScaler()),
    ("svc_cl", LinearSVC(C=1, loss="hinge", random_state=42))
])

linear_svc.fit(X,y)
# Train the classifier

In [None]:
from sklearn.model_selection import cross_val_score
cv_score = cross_val_score(linear_svc, X, y, cv=3, scoring="accuracy")
# accuracy with c=1

In [None]:
# performin grid search to find the best hyperparameters
from sklearn.model_selection import GridSearchCV

param_grid = [
    {'svc_cl__C': [0.001, 0.01, 0.1, 1, 10, 100]}
]

grid_search = GridSearchCV(linear_svc, param_grid, cv=3, scoring="accuracy", refit=True, return_train_score=True)
grid_search.fit(X, y)


In [None]:
cv_post_gs = cross_val_score(grid_search, X, y, cv=3, scoring="accuracy")

In [None]:
print(cv_score)
print(cv_post_gs)

In [None]:
linear_svc

In [None]:
grid_search

In [None]:
# final accuracy reached
from numpy import mean
print(mean(cv_post_gs))

#### Domanda 9
**Train an SVM regressor on the California housing dataset**

General Workflow:
1. scale the data (Standard scaling) -> reduces computational time needed for training, better model found
2. train a model (here: LinearSVR, SVR with RBF kernel)
3. fine tune the model using GridSearch or RandomizedSearch
4. Compare the model using some performance measure (here: MSE)

In [5]:
from sklearn.datasets import fetch_california_housing
# loading the data
housing = fetch_california_housing()
X = housing["data"]
y = housing["target"]

In [6]:
from sklearn.model_selection import train_test_split
# splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
from sklearn.svm import LinearSVR
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# kernel regression 
linear_svr = Pipeline([
    ("scaler", StandardScaler()),
    ("svc_cl", LinearSVR(random_state=42))
])

In [10]:
linear_svr.fit(X_train, y_train)
# Train the regressor



In [11]:
# model evaluation
from sklearn.metrics import mean_squared_error
y_pred = linear_svr.predict(X_train)
mse = mean_squared_error(y_train, y_pred)
mse

0.9641780189948642

In [14]:
import numpy as np
np.sqrt(mse) # RMSE

np.float64(0.9819256687727764)

In [18]:
#scaling the data
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
# training the model with scaled data
# linear regression with SVR and RBF kernel (gamma and c hyperparameters)
# finding the best hyperparameters with randomized search
# model: SVR (Support Vector Regressor)
# hyperparameters: C and gamma
from sklearn.svm import SVR
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import reciprocal, uniform

param_distributions = {"gamma": reciprocal(0.001, 0.1), "C": uniform(1, 10)}
rnd_search_cv = RandomizedSearchCV(SVR(), param_distributions, n_iter=10, verbose=2, cv=3, random_state=42)
rnd_search_cv.fit(X_train_scaled, y_train)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV] END .....C=4.745401188473625, gamma=0.07969454818643935; total time=   8.3s
[CV] END .....C=4.745401188473625, gamma=0.07969454818643935; total time=  10.4s
[CV] END .....C=4.745401188473625, gamma=0.07969454818643935; total time=   9.1s
[CV] END .....C=8.31993941811405, gamma=0.015751320499779727; total time=   8.8s
[CV] END .....C=8.31993941811405, gamma=0.015751320499779727; total time=   8.5s
[CV] END .....C=8.31993941811405, gamma=0.015751320499779727; total time=   8.7s
[CV] END ....C=2.560186404424365, gamma=0.002051110418843397; total time=   8.3s
[CV] END ....C=2.560186404424365, gamma=0.002051110418843397; total time=   8.4s
[CV] END ....C=2.560186404424365, gamma=0.002051110418843397; total time=   8.3s
[CV] END ....C=1.5808361216819946, gamma=0.05399484409787434; total time=   8.1s
[CV] END ....C=1.5808361216819946, gamma=0.05399484409787434; total time=   8.2s
[CV] END ....C=1.5808361216819946, gamma=0.05399

In [20]:
y = rnd_search_cv.best_estimator_.predict(X_test_scaled)
mse = mean_squared_error(y_test, y)
rmse = np.sqrt(mse)
print(rmse) # RMSE

0.5929120979852832


Best model:

*SVR*: $\gamma$ = 0.08, C = 4.75, RMSE: 0.59