# Machine Learning 3 - Support Vector Machines

A SVM classifier builds a set of hyper-planes to try and separate the data by maximizing the distance between the borders and the data points.

![SVM](http://scikit-learn.org/stable/_images/sphx_glr_plot_separating_hyperplane_0011.png "Decision border in an SVM")

This separation is generally not possible to achieve in the original data space. Therefore, the first step of the SVM is to project the data into a high or infinite dimensions space in which this linear separation can be done. The projection can be done with linear, polynomial, or more comonly "RBF" kernels.

In [1]:
from lab_tools import CIFAR10, evaluate_classifier, get_hog_image

dataset = CIFAR10('./CIFAR10/')

Pre-loading training data
Pre-loading test data


**Build a simple SVM** using [the SVC (Support Vector Classfiication) from sklearn](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC). 
**Train** it on the CIFAR dataset.

In [2]:
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold

# -- Your code here -- #
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

split_val = 0.1
len_dataset = int(split_val*len(dataset.train["hog"]))
train_X = dataset.train["hog"][:-len_dataset]
train_Y = dataset.train["labels"][:-len_dataset]
val_X = dataset.train["hog"][-len_dataset:]
val_Y = dataset.train["labels"][-len_dataset:]

model = SVC()
model.fit(train_X, train_Y)
pred_model = model.predict(val_X)
score_model = accuracy_score(val_Y, pred_model)
print(f"Predictive model: {score_model}") 


Predictive model: 0.816


**Explore the classifier**. How many support vectors are there? What are support vectors?

In [6]:
all_support_vectors = model.support_vectors_ #Each line = 1 "Support Vector" ; 1024 columns forming a 32x32 image 
vectors_per_class = model.n_support_ #Number of "Support Vector" for each class

# -- Your code here -- #
print(all_support_vectors.shape)
print(vectors_per_class)

(7930, 256)
[2483 3193 2254]


**Try to find the best "C" (error penalty) and "gamma" parameters** using cross-validation. What influence does "C" have on the number of support vectors?

In [7]:
# -- Your code here -- #
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV

X, y = dataset.train["hog"], dataset.train["labels"]
skf = StratifiedKFold(n_splits=5, random_state=42, shuffle=True)

param_grid = {
    'C': [1, 1.5, 2, 5, 10, 100],
    'gamma': ["scale", "auto", float],
}

model = SVC()
grid_search = GridSearchCV(model, param_grid, cv=skf, scoring='accuracy', verbose=1)

grid_search.fit(X, y)

print("Best result model: ", grid_search.best_score_)
print("Best parameters model: ", grid_search.best_params_)

Fitting 5 folds for each of 18 candidates, totalling 90 fits


30 fits failed out of a total of 90.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
30 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/xavierdekeme/miniconda3/lib/python3.11/site-packages/sklearn/model_selection/_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/xavierdekeme/miniconda3/lib/python3.11/site-packages/sklearn/base.py", line 1467, in wrapper
    estimator._validate_params()
  File "/Users/xavierdekeme/miniconda3/lib/python3.11/site-packages/sklearn/base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "/Users/xavierdekeme/miniconda3/lib/python3.11/site-packages/sklearn/utils/_param_validation.py", l

Best result model:  0.8268666666666666
Best parameters model:  {'C': 5, 'gamma': 'scale'}


In [8]:
from sklearn.decomposition import PCA

pca = PCA(n_components=0.80) 

X, y = dataset.train["hog"], dataset.train["labels"]
X_train_pca = pca.fit_transform(X)
skf = StratifiedKFold(n_splits=5, random_state=42, shuffle=True)

param_grid = {
    'C': [1, 1.5, 2, 5, 10],
    'gamma': ["scale", "auto", float],
}

model = SVC()
grid_search = GridSearchCV(model, param_grid, cv=skf, scoring='accuracy', verbose=1)

grid_search.fit(X_train_pca, y)

print("Best result model: ", grid_search.best_score_)
print("Best parameters model: ", grid_search.best_params_)

Fitting 5 folds for each of 15 candidates, totalling 75 fits


25 fits failed out of a total of 75.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
25 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/xavierdekeme/miniconda3/lib/python3.11/site-packages/sklearn/model_selection/_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/xavierdekeme/miniconda3/lib/python3.11/site-packages/sklearn/base.py", line 1467, in wrapper
    estimator._validate_params()
  File "/Users/xavierdekeme/miniconda3/lib/python3.11/site-packages/sklearn/base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "/Users/xavierdekeme/miniconda3/lib/python3.11/site-packages/sklearn/utils/_param_validation.py", l

Best result model:  0.8284
Best parameters model:  {'C': 5, 'gamma': 'scale'}


In [10]:
#Comparison of results based on the two hyper-parameters found
clf_model1 = SVC(C=5, gamma="scale")
clf_model1.fit(dataset.train['hog'], dataset.train['labels'])
pred_model1 = clf_model1.predict(dataset.test["hog"])
score_model1 = accuracy_score(dataset.test["labels"], pred_model1)
print(f"Predictive best parameters tree (raw data): {score_model1}") #Predictive based on the testing/validation data
cm_model1= confusion_matrix(dataset.test["labels"], pred_model1)
print(cm_model1)

pca = PCA(n_components=0.80)  
X_train_pca = pca.fit_transform(dataset.train['hog'])
X_test_pca = pca.transform(dataset.test['hog'])
clf_model2 = SVC(C=5, gamma="scale")
clf_model2.fit(X_train_pca, dataset.train['labels'])
pred_model2 = clf_model2.predict(X_test_pca)
score_model2 = accuracy_score(dataset.test["labels"], pred_model2)
print(f"Predictive best parameters tree (after PCA): {score_model2}") #Predictive based on the testing/validation data
cm_model2= confusion_matrix(dataset.test["labels"], pred_model2)
print(cm_model2)

Predictive best parameters tree (raw data): 0.83
[[865 103  32]
 [120 779 101]
 [ 40 114 846]]
Predictive best parameters tree (after PCA): 0.8336666666666667
[[863 104  33]
 [111 791  98]
 [ 38 115 847]]


# Comparing algorithms

Using the best hyper-parameters that you found for each of the algorithms (kNN, Decision Trees, Random Forests, MLP, SVM):

* Re-train the models on the full training set.
* Compare their results on the test set.

In [None]:

# -- Your code here -- #
#Already done in each TP