# Set up

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
from sklearn.datasets import load_wine

wine = load_wine(as_frame=True)

In [3]:
print(wine.DESCR)

.. _wine_dataset:

Wine recognition dataset
------------------------

**Data Set Characteristics:**

    :Number of Instances: 178
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- Alcohol
 		- Malic acid
 		- Ash
		- Alcalinity of ash  
 		- Magnesium
		- Total phenols
 		- Flavanoids
 		- Nonflavanoid phenols
 		- Proanthocyanins
		- Color intensity
 		- Hue
 		- OD280/OD315 of diluted wines
 		- Proline

    - class:
            - class_0
            - class_1
            - class_2
		
    :Summary Statistics:
    
                                   Min   Max   Mean     SD
    Alcohol:                      11.0  14.8    13.0   0.8
    Malic Acid:                   0.74  5.80    2.34  1.12
    Ash:                          1.36  3.23    2.36  0.27
    Alcalinity of Ash:            10.6  30.0    19.5   3.3
    Magnesium:                    70.0 162.0    99.7  14.3
    Total Phenols:                0.98  3.88    2.29  0.63
    Fl

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    wine.data, wine.target, random_state=42
)

In [5]:
X_train.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0
100,12.08,2.08,1.7,17.5,97.0,2.23,2.17,0.26,1.4,3.3,1.27,2.96,710.0
122,12.42,4.43,2.73,26.5,102.0,2.2,2.13,0.43,1.71,2.08,0.92,3.12,365.0
154,12.58,1.29,2.1,20.0,103.0,1.48,0.58,0.53,1.4,7.6,0.58,1.55,640.0
51,13.83,1.65,2.6,17.2,94.0,2.45,2.99,0.22,2.29,5.6,1.24,3.37,1265.0


In [6]:
y_train.head()

2      0
100    1
122    1
154    2
51     0
Name: target, dtype: int64

We will start with a simple Linear SVM classifier. It will automatically use the One-vs-All (also called One-vs-the-Rest, OvR) strategy, so there's nothing special about handling multiple classes. Now, assume we forget to scale the features beforehand.  

In [7]:
from sklearn.svm import LinearSVC

lin_clf = LinearSVC(dual=True, random_state=42)
lin_clf.fit(X_train, y_train)



Not a good message! Our model failed to converge. Let's see if increasing the number of training iterations help?

In [10]:
lin_clf = LinearSVC(max_iter=1_000_000, dual=True, random_state=42)
lin_clf.fit(X_train, y_train)



Even with a million iterations, our model still fails to converge (the default is only 1,000). There must a different problem. So let's see if scale the feature helps the model.

But first, let's use `cross_val_score` to evaluate this model, at least to have something to compare with.

In [12]:
from sklearn.model_selection import cross_val_score

cross_val_score(lin_clf, X_train, y_train, cv=5).mean()



0.90997150997151

Well, about 91% accuracy on the train set is not a great start for us. Now let's scale the features.

In [13]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

lin_clf = make_pipeline(StandardScaler(), LinearSVC(dual=True, random_state=42))
lin_clf.fit(X_train, y_train)

Now it converges with ease. Let's see how it performs.

In [14]:
cross_val_score(lin_clf, X_train, y_train, cv=5).mean()

0.9774928774928775

Nice, now we obtain 97.7% accuracy. That's much better.

Let's see if the kernelized SVM do any better. We will use the default `kernel="rbf"` with the `SVC` class this time.

In [15]:
from sklearn.svm import SVC

svm_clf = make_pipeline(StandardScaler(), SVC(random_state=42))
cross_val_score(svm_clf, X_train, y_train).mean()

0.9698005698005698

That's not better than the previous model, but we can do a bit of hyperparameter tuning.

In [17]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform, uniform

param_distribution = {
    "svc__gamma": loguniform(0.001, 0.1),
    "svc__C": uniform(1, 10),
}

random_search_cv = RandomizedSearchCV(
    svm_clf, param_distributions=param_distribution, n_iter=100, cv=5, random_state=42
)
random_search_cv.fit(X_train, y_train)
random_search_cv.best_estimator_

In [19]:
random_search_cv.best_params_

{'svc__C': 9.925589984899778, 'svc__gamma': 0.011986281799901188}

In [20]:
random_search_cv.best_score_

0.9925925925925926

This looks very promsiing! Let's choose it and see its performance on the test set. 

In [21]:
random_search_cv.score(X_test, y_test)

0.9777777777777777

- This tuned kernelized SVM performs better than the `LinearSVC` model. 
- However, the score on the test set is lower than the score we use cross-validation. This is quite common: We did so much hyperparameter tuning that we ended up slightly overfitting the validation sets.
- Now you can fell the tempting to adjust the model to score on the test set even better, however, doing that will just overfit the test set. The accuracy score is good enough for us to stop here.