I am training an SVM classifier on the Wine dataset, which contains chemical analysis of 178 wine samples from three cultivators, to predict the cultivator based on the features. Using sklearn.datasets.load_wine(), I load the data, split it into training and test sets, and normalize it with StandardScaler to ensure proper scaling. Since SVMs are binary classifiers, I employ a one-versus-all strategy with OneVsRestClassifier and SVC to handle the multiclass classification problem. After training the model on the dataset, I evaluate its performance using accuracy on the test set to determine its effectiveness in classifying the wine samples correctly.

In [2]:
from sklearn.datasets import load_wine
wine = load_wine(as_frame=True)

lets see the description

In [3]:
print(wine.DESCR)

.. _wine_dataset:

Wine recognition dataset
------------------------

**Data Set Characteristics:**

:Number of Instances: 178
:Number of Attributes: 13 numeric, predictive attributes and the class
:Attribute Information:
    - Alcohol
    - Malic acid
    - Ash
    - Alcalinity of ash
    - Magnesium
    - Total phenols
    - Flavanoids
    - Nonflavanoid phenols
    - Proanthocyanins
    - Color intensity
    - Hue
    - OD280/OD315 of diluted wines
    - Proline
    - class:
        - class_0
        - class_1
        - class_2

:Summary Statistics:

                                Min   Max   Mean     SD
Alcohol:                      11.0  14.8    13.0   0.8
Malic Acid:                   0.74  5.80    2.34  1.12
Ash:                          1.36  3.23    2.36  0.27
Alcalinity of Ash:            10.6  30.0    19.5   3.3
Magnesium:                    70.0 162.0    99.7  14.3
Total Phenols:                0.98  3.88    2.29  0.63
Flavanoids:                   0.34  5.08    2.03  1.00

let's split the data in training and test sets

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split( wine.data, wine.target, random_state=42)

In [5]:
X_train.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0
100,12.08,2.08,1.7,17.5,97.0,2.23,2.17,0.26,1.4,3.3,1.27,2.96,710.0
122,12.42,4.43,2.73,26.5,102.0,2.2,2.13,0.43,1.71,2.08,0.92,3.12,365.0
154,12.58,1.29,2.1,20.0,103.0,1.48,0.58,0.53,1.4,7.6,0.58,1.55,640.0
51,13.83,1.65,2.6,17.2,94.0,2.45,2.99,0.22,2.29,5.6,1.24,3.37,1265.0


In [6]:
y_train.head()

Unnamed: 0,target
2,0
100,1
122,1
154,2
51,0


let's start with a simple linear SVM classifier

In [8]:
from sklearn.svm import LinearSVC

In [9]:
lin_clf = LinearSVC(dual=True,random_state =42)
lin_clf.fit(X_train,y_train)



it did not converge, maybe we need to increase the number of iterations

In [10]:
lin_clf = LinearSVC(max_iter=1_000_000,dual=True,random_state =42)
lin_clf.fit(X_train,y_train)



Also it did not converge

In [11]:
from sklearn.model_selection import cross_val_score

cross_val_score(lin_clf, X_train, y_train).mean()



0.90997150997151

*also failed to converge, maybe we forgot to scale the features!*

In [13]:
from sklearn.pipeline import make_pipeline # import make_pipeline
from sklearn.preprocessing import StandardScaler


In [14]:
lin_clf = make_pipeline(StandardScaler(),LinearSVC(max_iter=1_000_000,
                                                   dual=True,random_state =42))
lin_clf.fit(X_train,y_train)

Now let's measure it's performance

In [15]:
from sklearn.model_selection import cross_val_score
cross_val_score(lin_clf, X_train, y_train).mean()

0.9774928774928775

That's a good score, now let's see if a kernalized SVM will do better

In [17]:
from sklearn.svm import SVC

In [18]:
svm_clf = make_pipeline(StandardScaler(), SVC(random_state=42))
cross_val_score(svm_clf, X_train, y_train).mean()

0.9698005698005698

That's not better, maybe we will do some hyperparameter tuning

In [19]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform, uniform

param_distrib = {
    "svc__gamma": loguniform(0.001, 0.1),
    "svc__C": uniform(1, 10)
}
rnd_search_cv = RandomizedSearchCV(svm_clf, param_distrib, n_iter=100, cv=5,
                                   random_state=42)
rnd_search_cv.fit(X_train, y_train)
rnd_search_cv.best_estimator_

In [20]:
rnd_search_cv.best_score_

0.9925925925925926

That's much better,lets select this model and test on on test set

In [22]:
rnd_search_cv.score(X_test, y_test)

1.0