Rapidly obtain acceptable results using SVM (based on scikit-learn)
- Conduct simple scaling on the data
- sklearn.preprocessing.MinMaxScaler/StandardScaler
- Consider the RBF kernel
- sklearn.svm.SVC default
- Use cross-validation to find the best parameter C and gamma
- sklearn.model_selection.GridSearchCV
import numpy as np
from sklearn.datasets import load_digits
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVC
X_train, y_train = load_digits(return_X_y=True)
sc = MinMaxScaler(feature_range=(-1, 1))
Xt_train = sc.fit_transform(X_train)
params = {"C": np.logspace(-5, 15, num=11, base=2),
"gamma": np.logspace(3, -15, num=10, base=2)}
clf = GridSearchCV(SVC(), params, n_jobs=-1)
scores = cross_val_score(clf, Xt_train, y_train)
print(np.mean(scores), "+/-", np.std(scores))
-
Datasets 1: Astroparticle (from the reference)
- jupyter notebook
- evaluate using test set accuracy
- default in libsvm and old default in scikit-learn: 66.93% (66.93% in the reference)
- new default in scikit=learn: 96.25%
- scale with MinMaxScaler: 96.15% (96.15% in the reference)
- scale with MinMaxScaler & tune the parameters: 96.93% (96.87% in the reference)
- scale with StandardScaler: 96.80%
- scale with StandardScaler & tune the parameters: 96.68%
-
Datasets 2: Bioinformatics (from the reference)
- jupyter notebook
- evaluate using cross validation accuracy
- default in libsvm and old default in scikit-learn: 56.53% (56.52% in the reference)
- new default in scikit=learn: 81.87%
- scale with MinMaxScaler: 78.27% (78.52% in the reference)
- scale with MinMaxScaler & tune the parameters: 84.71% (85.17% in the reference)
- scale with StandardScaler: 56.53%
- scale with StandardScaler & tune the parameters: 84.15%
-
Datasets 3: Astroparticle (from the reference)
- jupyter notebook
- evaluate using test set accuracy
- default in libsvm and old default in scikit-learn: 2.44% (2.44% in the reference)
- new default in scikit=learn: 36.59%
- scale with MinMaxScaler: 12.20% (12.20% in the reference)
- scale with MinMaxScaler & tune the parameters: 80.49% (87.80% in the reference)
- scale with StandardScaler: 65.85%
- scale with StandardScaler & tune the parameters: 78.05%
-
Datasets 4: Breast Cancer (from sklearn.datasets.load_breast_cancer)
- jupyter notebook
- evaluate using cross validation accuracy
- default in libsvm and old default in scikit-learn: 62.74%
- new default in scikit=learn: 91.24%
- scale with MinMaxScaler: 96.13%
- scale with MinMaxScaler & tune the parameters: 97.54%
- scale with StandardScaler: 97.54%
- scale with StandardScaler & tune the parameters: 96.66%
-
Datasets 5: Digits (from sklearn.datasets.load_digits)
- jupyter notebook
- evaluate using cross validation accuracy
- default in libsvm and old default in scikit-learn: 44.88%
- new default in scikit=learn: 96.38%
- scale with MinMaxScaler: 95.72%
- scale with MinMaxScaler & tune the parameters: 97.33%
- scale with StandardScaler: 94.88%
- scale with StandardScaler & tune the parameters: 94.77%
-
Datasets 6: Wine (from sklearn.datasets.load_wine)
- jupyter notebook
- evaluate using cross validation accuracy
- default in libsvm and old default in scikit-learn: 42.77%
- new default in scikit=learn: 66.39%
- scale with MinMaxScaler: 96.68%
- scale with MinMaxScaler & tune the parameters: 96.68%
- scale with StandardScaler: 98.33%
- scale with StandardScaler & tune the parameters: 97.76%
- jupyter notebook
- evaluate using test set accuracy
- wrong way: use different scaler for training and testing sets (MinMaxScaler): 69.23% (69.23% in the reference)
- wrong way: use different scaler for training and testing sets (StandardScaler): 78.21%
- right way: use same scaler for training and testing sets (MinMaxScaler): 87.50% (89.42% in the reference)
- right way: use same scaler for training and testing sets (StandardScaler): 89.42%
-
Number of instances << number of features
- suggestion: use linear kernel
- jupyter notebook
- RBF kernel cross validation accuracy 92.85% (97.22% in the reference)
- linear kernel cross validation accuracy 92.85% (98.61% in the reference)
-
Both numbers of instances and features are large
- suggestion: use linear kernel
- jupyter notebook
- RBF kernel, cross validation accuracy 97.17% (96.81% in the reference), wall time 15min 17s
- linear kernel, corss validation accuracy 96.63% (97.01% in the reference), wall time <1s
-
Number of instances >> number of features
- suggestion: if linear kernel, set dual=False (default dual=True)
- jupyter notebook
- dual=False, cross validation accuracy 68.51% (75.67% in the reference), wall time 35s
- dual=True, corss validation accuracy 68.51% (75.67% in the reference), wall time 10min 31s
- In classification, large values in data may cause the following problems: (1) Features in larger numeric ranges may dominate those in smaller ranges; (2) Optimization methods for training may take longer time. The typical remedy is to scale data feature-wisely. However, for document data, often a simple instance-wise normalization is enough. Each instance becomes a unit vector
- Solvers in LIBLINEAR is not very sensitive to C. Once C is larger than certain value, the obtained models have similar performances.
- A Practical Guide to Support Vector Classification, Chih-Wei Hsu et al.
- LIBLINEAR: A Library for Large Linear Classification, Rong-En Fan et al.
- LIBSVM: A Library for Support Vector Machines, Chih-Chung Chang et al.