In [1]:
# Run this if on windows to improve speed on windows 
# https://github.com/intel/scikit-learn-intelex
from sklearnex import patch_sklearn
patch_sklearn()

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


In [2]:
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import classification_report
from sklearn.svm import SVC
import pandas as pd

## Machine learning algorithm
### Support vector classifier

I chose a support vector classifier, because support vector machine was one of the algorithms we covered in IN3120 Search Technology

### Hyperparameters

The hyperparameters of an SVC can be found at the following link:
https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html?highlight=svc#sklearn.svm.SVC

Some main hyperparameters are C, gamma, kernel and degree (for polynomial kernel)

C is a number that decides how much leeway you give the algorithm, i.e how many elements can be on the wrong side of the decision boundary

Kernel is which kernel to use, e.g. "poly" and "linear", and gamma is the kernel coefficient. 

Degree for a polynomial kernel is the degree of the polynomial (and a linear kernel is basically the same as a polynomial kernel of degree 1).  

## Dataset
### Balance Scale
I chose the balance scale dataset because it didn't have too many parameters or instances, so training it wouldn't take too much time for this quick cross validation. The data set can be found at https://archive.ics.uci.edu/ml/datasets/Balance+Scale

In [3]:
df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/balance-scale/balance-scale.data")
df

Unnamed: 0,B,1,1.1,1.2,1.3
0,R,1,1,1,2
1,R,1,1,1,3
2,R,1,1,1,4
3,R,1,1,1,5
4,R,1,1,2,1
...,...,...,...,...,...
619,L,5,5,5,1
620,L,5,5,5,2
621,L,5,5,5,3
622,L,5,5,5,4


In [4]:
X = df.values[:,1:].astype(np.uint8)
y = df.values[:,0].astype(str)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.4, random_state=1)

### Grid search with cross validation
On this link https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-and-model-selection
it is noted that cross validation can be used directly with Grid Search to find the optimal hyperparameters, so that's why I chose to try that for this assignment.

The hyperparameters i want to vary are the ones specified below, i.e kernel, C and Gamma, and the values are listed in the below list `tuned_parameters`. Since this is an exhaustive search all `3 * 4 * 3 = 36` possible combinations will be tested.

In [5]:
# for grid search and cross validation I've used this tutorial
# https://scikit-learn.org/stable/modules/grid_search.html

tuned_parameters = [{'kernel': ['linear', 'poly', 'rbf'],
                     'C': [1, 10, 100, 1000],
                     'gamma': [.1, .01, 1e-3]}, ]

In [6]:
# The code here is from
# https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html

scores = ['recall', 'f1']

for score in scores:
    print("# Tuning hyper-parameters for %s" % score)
    print()

    clf = GridSearchCV(
        SVC(), tuned_parameters, scoring='%s_macro' % score
    )
    clf.fit(X_train, y_train)

    print("Best parameters set found on development set:")
    print()
    print(f"{clf.best_params_}, score: {clf.best_score_:.4f}")
    print()
    print("Grid scores on development set:")
    print()
    means = clf.cv_results_['mean_test_score']
    stds = clf.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, clf.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r"
              % (mean, std * 2, params))
    print()

    print("Detailed classification report:")
    print()
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.")
    print()
    
    y_true, y_pred = y_test, clf.predict(X_test)
    print(classification_report(y_true, y_pred))
    print()
    

# Tuning hyper-parameters for recall

Best parameters set found on development set:

{'C': 100, 'gamma': 0.1, 'kernel': 'poly'}, score: 0.9792

Grid scores on development set:

0.790 (+/-0.230) for {'C': 1, 'gamma': 0.1, 'kernel': 'linear'}
0.859 (+/-0.189) for {'C': 1, 'gamma': 0.1, 'kernel': 'poly'}
0.649 (+/-0.019) for {'C': 1, 'gamma': 0.1, 'kernel': 'rbf'}
0.790 (+/-0.230) for {'C': 1, 'gamma': 0.01, 'kernel': 'linear'}
0.620 (+/-0.058) for {'C': 1, 'gamma': 0.01, 'kernel': 'poly'}
0.642 (+/-0.021) for {'C': 1, 'gamma': 0.01, 'kernel': 'rbf'}
0.790 (+/-0.230) for {'C': 1, 'gamma': 0.001, 'kernel': 'linear'}
0.333 (+/-0.000) for {'C': 1, 'gamma': 0.001, 'kernel': 'poly'}
0.639 (+/-0.020) for {'C': 1, 'gamma': 0.001, 'kernel': 'rbf'}
0.863 (+/-0.213) for {'C': 10, 'gamma': 0.1, 'kernel': 'linear'}
0.954 (+/-0.075) for {'C': 10, 'gamma': 0.1, 'kernel': 'poly'}
0.786 (+/-0.142) for {'C': 10, 'gamma': 0.1, 'kernel': 'rbf'}
0.863 (+/-0.213) for {'C': 10, 'gamma': 0.01, 'kernel': 'linear

In [7]:
print(clf.score(X_test, y_test))
best_params = clf.best_params_
print("Best parameters:", best_params)
clf = SVC(**best_params, random_state=1).fit(X_train, y_train)
print(clf.score(X_test, y_test))

clf = SVC(random_state=1, **best_params)
scores = cross_val_score(clf, X, y, cv=5)
scores.mean(), scores

0.9913129050867013
{'C': 100, 'gamma': 0.1, 'kernel': 'poly'}
0.996


(0.985574193548387,
 array([0.976     , 0.968     , 1.        , 1.        , 0.98387097]))

### Best parameters
Both for recall and f1, the best parameters given by the code above are `{'C': 100, 'gamma': 0.1, 'kernel': 'poly'}`. The performance is pretty good, 0.996 when trained on the training set and tested on the test set, and 0.986 when tested with cross validation. 

In reality i expect the performance to be a little lower, because the classifier might be overfit to this data set, which is pretty small, as it only has 625 instances. 
