## Support vector classifier

In this notebook we train a [Support Vector Machine (SVM)](https://en.wikipedia.org/wiki/Support_vector_machine), also known as a Support Vector Classifier (SVC). SVMs aim to find dividing boundaries between a spatial representation of the feature vectors, separating samples from different catagories by the largest distance possible. 

## Loading training data

In [None]:
import pandas as pd
import numpy as np

import os.path

training_data = pd.read_parquet(os.path.join("data", "training.parquet"))

In [None]:
training_data.sample(10)

In [None]:
## Feature engineering

In [None]:
import cloudpickle as cp
feature_pipeline = cp.load(open('feature_pipeline.sav', 'rb'))

In [None]:
train_vecs = feature_pipeline.fit_transform(training_data["Text"])

## Model Training

In [None]:
from sklearn import svm

In [None]:
clf = svm.LinearSVC()

In [None]:
clf.fit(train_vecs, training_data["Category"])

## Evaluating Model Performance

In [None]:
clf.score(train_vecs, training_data["Category"])

In [None]:
testing_data = pd.read_parquet(os.path.join("data", "testing.parquet"))
testing_vecs=feature_pipeline.transform(testing_data["Text"])
clf.score(testing_vecs, testing_data["Category"])

These raw scores (which in this case represents the model's mean accuracy, averaged across all 20 classes) suggest that the model is overfitting on the training set. Let's plot a confusion matrix to take a closer look at where the model is making misclassifications:

In [None]:
from mlworkflows import plot

df, chart = plot.confusion_matrix(testing_data["Category"], clf.predict(testing_vecs))

In [None]:
chart

We can also look at a more robust classification report: 

In [None]:
from sklearn.metrics import classification_report
print(classification_report(testing_data["Category"], clf.predict(testing_vecs)))


✅ SVC models have many parameters you can tune. Take a look at the documentation and try setting some parameter values. Can you make the model perform better? Can you make the model perform significantly worse? 

In [None]:
from mlworkflows import util

util.serialize_to(clf, "model.sav")