# L5c: Support Vector Machine (SVM) Classification
In this lecture, we introduce our last (for now) classification approach, namely [support vector machines (SVMs)](https://en.wikipedia.org/wiki/Support_vector_machine). Support vector machines are a _supervised_ learning approach to learn the _best_ possible separating hyperplane. The key ideas of this lecture are:

* A __support vector machine__ is a _supervised_ machine learning algorithm that finds an optimal (linear) hyperplane in an $N$-dimensional space to classify (binary) data points distinctly, maximizing the _margin_ between different classes. The _margin_ in a support vector machine is defined as the distance from the separating hyperplane to the closest data points of either class.
* A __hard margin support vector machine__ is a binary linear classifier that finds the optimal hyperplane to separate two classes of data points with the maximum possible _margin_, allowing no misclassifications and requiring the data to be linearly separable.
* A __soft margin support vector machine__ is a variant of the SVM algorithm that allows for some misclassification of training data points, enabling it to handle non-linearly separable datasets and reduce overfitting by finding a balance between maximizing the decision boundary margin and minimizing classification errors.

Lecture notes for today can be found: [here!](https://github.com/varnerlab/CHEME-5820-Lectures-Spring-2025/blob/main/lectures/week-5/L5c/docs/Notes.pdf)

## Setup, Data, and Prerequisites
We set up the computational environment by including the `Include.jl` file, loading any needed resources, such as sample datasets, and setting up any required constants. The `Include.jl` file loads external packages, various functions that we will use in the exercise, and custom types to model the components of our problem.

In [32]:
include("Include.jl");

### Data
This lecture will look at a [banknote authentication dataset](https://archive.ics.uci.edu/dataset/267/banknote+authentication) for classification tasks. We'll load the banknote dataset and split it into `training` and `test` data subsets (randomly).
* __Training data__: Training datasets are collections of labeled data used to teach machine learning models, allowing these tools to learn patterns and relationships within the data.
* __Test data__: Test datasets, on the other hand, are separate sets of labeled data used to evaluate the performance of trained models on unseen examples, providing an unbiased assessment of the _model's generalization capabilities_.

#### Banknote Authentication Dataset
The second dataset we will explore is the [banknote authentication dataset from the UCI archive](https://archive.ics.uci.edu/dataset/267/banknote+authentication). This dataset has `1372` instances of 4 continuous features and an integer $\{-1,1\}$ class variable. 
* __Description__: Data were extracted from images taken from genuine and forged banknote-like specimens.  An industrial camera, usually used for print inspection, was used for digitization. The final images have 400x 400 pixels. Due to the object lens and distance to the investigated object, gray-scale pictures with a resolution of about 660 dpi were gained. Wavelet Transform tools were used to extract features from images.
* __Features__: The data has four continuous features from each image: `variance` of the wavelet transformed image, `skewness` of the wavelet transformed image, `kurtosis` of the wavelet transformed image, and the `entropy` of the wavelet transformed image. The class is $\{-1,1\}$ where a class value of `-1` indicates genuine, `1` forged.

In [6]:
df_banknote = CSV.read(joinpath(_PATH_TO_DATA, "data-banknote-authentication.csv"), DataFrame)

Row,variance,skewness,curtosis,entropy,class
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Int64
1,3.6216,8.6661,-2.8073,-0.44699,-1
2,4.5459,8.1674,-2.4586,-1.4621,-1
3,3.866,-2.6383,1.9242,0.10645,-1
4,3.4566,9.5228,-4.0112,-3.5944,-1
5,0.32924,-4.4552,4.5718,-0.9888,-1
6,4.3684,9.6718,-3.9606,-3.1625,-1
7,3.5912,3.0129,0.72888,0.56421,-1
8,2.0922,-6.81,8.4636,-0.60216,-1
9,3.2032,5.7588,-0.75345,-0.61251,-1
10,1.5356,9.1772,-2.2718,-0.73535,-1


In [7]:
D_banknote = Matrix(df_banknote); # get the data as a Matrix (alias for Array{Float64,2})
number_of_training_examples_banknote = 1000; # how many training points for the banknote dataset?

In [8]:
banknote_training, banknote_test = let

    number_of_features = size(D_banknote,2); # number of cols of housing data
    number_of_examples = size(D_banknote,1); # number of rows of housing data
    full_index_set = range(1,stop=number_of_examples,step=1) |> collect |> Set;
    
    # build index sets for training and testing
    training_index_set = Set{Int64}();
    should_stop_loop = false;
    while (should_stop_loop == false)
        i = rand(1:number_of_examples);
        push!(training_index_set,i);

        if (length(training_index_set) == number_of_training_examples_banknote)
            should_stop_loop = true;
        end
    end
    test_index_set = setdiff(full_index_set,training_index_set);

    # build the test and train datasets -
    banknote_training = D_banknote[training_index_set |> collect,:];
    banknote_test = D_banknote[test_index_set |> collect,:];

    # return
    banknote_training,banknote_test
end;

## Theory: Support Vector Machine (SVM)
Suppose, we have dataset $\mathcal{D} = \{(\hat{\mathbf{x}}_{i}, y_{i}) \mid i = 1,2,\dots,n\}$, where $\hat{\mathbf{x}}_i \in \mathbb{R}^p$ is an _augmented_ feature vector ($m$ features with additional `1` to model the bias on the end of the vector) and $y_i \in \{-1, 1\}$ is the corresponding class label.
* __Objective__: the goal of an SVM is to find the hyperplane $\mathcal{H}(\hat{\mathbf{x}}) = \{\hat{\mathbf{x}} \mid \left<\hat{\mathbf{x}},\theta\right> = 0\}$ that separates the data points into two classes (those points above the hyperplane, and those points below the hyperplane), 
where $\theta \in \mathbb{R}^{p}$ ($p=m+1$) is the normal vector to the hyperplane, or alternatively, the parameters of the model that we need to estimate.
* __Why another method__? Support vector machines (SVMs) and other approaches, e.g., [the perceptron](https://en.wikipedia.org/wiki/Perceptron) differ primarily in their optimization objectives and training methods: while a [perceptron](https://en.wikipedia.org/wiki/Perceptron) can find _a hyperplane_ that separates classes, SVMs seek to find the _best hyperplane_ in the sense that the _margin_ between classes is maximized.

There are (at least) two strategies that we could use to estimate the unknown parameters $\theta \in \mathbb{R}^{p}$, depending upon if we know beforehand whether the dataset $\mathcal{D}$ is linearly separable.

### Case 1: Linearly separable 
Let's take a look at a [schematic of the ideas behind a hard margin support vector machine](https://github.com/varnerlab/CHEME-5820-Lectures-Spring-2025/blob/main/lectures/week-5/L5c/docs/figs/Fig-SVM-Schematic.pdf). If the data is linearly separable, a hyperplane $\mathcal{H}(\hat{\mathbf{x}})$ exists that perfectly seperates the data. We can estimate the _best_ hyperplane by maximizing the _margin_, which is equivalent to minimizing the parameter vector $\theta$. The _maximum hard margin problem_ is given by:
$$
\begin{align*}
    \min_{\theta}\quad & \frac{1}{2}\lVert{\theta}\rVert_{2}^{2}\\
    \text{subject to}\quad & y_{i}\left<\hat{\mathbf{x}}_{i},\theta\right> \geq 1\quad\forall i
\end{align*}
$$
where $\theta\in\mathbb{R}^{p}$ denote the unknown parameters that we are trying to estimate,
$\hat{\mathbf{x}}_{i}\in\mathbb{R}^{p}$ are the augmented (training) feature vectors, $y_{i}\in\{-1,1\}$ are the class labels, and $p=m+1$ is the number of parameters, where $m$ is the number of features. The index $i$ runs over the training examples, i.e., one constraint per training example.

### Case 2: Not linearly separable 
Let's take a look at a [schematic of the ideas behind a soft margin support vector machine](https://github.com/varnerlab/CHEME-5820-Lectures-Spring-2025/blob/main/lectures/week-5/L5c/docs/figs/Fig-SVM-Schematic-Softmargin.pdf).
If the data is _not linearly separable_, then we know that a perfect $\mathcal{H}(\hat{\mathbf{x}})$ will not exist, i.e., 
no hyperplane will separate the data without making at least one mistake. In this case, we can estimate the _best_ hyperplane possible by solving the maximum soft margin problem given by:
$$
\begin{align*}
    \min_{\theta}\quad & \frac{1}{2}\lVert{\theta}\rVert_{2}^{2} + C\sum_{i=1}^{n}\xi_{i}\\
    \text{subject to}\quad & y_{i}\left<\hat{\mathbf{x}}_{i},\theta\right> \geq 1 - \xi_{i}\quad\forall i\\
    & \xi_{i} \geq 0\quad\forall i
\end{align*}
$$
where $\xi_{i}$ is a _slack variable_, that quantifies the cost of a classification mistake, and $C>{0}$ is a user-adjustable parameter that controls the trade-off between maximizing the margin and minimizing the slack variables.
* __Values of $C$__: If $C\gg{1}$ the classifier will behave like the maximum (hard) margin classifier, i.e., mistakes will be expensive, and the search will avoid making choices with mistakes. However, if $C\ll{1}$, the classifier will allow more slack (mistakes), i.e., mistakes are cheap, so what's it matter!

__Do we solve the soft margin problem?__

Typically, we don't solve the problem above directly; instead, we reformulate it as an _unconstrained_ problem [using the hinge-loss function](https://en.wikipedia.org/wiki/Hinge_loss):
$$
\begin{equation*}
    \min_{\theta}\frac{1}{2}\lVert{\theta}\rVert_{2}^{2} + C\sum_{i=1}^{n}\max\{0, 1 - y_{i}\left<\hat{\mathbf{x}}_{i},\theta\right>\}
\end{equation*}
$$
where the sum is computed over $n$ training examples.

## Banknote Classification Problem using an SVM
In this example, we [use the SVM implementation exported by the `LIBSVM.jl` package](https://github.com/JuliaML/LIBSVM.jl) to classify the [banknote authentication dataset from the UCI archive](https://archive.ics.uci.edu/dataset/267/banknote+authentication). In particular, we use the `training` dataset to estimate the unknown model parameters $\theta$ [using the `svmtrain(...)` method](https://github.com/JuliaML/LIBSVM.jl/blob/master/src/LIBSVM.jl). 
* The [`svmtrain(...)` method](https://github.com/JuliaML/LIBSVM.jl/blob/master/src/LIBSVM.jl) takes a matrix of feature vectors where augmented training examples matrix $\hat{\mathbf{X}}^{\top}$, i.e., the examples are on the columns and the features are the rows, and a label vector $\mathbf{y}\in\left\{-1,1\right\}$.
* The [`svmtrain(...)` method](https://github.com/JuliaML/LIBSVM.jl/blob/master/src/LIBSVM.jl) returns a [model instance](https://github.com/JuliaML/LIBSVM.jl/blob/master/src/LIBSVM.jl) that holds the trained data and a bunch of other data associated with the problem.
* __Hmmm__: One of the (super) interesting optional arguments [the `svmtrain(...)` method](https://github.com/JuliaML/LIBSVM.jl/blob/master/src/LIBSVM.jl) is the `kernel` argument. Check out the documentation to see what kernels are supported! Wow! we get [kernelized SVM capability](https://en.wikipedia.org/wiki/Support_vector_machine#Nonlinear_kernels) right out of the box. Buy versus build, 99% buy!

In [58]:
model = let

    # Setup the data that we are using
    D = banknote_training; # what dataset are we looking at?
    number_of_examples = size(D,1); # how many rows?
    X = [D[:,1:end-1] ones(number_of_examples)] |> transpose |> Matrix; # features (arranged as m x n)
    y = D[:,end]; # label

    # Train the data -
    model = svmtrain(X, y, kernel=LIBSVM.Kernel.Linear); # we are using the LIBSVM

    # return
    model
end;

__Inference__: Now that we have parameters estimated from the `training` data, we can use those parameters on the `test` dataset to see how well the model can differentiate between an actual banknote and a forgery on data it has never seen. We run the classification operation on the (unseen) test data [using the `svmpredict(...)` method](https://github.com/JuliaML/LIBSVM.jl/blob/master/src/LIBSVM.jl). 
* The [`svmpredict(...)` method](https://github.com/JuliaML/LIBSVM.jl/blob/master/src/LIBSVM.jl) returns the predicted label which we store in the `ŷ::Array{Int64,1}` array. We store the actual (correct) label in the `y::Array{Int64,1}` vector. 

In [46]:
ŷ,y,d = let

     # Setup the data that we are using
    D = banknote_test; # what dataset are we looking at?
    number_of_examples = size(D,1); # how many rows?
    X = [D[:,1:end-1] ones(number_of_examples)] |> transpose |> Matrix; # features (arranged as m x n)
    y = D[:,end]; # label
    
    # Test model on the other half of the data.
    ŷ, decision_values = svmpredict(model, X);

    # return -
    ŷ,y,decision_values
end;

### Confusion Matrix
The confusion matrix is a $2\times{2}$ matrix that contains four entries: true positive (TP), false positive (FP), true negative (TN), and false negative (FN). [Click me for a confusion matrix schematic!](https://github.com/varnerlab/CHEME-5820-Labs-Spring-2025/blob/main/labs/week-3/L3b/figs/Fig-BinaryConfusionMatrix.pdf). Let's compute these four values [using the `confusion(...)` method](src/Compute.jl) and store them in the `CM_perceptron::Array{Int64,2}` variable:

In [50]:
CM = confusion(y, ŷ) # call with the SVM percepton values

2×2 Matrix{Int64}:
 153    0
   0  219

In [52]:
number_of_test_points = length(y);
correct_prediction_perceptron = CM[1,1] + CM[2,2];
(correct_prediction_perceptron/number_of_test_points) |> f-> println("Fraction correct: $(f) Fraction incorrect $(1-f)")

Fraction correct: 1.0 Fraction incorrect 0.0


# Today?
That's a wrap! What are some of the interesting things we discussed today? Let's review by looking at [the schematic of a hard margin support vector machine!](https://github.com/varnerlab/CHEME-5820-Lectures-Spring-2025/blob/main/lectures/week-5/L5c/docs/figs/Fig-SVM-Schematic.pdf)