# Support vector machines (SVM)

## Theory

Support vector machines (SVM) is a classification algorithm where the margin
(the distance between the datapoints and the decission boundary) is maximized.
Non-linear can also be used by mapping the features to a higher dimensional
feature space. This is called the kernel trick. Other parameters of a SVM 
classifier are:

- C: determines the tradeoff between a smooth decission boundary (low C) and the correct
    classification of the datapoints (high C)

- gamma: defines how far the influence of a single training example reaches. A low value is used for a far reach and
    a high value is used for a short reach.

## Example

This example uses SVM to classify breast cancer cells into benign and malignant 
classes using the Breast Cancer Wisconsin dataset (https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data).

In [1]:
import numpy as np
import pandas as pd
from sklearn import svm
from sklearn.model_selection import train_test_split

In [6]:
df = pd.read_csv("../data/data.csv")
df.drop(labels=["id", "Unnamed: 32"], axis=1, inplace=True)

# Change the labels M and B to 1 and 0 respectively
df.replace("M", 1, inplace=True)
df.replace("B", 0, inplace=True)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   diagnosis                569 non-null    int64  
 1   radius_mean              569 non-null    float64
 2   texture_mean             569 non-null    float64
 3   perimeter_mean           569 non-null    float64
 4   area_mean                569 non-null    float64
 5   smoothness_mean          569 non-null    float64
 6   compactness_mean         569 non-null    float64
 7   concavity_mean           569 non-null    float64
 8   concave points_mean      569 non-null    float64
 9   symmetry_mean            569 non-null    float64
 10  fractal_dimension_mean   569 non-null    float64
 11  radius_se                569 non-null    float64
 12  texture_se               569 non-null    float64
 13  perimeter_se             569 non-null    float64
 14  area_se                  5

In [9]:
X = df.drop(labels=["diagnosis"], axis=1).to_numpy()
y = df["diagnosis"].to_numpy()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [15]:
# Fit SVM model using rbf kernel
clf = svm.SVC()
clf.fit(X_train, y_train)

accuracy = clf.score(X_test, y_test)
print(accuracy)

0.9210526315789473


In [16]:
# Fit SVM model using linear kernel
clf = svm.SVC(kernel='linear')
clf.fit(X_train, y_train)

accuracy = clf.score(X_test, y_test)
print(accuracy)

0.9649122807017544
