# A Direct Comparison of SVM Kernel Performance In a Tri-class Classification Problem

As part of one of my modules for my University Computer Science course, I was tasked with conducting an AI/ML project using a dataset. I decided to conduct an experiment to determine the best kernel to use in tri-class classification. You can find the project sourcecode and my paper going more in depth into the project on my GitHub: https://github.com/Spectrum2511/SVM-Kernel-performance-comparison




I found the Palmer Archipelago (Antarctica) penguin dataset which looked fun to play with. It contained three subspecies of penguin with a variety of variables. You can download it at:
- Kaggle:https://www.kaggle.com/parulpandey/palmer-archipelago-antarctica-penguin-data, 
- GitHub:https://github.com/allisonhorst/palmerpenguins) 

In [None]:
import numpy as np
import sklearn
import matplotlib
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
figure(num=None, figsize=(16, 12), dpi=80, facecolor='w', edgecolor='k')
import pandas as pd
from sklearn import datasets
import sklearn

data = "../input/palmer-archipelago-antarctica-penguin-data/penguins_size.csv"

_df = pd.read_csv(data)
_df.columns = ["Species", "Island", "Culmen Length (mm)", "Culmen Depth (mm)", "Flipper Length (mm)", "Body Mass (g)", "Sex"]
_df
df = _df[["Species", "Culmen Length (mm)", "Culmen Depth (mm)", "Flipper Length (mm)", "Body Mass (g)"]].copy()
penguins = df.dropna()
penguins

With the data in a DataFrame, we can now try to find some correlation to make classification easier. I took the dataset into an R environment and found clustering when the data was plotted on a scatterplot of culmen length by culmen depth.
This graph can be recreated using the following R code, note that `df` is the variable holding the data in a `data.frame`:

`plot <- ggplot(data=df, mapping = aes(x=penguins$`Culmen Length (mm)`, y=penguins$`Culmen Depth (mm)`, color=penguins$Species)) + geom_point()`

In [None]:
plt.scatter(penguins["Culmen Length (mm)"], penguins["Culmen Depth (mm)"])

Now that we have some features to use, we now need to label each observation with a numerical value, something a computer can understand:

In [None]:
l = []
for i in range(len(penguins)):
  if penguins.iloc[i, 0] == "Adelie":
    l.append(-1)
  if penguins.iloc[i, 0] == "Chinstrap":
    l.append(0)
  if penguins.iloc[i, 0] == "Gentoo":
    l.append(1)
print(l) 



This gets added to the dataframe under the column name "target":

In [None]:
target = pd.DataFrame(l, columns = ["target"])
_raw = pd.concat([penguins, target], axis =1)
raw = _raw.drop(["Species"], axis=1)
raw = raw.dropna()
raw.head()

`raw_df` is now our raw set of data, which now includes the target labels for each subspecies. Lets give it a good ol' shuffle:

In [None]:
raw_df = raw
# Shuffle dataset
rng = np.random.default_rng(0)
Xy_df = raw_df.iloc[rng.permutation(len(raw_df))].reset_index(drop=True)

display(Xy_df)
display(raw_df)

Now lets seperate the bits that we **REALLY** need, this includes culmen length, culmen depth and the target labels. Here, we create two seperate numpy arrays: one being a subsection of the `xy_df` only containing the culmen length and depth values and the other being just the target labels.

In [None]:
# prepare NumPy ndarrays
X = np.array(Xy_df[["Culmen Length (mm)", "Culmen Depth (mm)"]]) # we are using data from the first two columns
y = np.array(Xy_df['target'])

Now we need to split each of these subsets into test/train sets. You will notice we are using a 30:70 train/test split. This is because we are testing these kernels under a worst case scenario, a stress test of sorts. 

In [None]:
#train points and new points must add up to 340

#We are using an 30:70 train/test split to vary the results and to act as a "worst case scenario" to properly test the kernels.
n_train_points = 102
n_new_points = 238

# Split the data into training/new data
X_train = X[:n_train_points]
X_new = X[n_train_points:n_train_points+n_new_points]

# Split the targets into training/new data
y_train = y[:n_train_points]
y_true = y[n_train_points:n_train_points+n_new_points]

Now that the test/train sets are sorted, we can not train our models. We will be using four different instances of SVM each with a different kernel: 

- Sigmoid 
- Linear 
- RBF 
- polynomial 

Note that we are using a Linear SVC rather than a SVM with a linear kernel. We train the models and then create predictions. 

In [None]:
%%time
from sklearn import svm
C = 1.0  # SVM regularization parameter
models = (svm.SVC(kernel="sigmoid", C=C),
          svm.LinearSVC(C=C, max_iter=10000),
          svm.SVC(kernel='rbf', gamma=0.7, C=C),
          svm.SVC(kernel='poly', degree=3, gamma='auto', C=C))

models = (clf.fit(X_train, y_train) for clf in models)

# Make predictions using the testing set, one set of predictions per classifer
y_pred = []
for clf in models:
  y_pred.append(clf.predict(X_new))
graph_labels = ["SVC w/ sigmoid Kernel", "Linear SVC", "SVC w/ rbf kernel", "SVC w/ polynomial kernel"]
print(y_pred)
print(X_new)

We want some graphical representation to show how each of the models behave. This method creates a scatterplot showing the training datapoints and the predicted datapoints. One graph is produced per kernel. Wityh each of these graphs, the predicted points are triangles and the training points are dits, its the color of the individual points we are looking at, not their positioning.

In [None]:
from mlxtend.plotting import plot_decision_regions

def plot_graphs(labels, lab):
  
  xrange = [30, np.max(X_new[:,0])+2]
  yrange = [10, np.max(X_new[:,1])+2]
  print(xrange)
  print(yrange)
  step = 0.1
  x = np.arange(xrange[0], xrange[1], step)
  y = np.arange(yrange[0], yrange[1], step)
  xx, yy = np.meshgrid(x, y)


  #plottingthe predicted results
  X_pred_Adelie = X_new[labels==-1, :]
  X_pred_Chinstrap = X_new[labels==0, :]
  X_pred_Gentoo = X_new[labels==1, :]

  plt.figure(num=1, figsize=(16, 12))

  plt.scatter(X_pred_Adelie[:, 0], X_pred_Adelie[:, 1],  color='red', label='predicted species = Adelie ' + str(lab), marker='^')
  plt.scatter(X_pred_Chinstrap[:, 0], X_pred_Chinstrap[:, 1],  color='green', label='predicted species = Chinstrap ' + str(lab), marker='^')
  plt.scatter(X_pred_Gentoo[:, 0], X_pred_Gentoo[:, 1],  color='blue', label='predicted species = Gentoo ' + str(lab), marker='^')

  #plotting the training data
  X_true_Adelie = X_train[y_train==-1, :]
  X_true_Chinstrap = X_train[y_train==0, :]
  X_true_Gentoo = X_train[y_train==1, :]

  plt.scatter(X_true_Adelie[:, 0], X_true_Adelie[:, 1],  color='red', label='species = Adelie')
  plt.scatter(X_true_Chinstrap[:, 0], X_true_Chinstrap[:, 1],  color='green', label='species = Chinstrap')
  plt.scatter(X_true_Gentoo[:, 0], X_true_Gentoo[:, 1],  color='blue', label='species = Gentoo')

  plt.xlim(xrange)
  plt.ylim(yrange)

  plt.xlabel("Culmen Length (mm)")
  plt.ylabel("Culmen Depth (mm)")

  plt.legend()
  plt.show()

We also want to confusion matricies for more numerical data:

In [None]:
def display_conf_mat(labels, l):
  print(l)
  # The accuracy score: If 1 for perfect prediction
  print('Accuracy: {:.4f}'.format(sklearn.metrics.accuracy_score(y_true, labels)))
  # Confusion matrix
  #print('Confusion matrix: ', sklearn.metrics.confusion_matrix(y_true, labels, normalize='all'))
  # Visualize the confusion matrix
  sklearn.metrics.ConfusionMatrixDisplay(sklearn.metrics.confusion_matrix(y_true, labels, normalize='all'), ["Adelie", "Chinstrap", "Gentoo"]).plot(cmap=plt.cm.Blues)
  plt.grid(False)
  # The classification report, which contains accuracy, precision, recall, F1 score
  print(sklearn.metrics.classification_report(y_true, labels))

Now we can view the performance of each kernel:

In [None]:
for array, l in zip(y_pred, graph_labels):
  plot_graphs(array, l)

In [None]:
for array, lab in zip(y_pred, graph_labels):
  display_conf_mat(array, lab)

Sigmoid seems to fall flat with learning the Chinstrap and Gentoo classes and seems to predict all the points as being Adeline penguins. As with the other three, they all seem to perform well.
Looking at the metrics, RBF scores the highest accuracy compared to the other three.

Please feel free to use this code and adapt it for other problems, maybe even increasing  the number of classes...