<div style="text-align:center;">
    <img src="http://www.cs.wm.edu/~rml/images/wm_horizontal_single_line_full_color.png">
    <h1>CSCI 416-01/516-01: Fundamentals of AI/ML, Fall 2025</h1>
    <h1>Kernel methods</h1>
</div>

# Contents 

- [Support vector classifiers](#Support-vector-classifiers)
- [The dual QP](#The-dual-QP)
    - [The dual QP for overlapping classes](#The-dual-QP-for-overlapping-classes)
- [Emedding in a higher dimension](#Emedding-in-a-higher-dimension)    
- [The kernel trick](#The-kernel-trick)
    - [The kernel for XOR](#The-kernel-for-xor)
    - [¿ $\phi(x)$ or $K(x,y)$ ?](#¿-$\phi(x)$-or-$K(x,y)$-?)
    - [Properties of kernels](#Properties-of-kernels)
- [A necessary and sufficient condition for kernels](#A-necessary-and-sufficient-condition-for-kernels)
- [The $\nu$-SVC](#The-$\nu$-SVC)
- [Examples of kernels](#Examples-of-kernels)
- [Constructing a kernel SVC in Scikit-Learn](#Constructing-a-kernel-SVC-in-Scikit-Learn)
- [Kernels for symbolic inputs](#Kernels-for-symbolic-inputs)


# Emedding in a higher dimension

$\newcommand{\R}{\mathbb{R}}$

**Kernel methods** provide a way of handling sets that are not linearly separable.

If we can't linearly separate the original inputs $x^{(1)}, \ldots, x^{(N)}$, perhaps we can linearly separate them after applying a nonlinear transformation to embed them in a higher-dimensional space.

Let $\phi(x)$ be the transformation, or **feature map**, that we apply to our inputs.  The function $\phi$ may be nonlinear.

The range of $\phi$ is the **feature space**, while the space we map from is called **input space** in this context.

XOR is an example of a set of points that are not linearly separable.  Let's map the inputs from $\R^{2}$ to $\R^{3}$ using the transformation
$$
  \phi(x_{1}, x_{2}) = (x_{1}^{2}, x_{2}^{2}, \sqrt{2}x_{1}x_{2}).
$$
Then
\begin{align*}
  F = (+1, +1) &\mapsto (+1, +1, +\sqrt{2}) \\
  T = (+1, -1) &\mapsto (+1, +1, -\sqrt{2}) \\
  T = (-1, +1) &\mapsto (+1, +1, -\sqrt{2}) \\
  F = (-1, -1) &\mapsto (+1, +1, +\sqrt{2}).
\end{align*}
Note that our transformation is **not** one-to-one.

In [None]:
import matplotlib as mpl
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt

# mpl.rcParams['legend.fontsize'] = 12

fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(projection='3d')

x = [1, 1]
y = [1, 1]
z = [1.414, 1.414]
ax.scatter(x, y, z, c=['r'])

x = [1, 1]
y = [1, 1]
z = [-1.414, -1.414]
ax.scatter(x, y, z, c=['b'])

ax.legend(['F', 'T'])
ax.set_xlabel('$x_{1}$')
ax.set_ylabel('$x_{2}$')
ax.set_zlabel('$x_{3}$')

plt.show()

Ah ha!  Now our two classes are on opposite sides of the $x_{1}-x_{2}$ plane!

# Support vector classifiers

In order to derive kernel classifiers we must make a detour to the wonderful world of convex duality!

Consider building an SVC with training cases $x^{(1)}, \ldots, x^{(N)}$.

Recall the QP for the optimal SVC hyperplane:
$\DeclareMathOperator*{\minimize}{minimize}$
$\DeclareMathOperator*{\subjectto}{subject to}$
$\newcommand{\half}{\frac{1}{2}}$
$\newcommand{\norm}[1]{\|\; #1 \;\|}$
$\newcommand{\twonormsq}[1]{\norm{#1}_{2}^{2}}$
\begin{align*}
    \minimize_{w,b} &\quad \half \twonormsq{w} \\
    \subjectto &\quad y_{i} (w^{T}x^{(i)} + b) \geq 1 \quad \mbox{for all $1 \leq i \leq t$}.
\end{align*}
We will call this QP the **primal** problem.

The **dual** QP is
$\DeclareMathOperator*{\maximize}{maximize}$
\begin{align*}
  \maximize_{\mu = (\mu_{1}, \ldots, \mu_{N})} &\quad \sum_{i=1}^{N} \mu_{i} 
       - \half \sum_{i=1}^{N}\sum_{i=1}^{N} \mu_{i}\mu_{j}y_{i}y_{j}(x^{(i)})^{T}x^{(j)} \\
  \subjectto &\quad \mu_{i} \geq 0,\ i = 1, \ldots, N \\
             &\quad \sum_{i=1}^{N} \mu_{i}y_{i} = 0.
\end{align*}

The optimal value of the primal is the same as the duel.  Moreover, knowing the solution to one of the two enables to build a solution to the other.

## The dual QP for overlapping classes

Recall the relaxed maximum margin problem for overlapping classes (i.e., classes that are not linearly separable):
\begin{align*}
  \minimize_{w,b,\xi} &\quad \half \twonormsq{w} + C \sum_{i=1}^{N} \xi_{i} \\
  \subjectto &\quad y_{i} (w^{T}x^{(i)} + b) \geq 1 - \xi_{i}, \quad i = 1, \ldots N \\
             &\quad \xi_{i} \geq 0, \quad i = 1, \ldots, N.
\end{align*}

This QP has the dual
\begin{align*}
  \maximize_{\mu = (\mu_{1}, \ldots, \mu_{N})} &\quad \sum_{i=1}^{N} \mu_{i} 
      - \half \sum_{i=1}^{N}\sum_{j=1}^{N} \mu_{i}\mu_{j}y_{i}y_{j}(x^{(i)})^{T} x^{(j)} \\
  \subjectto &\quad 0 \leq \mu_{j} \leq C,\ i = 1, \ldots, N \\
            &\quad \sum_{i=1}^{N} \mu_{i} y_{i} = 0.
\end{align*}

The difference between this dual and the previous dual for linearly separable classes is the upper bound on the $\mu_{i}$.

The solution of this problem is used to build an SVC just as before.

# The kernel trick

The optimal separating hyperplane in the feature space is found by solving
$$
\begin{align*}
    \minimize_{w,b,\xi} &\qquad \half \twonormsq{w} + C \sum_{i} \xi_{i} \\
    \subjectto &\qquad y_{i} (w^{T}\mathbf{\phi}(x^{(i)}) + b) \geq 1 - \xi_{i} \quad \mbox{for all $i$} \\
               &\qquad \xi_{i} \geq 0 \quad \mbox{for all $i$}.
\end{align*}
$$
or the dual,
$$
\begin{align*}
    \maximize_{\mu_{1}, \ldots, \mu_{N}} &\qquad \sum_{i=1}^{N} \mu_{i} 
       - \half \sum_{i=1}^{N}\sum_{j=1}^{N} \mu_{j} \mu_{k} y_{j} y_{k} \phi(x^{(i)})^{T} \phi(x^{(j)}) \\
    \subjectto &\qquad 0 \leq \mu_{i} \leq C,\ i = 1, \ldots, N \\
               &\qquad \sum_{i=1}^{N} \mu_{j}y_{j} = 0.
\end{align*}
$$
The dual turns out to be more interesting.

The training data enter the dual through the inner product of the $\phi(x^{(i)})$.   Let
$$
  K(x,y) = \phi(x)^{T} \phi(y).
$$
$K$ is called a **kernel**.

Then the dual can be written as
$$
\begin{align*}
    \maximize_{\mu_{1}, \ldots, \mu_{N}} &\qquad \sum_{i=1}^{N} \mu_{i} 
      - \half \sum_{i=1}^{N}\sum_{j=1}^{N} \mu_{j} \mu_{k} y_{j} y_{k} K(x^{(i)}, x^{(j)}) \\
    \subjectto &\qquad \mu_{i} \geq 0,\ i = 1, \ldots, N \\
               &\qquad \sum_{i=1}^{N} \mu_{i}y_{i} = 0.
\end{align*}
$$

## The kernel for XOR

Recall the transformation we used for XOR:
$$
  \mathbf{\phi}(x_{1}, x_{2}) = (x_{1}^{2}, x_{2}^{2}, \sqrt{2}x_{1}x_{2}).
$$
Observe that 
\begin{align*}
  \mathbf{\phi}(x_{1}, x_{2})^{T}\mathbf{\phi}(y_{1}, y_{2}) 
  &= \begin{pmatrix} x_{1}^{2} & x_{2}^{2} & \sqrt{2}x_{1}x_{2} \end{pmatrix}
     \begin{pmatrix} y_{1}^{2} \\ y_{2}^{2} \\ \sqrt{2}y_{1}y_{2} \end{pmatrix} \\
  &= x_{1}^{2} y_{1}^{2} + x_{2}^{2} y_{2}^{2} + 2 x_{1}x_{2} y_{1}y_{2} \\
  &= (x_{1} y_{1} + x_{2} y_{2})^{2} \\
  &= (x^{T}y)^{2}.
\end{align*}

In this case the kernel is 
$$
  K(x, y) = (x^{T}y)^{2}.
$$

## &#191; $\phi(x)$ or $K(x,y)$ ?

If we work with the dual,
$$
  \begin{array}{ll}
    \maximize_{\mu_{1}, \ldots, \mu_{N}} & \sum_{i=1}^{N} \mu_{i} 
      - \half \sum_{i=1}^{N}\sum_{j=1}^{N} \mu_{i}\mu_{j}y_{i}y_{j} K(x^{(i)}, x^{(j)}) \\
    \subjectto & \mu_{i} \geq 0,\ i = 1, \ldots, N \\
               & \sum_{i=1}^{N} \mu_{i}y_{i} = 0.
  \end{array}
$$
they we don't need to know $\phi$ &ndash; only $K$.

This leads to the following question: rather than choose a nonlinear transformation $\phi$ to apply to our inputs, can we just choose the kernel $K$ associated with $\phi$?

So, given a function $K(x,y)$, when is there a function $\phi$ such that 
$$
  K(x,y) = \phi(x)^{T} \phi(y)?
$$

# Properties of kernels

Clearly, $K$ needs to be **symmetric**:
$$
  K(x,y) = \phi(x)^{T}\phi(y) = \phi(y)^{T}\phi(x) = K(y,x).
$$

There is a less obvious condition $K$ must satisfy.  Given $x_{1}, \ldots, x_{n}$, consider the $n \times n$ matrix whose $(i,j)$ entry is $K(x^{(i)},x^{(i)})$:
$$
  \begin{pmatrix} 
    K(x_{1},x_{1}) & K(x_{1},x_{2}) & \cdots & K(x_{1},x_{n}) \\
    K(x_{2},x_{1}) & K(x_{1},x_{2}) & \cdots & K(x_{2},x_{n}) \\
    \vdots & \vdots & \vdots & \vdots \\
    K(x_{n},x_{1}) & K(x_{n},x_{2}) & \cdots & K(x_{n},x_{n})
  \end{pmatrix}
  = \begin{pmatrix} 
    \phi(x_{1})^{T}\phi(x_{1}) & \phi(x_{1})^{T}\phi(x_{2}) & \cdots & \phi(x_{1})^{T}\phi(x_{n}) \\
    \phi(x_{2})^{T}\phi(x_{1}) & \phi(x_{2})^{T}\phi(x_{2}) & \cdots & \phi(x_{2})^{T}\phi(x_{n}) \\
    \vdots & \vdots & \vdots & \vdots \\
    \phi(x_{n})^{T}\phi(x_{1}) & \phi(x_{n})^{T}\phi(x_{2}) & \cdots & \phi(x_{n})^{T}\phi(x_{n}) \\
      \end{pmatrix}
   = 
    \begin{pmatrix} 
      \phi(x_{1})^{T} \\
      \phi(x_{2})^{T} \\
      \vdots \\ 
      \phi(x_{n})^{T}
    \end{pmatrix}
    \begin{pmatrix} 
      \phi(x_{1}) & \phi(x_{2}) & \cdots & \phi(x_{n}) 
    \end{pmatrix} 
   \equiv \Phi^{T} \Phi.
$$
For any vector $u$ we have
$$
  u^{T} \Phi^{T} \Phi u = (\Phi u)^{T} (\Phi u) = \twonormsq{\Phi u} \geq 0.
$$
This means that if $K$ is a kernel, then the matrix $(K(x^{(i)},x^{(i)}))$ must be positive semidefinite for all choices of $x_{1}, \ldots, x_{n}$.

# A necessary and sufficient condition for kernels

It turns out that symmetry and positive semidefiniteness are also sufficient.

**Theorem.**
Suppose $K(x,y)$ is symmetric: $K(x,y) = K(y,x)$.  Then there exists $\phi$ such that $K(x,y) = \phi(x)^{T}\phi(y)$ if and only if the matrix $(K(x^{(i)},x^{(i)}))$ is positive semidefinite for any collection of $x^{(i)}$.
</div>

A more easily checked condition is

**Theorem [Mercer, 1909]**
Suppose $K(x,y)$ is symmetric: $K(x,y) = K(y,x)$.  Then there exists $\phi$ such that $K(x,y) = \phi(x)^{T}\phi(y)$ if and only if 
$$
    \int\!\!\!\int K(x,y) g(x) g(y)\ dx\ dy \geq 0
$$
for all continuous functions $g$.
</div>

So, to summarize, you can either choose a nonlinear transformation $\mathbf{\phi}$ and solve
$$
\begin{align*}
    \minimize_{w,b,\xi} &\qquad \half \twonormsq{w} + C \sum_{i} \xi_{i} \\
    \subjectto &\qquad y_{i} (w^{T}\mathbf{\phi}(x^{(i)}) + b) \geq 1 - \xi_{i}, \quad 1 = 1, \ldots, N \\
               &\qquad \vphantom{\half} \xi_{i} \geq 0, \quad i = 1, \ldots, N,
\end{align*}
$$

or you can choose a kernel $K$ and solve
$$
\begin{align*}
    \maximize_{\mu_{1}, \ldots, \mu_{N}} &\qquad \sum_{i=1}^{N} \mu_{i} 
      - \half \sum_{i=1}^{N}\sum_{j=1}^{N} \mu_{i}\mu_{j}y_{i}y_{j} K(x^{(i)}, x^{(j)}) \\
    \subjectto &\qquad 0 \leq \mu_{i} \leq C,\ i = 1, \ldots, N \\
               &\qquad \sum_{i=1}^{N} \mu_{i}y_{i} = 0.
\end{align*}
$$

# The $\nu$-SVC

A variant of the SVC approach is the $\nu$-SVC.

The kernel formulation of the maximum-margin classifier is
$$
\begin{align*}
    \minimize_{\mu_{1}, \ldots, \mu_{N}} &\qquad \sum_{i=1}^{N} \mu_{i} 
      - \half \sum_{i=1}^{N}\sum_{j=1}^{N} \mu_{i}\mu_{j}y_{i}y_{j} K(x^{(i)}, x^{(j)}) \\
    \subjectto &\qquad 0 \leq \mu_{i} \leq C,\ i = 1, \ldots, N \\
               &\qquad \sum_{i=1}^{N} \mu_{i}y_{i} = 0.
\end{align*}
$$

The $\nu$-SVC approach eliminates the penalty weight $C$ with an alternative formulation of the problem:
$$
  \begin{align*}
    \minimize_{\mu_{1}, \ldots, \mu_{N}} &\qquad
      - \half \sum_{i=1}^{N}\sum_{j=1}^{N} \mu_{i}\mu_{j}y_{i}y_{j} K(x^{(i)}, x^{(j)}) \\
    \subjectto &\qquad 0 \leq \mu_{i} \leq 1/N,\ i = 1, \ldots, N \\
               &\qquad \sum_{i=1}^{N} \mu_{i}y_{i} = 0 \\
               &\qquad \sum_{i=1}^{N} \mu_{i} \geq \nu.
  \end{align*}
$$

We choose the parameter $\nu > 0$.  

This approach is attractive because we no longer have to guess the "right" value for $C$, while $\nu$ can be directly interpreted:
1. $\nu$ is a lower bound on the fraction of the training cases that will be support vectors;
2. $\nu$ is an upper bound on the fraction of margin errors.

# The kernel classifier

After choosing the kernel $K$ and solving the dual, it appears we still need the $\phi$ in order to compute the support vectors:
$$
  w = \sum_{i=1}^{N} \mu_{i} y_{i} \phi(x^{(i)}).
$$

However, the support vectors will only be used in the classifier, and we have
$$
  y(x) 
  = w^{T}\mathbf{\phi}(x) + b
  = \sum_{i=1}^{N} \mu_{i} y_{i} \phi(x^{(i)})^{T} \phi(x) + b
  = \sum_{i=1}^{N} \mu_{i} y_{i} K(x^{(i)}, x) + b.
$$

Thus, in order to build the classifier it suffices to know the kernel.

# Examples of kernels

Here are several commonly used kernels.

### Gaussian or radial basis function (RBF) kernel

$$
  K(x,y) = \exp(-\gamma \twonormsq{x-y})
$$

This is a very popular and frequently effective kernel.  The effectiveness likely derives from the fact that the feature space is infinite-dimensional (high-dimensional, in practice), giving us room to maneuver.  See the course notes for the details.

### Quadratic kernel

$$
  K(x,y) = (x^{T}y)^{2}
$$

### Polynomial kernel

$$
  K(x,y) = (x^{T}y + c)^{m},\ c > 0
$$

### Sigmoidal kernel

$$
  K(x,y) = \tanh (a x^{T}y + r),\ a > 0, r < 0
$$

The sigmoidal kernel is not positive definite, but does yield a kernel function for all $r$ sufficiently negative.

# Creating new kernels from existing kernels <a id="new_kernels"/>

Given a kernel, we can create other kernels by a variety of transformations.  If $K_{1}(x,y)$ and $K_{2}(x,y)$ are kernels, then so are the following:
\begin{align*}
  \newcommand{\kone}{K_{1}}
  \newcommand{\ktwo}{K_{2}}
  K(x,y) &= c \kone(x,y),\ \mbox{if $c > 0$,} \\
  K(x,y) &= f(x) \kone(x,y) f(y),\ \mbox{if $f$ is real-valued,} \\
  K(x,y) &= q(\kone(x,y)),\ \mbox{if $q$ is a polynomial with nonegative coefficients,} \\
  K(x,y) &= \exp(\kone(x,y)), \\
  K(x,y) &= \kone(x,y) + \ktwo(x,y), \\
  K(x,y) &= \kone(x,y) \ktwo(x,y).
\end{align*}

If $\mathbf{\phi}(x) \in \R^{M}$ and $K_{3}$ is a kernel on $\R^{M}$, then
$$
  K(x,y) = K_{3}(\mathbf{\phi}(x),\mathbf{\phi}(y))
$$
is also a kernel.

# Building a kernel SVC in Scikit-Learn

We will use the scikit-learn SVC (support vector classifier) module.  SVC is based on [libSVM](http://www.csie.ntu.edu.tw/~cjlin/libsvm), which is also used in the R package [e1071](https://cran.r-project.org/web/packages/e1071/vignettes/svmdoc.pdf).

The SciKit-Learn documentation is [here](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).

We will use Fisher's iris data set.

In [None]:
from sklearn import datasets
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

iris = datasets.load_iris()
X = iris.data
y = iris.target

wine = datasets.load_wine()
X = wine.data
y = wine.target

#classes = {'setosa':0, 'versicolor':1, 'virginica':2, 'none':0}

X_train, X_test, y_train, y_test = \
  train_test_split(X, y, test_size=0.3, random_state=0)  # Be sure to set the random seed so the results are reproducible!

kernel = SVC(kernel="rbf", C = 1000)  # The default kernel is RBF.
kernel.fit(X_train, y_train)

svm = SVC(kernel="linear")
svm.fit(X_train, y_train)

clf = kernel

If we look at the options for the SVC constructor we can see the hyperparameter weight $C$ given the penalty term for margin errors.

## Evaluating the classifier

First, the performance on the test set.

In [None]:
from sklearn import metrics

print("Results for the test set.")
y_pred = clf.predict(X_test)

# Accuracy, precision, recall, and f-score.
print(metrics.classification_report(y_test, y_pred))

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay

# The confusion matrix for the training data.
y_pred = clf.predict(X_test)

cm = confusion_matrix(y_test, y_pred)

class_names=("I. setosa", "I. versicolor", "I. virginica")
ConfusionMatrixDisplay.from_estimator(clf, X_test, y_test, display_labels=class_names)

The preceding shows the misclassification of the test data.  But how did the SVC do on the training data?

Let's find out&hellip;

In [None]:
print("Results for the training set.")
y_pred = clf.predict(X_train)

# Accuracy, precision, recall, and f-score.
print(metrics.classification_report(y_train, y_pred))

In [None]:
# The confusion matrix for the training data.
y_pred = clf.predict(X_train)

cm = confusion_matrix(y_train, y_pred)

class_names=("I. setosa", "I. versicolor", "I. virginica")
ConfusionMatrixDisplay.from_estimator(clf, X_train, y_train, display_labels=class_names)

# Kernels for symbolic inputs

Kernels can also be applied to inputs that are symbolic, e.g., sets, graphs, and strings and text.

For instance, let $S$ is a finite set, and consider the collection of all subsets of $S$.  

$\newcommand{\abs}[1]{|\; #1 \;|}$
If $X$ and $Y$ are subsets of $S$, and $\abs{X \cap Y}$ is the cardinality of their intersection, then the **intersection kernel** is 
$$
  K(X,Y) = 2^{\abs{X \cap Y}}.
$$

## A string kernel

Given two strings $X$ and $Y$ which are members of a set of strings $S$, let
$$
  K(X,Y) = \mbox{the number of substrings in $X$ and $Y$ have in common}.
$$

If we list all possible substrings of elements of $S$, and define $\phi(W)$ to be
$$
  \phi_{i}(W) = \left\{
    \begin{array}{cl}
      1 & \mbox{if string $i$ is a substring of $W$} \\
      0 & \mbox{otherwise},
    \end{array}
  \right.
$$
then 
$$
  K(X,Y) = \phi(X)^{T}\phi(Y).
$$

In this case it is not practical to construct $\phi$, since the number of possible substrings could be huge, but it is easy to work with the kernel.

## The spectrum kernel

The **spectrum kernel** was developed to build an SVC for protein classification,

We will call the set of all length $k$ subsequences of a string its $k$-spectrum.

Given $k$, and a string $x$ define the feature map
$$
  \mathbf{\phi}(x) = (\phi_{a}(x))_{\mbox{all subsequence $a$ from of length $k$ in our alphabet}},
$$
where 
$$
  \phi_{a}(x) = \mbox{the number of times $a$ occurs in $x$}.
$$
The $k$-spectrum kernel is then
$$
  K(x,y) = \mathbf{\phi}(x)^{T}\mathbf{\phi}(y).
$$

#### This notebook was brought to you by Savage Panda Attacks

In [None]:
import IPython.display
IPython.display.YouTubeVideo('I-ovzUNno7g')