<div style="text-align:center;">
    <img src="http://www.cs.wm.edu/~rml/images/wm_horizontal_single_line_full_color.png">
    <h1>CSCI 416-01/516-01, Fundamentals of AI/ML, Fall 2025</h1>
    <h1>Support vector classifiers</h1>
</div>

# Contents 

- [Linear classifiers](#Linear-classifiers)
- [Characterization of separating hyperplanes](#Characterization-of-separating-hyperplanes)
- [Modifying the separability condition](#Modifying-the-separability-condition)
- [Which separator is right for you?](#Which-separator-is-right-for-you?)
- [Finding the optimal separating hyperplane](#Finding-the-optimal-separating-hyperplane)
- [Overlapping classes](#Overlapping-classes)
    - [Relaxing the separation constraint](#Relaxing-the-separation-constraint)
    - [The maximum margin SVC](#The-maximum-margin-SVC)
- [Parameters vs hyperparameters](#Parameters-vs-hyperparameters)
- [Building an SVC in Scikit-Learn](#Building-an-SVC-in-Scikit-Learn)
- [Evaluating a classifier](#Evaluating-a-classifier)
    - [Accuracy](#Accuracy)
    - [True positive and false positive rates](#True-positive-and-false-positive-rates)
    - [The confusion matrix](#The-confusion-matrix)
    - [Precision and recall](#Precision-and-recall)
    - [The $F$-score](#The-$F$-score)
    - [Sensitivity and specificity](#Sensitivity-and-specificity)
- [Evaluating the iris SVC](#Evaluating-the-iris-SVC)
   

# Linear classifiers

The goal of classification is to divide the input space into **decision regions** with a single class in each region.

The boundaries of the decision regions are called the **decision boundaries** or **decision surfaces**.

In **linear classification**, the decision surfaces are hyperplanes.  In $n$ dimensions, a hyperplane has the form
$
\newcommand{\R}{\mathbb{R}}
\newcommand{\Rn}{\R^{n}}
$
$$
    \{x \in \Rn \;|\; w_{1}x_{1} + \cdots + w_{n}x_{n} + b = w^{T} x + b = 0\}
$$
for some $w \in \Rn$ and $b \in \R$.

Thus, to build a linear separator for binary classification with classes $C_{-1}$ and $C_{+1}$, we seek a scalar $b$ and vector $w$ such that 
\begin{align*}
  w^{T}x^{(i)} + b &> 0
\end{align*}
for all $x^{(i)} \in C_{+1}$, and
\begin{align*}
  w^{T}x^{(i)} + b &< 0
\end{align*}
for all $x^{(i)} \in C_{-1}$.

For binary classification, applying the classifier is, in principle, a matter of computing $w^{T}x + b$ and looking at the sign of the result.

# Characterization of separating hyperplanes

**Lemma.**
Given two finite sets $C_{-1}, C_{+1} \subset \Rn$, there exist $w \in \Rn$ and $b \in \R$ such that
\begin{align*}
  w^{T}x^{(i)} + b &> 0 \quad \mbox{if $x^{(i)} \in C_{+1}$}, \\
  w^{T}x^{(i)} + b &< 0 \quad \mbox{if $x^{(i)} \in C_{-1}$},
\end{align*}
if and only if there exist $v \in \Rn$ and $c \in \R$ such that
\begin{align*}
  v^{T}x^{(i)} + c &\geq +1 \quad \mbox{if $x^{(i)} \in C_{+1}$}, \\
  v^{T}x^{(i)} + c &\leq -1 \quad \mbox{if $x^{(i)} \in C_{-1}$},
\end{align*}
with $|\;v^{T}x^{(i)} + c\;| = 1$ for at least one $i$.

# Modifying the separability condition 

In light of the lemma, we can replace the original separability condition
\begin{align*}
  w^{T}x^{(i)} + b &> 0 \quad \mbox{if $x^{(i)} \in C_{+1}$}, \\
  w^{T}x^{(i)} + b &< 0 \quad \mbox{if $x^{(i)} \in C_{-1}$},
\end{align*}
with the condition
\begin{align*}
  w^{T}x^{(i)} + b &\geq +1 \quad \mbox{if $x^{(i)} \in C_{+1}$}, \\
  w^{T}x^{(i)} + b &\leq -1 \quad \mbox{if $x^{(i)} \in C_{-1}$}.
\end{align*}

In general, if we have two classes $C_{+1}$ and $C_{-1}$, let
$$
  y_{i} = \left\{
    \begin{array}{cl}
      +1 & \mbox{if $x^{(i)} \in C_{+1}$,} \\
      -1 & \mbox{if $x^{(i)} \in C_{-1}$.} 
    \end{array}
  \right.
$$
The property 
\begin{align*}
  w^{T}x^{(i)} + b &\geq +1 \quad \mbox{if $x^{(i)} \in C_{+1}$}, \\
  w^{T}x^{(i)} + b &\leq -1 \quad \mbox{if $x^{(i)} \in C_{-1}$},
\end{align*}
is then equivalent to
$$
  y_{i} (w^{T}x^{(i)} + b) \geq +1 \quad \mbox{for all $i$}.
$$

# Which separator is right for you?

Let's grab Fisher's iris data once again.

In [None]:
from sklearn import datasets
import numpy as np
from sklearn.model_selection import train_test_split

iris = datasets.load_iris()
X = iris.data
y = iris.target

iris = datasets.load_iris()

all_features = {'sepal_length':0, 'sepal_width':1, 'petal_length':2, 'petal_width':3}

excluded = 'versicolor'  # Any species to exclude.
features = ['sepal_width', 'petal_length']  # Features to be used.

columns = [all_features[f] for f in features]
print('Columns:', columns)
    
X = iris.data[:, columns]
y = iris.target

classes = {'setosa':0, 'versicolor':1, 'virginica':2, 'none':0}

# Grab only rows for the non-excluded species.
X = X[np.where(y != classes[excluded])]
y = y[np.where(y != classes[excluded])]    

X_train, X_test, y_train, y_test = \
  train_test_split(X, y, test_size=0.3, random_state=0)  # Be sure to set the random seed so the results are reproducible!

In [None]:
import matplotlib.pyplot as plt

# Parameters
num_classes = 3
plot_colors = "yrb"

# Plot the training points
for i, color in zip(range(num_classes), plot_colors):
    idx = np.where(y_train == i)
    plt.scatter(
        X_train[idx, 0],
        X_train[idx, 1],
        c=color,
        label=iris.target_names[i],
        edgecolor="black",
        s=15)


plt.plot([1, 5], [0, 5], color='c', linestyle='-', linewidth=2)
plt.plot([1, 5], [2, 2], color='m', linestyle='-', linewidth=2)
plt.plot([1, 5], [2.25, 4.25],  color='g', linestyle='-', linewidth=2)

plt.tight_layout()
plt.show()

Intuitively, we want a separating hyperplane that is as most "in the middle" as possible to avoid misclassification.

Given any hyperplane, one of the $x^{(k)}$ is closest to the hyperplane.

Let's choose the separating hyperplane that maximizes this minimum distance:
$$
\max_{H \in \mbox{\scriptsize hyperplanes}} \min_{k}\ \{ \mbox{distance from $x^{(k)}$ to H} \}.
$$

We will measure distance in the Euclidean norm.

# Finding the optimal separating hyperplane

The Euclidean distance from a point $x^{(i)}$ to the hyperplane
$$
  \{z \in \Rn \;|\; w^{T}z + b = 0\}
$$
is
$$
  \newcommand{\abs}[1]{|\; #1 \;|}
  \newcommand{\norm}[1]{\|\; #1 \;\|}
  \newcommand{\twonorm}[1]{\norm{#1}_{2}}
  \frac{\abs{w^{T}x^{(i)} + b}}{\twonorm{w}} = \frac{y_{i}(w^{T}x^{(i)} + b)}{\norm{w}}.
$$

The minimum distance from the $x^{(i)}$ to the hyperplane is thus
$$
  \min_{i} \frac{y_{i}(w^{T}x^{(i)} + b)}{\norm{w}}.
$$

So, the problem we want to solve is
$
  \DeclareMathOperator*{\minimize}{\mbox{minimize}}
  \DeclareMathOperator*{\maximize}{\mbox{maximize}}
  \DeclareMathOperator*{\subjectto}{\mbox{subject to}}
$
\begin{align*}
  \maximize_{w,b} &\quad \min_{i} \left(y_{i} \frac{w^{T}x^{(i)} + b}{\norm{w}}\right) \\
  \subjectto &\quad y_{i} (w^{T}x^{(i)} + b) \geq 1 \quad \mbox{for all $i$}.
\end{align*}

Moreover, from our lemma we can arrange that $y_{i} (w^{T}x^{(i)} + b) = 1$ for at least one $i$, so
$$
  \min_{i} \left(y_{i} \frac{w^{T}x^{(i)} + b}{\norm{w}}\right) = \frac{1}{\norm{w}}.
$$

Thus we arrive at
$$ 
  \begin{array}{ll}
    \maximize_{w,b} & \displaystyle \frac{1}{\norm{w}} \\
    \subjectto & y_{i} (w^{T}x^{(i)} + b) \geq 1 \quad \mbox{for all $i$},
  \end{array}
$$
or, equivalently,
$$
  \newcommand{\half}{\frac{1}{2}}
  \newcommand{\normsq}[1]{\norm{#1}^{2}}
  \begin{array}{ll}
    \minimize_{w,b} & \half \normsq{w} \\
    \subjectto & y_{i} (w^{T}x^{(i)} + b) \geq 1 \quad \mbox{for all $i$}.
  \end{array}
$$

This is a quadratic program (QP) &ndash; minimization of a quadratic function subject to linear inequality constraints.

The optimal separating hyperplane is given by the solution of this QP.

Suppose we have solved
$$
  \begin{array}{ll}
    \minimize_{w,b} & \half \twonorm{w}^{2} \\
    \subjectto & y_{i} (w^{T}x^{(i)} + b) \geq 1 \quad \mbox{for all $i$}.
  \end{array}
$$

A **support vector** is any point in either class for which 
$$
  y_{i} (w^{T}x^{(i)} + b) = 1.
$$

Support vectors are the points in the training set closest to the separating hyperplane.

The solution of the QP defines a **support vector classifier** (we will generalize this shortly).

# Overlapping classes

If the classes overlap, then it is impossible to satisfy all of the constraints.

A constrained optimization problem that has constraints that cannot be satisfied is called **infeasible**.

In [None]:
excluded = 'setosa'  # Any species to exclude.
    
X = iris.data[:, columns]
y = iris.target

# Grab only rows for the non-excluded species.
X = X[np.where(y != classes[excluded])]
y = y[np.where(y != classes[excluded])]    

X_train, X_test, y_train, y_test = \
  train_test_split(X, y, test_size=0.3, random_state=0)  # Be sure to set the random seed so the results are reproducible!

In [None]:
# Plot the training points
for i, color in zip(range(num_classes), plot_colors):
    idx = np.where(y_train == i)
    plt.scatter(
        X_train[idx, 0],
        X_train[idx, 1],
        c=color,
        label=iris.target_names[i],
        edgecolor="black",
        s=15)


# plt.plot([1, 5], [5, 5], color='c', linestyle='-', linewidth=2)
# plt.plot([1, 5], [4, 6], color='m', linestyle='-', linewidth=2)
# plt.plot([1, 5], [2.25, 4.25],  color='g', linestyle='-', linewidth=2)

plt.tight_layout()
plt.show()

## Relaxing the separation constraint
One approach to sets that are not cleanly linearly separable is to allow some misclassification, but to penalize for it.

Introduce new variables $\xi_{i} \geq 0$ to the problem, and relax the separability constraints from
$$
  \begin{array}{ll}
    \subjectto & y_{i} (w^{T}x^{(i)} + b) \geq 1 \quad \mbox{for all $i$}.
  \end{array}
$$
to 
$$
  \begin{array}{ll}
    \subjectto & y_{i} (w^{T}x^{(i)} + b) \geq 1 - \xi_{i} \quad \mbox{for all $i$}.
  \end{array}
$$
We can ensure that there exist $w$ and $b$ satisfying the latter constraints if the values of  the $\xi_{i}$ are sufficiently large.

The $\xi_{i}$ are a version of **slack variables**---they allow us to relax a condition by allowing slack in its satisfaction.

When we have solved the QP, there are three possibilities for each training case $i$:
1. if $\xi_{i} = 0$ the case is properly classified and safely away from the separating hyperplane;
2. if $0 < \xi_{i} \leq 1$ the case is properly classified, but is within distance
   $$
      \frac{1 - \xi_{i}}{\norm{w}}
   $$
   of the separating hyperplane;
3. if $\xi_{i} > 1$ the case is incorrectly classified, since it lies on the wrong side of the separating hyperplane.

A **margin error** is a point whose corresponding $\xi_{i}$ is greater than $0$.  These points lie on the wrong side of the margin around the separating hyperplane and which are potentially (but not necessarily) misclassified. 

## The maximum margin SVC

However, we don't want the $\xi_{i}$ to be too large, since that allows a greater proportion of misclassfication, so we'll penalize for large values of $\xi_{i}$.

This leads to a relaxed version of the problem called the **maximum margin problem**.
  
Choose a weight $C > 0$, and solve
$$
  \DeclareMathOperator*{\minimize}{\mbox{minimize}}
  \DeclareMathOperator*{\maximize}{\mbox{maximize}}
  \DeclareMathOperator*{\subjectto}{\mbox{subject to}}
  \begin{array}{ll}
    \minimize_{w,b,\xi} & \half \twonorm{w}^{2} + C \sum_{i} \xi_{i} \\
    \subjectto & y_{i} (w^{T}x^{(i)} + b) \geq 1 - \xi_{i} \quad \mbox{for all $i$} \\
               & \xi_{i} \geq 0 \quad \mbox{for all $i$}.
  \end{array}
$$

Suppose we have
* $n$ feature variables in $x$, and
* $N$ feature vectors in our training set.

Then the maximum margin QP has
* $n+N+1$ optimization variables $w, \mathbf{\xi}, b$, and
* $2N$ constraints.

# Parameters vs hyperparameters

To construct the SVC, we must the choose a weight $C > 0$:
$$
  \begin{array}{ll}
    \minimize_{w,b,\xi} & \half \twonorm{w}^{2} + C \sum_{i} \xi_{i} \\
    \subjectto & y_{i} (w^{T}x^{(i)} + b) \geq 1 - \xi_{i} \quad \mbox{for all $i$} \\
               & \xi_{i} \geq 0 \quad \mbox{for all $i$}.
  \end{array}
$$
In this context, the $w, b$ are **parameters** in the SVC, while $C$ is a **hyperparameter** used to
construct the SVC.
* The choice of hyperparameters affects the choice of parameters for our ML method.
* In turn, the choice of parameters affects the performance of our ML method.

# Building an SVC in Scikit-Learn

We will use the scikit-learn SVC (support vector classifier) module.  SVC is based on [libSVM](http://www.csie.ntu.edu.tw/~cjlin/libsvm), which is also used in the R package [e1071](https://cran.r-project.org/web/packages/e1071/vignettes/svmdoc.pdf).

The documentation is [here](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).

We will use all three iris species.

In [None]:
X = iris.data[:, columns]
y = iris.target

# Grab only rows for the non-excluded species.
# X = X[np.where(y != classes[excluded])]
# y = y[np.where(y != classes[excluded])] 

X_train, X_test, y_train, y_test = \
  train_test_split(X, y, test_size=0.3, random_state=0)  # Be sure to set the random seed so the results are reproducible!

In [None]:
from sklearn.svm import SVC

# The choice kernel='linear' is important.
svc = SVC(kernel='linear', C=0.1, random_state=0)
svc.fit(X_train, y_train)

N, n  = X_train.shape
print("Size of QP:")
print(f"  Number of variables:   {n + N + 1}")
print(f"  Number of constraints: {2*N}")

In [None]:
from sklearn.inspection import DecisionBoundaryDisplay

# Plot the decision boundaries.
plt.tight_layout(h_pad=0.5, w_pad=0.5, pad=2.5)
DecisionBoundaryDisplay.from_estimator(
    svc,
    X,
    cmap=plt.cm.RdYlBu,
    response_method="predict",
    xlabel=iris.feature_names[1:3],
    ylabel=iris.feature_names[1:3]
)

# Plot the training points
for i, color in zip(range(num_classes), plot_colors):
    idx = np.where(y_train == i)
    plt.scatter(
        X_train[idx, 0],
        X_train[idx, 1],
        c=color,
        label=iris.target_names[i],
        edgecolor="black",
        s=15
    )

plt.suptitle("Decision boundaries of the SVC showing the training data")
plt.legend(loc="lower right", borderpad=0, handletextpad=0)
_ = plt.axis("tight")

Let's take a look at the support vectors:

In [None]:
# Plot the decision boundaries.
plt.tight_layout(h_pad=0.5, w_pad=0.5, pad=2.5)
DecisionBoundaryDisplay.from_estimator(
    svc,
    X,
    cmap=plt.cm.RdYlBu,
    response_method="predict",
    xlabel=iris.feature_names[1:3],
    ylabel=iris.feature_names[1:3]
)

# Plot the support vectors.
for i, color in zip(range(num_classes), plot_colors):
    idx = list(np.where(y_train == i))
    idx = list(set(list(idx[0])).intersection(svc.support_))
    plt.scatter(
        X_train[idx, 0],
        X_train[idx, 1],
        c=color,
        label=iris.target_names[i],
        edgecolor="black",
        s=15
    )

plt.suptitle("Decision boundaries of the SVC showing the support vectors")
plt.legend(loc="lower right", borderpad=0, handletextpad=0)
_ = plt.axis("tight")

# Evaluating a classifier

How should we evaluate classifiers?  How should we compare the effectiveness of classifiers?

There is a bewildering variety of performance metrics for binary classifiers.

## Accuracy

One obvious metric is **accuracy**:
$$
  \mbox{accuracy} = \frac{\mbox{number of correctly classified test cases}}{\mbox{number of test cases}}.
$$
The misclassification rate (error rate) is 1 - accuracy.

In [None]:
from sklearn.metrics import accuracy_score
y_pred = svc.predict(X_test)
print(f"Misclassified test samples: {(y_test != y_pred).sum()}")
print(f"Accuracy on test set: {accuracy_score(y_test, y_pred):.2f}")

In [None]:
y_pred = svc.predict(X_train)
print(f"Misclassified training samples: {(y_train != y_pred).sum()}")
print(f"Accuracy on training set: {accuracy_score(y_train, y_pred):.2f}")

However, high accuracy might not be what we want:
* is it better for a spam filter to let some spam through than to block important messages?
* is it better for a cancer test err on the side of caution and tell you you have cancer when you don't?

## True positive and false positive rates

Consider a two-class problem with classes **positive** and **negative**, denoted by $\oplus$ and $\ominus$, respectively.

Correctly classified positives and negatives are called **true positives** and **true negatives**.

Incorrectly classified positives are called **false negatives** (false alarms), while incorrectly classified negatives are called **false positives**.

In true/false positive/negative, 
* positive/negative refers to the classifier's prediction, while 
* true/false refers to whether this prediction is correct.

The **true positive rate** (TPR) and **false positive rate** (FPR) are defined as follows.  Let
\begin{align*}
  P  &= \mbox{true number of $\oplus$'s}, \\
  N  &= \mbox{true number of $\ominus$'s}, \\
  TP &= \mbox{number of true positives}, \\
  TN &= \mbox{number of true negatives}, \\
  FP &= \mbox{number of false positives}, \\
  FN &= \mbox{number of false negatives}.
\end{align*}
Then
\begin{align*}
  TPR &= \frac{TP}{T} = \frac{TP}{FN + TP}, \\
  FPR &= \frac{FP}{N} = \frac{FP}{FP + TN}.
\end{align*}

## The confusion matrix

A **two-class contingency table** is the confusion matrix in the case of binary classification:
<table>
<tr><th></th><th>classified as $\oplus$</th> <th>classified as $\ominus$</th>
<tr><td>actually $\oplus$</td><td style="color:blue;"><b>60</b></td> <td style="color:red;"><b>15</b></td>
<tr><td>actually $\ominus$</td><td style="color:red;"><b>10</b></td><td style="color:blue;"><b>15</b></td>
</table>
The NW-SE diagonal shows the correctly classified cases, while the SW-NE diagonal shows the incorrectly classified cases.  The numbers on the borders are the row and column sums.

More generally, the confusion matrix will give some details of who is being misclassified as what.  This allows us to look for systematic misclassifications.

## Precision and recall

Again consider a two-class problem with classes **positive** and **negative**, denoted by
$\oplus$ and $\ominus$, respectively.

**Precision**: of all the $\oplus$'s you found, how fraction of them were really $\oplus$'s? 

**Recall**: of all the $\oplus$'s that are really there, what fraction of them did you find? 

In terms of our earlier notation, the precision $P$ and recall $R$ are given by
$$
  P = \frac{TP}{TP + FP}, \quad R = \frac{TP}{P} = \frac{TP}{FN + TP}.
$$

$P$ and $R$ lie in the interval $[0,1]$.  The closer the values are to 1, the better the classifier.

Note that recall is the same as the true positive rate.

For the positives the confusion matrix
<table>
<tr><th></th><th>predicted $\oplus$</th> <th>predicted $\ominus$</th>
<tr><td>actually $\oplus$</td><td style="color:blue;"><b>60</b></td> <td style="color:red;"><b>15</b></td>
<tr><td>actually $\ominus$</td><td style="color:red;"><b>10</b></td><td style="color:blue;"><b>15</b></td>
</table>
are
\begin{align*}
  TP &= 60 \\
  FP &= 10 \\
  FN &= 15 \\
  P &= TP/(TP + FP) = 60/70 = 0.857 \\
  R &= TP/(FN + TP) = 60/75 = 0.8.
\end{align*}

For the positives, this classifier has both good precision and good recall.

### Alternative definition

The use of the terms *positive* and *negative* can sometimes be confusing.

More generally, suppose $C$ is one of our classes.  Let
\begin{align*}
  F &= \mbox{number of cases classified as $C$'s}, \\
  T &= \mbox{true number of $C$'s in our test set}, \\
  B &= \mbox{number of cases that are both in $C$ and classified as such}.
\end{align*}

Then the precision $P$ and recall $R$ for class $C$ are given by
$$
  P = B/F, \quad R = B/T.
$$

For the negatives in the confusion matrix
<table>
<tr><th></th><th>classified as $\oplus$</th> <th>classified as $\ominus$</th>
<tr><td>actually $\oplus$</td><td style="color:blue;"><b>60</b></td> <td style="color:red;"><b>15</b></td>
<tr><td>actually $\ominus$</td><td style="color:red;"><b>10</b></td><td style="color:blue;"><b>15</b></td>
</table>
we have
\begin{align*}
  F &= 30 \\
  T &= 25 \\
  B &= 15 \\
  P &= B/F = 15/30 = 0.5 \\
  R &= B/T = 15/25 = 0.6.
\end{align*}

## The $F$-score

The $F$-score combines precision and recall via harmonic averaging.

The **harmonic average** $h(a,b)$ of $a$ and $b$ is the reciprocal of the average of their reciprocals:
$$
  \newcommand{\half}{\frac{1}{2}}
  \frac{1}{\half\left(\frac{1}{a} + \frac{1}{b}\right)},
$$
with the convention $1/0 = \infty$.

The harmonic mean has the following properties:
1. if $a \leq b$, then $a \leq h(a,b) \leq b$;
2. $h(a,0) = h(0,b) = 0$;
3. $h(a,a) = a$ (even if $a = 0$).

Property 2 suggests that the harmonic mean weighs terms close to $0$ more heavily.

One way to combine precision and recall into a single number is to take their harmonic mean:
$$
  F = \frac{1}{\half\left(\frac{1}{P} + \frac{1}{R}\right)}.
$$
This is called the **F-score** or **F1-score**.

The harmonic mean favors a balance of roughly equal precision and recall.

If $P = R$, then $F = P = R$.

If $F = P = 1$, then $F = 1$.

On the other hand, if $P = 0.05$ and $R = 0.9$, then $F = 0.0947$.

In [None]:
fig = plt.figure()

# Make data.
P = np.linspace(1.0e-16, 1)
R = np.linspace(1.0e-16, 1)
P, R = np.meshgrid(P, R)
F = 1/(0.5*(1/P + 1/R))

# Plot the surface.
mesh = plt.pcolormesh(P, R, F, cmap=plt.cm.OrRd)

# Add a color bar which maps values to colors.
fig.colorbar(mesh)

plt.xlabel('P')
plt.ylabel('R')
plt.title('Heatmap of F-score(P,R)')

## Sensitivity and specificity

Another evaluation metric is sensitivity/specificity.

**Sensitivity** is the same thing as recall: do I almost always find everything I am looking for?  It is the **true positive rate**, the proportion of positive cases that are correctly classified.  This is identical to **recall**.

**Specificity** refers to whether we do a good job not finding things we are not looking for.  It is the **true negative rate**, the proportion of negative cases that are correctly classified.

For example, a good medical test is sensitive and specific.

# Evaluating the iris SVC

Here we will use the confusion matrix, precision, recall, and $F$-score.

In [None]:
from sklearn import metrics

print("Results for the test set.")
y_pred = svc.predict(X_test)

# Accuracy, precision, recall, and f-score.
print(metrics.classification_report(y_test, y_pred))

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay

# The confusion matrix for the training data.
y_pred = svc.predict(X_test)

cm = confusion_matrix(y_test, y_pred)

class_names=("I. setosa", "I. versicolor", "I. virginica")
ConfusionMatrixDisplay.from_estimator(svc, X_test, y_test, display_labels=class_names)

The preceding shows the misclassification of the test data.  But how did the SVC do on the training data?

Let's find out&hellip;

In [None]:
from sklearn import metrics

print("Results for the test set.")
y_pred = svc.predict(X_train)

# Accuracy, precision, recall, and f-score.
print(metrics.classification_report(y_train, y_pred))

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay

# The confusion matrix for the training data.
y_pred = svc.predict(X_train)

cm = confusion_matrix(y_train, y_pred)

class_names=("I. setosa", "I. versicolor", "I. virginica")
ConfusionMatrixDisplay.from_estimator(svc, X_test, y_test, display_labels=class_names)