<div style="text-align:center;">
    <img src="http://www.cs.wm.edu/~rml/images/wm_horizontal_single_line_full_color.png">
</div>

<h1 style="text-align:center;">CSCI 416-01/CSCI 516-01: Fundamentals of AI/ML, Fall 2025</h1>
<h1 style="text-align:center;">Linear classifiers</h1>

# Contents

* [Linear classifiers](#Linear-classifiers)
* [Bayes theorem](#Bayes-theorem)
* [Discriminant analysis](#Discriminant-analysis)
  * [Linear discriminant analysis](#Linear-discriminant-analysis)
    * [Training an LDA model](#Training-an-LDA-model)
  * [Quadratic discriminant analysis](#Quadratic-discriminant-analysis)
* [Logistic regression](#Logistic-regression)
  * [Binary logistic regression](#Binary-logistic-regression)
  * [Maximum likelihood estimation](#Maximum-likelihood-estimation)
* [Fisher's iris data set](#Fisher's-iris-data-set)

# Linear classifiers

$$
  \newcommand{\cprob}[2]{P(#1 \,|\, #2)}
  \newcommand{\R}{\mathbb{R}}
  \newcommand{\Rn}{\R^{n}}
$$

In **linear classification** the decision boundaries are hyperplanes.

In $n$ dimensions, a hyperplane has the form
$$
  \{x \in \Rn | w^{T}x = w_{1}x_{1} + \cdots + w_{n}x_{n} = b\}
$$
for some $w \in \Rn$ and $b \in \R$.

In $\R^{2}$ a hyperplane is a line, and in $\R^{3}$ it is a plane.

Examples of linear classifiers:
* linear discriminant analysis (LDA),
* logistic regression,
* support vector machines (SVM).


# Bayes theorem

**Bayes Theorem.**  Let $E$ and $A$ be two events in a probability space.  Then
$$
  \cprob{E}{A} = \frac{\cprob{A}{E}\; P(E)}{P(A)}
$$

A spam filter is a two-class classifier: mail is either spam (bad) or ham (good).

Let $E$ denote that an email is spam, while $A$ is, say, the presence of the word ``lottery'' in an email.

For the purposes of Bayes theorem,
* $\cprob{A}{E}$ is called the **likelihood**. It is the probability that the feature $A$ appears given that $E$ is spam.
* $P(E)$ is the probability that an email is spam.  In this context it is called the **prior**, as it is what we know prior ro knowing that feature $A$ is present in an email.
* The quantity $P(A)$ is the probability that $A$ appears.  It normalizes the right-hand side so that the quotient is between $0$ and $1$.

We are interested in $\cprob{E}{A}$, the **posterior distribution**.  It is the probability of $E$ (spam) after taking the relevant information that $A$ is present into account.  We adjust our prior probability $P(E)$ in light of the presence of $A$.

## Example

Suppose there is a disease that strikes 1 person in 100,000.  We have a test for the disease:
* If you have the disease, you test positive 99% of the time.
* If you do not have the disease, you incorrectly test positive 1% of the time.

Question: if you test positive, what is the probability you have the disease?

Let
* P(sick | positive) be the probability you are sick if you test positive;
* P(positive) be the probability you test positive;
* P(positive | sick) and P(positive | not sick) be the probabilities you test positive if you are sick or are not sick;
* P(sick) and P(not sick) be the probabilities you are or are not sick;

The positive results consist of those who are sick and test positive, and those who are not sick and test
positive:
$$
  \mbox{P(positive)} = \mbox{P(positive | sick) $\times$ P(sick) + P(positive | not sick) $\times$ P(not sick)}.
$$
Bayes theorem says
$$
  \mbox{P(sick | positive)}
  = \frac{\mbox{P(positive | sick) $\times$ P(sick)}}{\mbox{P(positive)}}
  = \frac{0.99 \times 0.00001}{0.99 \times 0.00001 + 0.01 \times 0.99999}
  \approx 0.001,
$$
or one chance in 1,000---even if you test positive it is not very likely you have the disease.  

## A generalization of Bayes theorem

$$
  \newcommand{\cprob}[2]{P(#1 \,|\, #2)}
$$
  
Let $E_{1}, \ldots, E_{n}$ be mutually exclusive events that partition the probability space $S$.  Then
$$
\cprob{E_{i}}{A} = \frac{\cprob{A}{E_{i}} P(E_{i})}{\sum_{k=1}^{n} \cprob{A}{E_{k}} P(E_{k})}.
$$


# Discriminant analysis

In the classification setting, let $E_{i}$ be the event $x \in E_{i}$.  

The generalized Bayes theorem tells us that
$$
  \cprob{E_{i}}{X = x} = \frac{\cprob{X = x}{E_{i}} P(E_{i})}{\sum_{k=1}^{n} \cprob{X = x}{E_{k}} P(E_{k})}.
$$
The term $\cprob{E_{i}}{X = x}$ is the conditional probability that the class is $i$ given the input $x$.

The $\cprob{X = x}{E_{k}}$ terms are the class conditional probability densities of $X$ in class $k$.

We know that the Bayes optimal classifier would be to assign $x$ to the class $k$ for which $\cprob{E_{k}}{X = x}$ is largest.  From the preceding equation we see that this is the same as choosing the class with the largest value of
$$
  \cprob{X = x}{E_{i}} P(E_{i}).
$$

In linear and quadratic discriminant analysis we assume a particular statistical structure for our data.  In both cases we assume that the class densities are multivariate Gaussians:
$$
\newcommand{\abs}[1]{| #1 |}
\newcommand{\half}{\frac{1}{2}}
\cprob{X = x}{E_{k}} = \frac{1}{(2\pi)^{p/2} \abs{\Sigma_{k}}^{1/2}}
e^{-\half (x - \mu_{k})^{T} \Sigma_{k}^{-1} (x - \mu_{k})}.
$$

## Linear discriminant analysis

In linear discriminant analysis (LDA) we assume all of our classes have the same covariance matrix: $\Sigma_{k} = \Sigma$ for all $k$.

Under this assumption it turns out that choosing the class with the largest value of $\cprob{X = x}{E_{i}} P(E_{i})$ is the same as choosing the class with the largest value of 
$$
  d_{i}(x) = x^{T} \Sigma^{-1} \mu_{i} - \half \mu_{i}^{T} \Sigma^{-1} \mu_{i} + \log P(E_{i}).
$$
This decision function is linear in $x$.

The decision boundary between classes $i$ and $j$ are the $x$ satisfying
$$
  (\Sigma^{-1} (\mu_{i} - \mu_{j}))^{T} x =
  \half \mu_{i}^{T} \Sigma^{-1} \mu_{i} - \log P(E_{i}) -
  \half \mu_{j}^{T} \Sigma^{-1} \mu_{j} + \log P(E_{j}),
$$
which is a hyperplane.

### Training an LDA model

There is no optimization involved in building an LDA model.  Instead, we use our sample (training data) to estimate the quantities involved.

Let
* $K$ be the number of classes with associated labels $1, 2, \ldots, K$,
* $N_{k}$ the number of instances of class $k$,
* $N$ the total number of instances, and 
* $C_{k}$ be the set of instances $x$ with label $k$.

Then
\begin{align*}
  P(E_{i}) &= N_{i}/N \\
  \mu_{i} &= \sum_{x \in C_{i}} x/N_{i} \\
  \Sigma &= \sum_{k=1}^{K} \sum_{x \in C_{i}} (x - \mu_{i})(x - \mu_{i})^{T}/(N-K).
\end{align*}

## Quadratic discriminant analysis

In quadratic discriminant analysis (QDA) we allow different $\Sigma_{k}$.  This leads to the decision functions
$$
  d_{i}(x) = - \half \log\;\abs{\Sigma_{i}^{-1}}  - \half (x - \mu_{i}) \Sigma^{-1} (x - \mu_{i}) + \log P(E_{i}).
$$    
The decision boundaries are now quadratic surfaces.

# Logistic regression

Logistic regression computes probabilities of class membership.

## Binary logistic regression

Two classes: $C_{0}$ and $C_{1}$. 

Let $p(x)$ be the probability that $x$ belongs to class $C_{0}$.

Since $p$ is a probability, we must have
$$
  0 \leq p(x) \leq 1.
$$

In order to model $p(x)$ we need a function that ranges between $0$ and $1$.

One such function is the **logistic function** or **sigmoidal function**:
$$
  \sigma(z) = \frac{1}{1 + e^{-z}},
$$
which maps $\R$ to $(0,1)$.

In logistic regression we choose $w_{0}, w_{1}, \ldots, w_{n}$ and model the probability as
$$
  p(x) = \frac{1}{1 + e^{w_{0} + w_{1}x_{1} + \cdots + w_{n}x_{n}}}.
$$

## Maximum likelihood estimation

One approach to the training problem for logistic regression is **maximum likelihood estimation** (MLE).

Suppose that when we drop a piece of buttered toast, we have
\begin{align*}
  p(\mbox{toast lands buttered side down}) &= \theta, \\
  p(\mbox{toast lands buttered side up})   &= 1-\theta.
\end{align*}

This is a probability model that depends on the parameter $\theta$.  Now suppose we drop 10 pieces of toast
and observe the sequence
<blockquote>
  up, down, down, down, up, up, down, up, down, down.
</blockquote>
Since 6 pieces land buttered side down, and 4 pieces land buttered side up, we surmise
$$
  \theta = \frac{6}{6+4} = 0.6.
$$

If our experiment were described by $p(\mbox{down}) = \theta$, $p(\mbox{up}) = 1-\theta$, then the
**likelihood** we would see that particular sequence of results is
$$
  L(\theta) = (1-\theta) \theta \theta \theta (1-\theta) (1-\theta) \theta (1-\theta) \theta \theta
  = \theta^{6} (1-\theta)^{4}.
$$
This is the probability of observing this particular sequence of outcomes provided that $\theta$ was the probability that toast lands buttered side up.

For which $\theta$ is the outcome we saw most likely?  That is, what is the $\theta$ that gives us the maximum likelihood?

Since $\theta$ is a probability, we know that $0 \leq \theta \leq 1$, so we are asking about the solution of
$$
  \begin{array}{ll}
    \mbox{maximize} & L(\theta) \\
    \mbox{subject to} & 0 \leq \theta \leq 1.
  \end{array}
$$
In this case we have
$$
  L'(\theta) = 6 \theta^{5} (1-\theta)^{4} - 4 \theta^{6} (1-\theta)^{3}
  = \theta^{5} (1-\theta)^{3} (6 (1-\theta) - 4 \theta).
$$
Thus, $L'(\theta) = 0$ when $\theta = 0, 1$, and
$$
  6 (1-\theta) - 4 \theta = 6 - 10 \theta = 0,
$$
or $\theta = 0.6$.

Inspecting these candidates, we find that $\theta = 0.6$ yields the maximum likelihood, which is consistent with
our previous reasoning.

The general idea in MLE is that we have observations that are presumed to be drawn from a probability distribution parameterized by unknown parameters $\theta$.  We then choose the $\theta$ so that the likelihood of our observations is maximized if they were drawn with the distribution with parameter values $\theta$.

The following function, which is defined in terms of the training data $x_{i}$ and the associated class labels, is called the **likelihood function** for this problem:
$$
  L(w_{0}, w_{1}, \ldots, w_{n}) = \prod_{x_{i} \in C_{0}} p(x_{i}) \prod_{x_{i} \in C_{1}} (1-p(x_{i})).
$$
The goal in MLE is to choose model parameters to maximize this function.

To avoid problems like numerical underflow to zero, it is customary to take the logarithm of the likelihood function (the so-called **log-likelihood**. For binary logistic regression the log-likelihood is
$$
  \ell(w_{0}, w) = \log L(w_{0}, w)
  = \sum_{x_{i} \in C_{0}} \ln p(x_{i}) + \sum_{x_{i} \in C_{1}} \ln (1-p(x_{i})).
$$

# Fisher's iris data set

We will use Fisher's iris data set again to illustrate these linear classifiers.  This is only fitting as Fisher introduced the iris data set in the same paper where he presented linear discriminant analysis.

The data set consists of 50 samples from each of three species of iris, 
* [*I. setosa*](https://en.wikipedia.org/wiki/Iris_setosa),
* [*I. versicolor*](https://en.wikipedia.org/wiki/Iris_versicolor), and 
* [*I. virginica*](https://en.wikipedia.org/wiki/Iris_virginica).
  
for a total of 150 instances.

Four **features** are measured for each sample: 
1. the length of the sepal,
2. the width of the sepal,
3. the length of the petal, and
4. the width of the petal.

The measurements are in centimeters.

The **class labels** are
1. Iris setosa,
2. Iris versicolor,
3. Iris virginica.

In [None]:
from sklearn import datasets
import numpy as np

iris = datasets.load_iris()
X = iris.data
y = iris.target

print("Class labels:", np.unique(y))
print("Class names:", iris.target_names)

In [None]:
print(iris.feature_names)
print(X[0:10,:])  # Just the first 10 rows.

We will use only the petal length and width so we will be able to plot the decision regions:

In [None]:
X = X[:,2:]

## Training and test sets

Split the data into 70% training and 30% test data:

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = \
  train_test_split(X, y, test_size=0.3, random_state=0)  # Be sure to set the random seed so the results are reproducible!

## Construct the classifier

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression

lda = LinearDiscriminantAnalysis()
qda = QuadraticDiscriminantAnalysis()
lr = LogisticRegression()

lda.fit(X_train, y_train)
qda.fit(X_train, y_train)
lr.fit(X_train, y_train)

# Set clf to be the classifier we wish to study.
clf = qda

## Plot the decision regions

In [None]:
# A hacked up version of https://scikit-learn.org/stable/auto_examples/tree/plot_iris_dtc.html.

import matplotlib.pyplot as plt
from sklearn.inspection import DecisionBoundaryDisplay

# Parameters
num_classes = 3
plot_colors = "ryb"

# Plot the decision boundaries.
plt.tight_layout(h_pad=0.5, w_pad=0.5, pad=2.5)
DecisionBoundaryDisplay.from_estimator(
    clf,
    X,
    cmap=plt.cm.RdYlBu,
    response_method="predict",
    xlabel=iris.feature_names[2:],
    ylabel=iris.feature_names[2:],
)

# Plot the training points
for i, color in zip(range(num_classes), plot_colors):
    idx = np.where(y_train == i)
    plt.scatter(
        X_train[idx, 0],
        X_train[idx, 1],
        c=color,
        label=iris.target_names[i],
        edgecolor="black",
        s=15,
    )

plt.suptitle("Decision boundaries of the tree showing the training data")
plt.legend(loc="lower right", borderpad=0, handletextpad=0)
_ = plt.axis("tight")

Observe that QDA can produce complex nonlinear shapes in its decision boundaries.

## Evaluate the model

In [None]:
from sklearn.metrics import accuracy_score
y_pred = clf.predict(X_train)
print(f"Misclassified training samples: {(y_train != y_pred).sum()}")
print(f"Accuracy: {accuracy_score(y_train, y_pred):.2f}")

In [None]:
y_pred = clf.predict(X_test)
print(f"Misclassified test samples: {(y_test != y_pred).sum()}")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay

# The confusion matrix for the training data.
y_pred = clf.predict(X_train)

cm = confusion_matrix(y_train, y_pred)

class_names=("I. setosa", "I. versicolor", "I. virginica")
ConfusionMatrixDisplay.from_estimator(clf, X_test, y_test, display_labels=class_names)

In [None]:
# Plot the decision boundaries.
plt.tight_layout(h_pad=0.5, w_pad=0.5, pad=2.5)
DecisionBoundaryDisplay.from_estimator(
    clf,
    X,
    cmap=plt.cm.RdYlBu,
    response_method="predict",
    xlabel=iris.feature_names[2:],
    ylabel=iris.feature_names[2:],
)

# Plot the test points
for i, color in zip(range(num_classes), plot_colors):
    idx = np.where(y_test == i)
    plt.scatter(
        X_test[idx, 0],
        X_test[idx, 1],
        c=color,
        label=iris.target_names[i],
        edgecolor="black",
        s=15,
    )

plt.suptitle("Decision boundaries of the tree showing the test data")
plt.legend(loc="lower right", borderpad=0, handletextpad=0)
_ = plt.axis("tight")