# L9a: Linear models for classification tasks
In this lecture, let's look at another use case for linear models: classification tasks. Linear regression can be adapted for classification tasks by transforming the continuous output of the linear regression model into a class designation.

> __Learning Objectives__
>
> By the end of this lecture, students will be able to:
> - **Adapt linear models for binary classification** by understanding how continuous regression outputs can be transformed into discrete class labels through activation functions like the sign function, enabling the application of linear techniques to classification problems.
> - **Implement and analyze the Perceptron algorithm** for learning linear decision boundaries through incremental weight updates, understanding convergence guarantees for linearly separable data and recognizing when datasets require alternative approaches.
> - **Evaluate classifier performance using confusion matrices** to compute key metrics (accuracy, precision, recall, specificity) and interpret trade-offs between different types of classification errors in real-world applications.

Let's get started!
___

## Examples
Today, we will use the following example to illustrate key concepts:

> [▶ Let's build a Perceptron classifier](CHEME-5800-L9a-Example-PerceptronClassifier-Fall-2025.ipynb). In this example, students will implement the Perceptron algorithm to classify data points into two categories based on their features. This will help us understand how to apply linear models for classification tasks.

___

<div>
    <center>
        <img src="figs/Fig-LinearlySeparable-Schematic.svg" width="480"/>
    </center>
</div>

## Binary Classification Problem
Linear regression can be adapted for classification tasks by transforming the continuous output of the linear regression model into a class designation in one of two ways: either directly to a class designation or into a probability of a label using an __activation function__ $\sigma:\mathbb{R}\rightarrow{\mathbb{R}}$.

Let's examine two examples of binary classification strategies:

* [The Perceptron (Rosenblatt, 1957)](https://en.wikipedia.org/wiki/Perceptron) is a simple yet powerful algorithm used in machine learning for binary classification tasks. It operates by _incrementally_ learning a linear decision boundary (linear regression model) that separates two classes based on input features. The Perceptron directly maps the continuous output to a class such as $\sigma:\mathbb{R}\rightarrow\{-1,+1\}$. In the case of the Perceptron, we use $\sigma(\star) = \text{sign}(\star)$.
* [Logistic regression](https://en.wikipedia.org/wiki/Logistic_regression#) is a statistical method used in machine learning for binary classification tasks that uses the [logistic function](https://en.wikipedia.org/wiki/Logistic_function) as the transformation function. Applying the logistic function transforms the output of a linear regression model into a probability, enabling effective decision-making in various applications. We'll consider logistic regression in the next module.

___

## Perceptron
[The Perceptron (Rosenblatt, 1957)](https://en.wikipedia.org/wiki/Perceptron) takes the (scalar) output of a linear regression model $y_{i}\in\mathbb{R}$ and transforms it, using an activation function, to a discrete set of values representing categories, e.g., $\sigma:\mathbb{R}\rightarrow\{-1,1\}$ in the binary classification case. 

Suppose there exists a dataset $\mathcal{D} = \left\{(\mathbf{x}_{1},y_{1}),\dotsc,(\mathbf{x}_{n},y_{n})\right\}$ with $n$ _labeled_ examples, where each example has been labeled by an expert (i.e., a human) to be in a category $y_{i}\in\{-1,1\}$, given the $m$-dimensional feature vector $\mathbf{x}_{i}\in\mathbb{R}^{m}$. 

[The Perceptron](https://en.wikipedia.org/wiki/Perceptron) _incrementally_ learns a linear decision boundary between _two_ classes in $\mathcal{D}$ by repeatedly processing the data. During each pass through the dataset, the regression parameter vector $\mathbf{\theta}$ is updated until it makes no more than a specified number of mistakes. 

[The Perceptron](https://en.wikipedia.org/wiki/Perceptron) computes the estimated label $\hat{y}_{i}$ for feature vector $\hat{\mathbf{x}}_{i}$ using the $\texttt{sign}:\mathbb{R}\to\{-1,1\}$ function:
$$
\begin{equation*}
    \hat{y}_{i} = \texttt{sign}\left(\hat{\mathbf{x}}_{i}^{\top}\;\mathbf{\theta}\right)
\end{equation*}
$$
where $\mathbf{\theta}=\left(w_{1},\dots,w_{m}, b\right)$ is a column vector of (unknown) classifier parameters, with $w_{j}\in\mathbb{R}$ corresponding to the importance of feature $j$ and $b\in\mathbb{R}$ being a bias parameter. The features $\hat{\mathbf{x}}^{\top}_{i}=\left(x^{(i)}_{1},\dots,x^{(i)}_{m}, 1\right)$ are $p = m+1$-dimensional (row) vectors (features augmented with the bias term), and $\texttt{sign}(z)$ is defined as:
$$
\begin{equation*}
    \texttt{sign}(z) = 
    \begin{cases}
        1 & \text{if}~z\geq{0}\\
        -1 & \text{if}~z<0
    \end{cases}
\end{equation*}
$$

### Classical: Online Perceptron Training
__Hypothesis__: If the dataset $\mathcal{D}$ is linearly separable, the Perceptron is guaranteed to _incrementally_ learn a separating hyperplane in a finite number of passes through the dataset $\mathcal{D}$. However, if the dataset $\mathcal{D}$ is __not__ linearly separable, the Perceptron may not converge. 

Let's examine the pseudocode for the Perceptron learning algorithm. 

__Initialize__: Given a linearly separable dataset $\mathcal{D} = \left\{(\mathbf{x}_{1},y_{1}),\dotsc,(\mathbf{x}_{n},y_{n})\right\}$, the maximum number of iterations $T$, and the maximum number of mistakes $M$ (e.g., $M=1$), initialize the parameter vector $\mathbf{\theta} = \left(\mathbf{w}, b\right)$ to small random values and set the loop counter $t\gets{0}$.

> **Rule of thumb for $T$**: Set $T = 10n$ to $100n$, where $n$ is the number of training examples. The algorithm often converges faster for linearly separable data.

While $\texttt{true}$ __do__:
1. Initialize the number of mistakes $\texttt{mistakes} = 0$.
2. For each training example $(\mathbf{x}, y) \in \mathcal{D}$: compute $y\;\left(\mathbf{\theta}^{\top}\;\mathbf{x}\right)\leq{0}$. 
    - If this condition is $\texttt{true}$: the training example $(\mathbf{x}, y)$ is __misclassified__ (the sign of the prediction doesn't match the true label $y$). Update the parameter vector $\mathbf{\theta} \gets \mathbf{\theta} + y\;\mathbf{x}$ and increment the error counter $\texttt{mistakes} \gets \texttt{mistakes} + 1$.
3. After processing all training examples, if $\texttt{mistakes} \leq {M}$ or $t \geq T$, break the loop. Otherwise, increment the loop counter $t \gets t + 1$ and repeat from step 1.

Traditionally, we want to learn the perceptron parameters $\mathbf{\theta}\in\mathbb{R}^{m+1}$ such that the number of mistakes is minimized, i.e., $M = 0$ in the best case. However, zero mistakes may not always be achievable with weakly linearly separable datasets and is impossible for non-linearly separable data.

### Modern: Nonlinear Optimization
In the next module, we'll revisit the perceptron training algorithm through a modern lens using optimization techniques that minimize a nonlinear loss function measuring the distance between the predicted and true labels. 

However, the classical online perceptron training algorithm is a good starting point for understanding the basic concepts of linear (binary) classification.
___

## Confusion matrix for binary classification
A [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix) is a table used to evaluate the performance of a binary classification model. It compares the predicted class labels to the true class labels, providing a detailed breakdown of correct and incorrect predictions.

> __Confusion Matrix Structure__
>
> The confusion matrix for a binary classifier is a $2\times{2}$ matrix that looks like this:
>
>|                     | **Predicted Positive** | **Predicted Negative** |
>|---------------------|------------------------|------------------------|
>| **Actual Positive** | True Positive (TP)     | False Negative (FN)    |
>| **Actual Negative** | False Positive (FP)    | True Negative (TN)     |
>
> * **True Positive (TP)**: The model correctly predicts positive when the actual class is positive. Example: correctly diagnosing a patient who actually has a disease.
> * **False Positive (FP)**: The model incorrectly predicts positive when the actual class is negative. This is a "false alarm", like telling a healthy patient they have a disease when they don't.
> * **False Negative (FN)**: The model incorrectly predicts negative when the actual class is positive. This means missing a true case, like failing to diagnose a patient who actually has the disease.
> * **True Negative (TN)**: The model correctly predicts negative when the actual class is negative. Example: correctly identifying that a healthy patient is indeed healthy.

### Why is the confusion matrix important?
The confusion matrix provides a comprehensive view of classifier performance and enables the calculation of key metrics:

* **Accuracy** $\frac{TP + TN}{TP + TN + FP + FN}$: What fraction of all predictions were correct? This metric tells us the overall correctness of the model across both positive and negative classes.
* **Precision** $\frac{TP}{TP + FP}$: Of all positive predictions made by the model, how many were actually correct? High precision means fewer false alarms.
* **Recall (Sensitivity)** $\frac{TP}{TP + FN}$: Of all actual positive cases, how many did the model successfully identify? High recall means fewer missed positive cases.
* **Specificity** $\frac{TN}{TN + FP}$: Of all actual negative cases, how many did the model correctly identify as negative? High specificity means fewer false positives.

> **Key insight**: Different applications require different trade-offs. In medical diagnosis, high recall (avoiding false negatives) might be more important than high precision, as missing a disease can be more costly than a false alarm.

By analyzing each quadrant, you can understand the types of errors your model makes and make informed decisions about model improvements, threshold adjustments, or cost-sensitive learning approaches.

Let's take a look at an example to illustrate the Perceptron algorithm and the confusion matrix in action.

> __Example__
> 
> [▶ Let's build a Perceptron classifier](CHEME-5800-L9a-Example-PerceptronClassifier-Fall-2025.ipynb). In this example, students will implement the Perceptron algorithm to classify data points into two categories based on their features. This will help us understand how to apply linear models for classification tasks.

___

## Lab
In lab `L9b`, you will implement the Perceptron algorithm to classify data points into two categories based on their features. This hands-on exercise will help you understand how to apply linear models for classification tasks and evaluate their performance using a confusion matrix.

## Summary

In this notebook, we've explored how linear models can be extended from regression to binary classification tasks:

> __Key takeaways:__
>
> 1. **Linear models for classification through activation functions**: By applying transformation functions $\sigma:\mathbb{R}\rightarrow\{-1,+1\}$ to continuous linear regression outputs, we can perform binary classification. The Perceptron uses $\sigma(\star) = \text{sign}(\star)$ to directly map predictions to class labels, while other methods like logistic regression compute the probability of a label given the features.
> 2. **Perceptron learning algorithm and convergence**: The classical online Perceptron incrementally learns decision boundaries through the update rule $\theta \gets \theta + y\;\mathbf{x}$ when misclassifications occur. For linearly separable data, convergence to a separating hyperplane is guaranteed in finite iterations, but non-separable data requires modern optimization approaches that minimize nonlinear loss functions.
> 3. **Performance evaluation through confusion matrices**: The $2\times{2}$ confusion matrix reveals all four prediction outcomes (TP, FP, TN, FN), enabling calculation of accuracy $\frac{TP+TN}{\text{total}}$, precision $\frac{TP}{TP+FP}$, recall $\frac{TP}{TP+FN}$, and specificity $\frac{TN}{TN+FP}$. Understanding these metrics allows informed decisions about model performance trade-offs specific to application requirements.

These foundational concepts in binary classification set the stage for exploring more sophisticated probabilistic approaches and optimization techniques in subsequent modules.

___