# Linear models for prediction and classification tasks
This module introduces supervised learning, focusing on two key tasks: prediction and classification using linear regression models. 

By the end of this module, you will be able to define and demonstrate mastery of the following key concepts:
* __Linear regression__ is a statistical method for modeling the relationship between a target variable (output) and one or more features (input) by fitting a linear model to observed data. It provides a straightforward method for predicting outcomes and understanding the relationships between variables. The model is linear in the parameters, not necessarily in the features.
* __Continuous variable prediction tasks__: In machine learning, linear regression models are commonly employed for continuous variable prediction tasks. These models enable the estimation of numerical outcomes based on the (non)linear relationships identified between input features and the target variable.
* __Binary classification tasks__: While linear regression is primarily designed for continuous outcomes, it can be adapted to binary classification by applying an output function (e.g., the sign function or logistic transformation). We will introduce the Perceptron, learn how to evaluate its performance using a confusion matrix, and then motivate logistic regression as an alternative approach.

Linear models, while seemingly simple, are used all over machine learning and artificial intelligence (even in super advanced deep learning applications!). So, let's get started!
___

<div>
    <center>
        <img src="figs/Fig-LinearRegressionModel-Schematic.svg" width="580"/>
    </center>
</div>

## Linear models for continuous prediction tasks
Suppose there exists a dataset $\mathcal{D} = \left\{(\mathbf{x}_{i},y_{i}) \mid i = 1,2,\dots,n\right\}$ with $n$ training (labeled) examples, where $\mathbf{x}_{i}\in\mathbb{R}^{m}$ is an $m$-dimensional vector of features (independent input variables) and $y_{i}\in\mathbb{R}$ denotes a scalar response variable (dependent variable). Then, a $\texttt{linear model}$ for the dataset $\mathcal{D}$ is given (in index-form) by:
$$
\begin{equation*}
y_{i} = \hat{\mathbf{x}}_{i}^{\top}\,\mathbf{\theta} + \epsilon_{i}\qquad{i=1,2,\dots,n}
\end{equation*}
$$
where the augmented features are $\hat{\mathbf{x}}_{i}^{\top}=\left(x_{i1},x_{i2},\dots,x_{im},1\right)$ (we've added an extra `1` to each feature vector to account for the intercept (bias) term), 
the unknown parameters are represented by the $\mathbf{\theta}\in\mathbb{R}^{p}$ vector (where $p=m+1$), and $\epsilon_{i}\in\mathbb{R}$ is the unobserved random error for response $i$, i.e., the component of the target that is _not_ explained by the linear model. 

We can rewrite the linear regression model in matrix-vector form as:
$$
\begin{equation*}
\mathbf{y} = \hat{\mathbf{X}}\;\mathbf{\theta} + \mathbf{\epsilon}
\end{equation*}
$$
where $\hat{\mathbf{X}}$ is an $n\times{p}$ matrix with the augmented features $\hat{\mathbf{x}}_{i}^{\top}$ on the rows, the target (output) vector $\mathbf{y}$ is an $n\times{1}$ column vector with entries $y_{i}$, and the error vector $\mathbf{\epsilon}$ is an $n\times{1}$ column vector with entries $\epsilon_{i}$. The challenge of linear regression is to estimate the unknown parameters $\mathbf{\theta}$ from the dataset $\mathcal{D}$ by minimizing an appropriate loss function, typically the sum of squared errors.

> **Key Insight**: A linear model must only be linear in the parameters, not necessarily the features. For example, we could have polynomial features in the data matrix $\hat{\mathbf{X}}$ such as $1,x,x^{2},x^{3},\dots$. This would still be a linear regression problem because the model remains linear in the parameters $\mathbf{\theta}$.

___

## Overdetermined data matrix without regularization
Suppose you have a data matrix $\hat{\mathbf{X}}\in\mathbb{R}^{n\times{p}}$ that is $\texttt{overdetermined}$, i.e., $n \gg p$, and an error model $\mathbf{\epsilon}\sim\mathcal{N}(\mathbf{0},\sigma^{2}\;\mathbf{I})$ follows [a Normal distribution](https://en.wikipedia.org/wiki/Normal_distribution) with a mean of zero and variance $\sigma^{2}$. We estimate the model parameters by minimizing the sum of squared errors between the model's estimated outputs and the observed outputs:
$$
\begin{align*}
\hat{\mathbf{\theta}} = \arg\min_{\mathbf{\theta}} \frac{1}{2}\;\lVert~\mathbf{y} - \hat{\mathbf{X}}\;\mathbf{\theta}~\rVert^{2}_{2}
\end{align*}
$$
where $\lVert\star\rVert^{2}_{2}$ is the square of the [L2 vector norm](https://en.wikipedia.org/wiki/Norm_(mathematics)#Euclidean_norm), and $\hat{\mathbf{\theta}}\in\mathbb{R}^{p}$ is the estimated parameter vector. When $\hat{\mathbf{X}}$ has full column rank (i.e., $\text{rank}(\hat{\mathbf{X}}) = p$ and $\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}$ is invertible), this problem has the analytical solution:
$$
\begin{align*}
\hat{\mathbf{\theta}} &= \left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}\right)^{-1}\hat{\mathbf{X}}^{\top}\mathbf{y}
\end{align*}
$$

We can also express the estimated parameters in terms of the true parameters and the error model $\mathbf{\epsilon}$:
$$
\begin{align*}
\hat{\mathbf{\theta}} &= \left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}\right)^{-1}\hat{\mathbf{X}}^{\top}\mathbf{y} \\
\hat{\mathbf{\theta}}&= \left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}\right)^{-1}\hat{\mathbf{X}}^{\top}(\hat{\mathbf{X}}\;\mathbf{\theta} + \mathbf{\epsilon}) \\
\hat{\mathbf{\theta}} &= \underbrace{\left[\left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}\right)^{-1}\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}\right]}_{= \mathbf{I}}\;\mathbf{\theta} + \left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}\right)^{-1}\hat{\mathbf{X}}^{\top}\mathbf{\epsilon}\\
\hat{\mathbf{\theta}} &= \mathbf{\theta} + \left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}\right)^{-1}\hat{\mathbf{X}}^{\top}\mathbf{\epsilon}\quad\blacksquare\\
\end{align*}
$$

### Key Insights:

* **Unbiased estimator**: The expected value of $\hat{\mathbf{\theta}}$ equals the true parameters: $\mathbb{E}[\hat{\mathbf{\theta}}] = \mathbf{\theta}$, making it an unbiased estimator.
* **Random variable**: Since $\mathbf{\epsilon}$ is a random vector, $\hat{\mathbf{\theta}}$ is also a random vector with its own distribution. This uncertainty in parameter estimates is fundamental to statistical inference.
* **Connection to Bayesian inference**: This analysis shows that even in classical (frequentist) regression, parameter estimates have distributions. This provides a bridge toward [Bayesian linear regression](https://en.wikipedia.org/wiki/Bayesian_linear_regression), where parameters are explicitly treated as random variables with prior distributions.
* **Practical implications**: The randomness in $\hat{\mathbf{\theta}}$ means we should consider confidence intervals and hypothesis tests when making inferences about the true parameters.

___

## Regularized linear regression
In the overdetermined case, we can add a regularization term to the objective function to prevent overfitting and improve generalization. The regularized linear regression problem is given by:
$$
\begin{align*}
\hat{\mathbf{\theta}}_{\lambda} = \arg\min_{\mathbf{\theta}}\left( \frac{1}{2}\;\lVert~\mathbf{y} - \hat{\mathbf{X}}\;\mathbf{\theta}~\rVert^{2}_{2} + \frac{\lambda}{2}\;\lVert~\mathbf{\theta}~\rVert^{2}_{2}\right)
\end{align*}
$$
where $\lambda> 0$ is the regularization parameter controlling regularization strength. This is Ridge (or Tikhonov) regularization. The first term sums squared errors, the second penalizes large parameters. Its solution is given by:
$$
\begin{align*}
\hat{\mathbf{\theta}}_{\lambda} &= \left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}} + \lambda\;\mathbf{I}\right)^{-1}\hat{\mathbf{X}}^{\top}\mathbf{y}
\end{align*}
$$
This solution can also be expressed in terms of the error model $\mathbf{\epsilon}$:
$$
\begin{align*}
\hat{\mathbf{\theta}}_{\lambda} &= \left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}} + \lambda\;\mathbf{I}\right)^{-1}\hat{\mathbf{X}}^{\top}(\hat{\mathbf{X}}\;\mathbf{\theta} + \mathbf{\epsilon}) \\
\hat{\mathbf{\theta}}_{\lambda} &= \left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}} + \lambda\;\mathbf{I}\right)^{-1}\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}\;\mathbf{\theta} + \left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}} + \lambda\;\mathbf{I}\right)^{-1}\hat{\mathbf{X}}^{\top}\mathbf{\epsilon} \\
\hat{\mathbf{\theta}}_{\lambda} &= \underbrace{\left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}} + \lambda\;\mathbf{I}\right)^{-1}\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}}_{\text{Shrinkage}\;\mathbf{P}}\;\mathbf{\theta} + \left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}} + \lambda\;\mathbf{I}\right)^{-1}\hat{\mathbf{X}}^{\top}\mathbf{\epsilon} \\
\hat{\mathbf{\theta}}_{\lambda} &= \mathbf{P}\;\mathbf{\theta} + \left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}} + \lambda\;\mathbf{I}\right)^{-1}\hat{\mathbf{X}}^{\top}\mathbf{\epsilon}\quad\blacksquare\\
\end{align*}
$$

__Interesting__: The regularization term $\lambda\;\mathbf{I}$ acts as a penalty for large parameter values, effectively shrinking the estimated parameters towards zero. This helps to prevent overfitting by discouraging complex models that fit the training data too closely.

Now, the question is: how do we select the regularization parameter $\lambda$? The answer is that we can use a validation set to tune the hyperparameter $\lambda$ by evaluating the model's performance on completely unseen data.
___

## Understanding the error model
The error model $\mathbf{\epsilon}$ captures the randomness in our observations that cannot be explained by the linear relationship. In linear regression, we typically assume:
$$\begin{align*}
\mathbf{\epsilon} &\sim \mathcal{N}(\mathbf{0},\sigma^{2}\;\mathbf{I})
\end{align*}$$
This means each error term $\epsilon_i$ is independent, normally distributed with mean zero and constant variance $\sigma^2$. 

> **Why does this assumption matter?** The normality assumption enables the analytical OLS and Ridge solutions we derived earlier. It also allows us to construct confidence intervals, perform hypothesis tests, and quantify uncertainty. Finally, under these conditions, OLS gives us the Best Linear Unbiased Estimator (BLUE).

**Reality check**: While convenient, the normality assumption may not always hold in practice. Fortunately, many results remain approximately valid even when this assumption is violated.

### Estimating the error variance
Since $\sigma^2$ is unknown, we estimate it from the residuals $\mathbf{r} = \mathbf{y} - \hat{\mathbf{X}}\hat{\mathbf{\theta}}$:
$$\begin{align*}
\hat{\sigma}^{2} &= \frac{1}{n-p}\;\lVert~\mathbf{r}~\rVert^{2}_{2} = \frac{1}{n-p}\sum_{i=1}^{n}r_i^2
\end{align*}$$
where $n$ is the number of observations, $p$ is the number of parameters, and $r_i = y_i - \hat{\mathbf{x}}_i^{\top}\hat{\mathbf{\theta}}$ is the $i$-th residual.

> **Key insight**: We divide by $(n-p)$ instead of $n$ to account for the degrees of freedom "used up" by estimating $p$ parameters. This correction makes $\hat{\sigma}^2$ an unbiased estimator of the true variance.

### Quantifying parameter uncertainty
With our estimate of $\sigma^2$, we can quantify the uncertainty in our parameter estimates. The standard error of parameter $\hat{\theta}_j$ is:
$$
  \mathrm{SE}(\hat{\theta}_j) = \sqrt{\;\hat{\sigma}^2\;\bigl[(\hat{\mathbf{X}}^\top\hat{\mathbf{X}})^{-1}\bigr]_{jj}\,}
$$
These standard errors are essential for:
* **Confidence intervals**: $\hat{\theta}_j \pm 1.96 \cdot \mathrm{SE}(\hat{\theta}_j)$ gives an approximate 95% confidence interval
* **Hypothesis testing**: Testing whether $\theta_j = 0$ (is feature $j$ significant?)
* **Prediction intervals**: Quantifying uncertainty in new predictions

___

## Linear models for classification tasks
Linear regression can be adapted for classification tasks by transforming the continuous output of the linear regression model directly to a class designation, e.g., $\sigma:\mathbb{R}\rightarrow\{-1,+1\}$ or into a probability using an output function $\sigma:\mathbb{R}\rightarrow\mathbb{R}$ and applying a threshold to categorize predictions into discrete classes. Let's take a look at two examples of these strategies:

* [The Perceptron (Rosenblatt, 1957)](https://en.wikipedia.org/wiki/Perceptron) is a simple yet powerful algorithm used in machine learning for binary classification tasks. It operates by _incrementally_ learning a linear decision boundary (linear regression model) that separates two classes based on input features by directly mapping the continuous output to a class such as $\sigma:\mathbb{R}\rightarrow\{-1,+1\}$, where the output function is $\sigma(\star) = \text{sign}(\star)$.
* [Logistic regression](https://en.wikipedia.org/wiki/Logistic_regression#) is a statistical method used in machine learning for binary classification tasks using the [logistics function](https://en.wikipedia.org/wiki/Logistic_function) as the transformation function. Applying the logistic function transforms the output of a linear regression model into a probability, enabling effective decision-making in various applications. WE'll consider logistic regression in the next module.

Let's start by reviewing the perceptron algorithm, a simple linear classifier that learns a linear decision boundary by iteratively adjusting its weights based on misclassified examples.

## Perceptron
[The Perceptron (Rosenblatt, 1957)](https://en.wikipedia.org/wiki/Perceptron) takes the (scalar) output of a linear regression model $y_{i}\in\mathbb{R}$ and then transforms it using the $\sigma(\star) = \text{sign}(\star)$ function to a discrete set of values representing categories, e.g., $\sigma:\mathbb{R}\rightarrow\{-1,1\}$ in the binary classification case. 
* Suppose there exists a data set
$\mathcal{D} = \left\{(\mathbf{x}_{1},y_{1}),\dotsc,(\mathbf{x}_{n},y_{n})\right\}$ with $n$ _labeled_ examples, where each example has been labeled by an expert, i.e., a human to be in a category $\hat{y}_{i}\in\{-1,1\}$, given the $m$-dimensional feature vector $\mathbf{x}_{i}\in\mathbb{R}^{m}$. 
* [The Perceptron](https://en.wikipedia.org/wiki/Perceptron) _incrementally_ learns a linear decision boundary between _two_ classes of possible objects (binary classification) in $\mathcal{D}$ by repeatedly processing the data. During each pass, a regression parameter vector $\mathbf{\theta}$ is updated until it makes no more than a specified number of mistakes. 

[The Perceptron](https://en.wikipedia.org/wiki/Perceptron) computes the estimated label $\hat{y}_{i}$ for feature vector $\hat{\mathbf{x}}_{i}$ using the $\texttt{sign}:\mathbb{R}\to\{-1,1\}$ function:
$$
\begin{equation*}
    \hat{y}_{i} = \texttt{sign}\left(\hat{\mathbf{x}}_{i}^{\top}\;\theta\right)
\end{equation*}
$$
where $\theta=\left(w_{1},\dots,w_{n}, b\right)$ is a column vector of (unknown) classifier parameters, $w_{j}\in\mathbb{R}$ corresponding to the importance of feature $j$ and $b\in\mathbb{R}$ is a bias parameter, the features $\hat{\mathbf{x}}^{\top}_{i}=\left(x^{(i)}_{1},\dots,x^{(i)}_{m}, 1\right)$ are $p = m+1$-dimensional (row) vectors (features augmented with bias term), and $\texttt{sign}(z)$ is the function:
$$
\begin{equation*}
    \texttt{sign}(z) = 
    \begin{cases}
        1 & \text{if}~z\geq{0}\\
        -1 & \text{if}~z<0
    \end{cases}
\end{equation*}
$$

### Classical: Online Perceptron Training
__Hypothesis__: If the dataset $\mathcal{D}$ is linearly separable, the Perceptron is guaranteed to _incrementally_ learn a separating hyperplane in a finite number of passes through the data set $\mathcal{D}$. However, if the dataset $\mathcal{D}$ is __not__ linearly separable, the Perceptron may not converge. 

Let's look at some pseudocode for the Perceptron learning algorithm. 

__Initialize__: Given a linearly separable dataset $\mathcal{D} = \left\{(\mathbf{x}_{1},y_{1}),\dotsc,(\mathbf{x}_{n},y_{n})\right\}$, the maximum number of iterations $T$, and the maximum number of mistakes $M$ (e.g., $M=1$), initialize the parameter vector $\theta = \left(\mathbf{w}, b\right)$ to small random values, and initialize the loop counter $t\gets{0}$.

> **Rule of thumb for $T$**: Set $T = 10n$ to $100n$, where $n$ is the number of training examples. The algorithm often converges faster for linearly separable data.

While $\texttt{true}$ __do__:
1. Initialize the number of mistakes $\texttt{mistakes} = 0$.
2. For each training example $(\mathbf{x}, y) \in \mathcal{D}$: compute $y\cdot\left(\theta^{\top}\cdot\mathbf{x}\right)\leq{0}$. If this condition is $\texttt{true}$, the training example $(\mathbf{x}, y)$ is misclassified. Update the parameter vector $\theta \gets \theta + y\cdot\mathbf{x}$, and the error counter $\texttt{mistakes} \gets \texttt{mistakes} + 1$.
4. After processing all training examples, if $\texttt{mistakes} \leq {M}$, or $t \geq T$, break the loop. Otherwise, increment the loop counter $t \gets t + 1$ and repeat from step 1.


Traditionally, we want to learning the perceptron parameters $\theta\in\mathbb{R}^{m+1}$ such that the number of mistakes is minimized, i.e., $M = 0$ in the best case. However, zero mistakes may not always be achievable with weakly linearly separable datasets, and it is impossible for non-linearly separable data.


### Modern: Cross-entropy loss
Suppose we view our two–class labels $y\in\{-1,1\}$ as _states_ in a Boltzmann distribution conditioned on the input $\hat{\mathbf{x}}\in\mathbb{R}^{m+1}$ (the original feature vector with a `1` as the last element to account for a bias). Then for any state $y$ with energy $E(y,\hat{\mathbf{x}})$ at (unit) temperature, the conditional probability of observing the label $y\in\left\{-1,+1\right\}$ given the feature vector $\hat{\mathbf{x}}$ can be represented as
$$
\begin{align*}
P(y\mid \hat{\mathbf{x}})
=\frac{\exp\bigl(-E(y,\hat{\mathbf{x}})\bigr)}
      {\underbrace{\sum_{y' \in\{-1,1\}} \exp\bigl(-E(y',\hat{\mathbf{x}})\bigr)}_{Z(\hat{\mathbf{x}})}}.
\end{align*}
$$
For the energy function, we can use a linear model of the form:
$$
\begin{align*}
E(y,\hat{\mathbf{x}})\;=\;-\,y\;\bigl(\hat{\mathbf{x}}^{\top}\theta \bigr).
\end{align*}
$$
where $\theta\in\mathbb{R}^{p}$ is a vector of parameters (weights plus bias) that we want to learn. When $y=+1$, the energy $E(1,\hat{\mathbf{x}})=-\hat{\mathbf{x}}^{\top}\theta$ is *lower* (more probable) if $\hat{\mathbf{x}}^{\top}\theta$ is large. On the other hand, when $y=-1$, the energy $E(-1,\hat{\mathbf{x}})=+\hat{\mathbf{x}}^{\top}\theta$, so $y=-1$ is favored when $\hat{\mathbf{x}}^{\top}\theta$ is very negative.

Let's substitute the energy function into the conditional probability expression and do some algebra:
$$
\begin{align*}
P_{\theta}(y\mid \hat{\mathbf{x}})
& =\frac{\exp\bigl(-E(y,\hat{\mathbf{x}})\bigr)}
      {\underbrace{\sum_{y' \in\{-1,1\}} \exp\bigl(-E(y',\hat{\mathbf{x}})\bigr)}_{Z(\hat{\mathbf{x}})}}\\
&=\frac{\exp\bigl(y\left(\hat{\mathbf{x}}^{\top}\theta\right)\bigr)}
      {\exp\bigl(\hat{\mathbf{x}}^{\top}\theta\bigr) + \exp\bigl(-\hat{\mathbf{x}}^{\top}\theta\bigr)}\quad\Longrightarrow\;{\text{substituting } z = \hat{\mathbf{x}}^{\top}\theta}\\
& = \frac{\exp\bigl(yz\bigr)}
      {\exp\bigl(z\bigr) + \exp\bigl(-z\bigr)}\quad\Longrightarrow\;{\text{factor out}\; \exp(yz)\;\text{from denominator}}\\
& = \frac{\exp\bigl(yz\bigr)}
      {\exp\bigl(yz\bigr)\left(\exp\bigl((1-y)z\bigr) + \exp\bigl(-(1+y)z\bigr)\right)}\quad\Longrightarrow\;\text{cancel}\;\exp(yz)\\
& = \frac{1}
      {\exp\bigl((1-y)z\bigr) + \exp\bigl(-(1+y)z\bigr)}\quad\blacksquare\\
\end{align*}
$$

This expression is the probability of observing the label $y$ given the feature vector $\hat{\mathbf{x}}$ and the parameters $\theta$. Let's look at the case when $y=+1$ and $y=-1$:
* When $y=+1$, we have:
$$
\begin{align*}
P_{\theta}(y = +1\mid \hat{\mathbf{x}})
& = \frac{1}
      {\exp\bigl(0\bigr) + \exp\bigl(-2\left(\hat{\mathbf{x}}^{\top}\theta\right)\bigr)}\\
& = \frac{1}
      {1 + \exp\bigl(-2\left(\hat{\mathbf{x}}^{\top}\theta\right)\bigr)}\quad\blacksquare\\
\end{align*}
$$

* When $y=-1$, we have:
$$\begin{align*}
P_{\theta}(y = -1\mid \hat{\mathbf{x}})
& = \frac{1}
      {\exp\bigl(2\left(\hat{\mathbf{x}}^{\top}\theta\right)\bigr) + \exp\bigl(0\bigr)}\\
& = \frac{1}
      {1+\exp\bigl(2\left(\hat{\mathbf{x}}^{\top}\theta\right)\bigr)}\quad\blacksquare\\
\end{align*}
$$

Putting this all together, we can write the conditional probability of observing the label $y$ given the feature vector $\hat{\mathbf{x}}$ and the parameters $\theta$ as:
$$\begin{align*}
P_{\theta}(y\mid \hat{\mathbf{x}}) & = \frac{1}{1+\exp\bigl(-2y\left(\hat{\mathbf{x}}^{\top}\theta\right)\bigr)}\quad\Longrightarrow\;\text{Logistic function!}\\
& = \sigma\bigl(2y\left(\hat{\mathbf{x}}^{\top}\theta\right)\bigr)\\
\end{align*}$$

Of course, we want to learn the parameters $\theta$ so that we maximize the log likelihood (or minimize the negative log-likelihood) of the observed labels given the feature vectors. The likelihood function is given by:
$$
\begin{align*}
\mathcal{L}(\theta) & = \prod_{i=1}^{n} P_{\theta}(y_{i}\mid \hat{\mathbf{x}}_{i})\\
& = \prod_{i=1}^{n} \frac{1}{1+\exp\bigl(-2y_{i}\,\left(\hat{\mathbf{x}}^{\top}_{i}\theta\right)\bigr)}\quad\Longrightarrow\;\text{Product is $\textbf{hard}$ to optimize! Take the $\log$}\\
\log\mathcal{L}(\theta) & = -\sum_{i=1}^n \log\!\bigl(1+\exp\bigl(-2y_i\,\left(\hat{\mathbf{x}}^{\top}_{i}\theta\right)\bigr)\bigr)\\
\end{align*}
$$  

__Wow!__ What a minute, is this the same as logistic regression? Yes, it is! The Perceptron learning algorithm can be viewed as a special case of logistic regression with a binary cross-entropy loss function.
___

## Confusion Matrix for Binary Classification
A **confusion matrix** is a table used to evaluate the performance of a binary classification model. It compares the predicted class labels to the true class labels, providing a detailed breakdown of correct and incorrect predictions.

The confusion matrix for a binary classifier is typically structured as follows:

|                     | **Predicted Positive** | **Predicted Negative** |
|---------------------|------------------------|------------------------|
| **Actual Positive** | True Positive (TP)     | False Negative (FN)    |
| **Actual Negative** | False Positive (FP)    | True Negative (TN)     |

### Understanding Each Quadrant
- **True Positive (TP):** The model correctly predicts positive when the actual class is positive. Example: correctly diagnosing a patient who actually has a disease.

- **False Positive (FP):** The model incorrectly predicts positive when the actual class is negative. This is a "false alarm"—like telling a healthy patient they have a disease when they don't.

- **False Negative (FN):** The model incorrectly predicts negative when the actual class is positive. This means missing a true case—like failing to diagnose a patient who actually has the disease.

- **True Negative (TN):** The model correctly predicts negative when the actual class is negative. Example: correctly identifying that a healthy patient is indeed healthy.

### Why is the Confusion Matrix Important?
The confusion matrix provides a comprehensive view of classifier performance and enables calculation of key metrics:

* **Accuracy**: $\frac{TP + TN}{TP + TN + FP + FN}$ — overall correctness
* **Precision**: $\frac{TP}{TP + FP}$ — of all positive predictions, how many were correct?
* **Recall (Sensitivity)**: $\frac{TP}{TP + FN}$ — of all actual positives, how many were correctly identified?
* **Specificity**: $\frac{TN}{TN + FP}$ — of all actual negatives, how many were correctly identified?

> **Key Insight**: Different applications require different trade-offs. In medical diagnosis, high recall (avoiding false negatives) might be more important than high precision, as missing a disease can be more costly than a false alarm.

By analyzing each quadrant, you can understand the types of errors your model makes and make informed decisions about model improvements, threshold adjustments, or cost-sensitive learning approaches.
___