# L3c: Logistic Regression and Regularization
In this lecture, we will explore logistic regression, a technique for binary classification tasks. We will also discuss the concept of regularization, which helps prevent overfitting in our models.

> __Learning Objectives:__
> 
> By the end of this lecture, you should be able to:
>
> * __Understand the Boltzmann distribution framework for classification:__ Represent binary classification as a probabilistic problem using energy-based models and explain how the logistic function emerges from the Boltzmann distribution.
> * __Derive and implement logistic regression with cross-entropy loss:__ Derive the cross-entropy loss function from maximum likelihood estimation and apply gradient descent to optimize logistic regression models.
> * __Apply logistic regression to binary classification tasks:__ Train logistic regression models on real datasets, interpret model parameters, and evaluate classifier performance using appropriate metrics.

Let's get started!
___

## Example
Today, we will use the following examples to illustrate key concepts:
 
> [▶ Logistic classification of a banknote dataset](CHEME-5820-L3c-Example-LogisticRegression-GD-Spring-2026.ipynb). In this example, we'll use logistic regression to classify authentic and inauthentic banknotes based on features extracted from images of the banknotes. We'll train a logistic regression model using gradient descent and evaluate its performance using the confusion matrix.

___

## Logistic Regression: Cross-entropy loss
Suppose we view our two–class labels $y\in\{-1,1\}$ as _states_ in a Boltzmann distribution conditioned on the input $\hat{\mathbf{x}}\in\mathbb{R}^{m+1}$ (the original feature vector with a `1` as the last element to account for a bias). Then for any state $y$ with energy $E(y,\hat{\mathbf{x}})$ at (unit) temperature, the conditional probability of observing the label $y\in\left\{-1,+1\right\}$ given the feature vector $\hat{\mathbf{x}}$ can be represented as
$$
\begin{align*}
P(y\mid \hat{\mathbf{x}})
=\frac{\exp\bigl(-E(y,\hat{\mathbf{x}})\bigr)}
      {\underbrace{\sum_{y' \in\{-1,1\}} \exp\bigl(-E(y',\hat{\mathbf{x}})\bigr)}_{Z(\hat{\mathbf{x}})}}.
\end{align*}
$$
For the energy function, we can use a linear model of the form:
$$
\begin{align*}
E(y,\hat{\mathbf{x}})\;=\;-\,y\;\bigl(\hat{\mathbf{x}}^{\top}\theta \bigr).
\end{align*}
$$
where $\theta\in\mathbb{R}^{p}$ is a vector of __unknown__ parameters (weights plus bias) that we want to learn. When $y=+1$, the energy $E(1,\hat{\mathbf{x}})=-\hat{\mathbf{x}}^{\top}\theta$ is *lower* (more probable) if $\hat{\mathbf{x}}^{\top}\theta$ is large. On the other hand, when $y=-1$, the energy $E(-1,\hat{\mathbf{x}})=+\hat{\mathbf{x}}^{\top}\theta$, so $y=-1$ is favored when $\hat{\mathbf{x}}^{\top}\theta$ is very negative.

Let's substitute the energy function into the conditional probability expression and do some algebra:
$$
\begin{align*}
P_{\theta}(y\mid \hat{\mathbf{x}})
& =\frac{\exp\bigl(-E(y,\hat{\mathbf{x}})\bigr)}
      {\underbrace{\sum_{y' \in\{-1,1\}} \exp\bigl(-E(y',\hat{\mathbf{x}})\bigr)}_{Z(\hat{\mathbf{x}})}}\\
&=\frac{\exp\bigl(y\left(\hat{\mathbf{x}}^{\top}\theta\right)\bigr)}
      {\exp\bigl(\hat{\mathbf{x}}^{\top}\theta\bigr) + \exp\bigl(-\hat{\mathbf{x}}^{\top}\theta\bigr)}\quad\Longrightarrow\;{\text{substituting } z = \hat{\mathbf{x}}^{\top}\theta}\\
& = \frac{\exp\bigl(yz\bigr)}
      {\exp\bigl(z\bigr) + \exp\bigl(-z\bigr)}\quad\Longrightarrow\;{\text{factor out}\; \exp(yz)\;\text{from denominator}}\\
& = \frac{\exp\bigl(yz\bigr)}
      {\exp\bigl(yz\bigr)\left(\exp\bigl((1-y)z\bigr) + \exp\bigl(-(1+y)z\bigr)\right)}\quad\Longrightarrow\;\text{cancel}\;\exp(yz)\\
& = \frac{1}
      {\exp\bigl((1-y)z\bigr) + \exp\bigl(-(1+y)z\bigr)}\quad\blacksquare\\
\end{align*}
$$

This expression is the probability of observing the label $y$ given the feature vector $\hat{\mathbf{x}}$ and the parameters $\theta$. Let's look at the case when $y=+1$ and $y=-1$:

> __Cases:__
>
> When $y=+1$, we have:
> $$
\begin{align*}
P_{\theta}(y = +1\mid \hat{\mathbf{x}})
& = \frac{1}
      {\exp\bigl(0\bigr) + \exp\bigl(-2\left(\hat{\mathbf{x}}^{\top}\theta\right)\bigr)}\\
& = \frac{1}
      {1 + \exp\bigl(-2\left(\hat{\mathbf{x}}^{\top}\theta\right)\bigr)}\quad\blacksquare\\
\end{align*}
$$
> 
> When $y=-1$, we have:
> $$\begin{align*}
P_{\theta}(y = -1\mid \hat{\mathbf{x}})
& = \frac{1}
      {\exp\bigl(2\left(\hat{\mathbf{x}}^{\top}\theta\right)\bigr) + \exp\bigl(0\bigr)}\\
& = \frac{1}
      {1+\exp\bigl(2\left(\hat{\mathbf{x}}^{\top}\theta\right)\bigr)}\quad\blacksquare\\
\end{align*}
$$
> Putting this all together, we can write the conditional probability of observing the label $y$ given the feature vector $\hat{\mathbf{x}}$ and the parameters $\theta$ as:
> $$\begin{align*}
P_{\theta}(y\mid \hat{\mathbf{x}}) & = \frac{1}{1+\exp\bigl(-2y\left(\hat{\mathbf{x}}^{\top}\theta\right)\bigr)}\quad\Longrightarrow\;\text{Logistic function!}\\
& = \sigma\bigl(2y\left(\hat{\mathbf{x}}^{\top}\theta\right)\bigr)\\
\end{align*}$$

The logistic function $\sigma(\cdot)$ is a sigmoid activation function that maps any real-valued input to the interval $(0, 1)$, making it ideal for modeling probabilities. This squashing property ensures that predicted probabilities remain valid regardless of the input magnitude. In logistic regression, the function compresses the linear predictor $2y(\hat{\mathbf{x}}^{\top}\theta)$ into a probability space, providing a smooth, differentiable decision boundary between classes.

### Parameter Estimation
Of course, we want to learn the parameters $\theta$ so that we maximize the log likelihood (or minimize the negative log-likelihood) of the observed labels given the feature vectors. The likelihood function is given by:
$$
\begin{align*}
\mathcal{L}(\theta) & = \prod_{i=1}^{n} P_{\theta}(y_{i}\mid \hat{\mathbf{x}}_{i})\\
& = \prod_{i=1}^{n} \frac{1}{1+\exp\bigl(-2y_{i}\,\left(\hat{\mathbf{x}}^{\top}_{i}\theta\right)\bigr)}\quad\Longrightarrow\;\text{Product is $\textbf{hard}$ to optimize! Take the $\log$}\\
\log\mathcal{L}(\theta) & = -\sum_{i=1}^n \log\!\bigl(1+\exp\bigl(-2y_i\,\left(\hat{\mathbf{x}}^{\top}_{i}\theta\right)\bigr)\bigr)\\
\end{align*}
$$  

We can use gradient descent to minimize the negative log-likelihood (also known as the cross-entropy loss function):
$$
\boxed{
\begin{align*}
J(\theta) & = -\log\mathcal{L}(\theta)\\
& = \sum_{i=1}^n \log\!\bigl(1+\exp\bigl(-2y_i\,\left(\hat{\mathbf{x}}^{\top}_{i}\theta\right)\bigr)\bigr)\quad\blacksquare\\
\end{align*}}
$$      
This will give us the optimal parameters $\theta$ for our logistic regression model:
$$
\hat{\theta} = \arg\min_{\theta} J(\theta)
$$

Gradient descent can minimize this loss function to learn the optimal parameters for binary classification. The example notebook demonstrates this approach on a real dataset.

___

## Regularization: Preventing Overfitting
In practice, logistic regression models can overfit to training data, particularly when the feature space is high-dimensional or when the training set is small. __Regularization__ is a technique that adds a penalty term to the loss function to discourage complex models and improve generalization to unseen data.

> __What is regularization?__
>
> __Regularization__ is a method that constrains the magnitude of model parameters to reduce overfitting. By penalizing large parameter values, regularization encourages simpler, more generalizable decision boundaries. Two common regularization approaches are L2 (Ridge) and L1 (Lasso) regularization.

The regularized cross-entropy loss function can be written as:
$$
\boxed{
\begin{align*}
J_{\text{reg}}(\theta) & = J(\theta) + \lambda\,R(\theta)\\
& = \sum_{i=1}^n \log\!\bigl(1+\exp\bigl(-2y_i\,\left(\hat{\mathbf{x}}^{\top}_{i}\theta\right)\bigr)\bigr) + \lambda\,R(\theta)\quad\blacksquare\\
\end{align*}}
$$

where $\lambda > 0$ is the __regularization parameter__ (also called the regularization strength) that controls the trade-off between minimizing training error and keeping parameters small, and $R(\theta)$ is the regularization function. Common choices for $R(\theta)$ are:

* __L2 regularization (Ridge):__ $R(\theta) = \frac{1}{2}\lVert\theta\rVert_{2}^{2} = \frac{1}{2}\sum_{j=1}^{p}\theta_{j}^{2}$. This penalizes the squared magnitude of all parameters equally and encourages smaller parameter values across the board.
* __L1 regularization (Lasso):__ $R(\theta) = \lVert\theta\rVert_{1} = \sum_{j=1}^{p}|\theta_{j}|$. This penalizes the absolute magnitude of parameters and can drive some parameters exactly to zero, effectively performing automatic feature selection.

The regularization parameter $\lambda$ controls the strength of regularization: small values of $\lambda$ give more weight to the training loss, while large values emphasize parameter magnitude control. Selecting an appropriate $\lambda$ is crucial and is typically done via cross-validation.

___

## Summary
Logistic regression uses the Boltzmann distribution and cross-entropy loss to model binary classification problems and learn decision boundaries through maximum likelihood estimation.

> __Key Takeaways:__
>
> * **The logistic function emerges from the Boltzmann distribution:** Starting with an energy-based model, algebraic manipulation yields the logistic function $\sigma(2y(\hat{\mathbf{x}}^{\top}\theta))$ as the conditional probability of a label given features. This provides a principled probabilistic interpretation for logistic regression.
> * **Cross-entropy loss is the negative log-likelihood:** The cross-entropy loss $J(\theta) = \sum_{i=1}^n \log(1+\exp(-2y_i(\hat{\mathbf{x}}^{\top}_{i}\theta)))$ arises naturally from maximum likelihood estimation and quantifies prediction error in probabilistic terms. Minimizing this loss finds parameters that maximize data likelihood.
> * **Logistic regression requires optimization for parameter learning:** Since the cross-entropy loss is non-convex and has no closed-form solution, gradient descent or other optimization algorithms are needed to find optimal parameters $\hat{\theta}$. The resulting classifier makes predictions based on whether the logistic function output exceeds a decision threshold.

Logistic regression provides a practical and interpretable approach to binary classification by combining probabilistic modeling with gradient-based optimization.
___