In [16]:
%load_ext autoreload
%autoreload 2

from utils import regularizers as rg

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Regularizers

**Why Regularize?**
- **Reduce Variance**: Regularization helps the model generalize well to new, unseen data.
- **Restrict Influence**: By shrinking certain weights/parameters toward zero (or even to zero), regularization limits how strongly individual predictors can influence the model.
- **Handle Multicollinearity**: Techniques like L2 regularization help address situations where some predictors are linearly dependent on others.

### L1 Regularization (Lasso)

In many regression problems, we want to keep our model simple by reducing the number of parameters used. One way to do this is to add a penalty term to our loss function. In L1 regularization (often called Lasso in regression), this penalty is the sum of the absolute values of the parameter/weights. In other words, if you have parameter/weights $\theta_1, \theta_2, \dots, \theta_p$, the penalty is
$$
\lambda \sum_{j=1}^{p} |\theta_j|
$$
The constant $\lambda$ controls how strong the penalty is. A larger $\lambda$ pushes more parameter/weights to exactly zero, promoting a sparse solution (using fewer parameters and therefore features).

Let's connect the regularization term with the full optimization problem:
$$
\min_{\theta} \; \text{Loss}(\theta) + \lambda \sum_{j=1}^{p} |\theta_j|
$$
We can see a couple of things from the above formulation:
1. Separation of Loss and Regularizer $\lambda \|\theta\|_1$ term
2. When using a proximal gradient method (like ISTA), the update is split into 3 steps:
	1. Gradient Step: Compute an intermediate update $v$ using the gradient of the loss: $v = \theta^{(k)} - \eta \nabla L(\theta^{(k)})$, where $\eta$ is the learning rate
	2. Proximal Step: Apply the proximal operator of the L1 norm to $v$: $\theta^{(k+1)} = \text{prox}_{\eta \lambda \|\cdot\|_1}(v).$ The proximal operator for the L1 norm is given by the **soft thresholding operator** (will explain in the next section): $\theta_i^{(k+1)} = \operatorname{sign}(v_i) \max\left(|v_i| - \eta \lambda, \; 0\right)$
   	3. The soft thresholding operator is the solution. And this derivation depends solely on the L1 term and the quadratic penalty used in the proximal framework. It is **completely independent** of the specific loss function $L(\theta)$. Whether $L(\theta)$ is mean squared error, cross-entropy, or any other differentiable loss, the proximal operator for the L1 norm remains the same.
	$$
	\text{prox}_{\eta \lambda \|\cdot\|_1}(v) = \arg\min_{x} \left\{ \eta \lambda |x| + \frac{1}{2}(x - v)^2 \right\}
	$$


#### Introducing the Proximal Operator

The optimization problem $\min_{\theta} \; \text{Loss}(\theta) + \lambda \sum_{j=1}^{p} |\theta_j|$  is defined over the entire parameter (or weight) space, which is a multidimensional space. The challenge here is that the L1 norm, $\|\theta\|_1$, is non-differentiable at points where any $\theta_j = 0$. To handle non-differentiable functions in optimization, we use a tool called the **proximal operator**. 

$$
\text{prox}_{f}(v) = \arg\min_{x} \left\{ f(x) + \frac{1}{2}\|x - v\|^2 \right\}
$$
The proximal operator can break down the problem space into multiple sub-problems. In our case, when $f(x) = \lambda |x|$, the subproblem for each coordinate becomes: $$ \text{prox}_{\lambda |\cdot|}(v) = \arg\min_{x} \left\{ \lambda |x| + \frac{1}{2}(x - v)^2 \right\}. $$
The reason we say this minimization has a "neat closed-form solution" is because the structure of the objective function—a sum of a linear absolute value term $\lambda |x|$ and a quadratic term $\frac{1}{2}(x - v)^2$ — is simple enough to allow an exact analytical solution. By examining the problem separately over different regions (specifically, where $x$ is positive, negative, or zero), we can derive an explicit formula without the need for iterative numerical methods.


#### The Role of the Quadratic Term in the Proximal Operator
I was initially confused about the quadratic term and its relationship with the loss function because it looks very similar to the mean squared error loss. However, the quadratic term $\frac{1}{2}(x - v)^2$ measures the distance from the updated point $v$, not the loss itself. The quadratic term is not a substitute for your loss function (such as cross-entropy). Instead, it is a fundamental part of the proximal operator's definition, which is used to handle non-differentiable regularizers like the L1 penalty. 

Let's dive deeper into what $x$ and $v$ mean intuitively. In a given training iteration, we have the weights $\theta^{(k)}$ and you take a gradient step on the loss $L(\theta)$ with learning rate $\eta$, you get:

$$
v = \theta^{(k)} - \eta \nabla L(\theta^{(k)})
$$

This $v$ represents the point suggested by the loss function **before** applying the L1 regularization. It represents a temporary value (or "pre-regularized" weight). The proximal operator then solves for $x$ by minimizing
$$ \lambda |x| + \frac{1}{2}(x - v)^2$$
The solution, $x$, is the updated value that incorporates the L1 regularization. So, in this setting, $x$ is initially introduced as a dummy variable representing the candidate update. However, once computed, this $x$ is used to update the corresponding weight in your model. In other words, the new weight becomes $x$. Thus $x$ is not a separate concept from the weight; it is the result of the update process. After applying the proximal operator, $x$ is assigned as the new weight.

Intuitively, the quadratic term acts as a penalty for deviating too far from $v$. This "trust region" ensures that the update for $x$ (which now incorporates the regularizer) remains close to the point suggested by the gradient descent on the loss. This helps maintain stability in the overall optimization process. 


When we perform our optimization process (gradient descent), we need to take the derivative of the entire loss function (including the L1 penalty term). Therefore, let's discuss the derivative of this penalty term. In the L1 penalty, $|\theta_j|$, is not differentiable at $\theta_j = 0$.



#### Step-by-Step Derivation of the Soft-Thresholding Operator
We wish to solve the minimization problem for the proximal operator of the L1 penalty:

$$
\min_{x} \; \lambda |x| + \frac{1}{2}(x - v)^2
$$

Because the absolute value $|x|$ is not differentiable at $x=0$, we consider three cases:


$$
\text{prox}_{\lambda|\cdot|}(v) = \arg\min_{x} \left\{ \lambda |x| + \frac{1}{2}(x-v)^2 \right\} =
\begin{cases}
\textbf{Case 1: } x > 0: & \begin{aligned}
&|x| = x,\\[1mm]
&J(x) = \lambda x + \frac{1}{2}(x-v)^2,\\[1mm]
&\frac{dJ}{dx} = \lambda + (x-v),\\[1mm]
&\text{Setting } \frac{dJ}{dx}=0 \Rightarrow x = v - \lambda \quad (\text{valid if } v > \lambda)
\end{aligned} \\[2em] \\
\textbf{Case 2: } x < 0: & \begin{aligned}
&|x| = -x,\\[1mm]
&J(x) = -\lambda x + \frac{1}{2}(x-v)^2,\\[1mm]
&\frac{dJ}{dx} = -\lambda + (x-v),\\[1mm]
&\text{Setting } \frac{dJ}{dx}=0 \Rightarrow x = v + \lambda \quad (\text{valid if } v < -\lambda)
\end{aligned} \\[2em] \\
\textbf{Case 3: } x = 0: & \begin{aligned}
&\text{If } |v| \le \lambda,\\[1mm]
&\text{Subgradient condition: } 0 \in -v + \lambda[-1,1] \quad \Rightarrow x = 0.
\end{aligned}
\end{cases}
$$


The solution in all cases can be compactly expressed as:
$$
x = \operatorname{sign}(v) \max(|v| - \lambda, 0)
$$
This is the **soft-thresholding operator**:
- If $v > \lambda$: $x = v - \lambda$.
- If $v < -\lambda$: $x = v + \lambda$.
- If $|v| \le \lambda$: $x = 0$.

**Intuition Behind the L1 Proximal Operator and Soft-Thresholding:**
- **Data Fidelity:** Start with an initial point $v$ (e.g., from a gradient step on the loss), and the quadratic term $\frac{1}{2}\|x-v\|_2^2$ ensures that the updated value $x$ remains close to $v$, preserving the model’s fit to the data.
- **Sparsity Promotion:** The L1 penalty $\lambda|x|$ encourages sparsity by penalizing non-zero values. This penalty pushes small parameters/weights toward zero, effectively eliminating features with little contribution.
- **Trade-off Balance:** The quadratic term keeps the update $x$ close to the data-informed weight update $v$ (ensuring data fidelity), while the L1 penalty shrinks $x$ towards zero to enforce sparsity.  If $|v| \le \lambda$, the sparsity term dominates, the operator "softly" pushes the small values toward zero—hence the name "soft-thresholding." If $|v| > \lambda$, it subtracts $\lambda$ from $|v|$, resulting in $x = \operatorname{sign}(v)(|v|-\lambda)$, shrinking the value toward zero but keeping its sign.

#### Visualization
In the below visualization, we are using Ordinary Least Squares (OLS) as an example, to find the parameters in a linear regression  model. We are going to compare a model with L1 regularization (Lasso red bars and dots) versus without (OLS green bars and dots). 

Because OLS is basically the case $\lambda = 0$, you can see how L1 regularization’s solution progressively differs from that baseline. L1 regularization zeros out irrelevant features, so some red bars will disappear (go to zero), while OLS bars remain non-zero even for unimportant features. Similarly, the right scatter plot reveals how the L1 regularization predictions deviate from OLS as $\lambda$ grows—sometimes losing a bit of accuracy in exchange for a simpler (more sparse) model.

As $\lambda$ increases, you’ll see the L1 regularization parameter/weights shrink toward zero, and the predictions change accordingly. This trade-off underscores the primary advantage of L1 regularization: it fosters sparsity by enforcing a higher penalty for non-zero parameter/weights, enabling interpretable models that are less prone to overfitting. Additionally, by zeroing out irrelevant features, L1 regularization often simplifies the model, making it easier to understand which inputs carry the most predictive power. However, with a sufficiently large $\lambda$, even important parameter/weights may shrink too much, reducing the model's predictive accuracy.


In [17]:
rg.plot_regularization(filename="l1_regularization", reg_type="L1")

### L2 Regularization (Ridge)

1. **Definition**: L2 adds a penalty term to the cost function equal to the sum of the squares of the model’s parameters $\theta$, multiplied by $\lambda$.
   $$
   \min_{\theta} \; \text{Loss}(\theta) + \frac{\lambda}{2} \|\theta\|_2^2
   $$
   
   
   The proximal operator (see above for what this is) for L2 regularization:
   $$
   \begin{align*}
   \text{prox}_{f}(v) &= \arg\min_{x} \left\{ \frac{1}{2} \| x - v \|_2^2 + \frac{\eta \lambda}{2} \| x \|_2^2 \right\} \\
   \\
   x - v + \eta \lambda x &= 0 & (\text{Differentiating with respect to } x)\\
   (1 + \eta \lambda) x &= v & (\text{Rearrange})\\
   x &= \frac{1}{1 + \eta \lambda} v & (\text{Rearrange})
   \end{align*}
   $$

   As a reminder, $v$ represents the point suggested by the loss function **before** applying the L2 regularization. It represents a temporary value (or "pre-regularized" weight). The solution, $x$, is the updated value that incorporates the L2 regularization. 
   We can see that this is just simply scaling $v$ down without inducing sparsity, in constrast to the soft-thresholding operator in L1 regularization.



2. **Key Properties**:
- Weight Decay: L2 regularization penalizes weights based on their squared magnitude. This means that larger weights incur a significantly higher penalty compared to smaller weights. The idea is to discourage the model from assigning too much importance to any one feature. In practice, during gradient-based optimization, this results in a multiplicative decay of the weights—often referred to as “weight decay”—where each update scales the weight by a factor less than one. This continuous shrinkage prevents any parameter/weight from becoming excessively large, thereby controlling the complexity of the model and reducing overfitting.
- Collinearity: In many real-world datasets, predictor variables or features ($x$) can be highly correlated (a situation known as multicollinearity). This can cause instability in the estimation of parameter/weights, leading to large variances. L2 regularization helps mitigate this by shrinking the parameter/weights, which in turn stabilizes the solution. By imposing a penalty on the squared size of the parameter/weights, the method encourages them to be of similar scale, thereby reducing the impact of multicollinearity. This results in a more stable and generalizable model.
- No Feature Elimination: Unlike L1 regularization—which can drive some parameter/weights exactly to zero, effectively performing feature selection—L2 regularization shrinks all parameter/weights but does not set any of them exactly to zero. This means that even features with minimal predictive power remain in the model. The benefit here is that the model retains all available features, which can be useful when every feature might contain some signal. However, it also means that L2 is less effective for scenarios where a sparse model (one with many parameter/weights exactly zero) is desired.
- Analytical Solution: For linear models, the incorporation of an L2 penalty (resulting in what is known as ridge regression) leads to a modification of the normal equations. Specifically, the closed-form solution becomes:
   $$
   w = (\mathbf{X}^T \mathbf{X} + \lambda I)^{-1} \mathbf{X}^T \mathbf{y}
   $$

   This solution is not only elegant but also computationally efficient for moderate-sized datasets. The addition of $\lambda I$ (where $I$ is the identity matrix) ensures that the matrix being inverted is well-conditioned, which addresses issues with multicollinearity and makes the solution more stable.


#### Visualization

In the below visualization, we are using Ordinary Least Squares (OLS) as an example, to find the parameters in a linear regression  model. We are going to compare a model with L2 regularization (Ridge red bars and dots) versus without (OLS green bars and dots). 

We can see below that the larger weights have a much significant effect from the L2 regularization, whereas the smaller weights have a smaller effect. And this is apparent from the formula, which has a quadratic term on the weight values.

In [18]:
rg.plot_regularization(filename="l2_regularization", reg_type="L2", reg_strength=100)


### L1 vs. L2 Regularization
| Aspect                | L1 (Lasso)                                                            | L2 (Ridge)                                                      |
| --------------------- | --------------------------------------------------------------------- | --------------------------------------------------------------- |
| **Penalty Term**      | $\sum \|\theta_j\|$                                                     | $\sum \theta_j^2$                                               |
| **Feature Selection** | Tends to produce sparse models (zeros out some parameter/weights)          | Does not produce exact zeros; keeps all features                |
| **Computational**     | No analytical solution (often solved by coordinate descent)           | Analytical solution (closed-form) in linear regression          |
| **Multicollinearity** | May arbitrarily pick among correlated variables (and set others to 0) | Shrinks parameter/weights of correlated predictors, keeping them all |


### Batch Normalization

Batch Normalization is often introduced as a method to stabilize and speed up training, but it also has a mild regularizing effect. 

For each mini-batch and for each activation $x$, batch normalization performs the following steps:

1. **Compute the Mini-Batch Mean:**
   $$
   \mu_B = \frac{1}{m} \sum_{i=1}^{m} x_i
   $$
   where $m$ is the number of examples in the mini-batch.

2. **Compute the Mini-Batch Variance:**
   $$
   \sigma_B^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_B)^2
   $$

3. **Normalize the Activation:**
   $$
   \hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}
   $$
   Here, $\epsilon$ is a small constant added to avoid division by zero.

4. **Scale and Shift:**
   $$
   y_i = \gamma \hat{x}_i + \beta
   $$
   The parameters $\gamma$ and $\beta$ are learnable and allow the network to adjust the normalized output.
   This operation was counterintuitive when I initially learned this. The purpose of the shift and scale operation is to basically preserve the representational power of the activations. Because our normalization step doesn't include any nonlinear terms, we don't need to include any nonlinear terms at this "shift and scale" stage as well. So, roughly, this step is trying to "denormalize" what we normalized before, but in a more milder way than canceling out the entire normalization processs. This is because the hyperparameters of the scale and shift operations are learnable.


By ensuring that each layer receives inputs with consistent statistical properties, batch normalization contributes to more efficient training and often leads to better overall performance.

#### Visualization

In this animation below, we start by looking at the raw activations of a neural network layer. Each feature has its own distribution, with different means and variances, which can make training unstable.

Batch normalization begins by computing the mean and variance for each feature across the batch. Then, for every activation, we subtract its feature’s mean and divide by its standard deviation (see formula above). This normalization step centers the activations around zero and scales them to have unit variance, helping to stabilize the learning process.

Next, the normalized values are adjusted by applying two learned parameters: $\gamma$ and $\beta$. $\gamma$ scales the normalized activations, while $\beta$ shifts them. This step allows the network to reintroduce flexibility, ensuring that the activations can still adopt the optimal range needed for effective learning.

Overall, batch normalization reduces internal covariate shift, making the training process more robust and efficient. This animation visually demonstrates how the raw, varied activations are first standardized and then adaptively scaled and shifted for improved performance.

In [19]:
rg.plot_batchnorm(filename="batch_norm")

### Layer Normalization

Layer Normalization is a technique designed to stabilize and accelerate training by normalizing the activations across the features of each individual data point rather than across a mini-batch (Batch Normalization). This approach is particularly advantageous in models where the mini-batch size is small or varies significantly—such as in recurrent neural networks and transformer architectures.

For each input data point with activation vector $x = (x_1, x_2, \dots, x_H)$, layer normalization performs the following steps:

1.	Compute the Layer Mean:

    $$
    \mu_L = \frac{1}{H} \sum_{i=1}^{H} x_i
    $$

    Here, H is the number of features in the layer. This step centers the activations of the current data point by calculating their average value.

2.	Compute the Layer Variance:
    $$
    \sigma_L^2 = \frac{1}{H} \sum_{i=1}^{H} \left(x_i - \mu_L\right)^2
    $$

    This variance measures the spread of the activations around the mean, ensuring that we capture the dispersion across all features for that single example.

3.	Normalize the Activations:

    $$
    \hat{x}_i = \frac{x_i - \mu_L}{\sqrt{\sigma_L^2 + \epsilon}}
    $$

    A small constant $\epsilon$ is added for numerical stability, preventing division by zero. This normalization guarantees that the activations for each data point have zero mean and unit variance.

4.	Scale and Shift:

    $$
    y_i = \gamma \hat{x}_i + \beta
    $$

    The learnable parameters $\gamma$ and $\beta$ allow the network to adjust the normalized output, preserving the layer’s representational power. This step effectively reintroduces flexibility, enabling the network to recover any necessary scaling or shifting of the normalized activations.

By normalizing across the features of each individual data point, layer normalization ensures that every layer receives inputs with a consistent statistical distribution, regardless of the mini-batch composition. This not only stabilizes the gradients during training but also leads to faster convergence. Moreover, because the normalization is independent of other examples in the batch, it is especially well-suited for tasks involving sequential or variable-length data where traditional batch normalization might struggle.

Layer normalization, therefore, contributes both to improved training dynamics and to a mild regularizing effect, ultimately enhancing model performance in a wide range of applications.

#### Visualization
In this step of the animation, we’re applying layer normalization to each individual sample. Unlike batch normalization—which normalizes across the entire batch—layer norm computes the mean and variance across the features of a single sample. As a result, each data point’s activations are centered around zero and scaled to have unit variance, regardless of batch size.

After normalizing, we use $\gamma$ and $\beta$ to rescale and shift the activations. So you can see the activation values are shifted upward from zero. This learned shift and scale give the network flexibility to find the most effective range for subsequent layers, especially in scenarios where batch normalization might be less effective, such as with very small batch sizes or in recurrent networks.

In [20]:
rg.plot_layernorm(filename="layer_norm")

#### Visualization comparing Batch Norm and Layer Norm

Below, we have Batch Normalization applied to a small batch of just 4 samples, each having 20 features. Each cluster of colored points represents one sample’s 20 features. You’ll see them move from raw activations $x$ to normalized activations $x_{\hat{}}$, and then to the final scaled-shifted values $y_{BN}$. One immediate observation is that because there are only 4 samples, the batch statistics – mean and variance computed across those 4 samples – can be quite noisy. If one sample has an outlier feature, it can significantly shift the overall batch mean and variance. This instability can cause the normalized values to be less predictable or stable from batch to batch, defeating the purpose of BN’s smoothing effect. In a real training setting, you might see the learning process struggle or produce inconsistent gradients. So, while BN typically shines with a moderately large batch, it’s not well-suited here in a ‘small batch, many features’ scenario.

In [21]:
rg.plot_batchnorm_on_small_batch(gamma=1.5, beta=0.5)

Below, we’re looking at Layer Normalization applied to a dataset of 32 samples, each with only 3 features. In the animation, you see each point moving from its original ‘raw’ activation, to a normalized value $x_{\hat{}}$, and finally getting scaled and shifted $y_{LN}$. But notice how even after normalization, the spread of each feature doesn’t dramatically converge or align – it remains somewhat scattered around different values. That’s because LayerNorm is normalizing across the three features per sample, ignoring the fact that there’s a large batch of 32 samples. With only 3 features, there’s very little internal diversity per sample to push the mean and variance to stable values. The result is that normalizing each sample across just three features doesn’t yield as much benefit here. In other words, LN is not leveraging the fact that we have many samples to get robust statistics, so in this ‘large batch, few features’ setting, LayerNorm isn’t particularly helpful.

In [22]:
rg.plot_layernorm_on_large_batch(gamma=1.5, beta=0.5)

### Dropout

Dropout is a powerful regularization technique that mitigates overfitting by randomly "dropping out" a fraction of the neurons during each training iteration. By temporarily deactivating a random subset of neurons, dropout forces the network to learn more robust, distributed representations that do not rely on any single neuron. This process can be seen as training an ensemble of many smaller networks that share weights, which, when combined, lead to a more generalizable model.

For each training example, dropout works as follows:

1. **Random Mask Generation:**

   For each neuron with activation $x_i$, generate a binary mask $m_i$ sampled from a Bernoulli distribution:
   
   $$
   m_i \sim \text{Bernoulli}(1-p)
   $$
   
   Here, $p$ is the dropout probability (the likelihood that a neuron is "dropped" or set to zero), and $1-p$ is the probability of retaining the neuron. This mask randomly decides which neurons will be active during that forward pass.

2. **Apply Dropout to the Activations:**

   The original activations are then modified by element-wise multiplication with the mask:
   
   $$
   \tilde{x}_i = x_i \cdot m_i
   $$
   
   This operation effectively disables a fraction of the neurons, ensuring that no single neuron can dominate the learning process.

3. **Scaling for Consistent Output:**

   To maintain the same expected value of the activations during both training and inference, the retained activations are scaled by a factor of $\frac{1}{1-p}$:
   
   $$
   \tilde{x}_i = \frac{x_i \cdot m_i}{1-p}
   $$
   
   This scaling step compensates for the reduced number of active neurons, ensuring that the overall magnitude of the activations remains consistent when dropout is not applied during inference.

**Intuition Behind Dropout**

- **Preventing Co-adaptation:** By randomly omitting neurons, dropout discourages the network from relying on specific paths or co-adaptations between neurons. Each neuron must learn to function well in a variety of contexts, enhancing the model's robustness.
- **Implicit Ensemble Learning:** Training with dropout can be viewed as training an ensemble of many smaller networks that share the same weights. At test time, when all neurons are active (after appropriate scaling), the network effectively averages the predictions of these sub-networks, which typically improves generalization.
- **Regularizing Effect:** The randomness introduced by dropout serves as a form of noise injection during training, which can help the network escape local minima and improve its ability to generalize to unseen data.

Much like layer normalization, dropout addresses issues that arise during training—but from a different perspective. While layer normalization stabilizes the training dynamics by normalizing activations within each data point, dropout regularizes the model by ensuring that it does not become overly dependent on any one feature. This complementary approach often leads to improved performance, especially in large and complex networks where overfitting is a significant concern.

In summary, dropout is a straightforward yet effective technique that enhances model generalization through stochastic training, effectively building an ensemble of sub-networks that, when combined, produce a robust, well-regularized model.


#### Visualization

In this visualization, each point on the scatter plot represents a neuron's activation value, with the neuron index on the x-axis and the activation magnitude on the y-axis. Initially, we see the original activation distribution—some neurons are higher, some are lower—reflecting the natural variance in outputs.

As the animation progresses, a portion of neurons abruptly drops to zero, illustrating the core idea of **dropout**: randomly “turning off” neurons according to a specified probability.

In the second phase, the active neurons are rescaled by a factor of $\frac{1}{1 - p}$ to maintain a comparable overall output magnitude despite having fewer active neurons. By the end of the animation, the rescaled **dropout** output lands in roughly the same range as the original. 

This demonstrates how **dropout** balances neuron deactivation with scaling, helping prevent over-reliance on individual neurons and thereby promoting better generalization in the model.

In [23]:
rg.plot_dropout("dropout")

### Label Smoothing

Label smoothing is a regularization technique that helps prevent models from becoming overly confident in their predictions. Instead of using hard, one-hot encoded labels, label smoothing replaces them with a softened version of the target distribution. This approach can lead to improved generalization, better calibration of predicted probabilities, and enhanced robustness, especially in classification tasks.

For a classification problem with $K$ classes, the conventional one-hot encoded target label for a given example is defined as:

$$
y_k =
\begin{cases}
1 & \text{if } k \text{ is the correct class}, \\
0 & \text{otherwise}.
\end{cases}
$$

With label smoothing, the target distribution is adjusted by assigning the correct class a probability of $1 - \epsilon + \frac{\epsilon}{K}$ and all other classes a uniform probability of $\frac{\epsilon}{K}$, where $\epsilon$ is a small constant (e.g., 0.1). The modified target label $\tilde{y}_k$ becomes:

$$
\tilde{y}_k =
\begin{cases}
1 - \epsilon + \frac{\epsilon}{K} & \text{if } k \text{ is the correct class}, \\
\frac{\epsilon}{K} & \text{otherwise}.
\end{cases}
$$

This smoothing reduces the gap between the target probability and the probabilities of other classes, thereby discouraging the model from becoming too confident about a single class.

**Intuition**

- **Mitigating Overconfidence:** By softening the target labels, label smoothing prevents the model from assigning an absolute probability of 1 to the correct class. This reduction in confidence helps avoid overfitting, as the model learns to distribute its predictions more evenly across classes.
- **Improved Calibration:** Models trained with label smoothing tend to produce probability estimates that better reflect the true uncertainty in predictions. This improved calibration is especially beneficial in applications where the confidence of predictions is crucial.
- **Robustness to Noisy Labels:** In scenarios where the training data may contain mislabeled examples, label smoothing can reduce the adverse impact of such noise by not forcing the model to fully trust any single label.
- **Complementary Regularization:** Similar to dropout, label smoothing acts as a regularizer. It encourages the network to form a more balanced view of the class distribution, thereby producing smoother decision boundaries and reducing the risk of overfitting.

In summary, label smoothing serves as an effective regularization method that softens the target distribution, leading to models that are less prone to overconfidence, better calibrated, and more robust when faced with noisy or ambiguous data.


#### Visualization

In this visualization, we’re looking at how label smoothing changes the target distribution. Originally, the label was ‘one‐hot,’ meaning all probability mass was on the correct class (index 2). As the smoothing parameter $\epsilon$ moves closer to 1, the model’s target probability for the correct class is slightly lowered, and the remaining probability is spread across the other classes. This helps prevent overconfidence, encouraging the model to be more robust to noise and reducing the risk of overfitting.

In [24]:
rg.plot_label_smoothing(filename="label_smoothing")

#### Visualization comparing with and without label smoothing
In the below visualization, we see both models’ final decision boundaries on the left and their test accuracy curves on the right. You can see that the label smoothing can definitely help with improving the model generalizability by looking at the test accuracy improvements compared to the no smoothing model. And also, the decision boundary is showing a faster convergence for the label smoothing model when separating these two blue and red dots. Label smoothing typically leads to more stable training and less overconfidence, even though here both end up with a similar final accuracy.

In [25]:
rg.plot_label_smoothing_comparison(filename="label_smoothing2", n_frames=200)