# **Softmax function**

It converts a vector of numbers into a probability distribution.


**Mathematical Definition:**

For a `vector z` with `K elements`, the softmax function is defined as:


   $$ \text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}} $$

**Key Properties:**


- Outputs are always between 0 and 1
- The sum of all outputs equals 1
- Preserves the relative ordering of the input elements
- Differentiable, making it suitable for gradient-based optimisation


**Common Applications:**
- Neural Network Output Layer:
  - Used in the final layer of classification networks
  - Converts raw scores into class probabilities


- Attention Mechanisms:

  - Used in transformer models to compute attention weights
  - Helps in determining the importance of different input elements


- Reinforcement Learning:

  - Converting action scores into action probabilities
  - Used in policy
  
**Key Implementation Note:**
When implementing softmax, it's common to subtract the maximum value from all inputs first:

  $$ \text{softmax}(z_i) = \frac{e^{z_i-max(z)}}{\sum_{j=1}^{K} e^{z_j-max(z)}} $$

# **PyTorch's commonly used loss functions**

- **Classification Loss Functions:**
Used when the target is a class label or probability distribution over classes.
  1.   Cross Entropy Loss (nn.CrossEntropyLoss)
  2.   Binary Cross Entropy Loss (nn.BCELoss)
  3.   NLLLoss (Negative Log-Likelihood Loss)
  4.   KLDivLoss (Kullback-Leibler Divergence Loss)

- **Regression Loss Functions**
  5.   Mean Squared Error Loss (nn.MSELoss)
  6.   L1 Loss (nn.L1Loss)
  7.   SmoothL1Loss (Huber Loss)
  


## Cross Entropy Loss (nn.CrossEntropyLoss)

  $$CrossEentropyLoss = -\sum_{i=1}^{C} y_i \log(\hat{y}_i)$$


- whaere $y_i$ is the true label and $\hat{y}_i$ is the predicted probbility.
- Use case: `Multi-class classification`.
- Combines `LogSoftmax` and `Negative Log-Likelihood Loss (NLLLoss)` into one.
- Automatically applies softmax to your network output
- Input: Logits (raw, unnormalised scores) of shape (N, C) where
N is batch size, and C is the number of classes.
- Target: Class indices (integers from 0 to C−1).
- Target shape: (batch_size, ...)





## Binary Cross Entropy Loss (nn.BCELoss)

$$BCELoss = - (y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i))$$

- Used for `binary classification`
- Expects input to be sigmoid output (between 0 and 1)
- Often paired with sigmoid activation
- Input/target shape: (batch_size, ...)

## NLLLoss (Negative Log-Likelihood Loss)


$$NLLLoss = -\frac{1}{N} \sum_{i=1}^{N} y_i \log(\hat{y}_i)$$

- N is batch size
- Used with log-softmax output
- Use case: `Multi-class classification`, but requires log probabilities as input.
- Usually used through CrossEntropyLoss, used less often as `CrossEntropyLoss` handles the same task more efficiently.
- Expects input to be log probabilities
- Input shape: (batch_size, num_classes, ...)
- Target shape: (batch_size, ...)

## KLDivLoss (Kullback-Leibler Divergence Loss)


$$KLDivLoss = \sum_{i=1}^{N} y_i \log(\frac{y_i}{\hat{y}_i})$$

- Measures difference between two probability distributions
- Used in distillation and probabilistic learning
- Input: Log-probabilities (output from log_softmax).
- target: Probabilities (must sum to 1).

## Mean Squared Error Loss


$$MSELoss = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2$$

- Used for regression problems
- Measures average squared difference between predictions and targets
- More sensitive to outliers than L1Loss
- Input/target shape: any matching shapes

## L1Loss (Mean Absolute Error Loss)


$$L1Loss = \frac{1}{N} \sum_{i=1}^{N} |y_i - \hat{y}_i|$$

- Used for regression
- More robust to outliers than MSELoss
- Input/target shape: any matching shapes
- Less sensitive to outliers compared to MSELoss.

## SmoothL1Loss (Huber Loss)


$$SmoothL1Loss = \begin{cases}
    0.5 (y_i - \hat{y}_i)^2, & \text{if } |y_i - \hat{y}_i| < 1 \\
    |y_i - \hat{y}_i| - 0.5, & \text{otherwise}
\end{cases}$$

- Use case: Regression tasks, robust to outliers by combining L1 and L2 loss.
- L1 loss for large errors, L2 loss for small errors.
- Used in: Object detection (e.g., Faster R-CNN)
and Regression tasks requiring robustness







Tips for Choosing Loss Functions:

- Classification Tasks:

    - Binary: Use BCEWithLogitsLoss
    - Multi-class: Use CrossEntropyLoss
    - Multi-label: Use BCEWithLogitsLoss


- Regression Tasks:

    - General purpose: MSELoss
    - Outlier-robust: L1Loss or SmoothL1Loss
    - Distribution learning: KLDivLoss