<a href="https://colab.research.google.com/github/rida-manzoor/DL/blob/main/Loss_Function.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **What is loss function**

It is a method of evaluating how well your algorithm is modelling your dataset. If output of loss function is high, model is not behaving well. It should be low.
A loss function, also known as a cost function or objective function, is a crucial component. Its primary role is to measure the difference between the predicted output of a model and the actual ground truth labels or targets. The goal of the loss function is to quantify how well or poorly the model is performing on a given dataset, providing feedback that guides the optimization process during training.

**Loss Function in DL**
Some main types of loss functions are:

1. **Regression**

    1. MSE
    2. MAE
    3. B. Huber loss
2. **Classification**

    1. Binary Crossentropy
    2. Categorical Crossentropy
    3. Hinge Loss

3. **AutoEncoder**
    1. KL Diverfence

4. **GAN**
    1. Discriminator loss
    2. MinMax Gan loss

5. **Embeddings**
    1. Tripled Loss

6. **Object Detection**
    1. Focal Loss

**Loss Function vs Cost Function**

>  **Loss function:** Used when we refer to the error for a single training example. **Cost function:** Used to refer to an average of the loss functions over an entire training data.

## **Mean Squared Error / L2 error**

MSE stands for Mean Squared Error, which is a commonly used loss function in regression tasks. It measures the average squared difference between the predicted values and the actual target values in a dataset. The MSE is calculated by taking the average of the squared differences between each predicted value and its corresponding actual value.

$$ (y_i - ̂y_i)^2  $$a*ble.


*Adv*antage*
1. Easy to interpret
2. Always differentiable
3. One local m*inima

*Disadv*antage*
1. Error unit is squared
2. It is not robust to outliers

>The aActivation function ofthe  last neuron should be linear

Key points about MSE:

- **Squared Differences**: MSE calculates the squared difference between each predicted value and its corresponding actual value. Squaring the differences penalizes larger errors more heavily than smaller errors.
  
- **Non-Negative Value**: MSE is always non-negative since it involves squaring the differences. A value of 0 indicates perfect predictions, where the predicted values exactly match the actual values.
- **Loss Function**: In machine learning models, MSE is often used as a loss function during training. The goal is to minimize the MSE, which means reducing the average squared difference between predictions and targets.
- **Impact of Outliers**: MSE can be sensitive to outliers in the data, as larger errors (due to outliers) contribute significantly to the overall loss.
- **Interpretation**: The square root of MSE (RMSE, Root Mean Squared Error) is sometimes used for easier interpretation, as it is in the same scale as the target variable.

## **Mean Absolute error / L1 error**


Mean Absolute Error (MAE) is another commonly used loss function in regression tasks, similar to Mean Squared Error (MSE). However, unlike MSE which calculates the average squared difference between predicted and actual values, MAE calculates the average absolute difference. This means that MAE measures the average magnitude of errors without considering their direction (positive or negative).

$$ |y_i - ŷ_i| $$


*Advantages*
1. Easy to understand
2. Unit is same as 'y'
3. Robust to outliers

*Disadvantages*
1. Not differentiable (have to calculate sub-gradients)

Key points about MAE:

- **Absolute Differences**: MAE calculates the absolute difference between each predicted value and its corresponding actual value. This means that errors are not squared, and both positive and negative errors contribute equally to the overall loss.
- **Robustness to Outliers**: MAE is more robust to outliers compared to MSE, as it does not heavily penalize large errors. Outliers have a linear impact on MAE, unlike MSE where their impact is quadratic.
- **Interpretation**: The average absolute difference calculated by MAE is easier to interpret in the context of the problem, as it represents the average magnitude of errors in the predictions.
- **Loss Function**: Like MSE, MAE can also be used as a loss function during training regression models. The goal is to minimize the MAE, indicating a reduction in the average absolute difference between predicted and actual values.

## **Huber Loss**

Huber Loss, also known as Smooth Mean Absolute Error, is a loss function used in regression tasks that combines the benefits of Mean Absolute Error (MAE) and Mean Squared Error (MSE). It is designed to be more robust to outliers compared to MSE while still maintaining the smoothness properties of MAE.
 
 $$ {\displaystyle L_{\delta }(a)={\begin{cases}{\frac {1}{2}}{y-ŷ}&{\text{for }}|y-ŷ|\leq \delta ,\\\delta \cdot \left(|y-ŷ |-{\frac {1}{2}}\delta \right),&{\text{otherwise.}}\end{cases}}}  $$

If the data point is an outlier, Huber will behave like MAE, if it is not an outlier, Huber will behave like MSE. **Sigma** is a hyperparameter which will determine whether the datapoint is an outlier or not.

Key points about Huber loss:
1. **Smooth Transition:** Huber loss smoothly transitions between the quadratic loss (MSE) for small errors and the linear loss (MAE) for large errors. This makes it less sensitive to outliers while still penalizing large errors effectively.
2. **Threshold Paramete\( \del )):** The choice of \( \delta \) affects the behavior of the loss function. A smaller \( \delta \) makes Huber loss more robust to outliers but may result in slower convergence, while a larger \( \delta \) can lead to faster convergence but may be less robust to outliers.
3. **Differentiable:** Huber loss is differentiable everywhere, including at the point where it switches between the quadratic and linear components. This property is beneficial for gradient-based optimization algorithms used in training neural networks.
4. **Usage:** Huber loss is commonly used in machine learning models, especially in situations where the dataset contains noisy or outlier-prone data. It strikes a balance between robustness to outliers and convergence speed during training.


## **Binary Crossentropy / log Loss**

Binary Cross-Entropy Loss (also known as Log Loss or Binary Log Loss) is a loss function commonly used in binary classification tasks. It measures the difference between the predicted probabilities and the actual binary labels (0 or 1) for each sample in the dataset. The goal is to minimize this difference, which is crucial for training accurate binary classification models.

$$-ylog(ŷ) -(1-y)log(1-ŷ)  $$

> Last activation function should be Sigmoid.

*Advantage*
1. Differentiable

*Disadvantage*
1. Multiple Local minima
2. Not Intuitive
ities.

Key points about Binary Cross-Entropy Loss:

- **Probability-Based Loss**: Binary Cross-Entropy Loss is based on the predicted probabilities of the positive class (class 1) for each sample. It penalizes the model more heavily for incorrect predictions that are confidently wrong (high predicted probability for the wrong class).
- **Logarithmic Function**: The use of logarithmic functions in the loss formula ensures that the loss increases exponentially as the predicted probability diverges from the actual binary label. This amplifies the impact of confident incorrect predictions.
- **Differentiability**: The Binary Cross-Entropy Loss function is differentiable with respect to the model's parameters, making it suitable for gradient-based optimization algorithms like stochastic gradient descent (SGD) during model training.
- **Suitability**: Binary Cross-Entropy Loss is well-suited for binary classification tasks where the goal is to classify inputs into two mutually exclusive classes (e.g., positive/negative, yes/no, spam/not spam).
- **Output Activation**: Typically, the output layer of the neural network in binary classification tasks uses a sigmoid activation function, which ensures that the predicted values are in the range [0, 1] representing probabilities.

## **Categorical cross entropy**

Categorical Cross-Entropy Loss (also known as Multiclass Cross-Entropy Loss) is a loss function commonly used in multiclass classification tasks. It measures the difference between the predicted class probabilities and the actual one-hot encoded class labels for each sample in the dataset. The goal is to minimize this difference, which is essential for training accurate multiclass classification models.

$$ - ∑ y_j log ŷ_j $$

> Activation fun should be softmax.

Key points about Categorical Cross-Entropy Loss:

- **Probability-Based Loss**: Categorical Cross-Entropy Loss is based on the predicted probabilities of each class for each sample. It penalizes the model more heavily for incorrect predictions that are confidently wrong (high predicted probability for the wrong class).
- **Logarithmic Function**: The use of logarithmic functions in the loss formula ensures that the loss increases exponentially as the predicted probability diverges from the actual class label. This amplifies the impact of confident incorrect predictions.
- **Differentiability**: The Categorical Cross-Entropy Loss function is differentiable with respect to the model's parameters, making it suitable for gradient-based optimization algorithms like stochastic gradient descent (SGD) during model training.
- **Suitability**: Categorical Cross-Entropy Loss is well-suited for multiclass classification tasks where the goal is to classify inputs into multiple mutually exclusive classes (e.g., different types of objects, sentiment categories).
- **Output Activation**: Typically, the output layer of the neural network in multiclass classification tasks uses a softmax activation function, which ensures that the predicted values are probabilities that sum up to 1 across all classes.

## **Hinge Loss**

Hinge Loss is a loss function commonly used in binary classification tasks, especially in the context of support vector machines (SVMs) and margin-based classifiers. It is designed to penalize the model based on the margin between the predicted decision boundary and the true class labels. Hinge Loss encourages the model to correctly classify samples with a margin greater than a specified threshold, promoting better separation between classes.


$$ L(y, \hat{y}) = \frac{1}{N} \sum_{i=1}^{N} \max(0, 1 - y_i \cdot \hat{y}_i) $$

In this formula:
- \( L(y, \hat{y}) \) represents the Hinge Loss between the true labels \( y \) and the predicted scores \( \hat{y} \).
- \( N \) is the number of samples in the dataset.
- \( y_i \) is the true class label for the \( i \)-th sample, typically -1 or 1 in binary classification tasks.
- \( \hat{y}_i \) is the predicted score (output of the decision function) for the \( i \)-th sample.
- \( \max(0, 1 - y_i \cdot \hat{y}_i) \) calculates the hinge loss for each sample, penalizing the model based on the margin between the predicted score and the true label.

Key points about Hinge Loss:

- **Margin-Based Loss**: Hinge Loss is a margin-based loss function that penalizes the model when the predicted score (decision function output) for a sample is within a certain margin of the true class label. The margin is typically set to 1 in binary classification tasks.
- **Sparse Margin**: Hinge Loss encourages a sparse margin between classes, meaning that the model aims to correctly classify samples with a margin greater than 1 while ignoring correctly classified samples that are already sufficiently far from the decision boundary.
- **Support Vector Machines (SVMs)**: Hinge Loss is commonly used in SVMs as the optimization objective for maximizing the margin between classes while minimizing classification errors.
- **Robust to Outliers**: Hinge Loss is less sensitive to outliers compared to other loss functions like Cross-Entropy Loss. Outliers that are correctly classified with a large margin do not contribute significantly to the loss.
- **Output Scaling**: In practice, the output of the model's decision function (e.g., raw scores) may need to be scaled or calibrated to ensure that the margin and loss calculations are meaningful.

## KL Diverfence

The Kullback-Leibler (KL) Divergence, also known as relative entropy, is a measure of how one probability distribution diverges from a second, expected probability distribution. It is often used in information theory and machine learning to quantify the difference between two probability distributions.
 logarithm.


For discrete distributions:
$$ D_{KL}(P || Q) = \sum_{x} P(x) \log \left( \frac{P(x)}{Q(x)} \right) $$

For continuous distributions:
$$ D_{KL}(P || Q) = \int_{-\infty}^{\infty} P(x) \log \left( \frac{P(x)}{Q(x)} \right) dx $$

In these formulas:
- $$ D_{KL}(P || Q) $$ represents the KL Divergence from distribution  P  to distribution Q.
-  P(x)  and Q(x)  are probability distributions over the variable  x .
- log  represents the natural logarithm.