<a href="https://colab.research.google.com/github/taskswithcode/probability_for_ml_notebooks/blob/main/ProbForML_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is a notebook link for the video [What is the loss function used in language models like ChatGPT?](https://youtu.be/LOh5-LTdosU)

To recap, entropy assigns a number,  to the information content in an event.

- High probability events have low surprise value (so low entropy).
- Low probability events have high surprise value (so high entropy).
- Numerically, **the entropy of an event** is the **negative logarithm of its probability**.

The **entropy of probability distribution** is the sum of the entropy of the individual events **weighted** by the probability of those events.

The **goal of this notebook** is to illustrate the underlying reason for **why  entropy can be used as a loss function**

- Imagine a model is learning to predict the distribution for some input. That is, the model predicts, 80% cat and 20% dog, for the image above.

- Lets use  these predictions to compute the entropy of each event, and then weight each event by the true probability of those events,  say  90% cat and 10% dog for the image above.

- This value we computed, also called **cross entropy**, will be lowest only if, the predicted probabilities are the same as the true probabilities, which is not, in this case.

- That is, **cross entropy is lowest, only when the predicted distribution, is the same as the true distribution for an input**  - which is 90% cat and 10% in our example.

- So **cross entropy** - **which is a single number**, captures **how far off**, the models **predicted distribution is**, from the **true distribution**.

- Note that when we  compute the entropy of a probability distribution,  each event is weighted by the probability of that event.






### Example of a True and Predicted probability distribution *(from video)*


In [None]:
y_true = [0.9, 0.1] #True distribution

In [None]:
y_pred = [0.8, 0.2] #Predicted distribution

### 1.  Cross entropy function

In [None]:
import numpy as np

def cross_entropy(y_true, y_pred):
    """
    Calculate the cross entropy between a true and a predicted distribution

    Args:
    y_true: A list representing the true probability distribution.
    y_pred: A list representing the predicted probability distribution.

    Returns:
    The cross entropy of the two distributions.
    """
    y_true = np.array(y_true)
    y_pred = np.array(y_pred)

    # The distributions should be valid probability distributions
    assert np.isclose(np.sum(y_true), 1), "True distribution should sum to 1."
    assert np.isclose(np.sum(y_pred), 1), "Predicted distribution should sum to 1."
    assert (y_true >= 0).all(), "All elements in the true distribution should be non-negative."
    assert (y_pred >= 0).all(), "All elements in the predicted distribution should be non-negative."

    # Exclude zero values to avoid log(0)
    mask = y_true > 0
    y_true = y_true[mask]
    y_pred = y_pred[mask]

    return round(-np.sum(y_true * np.log2(y_pred)),3) #Cross entropy computation

### 3. Compute Cross entropy

##### First compute cross entropy when the predicted distribution is the same as true distribution. In this case cross entropy value is the same as the entropy of the true distribution

In [None]:
print(y_true)
print(cross_entropy(y_true, y_true)) #Note this is the lowest value we can get for the entropy of the distribution [.9,.1]
                                    #this is the case where the cross entropy value is the same as the entropy of the true distribution

[0.9, 0.1]
0.469


##### Next compute the cross entropy for cases where predicted distribution is not the same as the true distribution

In [None]:
for i in range(1,9):
    i = round(i*.1,1)
    y_pred = [round(y_true[0] - i,1), round(y_true[1] + i,1)]
    print(y_pred)
    print(cross_entropy(y_true, y_pred)) #All the cross entropy values will be more than the number computed above. So minimizing cross entropy loss enables a model to learn

[0.8, 0.2]
0.522
[0.7, 0.3]
0.637
[0.6, 0.4]
0.795
[0.5, 0.5]
1.0
[0.4, 0.6]
1.263
[0.3, 0.7]
1.615
[0.2, 0.8]
2.122
[0.1, 0.9]
3.005


*Note all the values of cross entropy computed above is larger than the entropy of the true distribution. This fact makes cross entropy useful as a loss function*

Since the cross entropy value is always greater than entropy of the true distribution, the difference between these two values is always positive. This difference is called **KL-divergence.** and is also used as a loss function in some machine learning tasks. Cross entropy loss is typically used for supervised classification tasks. KL divergence is used in models like Variational autoencoders to quantify the difference between a latent variable distribution and a prior distribution, like a Gaussian.