# Cross Entropy

## Prerequisite
- Working of the neural nets, Backpropagation, and ArgMax and SoftMax.

## Previous Work

- In 1st example of working of neural networks it gives a simple output of any value. And to determine how well neural net fits the data we commanly use  the **Sum of the squred residuals**


## Notes
- But when we have neural networks with multiple output values just like the iris flower classifier NN I build previously.
- there we often use ArgMax to easly interpret the results. Argmax doesnt have derivative or gives 0, given that we can;t use it for backpropagation.
- So, in order to train neural networks we use SoftMax function.
- And SoftMax output values are predicted probabilities between 0 and 1.
- And when o/p is restricted between 0 and 1, we often use something called **Cross Entropy** to determine how well neural network fits the data.

- Cross Entropy might sound fancy and complicated but when it comes to neural networks, is super simpel.

Let's see by simple example.

|Petal width|Sepal width|Species (observed/actual Iris Species)|
|---|---|---|
|0.04|0.42|Setosa|
|1|0.52|Virginica|
|0.50|0.37|Versicolor|

**Step 1:** Get output values from previously trained Neural Network for all the inputs given in table.

So, for 1st sample which has Petal width = 0.04 and Sepal width = 0.42 we get these o/p values Setosa = 0.57, Versicolor = 0.20, Virginica = 0.14

We already know that data are from Setosa, the CrossEntropy is -log(SoftMax o/p value for Setosa) base `e`.


$$ Cross Entropy_{observedClass} = -log_e(softmax \; output \; value \; of \;observed\;class) $$

$$ Cross Entropy_{setosa} = -log_e(0.57) $$

In simple word, plug the predicted probability for observed species into Cross Entropy function.

Above given Cross Entropy equation is the simplified form of this general equation.

$$  Cross \; Entropy_{class} =  -\sum_{c=1}^{M}Observed_c * log(Predicted_c)$$

As the current sample is of Setosa so observed value of it will be 1 and other two will become 0

$$  = - (1) * log(Predicted_{setosa}) + (- (0) * log(Predicted_{versicolor})) + (- (0) * log(Predicted_{verginica}))$$

multiply with 0 will become 0 so we will left with the simplified version of the equation that is described earlier.




In [1]:
from math import log, e

Below is the table with SoftMax o/p values

|Petal width|Sepal width|Species (observed/actual Iris Species)| SoftMax o/p value
|---|---|---|---|
|0.04|0.42|Setosa| 0.57|
|1|0.52|Virginica| 0.58|
|0.50|0.37|Versicolor| 0.52|

In [4]:
ce_setosa = -log(0.57,e) # here 0.57 is Softmax o/p value
ce_virginica = -log(0.58,e)
ce_versicolor = -log(0.52, e)

print(f"Setosa Cross Entropy value: {ce_setosa}")
print(f"Virginica Cross Entropy value: {ce_virginica}")
print(f"Versicolor Cross Entropy value: {ce_versicolor}")

Setosa Cross Entropy value: 0.5621189181535413
Virginica Cross Entropy value: 0.5447271754416722
Versicolor Cross Entropy value: 0.6539264674066639


In [5]:
ce_setosa + ce_virginica + ce_versicolor

1.7607725610018772

Finally we get cross entropy value for each sample

|Petal width|Sepal width|Species (observed/actual Iris Species)| SoftMax o/p value | Cross Entropy value
|---|---|---|---|---|
|0.04|0.42|Setosa| 0.57|  0.562118918153541
|1|0.52|Virginica| 0.58| 0.5447271754416722
|0.50|0.37|Versicolor| 0.52| 0.6539264674066639

Now add up all the cross entropy values, we get Total Cross Entropy = 1.7607725610018772

- based on Total Cross Entropy value we can use Backpropagation to adjust Weights and Biases and minimize the total error.

## Why Cross Entropy and not something like Squred Residuals as a loss function?

- If we can calculate probabilities for each observed species then we can calculate residuals

- Residual = Observed - predicted
- The difference between observed probabilities and predicted probabilities
- eg: 1st row is of observed species Setosa, and thus observed probability is 1 and predicted prob is 0.57. thus Residual = 0.43

residual = 1 - 0.57 = 0.43

residual^2 = (0.43)^2 = 0.18

and we can now calculate sum of the squred residuals

**Worth taking notes**

- Remember, SoftMax only gives us values between 0 and 1,
  - so if prediction is for setosa is really good,
  - it will be close to 0, and if prediction is worst it will be close to 0,
  - In above example prediction for Setosa is kind of in the middle 0.57
- we can plug values for "p" from 0 to 1 into cross entropy function and plot the o/p in function CE = -log("p").
- Y axis is the loss, shows how bad the prediction is.
- When we use Cross Entropy, as the prediction gets worse and worse meaning closer to 0,the Loss kind of explodes and gets really big.
- In contrast, if we plug values for "p" from 0 to 1 into SSR. the change in Loss is not as large as it is for Cross Entropy.
- As step size for Backpropagation depends in part, on derivatives of these functions.
- And derivative, or slope of the tangent line for Cross Entropy for bad prediction will be relatively large compared to derivative for that same, bad prediction with SSR.
- So when NN makes Bad prediction, Cross Entropy will help us to take relatively large steps towards a better prediction. because slope of the tangent line will be relatively large.
- see below image to see difference.

<img src="https://raw.githubusercontent.com/srkds/deep-learning-foundation/main/assets/CE_slope.jpg" width="50%"/>

## Takeaway

- Based on above tradeoff, select loss function that gives large derivative or slope when prediction is bad. so that it can help take big steps to move towards good prediction.
