#### Example and explantion based on:
##### https://stackoverflow.com/questions/41990250/what-is-cross-entropy

#### Cross-entropy
#### Is commonly used to quantify the difference between two probability distributions. 
#### Usually the "true" distribution (the one that your machine learning algorithm is trying to match) is expressed in terms of a one-hot distribution.

In [24]:
import numpy as np

#### For example, suppose for a specific training instance, the label is B (out of the possible labels A, B, and C). 
#### The one-hot distribution for this training instance is therefore:

#### Pr(Class A)  Pr(Class B)  Pr(Class C)
####      [   0.0,          1.0,          0.0]



In [25]:
p = np.array([0,1,0])
print("\np:=",p)


p:= [0 1 0]


#### You can interpret the above "true" distribution to mean that the training instance has 0% probability of being class A, 100% probability of being class B, and 0% probability of being class C.

#### Now, suppose your machine learning algorithm predicts the following probability distribution:

#### Pr(Class A)  Pr(Class B)  Pr(Class C)
####       [0.228,        0.619,        0.153]

In [26]:
q = np.array([0.228,0.619,0.153])
print("\nq:=",q)


q:= [ 0.228  0.619  0.153]


#### How close is the predicted distribution to the true distribution? That is what the cross-entropy loss determines. 

#### Use this formula:

$$H(p,q) = -  \sum\limits_{i=0}^n p(i) \log q(i)$$

#### The sum is over the three classes A, B, and C, vector p is the correct label while q is our prediction as said before.
#### Since log can trought a negative number we multiply by -1 to avoid this problem, also we need to be vary carefull since H(p,q) is not the same as H(q,t) because the correct vector can have zeros we dont want to calculate log of zeros(usually the predict output is based on a softmax function).
####  If you complete the calculation, you will find that the loss is 0.479. So that is how "wrong" or "far away" your prediction is from the true distribution.

In [27]:
loss = - np.sum(p * np.log(q))
print("\nloss:=",loss)


loss:= 0.479650006298


#### We can also use dot product to avoid using sumation on code and math
$$H(p,q) = - p(i) \log q(i)$$

In [28]:
# Using dot product
loss = - p.dot(np.log(q))
print("\nloss:=",loss)


loss:= 0.479650006298


In [29]:
# Lets say we predicted almost corrected
# since we cannot take log of zeros, and the output function in our model commonly being softmax function to avoid this problem.

q = np.array([0.1,.8,0.1])
print("\nq:=",q)


q:= [ 0.1  0.8  0.1]


In [30]:
loss = - np.sum(p * np.log(q))
print("\nloss:=",loss)


loss:= 0.223143551314


In [31]:
# Using dot product
loss = - p.dot(np.log(q))
print("\nloss:=",loss)


loss:= 0.223143551314


In [32]:
# And we go even better

q = np.array([0.001,0.998,0.001])
print("\nq:=",q)


q:= [ 0.001  0.998  0.001]


In [33]:
loss = - np.sum(p * np.log(q))
print("\nloss:=",loss)


loss:= 0.00200200267067


In [34]:
# Using dot product
loss = - p.dot(np.log(q))
print("\nloss:=",loss)


loss:= 0.00200200267067


#### In conclusion Cross entropy take two probability distribution(correct, prediction) and mesure the error between them, Cross entropy is one out of many possible loss functions (another popular one is SVM hinge loss). 