# What's an intuitive way to think of cross entropy

**Source:**

> - [What's an intuitive way to think of cross entropy](https://www.quora.com/Whats-an-intuitive-way-to-think-of-cross-entropy)

# A Gentle Introduction to Information Entropy

**Source:**

>- [A Gentle Introduction to Information Entropy](https://machinelearningmastery.com/what-is-information-entropy/)

## What is information theory

Information theory is field of study concerned with quantifying information for communication.

*信息论是涉及量化沟通信息的研究领域.*

A foundational concept from information is the quantification of the amount of information in things like events, random variables, and distribution.

*信息的一个基本概念是对诸如事件, 随机变量和分布的信息量的量化.*

> *Why unify information theory and machine learning? Because they are two side of the ame coin. Information theory and machine learning still alone together.*

## Calculate the information for an event

Quantifying information is the foundation of field of information theory.

The intuition behind quantifying information is the idea of measuring how much surprise there is in an event. Those events that rare (low probability) are more surprising and therefore have more information those events that there are common (high probability).

- **Low probability event:** High information (surprising).
- **High probability event:** Low information (unsurprising).

Rare events are more uncertain or more surprising and require more information to represent them than common events.

We can calculate the amount of information there is in an event using the probability of the event. This is called *"Shannon information", "self-information"*, or simply the *"information"*, and can be calculated for a discrete event $x$ as follows:

- $information(x) = -\log_2(p(x))$

Where $p(x)$ is the probability of the event $x$.

The choice of the 2-base logarithm means that the units of information measure is in bits(binary digits).

The calculate of information is often written as $h()$ for example:

- $h(x) = -\log_2(p(x))$

The negative sign ensures that the result is always positive or zero.

Information will be zero when the probability of an event is $1.0$ or a certainty, e.g. there is no surprise.

Let's make this concrete with some examples.

Consider a flip of a single fair coin. The probability of heads(and tails) is 0.5. We can calculate the information for flipping a head in Python

In [1]:
import numpy as np

In [2]:
p = 0.5
h = -np.log2(p)
print(f'p(x)={p}, information: {h}')

p(x)=0.5, information: 1.0


## Calculate the Entropy for a Random Variable

$
H(X)=-\displaystyle\sum_{k \in K} p_k \log(p_k)
$

上面的公式含义是, $H(X)$ 表示随机变量 $X$ 的熵, 它等于随机变量 $X$ 的 $K$ 种状态下, 每种状态 $k$ 的概率 $p_k$ 乘以 每种状态 $k$ 的概率的对数 $\log(p_k)$.

The lowest entropy is calculated for random variable that has a single event with a probability of 1.0, a certainly.

*随机变量熵的最小值为 $0$, 也就是当随机变量只有一个事件, 换句话说就是该随机变量的概率为 $[1]$. 带入公式很容易得到熵的结果为 $0$*

The largest entropy for a random variable will be if all events are equally likely.

*当随机变量中的各种状态 $k \in K$ 的概率相等(均匀分布), 此时该随机变量的熵值最大.*

# A Gentle Introduction to Cross-Entropy for Machine Learning

**Source:**

> - [A Gentle Introduction to Cross-Entropy for Machine Learning](https://machinelearningmastery.com/cross-entropy-for-machine-learning/)

# An introduction to entropy, cross entropy and KL divergence in machine learning

**Source:**

> - [An introduction to entropy, cross entropy and KL divergence in machine learning](https://adventuresinmachinelearning.com/cross-entropy-kl-divergence/)

# Information Gain and Mutual Information for Machine Learning

**Source:**

> - (Information Gain and Mutual Information for Machine Learning)[https://machinelearningmastery.com/information-gain-and-mutual-information/]

$
I(X;Y) = H(Y) - H(Y|X)
$

# How to Calculate the KL Divergence for Machine Learning

**Source:**

> - (How to Calculate the KL Divergence for Machine Learning)[https://machinelearningmastery.com/divergence-between-probability-distributions/]