In [1]:
## tutorial to better understand how mutual information, common for decision trees
## tutorial url:
## https://machinelearningmastery.com/information-gain-and-mutual-information/

Information gain calculates the reduction in entropy or surprise from transforming a dataset in some way.

It is commonly used in the construction of decision trees from a training dataset, by evaluating the information gain for each variable, and selecting the variable that maximizes the information gain, which in turn minimizes the entropy and best splits the dataset into groups for effective classification.

Information gain can also be used for feature selection, by evaluating the gain of each variable in the context of the target variable. In this slightly different usage, the calculation is referred to as mutual information between the two random variables.

Information gain is calculated by comparing the entropy of the dataset before and after a transformation.

Mutual information calculates the statistical dependence between two variables and is the name given to information gain when applied to variable selection.

You might recall that information quantifies how surprising an event is in bits. Lower probability events have more information, higher probability events have less information. Entropy quantifies how much information there is in a random variable, or more specifically its probability distribution. A skewed distribution has a low entropy, whereas a distribution where events have equal probability has a larger entropy.

In information theory, we like to describe the “surprise” of an event. Low probability events are more surprising therefore have a larger amount of information. Whereas probability distributions where the events are equally likely are more surprising and have larger entropy.

Skewed Probability Distribution (unsurprising): Low entropy.

Balanced Probability Distribution (surprising): High entropy.

In [2]:
# calculate the entropy for a dataset
from math import log2
# proportion of examples in each class
class0 = 10/100
class1 = 90/100
# calculate entropy
entropy = -(class0 * log2(class0) + class1 * log2(class1))
# print the result
print('entropy: %.3f bits' % entropy)

entropy: 0.469 bits


In [3]:
# proportion of examples in each class
class0 = 50/100
class1 = 50/100
# calculate entropy
entropy = -(class0 * log2(class0) + class1 * log2(class1))
# print the result
print('entropy: %.3f bits' % entropy)

entropy: 1.000 bits


Information gain can be calculated as follows:

IG(S, a) = H(S) – H(S | a)
Where IG(S, a) is the information for the dataset S for the variable a for a random variable, H(S) is the entropy for the dataset before any change (described above) and H(S | a) is the conditional entropy for the dataset given the variable a.

This calculation describes the gain in the dataset S for the variable a. It is the number of bits saved when transforming the dataset.

## Worked Example of Calculating Information Gain

We can define a function to calculate the entropy of a group of samples based on the ratio of samples that belong to class 0 and class 1.

In [7]:
def entropy(class0, class1):
    return -(class0 * log2(class0) + class1 * log2(class1))

Now, consider a dataset with 20 examples, 13 for class 0 and 7 for class 1. We can calculate the entropy for this dataset, which will have less than 1 bit.

In [8]:
# split of the main dataset
class0 = 13 / 20
class1 = 7 / 20
# calculate entropy before the change
s_entropy = entropy(class0, class1)
print('Dataset Entropy: %.3f bits' % s_entropy)

Dataset Entropy: 0.934 bits


Now consider that one of the variables in the dataset has two unique values, say “value1” and “value2.” We are interested in calculating the information gain of this variable.

Let’s assume that if we split the dataset by value1, we have a group of eight samples, seven for class 0 and one for class 1. We can then calculate the entropy of this group of samples.

In [9]:
# split 1 (split via value1)
s1_class0 = 7 / 8
s1_class1 = 1 / 8
# calculate the entropy of the first group
s1_entropy = entropy(s1_class0, s1_class1)
print('Group1 Entropy: %.3f bits' % s1_entropy)

Group1 Entropy: 0.544 bits


Now, let’s assume that we split the dataset by value2; we have a group of 12 samples with six in each group. We would expect this group to have an entropy of 1.

In [10]:
# split 2  (split via value2)
s2_class0 = 6 / 12
s2_class1 = 6 / 12
# calculate the entropy of the second group
s2_entropy = entropy(s2_class0, s2_class1)
print('Group2 Entropy: %.3f bits' % s2_entropy)

Group2 Entropy: 1.000 bits


In this case, information gain can be calculated as:

Entropy(Dataset) – (Count(Group1) / Count(Dataset) * Entropy(Group1) + Count(Group2) / Count(Dataset) * Entropy(Group2))
Or:

Entropy(13/20, 7/20) – (8/20 * Entropy(7/8, 1/8) + 12/20 * Entropy(6/12, 6/12))

In [11]:
# calculate the information gain
gain = s_entropy - (8/20 * s1_entropy + 12/20 * s2_entropy)
print('Information Gain: %.3f bits' % gain)

Information Gain: 0.117 bits


## Examples of Information Gain in Machine Learning

Perhaps the most popular use of information gain in machine learning is in decision trees.

"Information gain is precisely the measure used by ID3 to select the best attribute at each step in growing the tree."

The information gain is calculated for each variable in the dataset. The variable that has the largest information gain is selected to split the dataset. Generally, a larger gain indicates a smaller entropy or less surprise.

The process is then repeated on each created group, excluding the variable that was already chosen. This stops once a desired depth to the decision tree is reached or no more splits are possible.

Information gain can be used as a split criterion in most modern implementations of decision trees, such as the implementation of the Classification and Regression Tree (CART) algorithm in the scikit-learn Python machine learning library in the DecisionTreeClassifier class for classification.

In [14]:
import sklearn
model = sklearn.tree.DecisionTreeClassifier(criterion='entropy')

## How Are Information Gain and Mutual Information Related?

Mutual Information and Information Gain are the same thing, although the context or usage of the measure often gives rise to the different names.

For example:

Effect of Transforms to a Dataset (decision trees): Information Gain.
Dependence Between Variables (feature selection): Mutual Information.
Notice the similarity in the way that the mutual information is calculated and the way that information gain is calculated; they are equivalent:

I(X ; Y) = H(X) – H(X | Y)

and

IG(S, a) = H(S) – H(S | a)

As such, mutual information is sometimes used as a synonym for information gain. Technically, they calculate the same quantity if applied to the same data.