# Information Theory: Mutual Information
Ian Tenney, September 10, 2016

We introduced this concept briefly in section last week, but it's worth expanding upon a bit here.

**Mutual information** is a general way of measuring the relationship between two random variables. More precisely, it tells us how much *information* (in bits) each variable tells us about the other. In this way it's similar to the idea of correlation, but it's not limited to real-valued variables.

Let's consider a simple corpus, with names for the pets. We'll compute a co-occurrence matrix between pets (rows) and names (cols):

In [1]:
import numpy as np
import pandas as pd
from IPython.display import display

corpus = [
"I have a pet dog named Chloe",
"I have a pet dog named Ozzie",
"I have a pet cat named Jinx",
"I have a pet cat named Fritz",
"I have a pet cat named Chloe",
"I have a pet gecko named Remy",
]

pets = ["dog", "cat", "gecko"]
pet_to_row = {w:i for i,w in enumerate(pets)}
names = ["Chloe", "Ozzie", "Jinx", "Fritz", "Remy"]
name_to_col = {w:i for i,w in enumerate(names)}

Cxy = np.zeros((3,5))
pairs = [(s.split()[-3], s.split()[-1]) for s in corpus]
for pet, name in pairs:
    i, j = pet_to_row[pet], name_to_col[name]
    Cxy[i,j] += 1

# Pretty-print function
def pretty_print_matrix(M, rows=pets, cols=names, dtype=float):
    display(pd.DataFrame(M, index=rows, columns=cols, dtype=dtype))
    
# Pretty-print with headers
pretty_print_matrix(Cxy, dtype=int)

Unnamed: 0,Chloe,Ozzie,Jinx,Fritz,Remy
dog,1,1,0,0,0
cat,1,0,1,1,0
gecko,0,0,0,0,1


We want to know: how much does one word (e.g. the pet type) tell us about another (e.g. the pet's name)?

Let's look at a single pair of words. Suppose we know the pet is a `dog`. How much more likely is it to be named `Ozzie`? Let's measure the ratio of probabilities:

$$ \frac{P(\text{"Ozzie"}\ |\ \text{"dog"})}{P(\text{"Ozzie"})} = \frac{1/2}{1/6} = 3$$

As usual, we'll take the log to get units of information:

$$ \text{PMI}(\text{"Ozzie"},\text{"dog"}) = \log_2 \frac{P(\text{"Ozzie"}\ |\ \text{"dog"})}{P(\text{"Ozzie"})} = \log_2 (3) $$

This quantity is known as **pointwise mutual information** (PMI). In general form:  

$$ \text{PMI}(x,y) = \log_2 \frac{P(x | y)}{P(x)} = \log_2 \frac{P(x | y)P(y)}{P(x)P(y)} = \log_2 \frac{P(x,y)}{P(x)P(y)}  $$  
The value of PMI is the same whichever side we condition on - unlike cross-entropy or KL divergence, this is symmetric.

In [2]:
Pxy = Cxy / np.sum(Cxy)
Px = Pxy.sum(axis=1)  # sum each row
Py = Pxy.sum(axis=0)  # sum each column

# Pointwise mutual information
# Note: np.outer(Px,Py)[i,j] = Px[i] * Py[j]
PMI_xy = np.log2(Pxy / np.outer(Px, Py))
pretty_print_matrix(PMI_xy)

Unnamed: 0,Chloe,Ozzie,Jinx,Fritz,Remy
dog,0.584963,1.584963,-inf,-inf,-inf
cat,0.0,-inf,1.0,1.0,-inf
gecko,-inf,-inf,-inf,-inf,2.584963


### Mutual Information

The mutual information (MI) is just the expectation of PMI, over all possible pairs $(x,y)$:

$$ I(X,Y) = E_{x,y}\left[\text{PMI}(x,y)\right] = \sum_{x,y} P(x,y) \log_2 \frac{P(x,y)}{P(x)P(y)}$$

Let's compute it over our corpus:

In [3]:
# The -inf values should be canceled by Pxy = 0
# Need np.nansum to ignore nan = (-inf * 0), since these values should really be zero.
I_xy = np.nansum(Pxy * PMI_xy)
print I_xy

1.12581458369


How do we interpret this? It's expressed in bits, so we can say that on average, there is 1.12 bits of information in the correlation between pet and name.

More formally, we can expand the sum and write MI in terms of entropy:

$$ I(X,Y) = \sum_{x,y} P(x,y) \log \frac{P(x,y)}{P(x)P(y)} $$
$$ = \sum_{x,y} P(x,y) \log_2 P(x,y) - \sum_{x,y} P(x,y) \log_2 P(x) - \sum_{x,y} P(x,y) \log_2 P(y) $$
$$ = \sum_{x,y} P(x,y) \log_2 P(x,y) - \sum_{x} P(x) \log_2 P(x) - \sum_{y} P(y) \log_2 P(y) $$
$$ = - H(X,Y) + H(X) + H(Y) $$

Recall that $H(X)$ is the information (entropy) of $X$: so the mutual information is the difference between how much information we would need to specify $X$ and $Y$ separately, we need to specify them jointly as pairs $(X,Y)$.

In [4]:
# Use np.nansum again to ignore nan = (-inf * 0)
Hx = np.nansum(-Px * np.log2(Px))
Hy = np.nansum(-Py * np.log2(Py))
Hxy = np.nansum(-Pxy * np.log2(Pxy))
print "H(X) = %.04f" % Hx
print "H(Y) = %.04f" % Hy
print "H(X,Y) = %.04f" % Hxy
print "I(X,Y) = %.04f" % (Hx + Hy - Hxy)

H(X) = 1.4591
H(Y) = 2.2516
H(X,Y) = 2.5850
I(X,Y) = 1.1258
