## Mutual Information

*NOTE: To understand this derivation, you first need to understand Shannon Entropy. This notebook will assume full knowledge of everything there*

- Mutual information measures the extent to which 2 variables are independent from one another, and this is very related to KL divergence

- We know that entropy is a measure of uncertainty given by some probability distribution $P$ over some discrete random variable $X$
    - $H(X) = \sum_X P(X) \log_2(P(X))$ is a measure of the uncertainty for a given distribution $P$ (i.e. entropy)

- In the same way, we can define an entropy for a joint distribution of 2 random variables $X$ and $Y$
    - $H(X,Y) = \sum_X P(X,Y) \log_2(P(X,Y))$ 

- In KL Divergence, we know that $D_{KL}(P || Q)$ gives us the measure of how much uncertainty increases when we use distribution $Q$ to approximate distribution $P$

- The idea here is that we can check the extent to which $X$ and $Y$ are independent, by checking how much uncertainty increases when we use distribution $P(X) \cdot P(Y)$ to estimate the joint distribution $P(X,Y)$
    - The idea here is that if X and Y are independent, then $P(X,Y) = P(X) \cdot P(Y)$

- Therefore, Mutual Information (MI) is measured by the Kullback-Liebler Divergence between $P(X,Y)$ and $P(X) \cdot P(Y)$

$$\begin{aligned}
    D_{KL}(P(X,Y) || P(X) \cdot P(Y)) &= \sum_X \sum_Y P(X,Y) \cdot \log(\frac{P(X,Y)}{P(X) \cdot P(Y)})
\end{aligned}$$

### Implementation

In [None]:
import numpy as np
from sklearn.metrics import mutual_info_score
from collections import Counter

N=1000
NRANGE=5

X1 = np.random.randint(0,NRANGE,N) # Base
X2 = X1 + np.random.randint(-2,3,N) # X1 + Symmetric Random Noise
X3 = np.random.randint(0,NRANGE,N) # Independent Draw

In [None]:
def make_distribution_map(X1, X2=None):
    if X2 is None:
        dist_map = dict(Counter(X1))
    else:
        dist_map = dict(Counter([(x,y) for x,y in zip(X1, X2)]))
    
    return {k: v/N for k,v in dist_map.items()}

def yj_mutual_info_score(X1, X2):
    joint_dist_x1x2 = make_distribution_map(X1=X1, X2=X2)
    dist_x1 = make_distribution_map(X1)
    dist_x2 = make_distribution_map(X2)

    res = 0
    for joint_key in joint_dist_x1x2.keys():
        xkey, ykey = joint_key
        entropy = joint_dist_x1x2[joint_key] * np.log(joint_dist_x1x2[joint_key] / (dist_x1[xkey] * dist_x2[ykey]))
        res += entropy

    return res

In [110]:
print(yj_mutual_info_score(X1,X1), yj_mutual_info_score(X1,X2), yj_mutual_info_score(X1,X3))

1.609237515573737 0.4750464928364453 0.0076388736754139435


In [111]:
print(mutual_info_score(X1, X1), mutual_info_score(X1, X2), mutual_info_score(X1, X3))

1.6092375155737368 0.4750464928364452 0.007638873675413535
