# Mutual information score

#### Consider the following: X is a set of predicted classes(labels), Y is a set of true classes. $p_X(x)$ is probability of point to be predicted as class X=x, $p_Y(y)$ is probability of point to be of class Y=y, $p_{X,Y}(x,y)$ is probability of predicting class to be x and being y in reality ($p_X(x)$ and $p_Y(y)$ are marginal distributions of predicted and true classes, $p_{X,Y}(x,y)$ is their joint distribution). Then their mutual information is calculated as follows.<br>
#### $I(X,Y) = \sum_{x \in X} \sum_{y \in Y} p_{X,Y}(x,y) \log \frac{p_{X,Y}(x,y)}{p_X(x)p_Y(Y)} = $ $ // p_X(x) = \frac{|x|}{N}$, $p_y = \frac{|y|}{N}$, $p_{X,Y}(x,y) = \frac{|x \cap y|}{N}//$ $ = \sum_{x \in X} \sum_{y \in Y} \frac{|x \cap y|}{N} \log \frac{N |x \cap y|}{|x||y|}$ <br>
#### Given predcited labels and corresponding true labels mutual information can be easily calculated using contingency matrix.

In [1]:
import numpy as np
from sklearn.metrics import mutual_info_score

In [2]:
def mutual_information(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    """
    Calculate the mutual information score between true labels and predicted labels using sklearn's contingency_matrix.

    Parameters:
    y_true (ndarray): True labels.
    y_pred (ndarray): Predicted labels.

    Returns:
    float: Mutual information score.
    """
    from sklearn.metrics.cluster import contingency_matrix
    from numpy import log
    contingency = contingency_matrix(y_true, y_pred)
    ni = contingency.sum(axis=1)
    nj = contingency.sum(axis=0)
    N = contingency.sum()

    mi = 0.0
    for i in range(contingency.shape[0]):
        for j in range(contingency.shape[1]):
            if contingency[i, j]: mi += (contingency[i, j] / N) * log((N * contingency[i, j]) / (ni[i] * nj[j]))
    return mi

In [3]:
y_pred = np.array([0, 0, 1, 1, 1, 2, 2, 2, 1])
y_true = np.array([0, 0, 1, 1, 1, 2, 2, 1, 1])

In [4]:
assert abs(mutual_information(y_true, y_pred) - mutual_info_score(y_true, y_pred)) < 1e-5,\
            "Implemented MI is not the same as sci-kit learn one!"

In [5]:
#Check for symmetry property
assert abs(mutual_information(y_true, y_pred) - mutual_information(y_pred, y_true)) == 0.,\
            "Implemented MI is not symmetrical!"

# Differential mutual information

#### Now consider we have predicted probabilities of labels. In this case the formula remains the same, but now $p_X(x)$ can not be calculated as $\frac{|x|}{N}$. Instead it is calculated as $p_X(x) = \sum_i p_i p_i(x) = \sum_i \frac{p_i(x)}{N} = \mathbb{E}_i[ p_i(x) ]$ , where $p_i$ is probability of choosing point $i$, $p_i(x)$ is probability of point $i$ to be predicted as class X=x. Here to evaluate the joint distribution we calculate predicted labels as those with greater probability.

In [6]:
def mutual_dif_information(y_true: np.ndarray, predicted_probs: np.ndarray) -> float:
    """
    Calculate the mutual differential information between true labels and predicted label probabilities.

    Parameters:
    y_true (ndarray of size K): True labels.
    y_pred (ndarray of size N * K): Predicted probabilities.

    Returns:
    float: Differential Mutual Information score.
    """
    from numpy import log
    from sklearn.metrics.cluster import contingency_matrix
    
    y_pred = np.argmax(predicted_probs, axis = 1)
    p_xy = contingency_matrix(y_true, y_pred)/9
    p_y = p_xy.sum(axis=1)
    p_x = predicted_probs.mean(axis = 0) 
    
    dmi = 0.0
    for y_label in range(p_xy.shape[0]):
        for x_label in range(p_xy.shape[1]):
            if p_xy[y_label][x_label]: dmi += p_xy[y_label][x_label] * log(p_xy[y_label][x_label] / (p_x[x_label] * p_y[y_label]))

    return dmi

In [7]:
y_pred = np.array([0, 0, 1, 1, 1, 2, 2, 2, 1])
y_true = np.array([0, 0, 1, 1, 1, 2, 2, 1, 1])
predicted_probs = np.array([[1., 0., 0.],
                            [1., 0., 0.],
                            [0., 1., 0.],
                            [0., 1., 0.],
                            [0., 1., 0.],
                            [0., 0., 1.],
                            [0., 0., 1.],
                            [0., 0., 1.],
                            [0., 1., 0.]], dtype = float)

In [8]:
mutual_dif_information(y_true, predicted_probs)

0.782855600747917

In [9]:
mutual_information(y_true, y_pred)

0.782855600747917

In [10]:
y_true = np.array([0, 0, 1, 1, 1, 2, 2, 1, 1])
predicted_probs = np.array([[0.75, 0.20, 0.05],
                            [0.60, 0.20, 0.20],
                            [0.30, 0.45, 0.15],
                            [0.25, 0.50, 0.25],
                            [0.10, 0.50, 0.40],
                            [0.20, 0.35, 0.45],
                            [0.10, 0.05, 0.85],
                            [0.10, 0.10, 0.80],
                            [0.05, 0.90, 0.05]], dtype = float)

In [11]:
mutual_dif_information(y_true, predicted_probs)

0.8085289571597928

Interesting fact - even if maximal predicted probabilities corresponds to true labels, mutual information may differ because of different $p_X(x)$ distibutions.