# Pointwise Mutual Information (PMI) and Positive PMI (PPMI)
In statistics, probability theory, and information theory, **Pointwise Mutual Information (PMI)** is a measure of association between two events. It compares the probability of two events occurring together to the probability we would expect if the events were independent.

The formula for PMI is:
$$ \text{PMI}(x, y) = \log_2 \left( \frac{P(x, y)}{P(x) \cdot P(y)} \right), $$

where:
- $P(w)$ is the probability of seeing word $w$ in the corpus.
- $P(w_1, w_2)$ is the probability that $w_1$ and $w_2$ occur next to each other (i.e., co-occur as a word pair).


If two words occur together more often than expected by chance, the PMI is positive. If they co-occur less frequently, it is negative.

In Natural Language Processing, PMI is used to identify meaningful word associations. For example, "natural" and "language" often appear together, so their PMI would be high.

To focus on strong, positive associations, we use **Positive PMI (PPMI)**, which sets all negative PMI values to zero:

$$ \text{PPMI}(x, y) = \max(\text{PMI}(x, y),\ 0) $$


This variant is widely used in NLP tasks like word embedding and semantic similarity analysis.

### This Notebook
In this notebook, we perform a basic co-occurrence analysis on a small corpus. After tokenizing and removing stopwords, we count word frequencies and adjacent co-occurrences to compute word and pair probabilities. Using these, we calculate PMI and PPMI scores and rank word pairs by their associative strength.

### Preliminaries

This installs required python packages and nltk resources.

In [1]:
!pip install -q nltk

import nltk
nltk.download('stopwords', quiet=True)
nltk.download('punkt_tab', quiet=True);

### Creating a Corpus

In [2]:
from collections import defaultdict
import nltk
from nltk.tokenize import word_tokenize

stopwords = set(nltk.corpus.stopwords.words('english'))

def get_corpus_stats(corpus: list[str]) -> tuple[int, int, dict[str, int], dict[tuple[str, str], int]]:
    word_count = defaultdict(lambda: 0)
    cooccurrence_count = defaultdict(lambda: 0)

    for document in corpus:
        words = [w for w in word_tokenize(document) if w not in stopwords]
        for i, word in enumerate(words):
            word_count[word] += 1
            if i < len(words) - 1:
                pair = (word, words[i + 1])
                cooccurrence_count[pair] += 1
    total_tokens = sum(word_count.values())
    total_cooccurrences = sum(cooccurrence_count.values())

    return {w: c/total_tokens for w, c in word_count.items()}, {ws: c/total_cooccurrences for ws, c in cooccurrence_count.items()}

wordprob, coocprob = get_corpus_stats([
  "I love natural language processing.",
  "Deep learning is revolutionizing natural language processing.",
  "Natural language processing is a core part of artificial intelligence.",
  "Machine learning and natural language processing go hand in hand.",
  "I love deep learning and artificial intelligence.",
])


### Implementing PMI and PPMI

In [3]:
import math

def pmi(x, y):
    p_xy = coocprob.get((x, y), coocprob.get((y, x), 0))
    if p_xy == 0:
        return float('-inf')  # PMI is undefined (or -inf) for zero co-occurrence
    return math.log2(p_xy / (wordprob[x] * wordprob[y]))

def ppmi(x, y):
  return max((0, pmi(x, y)))

### Demonstration

In [4]:
import itertools
def rank_by_pmi(words: list[str]):
  return sorted(list((w1, w2, ppmi(w1, w2))for w1, w2 in itertools.combinations(words, 2)), key=lambda x: x[2], reverse=True)

rank_by_pmi(["natural", "language", "processing", "deep", "learning"])[:10]

[('deep', 'learning', 3.8339442305367433),
 ('natural', 'language', 3.4189067312578993),
 ('language', 'processing', 3.4189067312578993),
 ('natural', 'learning', 2.2489817298155867),
 ('natural', 'processing', 0),
 ('natural', 'deep', 0),
 ('language', 'deep', 0),
 ('language', 'learning', 0),
 ('processing', 'deep', 0),
 ('processing', 'learning', 0)]