#### Term-Term/Word Co-occurance Matrix

Given a corpus and vocabulary $V$, the word co-occurance matrix is a $|V| \times |V|$ matrix whose $(i,j)th$ cell contains the frequency with which word $j$ appears in the context of word $i$. Word $j$ is called a `context word` and word $i$ is called a `center word`. The context window around a center word is defined as a $\pm k$ word window around the center word, i.e. $k$ words to the left and $k$ word to the right of that center word. Each row of this matrix can then be interpreted as a $|V|$ dimensional embedding vector for a word from the vocabulary. For smaller context windows, the embedding vectors tend to capture more syntactic information/local context, while for larger context windows the embedding vectors capture more global context.

We will use the Brown corpus again to create a co-occurance matrix and look at some properties of the resulting word embeddings. 

In [1]:
import nltk
from nltk.corpus import brown
from nltk.corpus import stopwords
nltk.download('stopwords')
import numpy as np
from collections import Counter

stop_words = set(stopwords.words('english'))

def check_punc(w):
    return any(c.isalpha() for c in w)

# remove punctuations and stopwords from list of words and apply lowercase folding 
def preprocess(s):
    words = [w.lower() for w in s if check_punc(w)]
    words = [w for w in words if not w in stop_words]
    return words

# preprocess the corpus (remove punctutations and lowecase folding)
corpus_words = preprocess(brown.words())
print(f"Num words in corpus: {len(corpus_words)}") 

[nltk_data] Downloading package stopwords to /home/tanzid/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Num words in corpus: 530090


In [2]:
word_counts = Counter(corpus_words)

# now lets create the vocabulary (keep 5000 most common words)
vocab = (word_counts.most_common(8000))
vocab = sorted([v[0] for v in vocab])

In [3]:
# context window half-size
k = 20 
corpus_length = len(corpus_words)

# we also insert a special padding token so that we can fit the context window at the beginning and end of the corpus text
pad_token = "<PAD>"
unk_token = "<UNK>"
vocab = [pad_token, unk_token] + vocab
word2idx = {w:i for i,w in enumerate(vocab)}
print(f"Vocab size: {len(vocab)}")

# adding padding to corpus
corpus_words = [pad_token]*k + corpus_words + [pad_token]*k
# replace with oov tokens in corpus
corpus_words = [w if w in vocab else unk_token for w in corpus_words]

Vocab size: 8002


In [4]:
# now let's create and populate the term-term matrix
V = len(vocab)
T = np.zeros(shape=(V,V))

# scan through documents/categories and accumulate counts
for i in range(corpus_length):
    window = corpus_words[i:i+1+2*k]
    center_word = window[k]
    context_words = window[:k] + window[k+1:]
    # accumulate context word counts
    for context_word in context_words:
        T[word2idx[center_word], word2idx[context_word]] += 1

In [5]:
# convert counts to log counts
T = np.log10(T+1)

In [6]:
# computes normalized cosine similarity between two word embedding vectors
def cosine_similarity(w1,w2):
    similarity_score = np.dot(w1, w2) / (np.linalg.norm(w1) * np.linalg.norm(w2))
    return similarity_score 

In [9]:
"pie" in vocab

True

In [11]:
word1 = "pie"
word2 = "train"
word3 = "sugar"

w1 = T[word2idx[word1]]
w2 = T[word2idx[word2]]
w3 = T[word2idx[word3]]
similarity_12 = cosine_similarity(w1, w2)
similarity_13 = cosine_similarity(w1, w3)

print(f"Simialrity socre between '{word1}' and '{word2}' = {similarity_12}")
print(f"Simialrity socre between '{word1}' and '{word3}' = {similarity_13}")

Simialrity socre between 'pie' and 'train' = 0.2430828364039373
Simialrity socre between 'pie' and 'sugar' = 0.2839855167816917


In [8]:
word1 = "population"
word2 = "crowd"
word3 = "unhappy"

w1 = T[word2idx[word1]]
w2 = T[word2idx[word2]]
w3 = T[word2idx[word3]]
similarity_12 = cosine_similarity(w1, w2)
similarity_13 = cosine_similarity(w1, w3)

print(f"Simialrity socre between '{word1}' and '{word2}' = {similarity_12}")
print(f"Simialrity socre between '{word1}' and '{word3}' = {similarity_13}")

Simialrity socre between 'population' and 'crowd' = 0.3537778241475392
Simialrity socre between 'population' and 'unhappy' = 0.31288833385673875


In [12]:
word1 = "occupation"
word2 = "job"
word3 = "horse"

w1 = T[word2idx[word1]]
w2 = T[word2idx[word2]]
w3 = T[word2idx[word3]]
similarity_12 = cosine_similarity(w1, w2)
similarity_13 = cosine_similarity(w1, w3)

print(f"Simialrity socre between '{word1}' and '{word2}' = {similarity_12}")
print(f"Simialrity socre between '{word1}' and '{word3}' = {similarity_13}")

Simialrity socre between 'occupation' and 'job' = 0.3860584884524287
Simialrity socre between 'occupation' and 'horse' = 0.3155096152321899


#### The similarity scores between embedding vectors for similar words seem to be reasonable.

#### An improvement over this vanially co-occurance matrix implementation is to use `Positive Point-wise Mutual Information` instead of raw co-occurance counts. In the vanilla cooccurance matrix, the $(i,j)th$ cell contains the value $f_{i,j} = count(w_i, w_j)$ which is the count of the co-occurance of context word $j$ in the window of center word $i$. Instead of these raw counts, consider the pointwise mutual information:

$PMI(w_i, w_j) = \log_2 \frac{P(w_i,w_j)}{P(w_i)P(w_j)}$

where $P(w_i,w_j) = \frac{f_{i,j}}{\sum_{i,j} f_{i,j}}$, $P(w_i) = \frac{\sum_j f_{i,j}}{\sum_{i,j} f_{i,j}}$, $P(w_j) = \frac{\sum_i f_{i,j}}{\sum_{i,j} f_{i,j}}$  

To understand the meaning of the PMI, we first note that $P(w_i,w_j)$ denotes the joint probability of the two words $w_i$ and $w_j$ co-occuring. If these two words always occured independently, then their joint probability would be $P(w_i)P(w_j)$. So the PMI is a comparision between the two words occuring together compared to if the two words occured independently at random chance. A higher PMI would indicate high co-occurance, while a zero PMI would indicate the word occur independently. We ignore negative values which indicate the words co-occur with proabilty below random chance which is not useful, so we clip negative values of PMI. This is called the Positive-PMI:

$PPMI = max(0, \log_2 \frac{P(w_i,w_j)}{P(w_i)P(w_j)})$ 

In [26]:
f = np.zeros(shape=(V,V))

# scan through documents/categories and accumulate counts
for i in range(corpus_length):
    window = corpus_words[i:i+1+2*k]
    center_word = window[k]
    context_words = window[:k] + window[k+1:]
    # accumulate context word counts
    for context_word in context_words:
        f[word2idx[center_word], word2idx[context_word]] += 1

In [27]:
# add-1 smoothing to avoid zero probabilities
f += 1

f_sum = f.sum()
P_ij = f / f_sum
P_i = P_ij.sum(axis=1, keepdims=True)
P_j = P_ij.sum(axis=0, keepdims=True)
PPMI = np.log2((P_ij / P_i) / P_j)
PPMI = PPMI * (PPMI>0)

In [36]:
word1 = "pie"
word2 = "sugar"
word3 = "honest"

w1 = PPMI[word2idx[word1]]
w2 = PPMI[word2idx[word2]]
w3 = PPMI[word2idx[word3]]
similarity_12 = cosine_similarity(w1, w2)
similarity_13 = cosine_similarity(w1, w3)

print(f"Simialrity socre between '{word1}' and '{word2}' = {similarity_12}")
print(f"Simialrity socre between '{word1}' and '{word3}' = {similarity_13}")

Simialrity socre between 'pie' and 'sugar' = 0.28617411789651437
Simialrity socre between 'pie' and 'honest' = 0.15642863473097238
