Word Co-occurence Matrix
========================
Given a list of sentences and a window size to define co-occurrence, we can work out a symmetric matrix that record the co-occurence frequency between word pairs. Let's store it as a pandas data frame, so we can use the words as column names.

Below is a plain Python implementation that does not rely on any NLP packages. 

In [1]:
from collections import defaultdict
import numpy as np
import pandas as pd

## Define a co-occurence function 

The function takes in a list of sentences and an integer that specifies the size of the window, then return a pandas data frame with word-pair co-occurence frequency.

In [13]:
def co_occurrence(sentences, window_size):
    d = defaultdict(int)
    vocab = set()
    for text in sentences:
        # preprocessing (use tokenizer instead)
        text = text.lower().split()
        # iterate over the sentence
        for i in range(len(text)):
            token = text[i]
            vocab.add(token)  # add to vocab
            next_token = text[i+1 : i+1+window_size]
            for t in next_token:
                key = tuple( sorted([t, token]) )
                d[key] += 1
    
    # formulate the dictionary into dataframe
    vocab = sorted(vocab) # sort vocab
    df = pd.DataFrame(data=np.zeros((len(vocab), len(vocab)), dtype=np.int16),
                      index=vocab,
                      columns=vocab)
    for key, value in d.items():
        df.at[key[0], key[1]] = value
        df.at[key[1], key[0]] = value
    return df

## Load Data and call function

In [14]:
text = ["It was the best of times",
"it was the worst of times",
"it was the age of wisdom",
"it was the age of foolishness"]

df = co_occurrence(text, 20) 
df

Unnamed: 0,age,best,foolishness,it,of,the,times,was,wisdom,worst
age,0,0,1,2,2,2,0,2,1,0
best,0,0,0,1,1,1,1,1,0,0
foolishness,1,0,0,1,1,1,0,1,0,0
it,2,1,1,0,4,4,2,4,1,1
of,2,1,1,4,0,4,2,4,1,1
the,2,1,1,4,4,0,2,4,1,1
times,0,1,0,2,2,2,0,2,0,1
was,2,1,1,4,4,4,2,0,1,1
wisdom,1,0,0,1,1,1,0,1,0,0
worst,0,0,0,1,1,1,1,1,0,0


Let's take the second sentence for example, `text[1]`, how many tokens are in this sentence? 

In [15]:
len(text[1])

25

Why? To work out the number of tokens without the aid of a tokeniser, we need to at least do the split. In this case, we also convert all to lower case. 

:::{tip}
tokenized_text = text[1].lower().split())
:::

Work out the number of tokens now.