### `Term Document Matrix`

Given $D$ diferent documents containing text and a vocabulary $V$, the term-document matrix is a $|V| \times D$ matrix whice $(i,j)th$ cell contains the frequency with which word $i$ from the vocabulary occurs in the $jth$ document. Then each row of this matrix can be interpreted as a $D$ dimensional embedding vector representation for a word from the vocab. We can then measure how similar two words are by computing the cosine similarity between their embedding vectors. 

We make two additional improvements: 
1) Because raw frequencies can vary over a large range, we keep track of log(counts+1) instead of raw counts (add 1 smoothing to avoid $log(0)$).
2) We weight each count by the inverse document frequency (idf)

So each cell is the matrix is computed as: $tf_{i,j} \times idf_i$

where $tf_{i,j} = \log_{10}(count(i,j) + 1)$ and $idf_i = \log_{10}(D/df_i)$, $df_i$ is the number of documents word $i$ appears in.


We will use the Brown corpus from NLTK to create a TD matrix and look at some properties of the resulting word embeddings. 

In [39]:
import nltk
from nltk.corpus import brown
import numpy as np

In [22]:
def check_punc(w):
    return any(c.isalpha() for c in w)

# remove punctuations from list of words and lowercase folding 
def remove_punc(s):
    return [w.lower() for w in s if check_punc(w)]


In [31]:
# we will treat all text from a particular category/genre as a single document, then each genre will be a dimension of our word vectors
categories = brown.categories()
category_words = {}
for category in categories:
    print(f"Category: {category}")
    # now we will store the words from all documents across each category in separate lists
    category_words[category] = []
    for fileid in brown.fileids(categories=category):
        category_words[category].extend(remove_punc(brown.words(fileids=fileid)))
    print(f"Num words in category: {len(category_words[category])}")    


Category: adventure
Num words in category: 58372
Category: belles_lettres
Num words in category: 151548
Category: editorial
Num words in category: 54087
Category: fiction
Num words in category: 58296
Category: government
Num words in category: 61143
Category: hobbies
Num words in category: 71552
Category: humor
Num words in category: 18265
Category: learned
Num words in category: 159940
Category: lore
Num words in category: 96695
Category: mystery
Num words in category: 48174
Category: news
Num words in category: 87004
Category: religion
Num words in category: 34308
Category: reviews
Num words in category: 35088
Category: romance
Num words in category: 58612
Category: science_fiction
Num words in category: 12035


In [121]:
# now lets create the vocabulary
vocab = sorted(list(set([w for d in list(category_words.values()) for w in d])))
word2idx = {w:i for i,w in enumerate(vocab)}
print(f"Vocab size: {len(vocab)}")

categories = sorted(categories)
cats2idx = {c:i for i,c in enumerate(categories)}

Vocab size: 48052


In [95]:
# now let's create and populate the term-document matrix
V = len(vocab)
D = len(categories)
T = np.zeros(shape=(V,D))

# scan through documents/categories and accumulate counts
for j, category in enumerate(categories):
    for word in category_words[category]:
        T[word2idx[word], j] += 1


In [102]:
# compute inverse document freqeuncy (if we had large number of documents, we would also take log of idf)
idf = D / (T > 0).sum(axis=1, keepdims=True)
# convert to log counts
T = np.log10(T+1)   
# multiply log counts by log idf
T = idf * T

#### Now that we've computed our term-document matrix, let's look at some word similarities

In [54]:
# computes normalized cosine similarity between two word embedding vectors
def cosine_similarity(w1,w2):
    similarity_score = np.dot(w1, w2) / (np.linalg.norm(w1) * np.linalg.norm(w2))
    return similarity_score 

In [122]:
word1 = "mysterious"
word2 = "judicial"
word3 = "court"

w1 = T[word2idx[word1]]
w2 = T[word2idx[word2]]
w3 = T[word2idx[word3]]
similarity_12 = cosine_similarity(w1, w2)
similarity_13 = cosine_similarity(w1, w3)

print(f"Simialrity socre between '{word1}' and '{word2}' = {similarity_12}")
print(f"Simialrity socre between '{word1}' and '{word3}' = {similarity_13}")

Simialrity socre between 'mysterious' and 'judicial' = 0.6879424352638849
Simialrity socre between 'mysterious' and 'court' = 0.6549556896017964


#### Note that these word embedding don't seem to make much sense. This is not surprising because the document categories have significant overlap and so the these features of the word embedding don't do a good job at capturing the word semantics.

#### The columns of the matrix can be interpreted as a vector representation for each document category, i.e. a "document embedding". Let's check simirity between these document embedding vectors

In [125]:
doc1 = "mystery"
doc2 = "romance"
doc3 = "religion"

d1 = T[:,cats2idx[doc1]]
d2 = T[:,cats2idx[doc2]]
d3 = T[:,cats2idx[doc3]]
similarity_12 = cosine_similarity(d1, d2)
similarity_13 = cosine_similarity(d1, d3)

print(f"Simialrity socre between '{doc1}' and '{doc2}' = {similarity_12}")
print(f"Simialrity socre between '{doc1}' and '{doc3}' = {similarity_13}")

Simialrity socre between 'mystery' and 'romance' = 0.013699794367111637
Simialrity socre between 'mystery' and 'religion' = 0.004418250824928202


In [126]:
doc1 = "editorial"
doc2 = "news"
doc3 = "science_fiction"

d1 = T[:,cats2idx[doc1]]
d2 = T[:,cats2idx[doc2]]
d3 = T[:,cats2idx[doc3]]
similarity_12 = cosine_similarity(d1, d2)
similarity_13 = cosine_similarity(d1, d3)

print(f"Simialrity socre between '{doc1}' and '{doc2}' = {similarity_12}")
print(f"Simialrity socre between '{doc1}' and '{doc3}' = {similarity_13}")

Simialrity socre between 'editorial' and 'news' = 0.020000326299592922
Simialrity socre between 'editorial' and 'science_fiction' = 0.004505660960214556


#### These document representations seem to do a good job at capturing the semantic of each category.