# Latent Dirichlet Allocation

Mathematically, LDA assumes the following generative process for each document $\boldsymbol{w}$ in a corpus $D$:

1. Choose $N \sim$ Poisson$(\xi)$. $N$ represents the no. of words for the new document.
2. Choose $\theta \sim \operatorname{Dir}(\alpha)$.
3. For each of the $N$ words $w_{n}$
    1. Choose a topic $z_{n} \sim$ Multinomial$(\theta)$.
    2. Choose a word $w_{n}$ from $p\left(w_{n} | z_{n}, \beta\right)$, a multinomial probability conditioned on the topic $z_{n}$.

In [14]:
import numpy as np

In [1]:
## Generate a corpus
rawdocs = ['eat turkey on turkey day holiday',
          'i like to eat cake on holiday',
          'turkey trot race on thanksgiving holiday',
          'snail race the turtle',
          'time travel space race',
          'movie on thanksgiving',
          'movie at air and space museum is cool movie',
          'aspiring movie star']

In [4]:
docs = [rawdoc.split(' ') for rawdoc in rawdocs]
docs

[['eat', 'turkey', 'on', 'turkey', 'day', 'holiday'],
 ['i', 'like', 'to', 'eat', 'cake', 'on', 'holiday'],
 ['turkey', 'trot', 'race', 'on', 'thanksgiving', 'holiday'],
 ['snail', 'race', 'the', 'turtle'],
 ['time', 'travel', 'space', 'race'],
 ['movie', 'on', 'thanksgiving'],
 ['movie', 'at', 'air', 'and', 'space', 'museum', 'is', 'cool', 'movie'],
 ['aspiring', 'movie', 'star']]

In [5]:
# PARAMETERS
K = 2 # number of topics
ALPHA = 1 # hyperparameter. single value indicates symmetric dirichlet prior. higher=>scatters document clusters
ETA = 0.001 # hyperparameter
ITERATIONS = 3 # iterations for collapsed gibbs sampling.  This should be a lot higher than 3 in practice.

In [8]:
## Assign WordIDs to each unique word
unique_words = set()

for doc in docs:
    for word in doc:
        unique_words.add(word)
        
print(unique_words)

{'aspiring', 'holiday', 'turtle', 'star', 'time', 'movie', 'trot', 'turkey', 'air', 'like', 'eat', 'day', 'travel', 'snail', 'at', 'to', 'cool', 'the', 'cake', 'i', 'is', 'and', 'on', 'thanksgiving', 'space', 'museum', 'race'}


In [10]:
vocab = {}

index = 0

for word in unique_words:
    vocab[word] = index
    index += 1
    
print(vocab)

{'aspiring': 0, 'holiday': 1, 'turtle': 2, 'star': 3, 'time': 4, 'movie': 5, 'trot': 6, 'turkey': 7, 'air': 8, 'like': 9, 'eat': 10, 'day': 11, 'travel': 12, 'snail': 13, 'at': 14, 'to': 15, 'cool': 16, 'the': 17, 'cake': 18, 'i': 19, 'is': 20, 'and': 21, 'on': 22, 'thanksgiving': 23, 'space': 24, 'museum': 25, 'race': 26}


In [13]:
## Replace words in documents with wordIDs
indexed_doc = []

for doc in docs:
    indexed_doc.append([vocab[word] for word in doc])
    
indexed_doc

[[10, 7, 22, 7, 11, 1],
 [19, 9, 15, 10, 18, 22, 1],
 [7, 6, 26, 22, 23, 1],
 [13, 26, 17, 2],
 [4, 12, 24, 26],
 [5, 22, 23],
 [5, 14, 8, 21, 24, 25, 20, 16, 5],
 [0, 5, 3]]

In [15]:
#word-topic matrix 
wt = np.zeros(shape=(K, len(vocab)))

In [16]:
wt

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

In [17]:
# @ta : topic assignment list
ta = [np.zeros(len(doc)) for doc in docs]

In [18]:
ta

[array([0., 0., 0., 0., 0., 0.]),
 array([0., 0., 0., 0., 0., 0., 0.]),
 array([0., 0., 0., 0., 0., 0.]),
 array([0., 0., 0., 0.]),
 array([0., 0., 0., 0.]),
 array([0., 0., 0.]),
 array([0., 0., 0., 0., 0., 0., 0., 0., 0.]),
 array([0., 0., 0.])]

In [19]:
# @dt : counts correspond to the number of words assigned to each topic for each document
dt = np.zeros(shape=(len(docs),K))