# Latent Dirichlet Allocation Tutorial

###### using Gibbs Sampling


## Author: Yifan Wang @ July 2018

For me to understand LDA, I found these blog posts are particularly easy to understand:

http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/

https://medium.com/@lettier/how-does-lda-work-ill-explain-using-emoji-108abf40fa7d



But still, reading the Wikipedia page and original paper is important:

https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf  (it's 30 pages long though >_< )



In [1]:
# Useful Libraries:
import numpy as np

# and yes, we will only use NumPy to build this up

In [2]:

data = [
    'apple banana are delicious food',
    'video game go play in game studio',
    'lunch food is fruit apple banana icecream',
    'warcraft or starcraft or overwatch best game',
    'chocolate or banana or icecream the most delicious food',
    'banana apple smoothie is  best for lunch or dinner',
    'video game is good for geeks',
    'what to eat for dinner banana or chocolate',
    'which game company is better ubisoft or blizzard',
    'play game on ps4 or xbox',
    'banana is less sweet icecream is more sweet',
    'chocolate icecream taste more delicious than banana'
    
]

In [3]:
'''
Data Pre-process

We will just do basic lower-case and tokenization
'''
stopwords = ['to','or','is','the','and','in','for','are','on','go','best','than']
data = [doc.lower().split(' ') for doc in data]
data = [[i for i in doc if i!=''] for doc in data]
data = [[i for i in doc if i not in stopwords] for doc in data]

In [4]:
data

[['apple', 'banana', 'delicious', 'food'],
 ['video', 'game', 'play', 'game', 'studio'],
 ['lunch', 'food', 'fruit', 'apple', 'banana', 'icecream'],
 ['warcraft', 'starcraft', 'overwatch', 'game'],
 ['chocolate', 'banana', 'icecream', 'most', 'delicious', 'food'],
 ['banana', 'apple', 'smoothie', 'lunch', 'dinner'],
 ['video', 'game', 'good', 'geeks'],
 ['what', 'eat', 'dinner', 'banana', 'chocolate'],
 ['which', 'game', 'company', 'better', 'ubisoft', 'blizzard'],
 ['play', 'game', 'ps4', 'xbox'],
 ['banana', 'less', 'sweet', 'icecream', 'more', 'sweet'],
 ['chocolate', 'icecream', 'taste', 'more', 'delicious', 'banana']]

In [5]:
'''Parameters of the model to Tune'''

########################
########################
########################

ALPHA = 0.2 # In per document the topic distribution, the higher the docs will have more topic
BETA = 0.2 # per topic word distribution, the higher the topics will have more words
ITERATIONS = 2000 # Go large, go !
K = 2  # number of topics, a lot of time need to experiment this
########################
########################
########################

In [6]:
'''
Initialize some intermediate storages and some latent parameters

'''
# Unique words list:
word2id = list(set([j for i in data for j in i]))
N = len(word2id)
word2id = {j:i for i,j in enumerate(word2id)}
print("There are %d unique words \n"%N)

# M documents:
M = len(data)
print("There are %d  documents \n"%M)

print("We choose %d topics \n"%K)


def docmap(x_list):
    return [word2id[w] for w in x_list]
doc2id = [docmap(doc) for doc in data] # map data to lists of word indexes

There are 33 unique words 

There are 12  documents 

We choose 2 topics 



In [7]:
'''
Important Matrices Initialization

Those 2 matrices will also be our output

We will randomly assign topic to a word and use that to update each matrix
'''

DocTopic_mat = np.zeros((M,K)) 
WordTopic_mat = np.zeros((N,K)) 
topic_count_mat = [[0 for idx in doc] for doc in doc2id] # this list records assignment of each doc element to topics


for _doc_id in range(M):
    _tempDoc = doc2id[_doc_id] 
    for idx in range(len(_tempDoc)):
        _word_id = _tempDoc[idx]
        _random_topic = np.random.choice(range(K))
        # Update each table:
        topic_count_mat[_doc_id][idx] = _random_topic
        DocTopic_mat[_doc_id,_random_topic] += 1
        WordTopic_mat[_word_id,_random_topic] += 1
        


Now data preparation is done, we can start our modeling process. We will use **Gibbs Sampling** approach to continuously improve the topic assignment to each word

In [8]:
for i in range(ITERATIONS): # Iterations
    if i % 200 == 0:
        print("Iteration-%d started..."%i)
    

    for _doc_id in range(M): # Each doc:
        _temp_doc = doc2id[_doc_id]
        for idx in range(len(_temp_doc)): # each word in doc d
            # get word
            _temp_word_idx = _temp_doc[idx]
            # get topic
            _temp_topic_idx = topic_count_mat[_doc_id][idx]


            
            # Pre-exclude current word:
            WordTopic_mat[_temp_word_idx,_temp_topic_idx] -= 1
            DocTopic_mat[_doc_id,_temp_topic_idx] -= 1
            
            
            
            # Update using Gibbs sampling:
 
                      # current word's topic assignment                # sum of all words count of each topic
            phi_k_w= (WordTopic_mat[_temp_word_idx,:] + BETA)  /  (np.sum(WordTopic_mat,axis=0) + N*BETA) # phi
                      # current doc's topic assignment                      # sum of all doc count of each topic
            theta_m_k = (DocTopic_mat[_doc_id,:] + ALPHA)/ (np.sum(DocTopic_mat[_doc_id,:],axis=0) + ALPHA*K ) # theta
            p = phi_k_w*theta_m_k

            # normalize the p to sum up to 1 the allow next step
            p = p/np.sum(p)
            # get the new topic assignment:
            new_topic = np.random.choice(range(K),p=p)
           

            WordTopic_mat[_temp_word_idx,new_topic] += 1
            

            DocTopic_mat[_doc_id,new_topic] += 1
            
            topic_count_mat[_doc_id][idx] = new_topic

    


Iteration-0 started...
Iteration-200 started...
Iteration-400 started...
Iteration-600 started...
Iteration-800 started...
Iteration-1000 started...
Iteration-1200 started...
Iteration-1400 started...
Iteration-1600 started...
Iteration-1800 started...


Use the same approach above, now we have the P of each word assign to each topic:

In [9]:
words_res = (WordTopic_mat + BETA) / (np.sum(WordTopic_mat,axis=0) + N*BETA) # aka phi

In [10]:
docs_res = (DocTopic_mat + ALPHA)/ (np.sum(DocTopic_mat,axis=0) + ALPHA*K ) # aka theta

#### Now let's check the results:

Top 5 words for each topic:

In [11]:
words_res[:10,0]

array([0.0044843 , 0.0044843 , 0.0044843 , 0.0044843 , 0.02690583,
       0.0044843 , 0.16143498, 0.0044843 , 0.04932735, 0.07174888])

In [12]:

id2word = {i[1]:i[0]  for i in word2id.items()}
for i in range(K):
    idxs = [i for i in reversed(words_res[:,i].argsort())][:5] # max -> min
    print("Topic %d top words:"%i)
    print([id2word[idx] for idx in idxs])
    

Topic 0 top words:
['banana', 'icecream', 'food', 'delicious', 'apple']
Topic 1 top words:
['game', 'play', 'video', 'blizzard', 'starcraft']


### Make sense right?

#### the first topic is about video games !!
#### the second on is about food  <3

Topic for each doc:

In [13]:
# Original Data:
data

[['apple', 'banana', 'delicious', 'food'],
 ['video', 'game', 'play', 'game', 'studio'],
 ['lunch', 'food', 'fruit', 'apple', 'banana', 'icecream'],
 ['warcraft', 'starcraft', 'overwatch', 'game'],
 ['chocolate', 'banana', 'icecream', 'most', 'delicious', 'food'],
 ['banana', 'apple', 'smoothie', 'lunch', 'dinner'],
 ['video', 'game', 'good', 'geeks'],
 ['what', 'eat', 'dinner', 'banana', 'chocolate'],
 ['which', 'game', 'company', 'better', 'ubisoft', 'blizzard'],
 ['play', 'game', 'ps4', 'xbox'],
 ['banana', 'less', 'sweet', 'icecream', 'more', 'sweet'],
 ['chocolate', 'icecream', 'taste', 'more', 'delicious', 'banana']]

In [14]:
docs_res.argmax(axis=1)

array([0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0])

#### Again it's correctly assign each doc to the right topic:

the ones are food related, the zeros are game related

=================================================

## Now let's put every together

In [23]:
class LDA():
    """
    Latent Dirichlet Allocation using Gibbs Sampling.
    """
    def __init__(
        self,
        ALPHA,
        BETA,
        ITERATIONS,
        N_TOPICS,
        verbose=True
    ):
        self.ALPHA = ALPHA
        self.BETA = BETA
        self.ITERATIONS = ITERATIONS
        self.N_TOPICS = N_TOPICS
        self.verbose = verbose
    
    
    def _preprocess_text(
        self,
        list_x,
        _stopwords = []
    ):
        """
        Input: a list of strings (documents) without punctuations
        will do some simple processing includes moving white spaces and stopwords
        """
#         stopwords = ['to','or','is','the','and']
        list_x = [doc.lower().split(' ') for doc in list_x]
        list_x = [[i for i in doc if i!=''] for doc in list_x]
        list_x = [[i for i in doc if i not in _stopwords] for doc in list_x]
        self.raw_data = list_x


    
    def _initialize(
        self,
    ):
        """
        Initialize latent varaibles and matrices
        """

        # Unique words list:
        word2id = list(set([j for i in self.raw_data for j in i]))
        self.N = len(word2id)
        self.word2id = {j:i for i,j in enumerate(word2id)}
        self.id2word = {i[1]:i[0]  for i in self.word2id.items()}
        # M documents:
        self.M = len(self.raw_data)
        self.doc2id = [self._docmap(doc) for doc in self.raw_data] # map data to lists of word indexes


        
        #Mat init:
        self.DocTopic_mat = np.zeros((self.M,self.N_TOPICS)) 
        self.WordTopic_mat = np.zeros((self.N,self.N_TOPICS)) 
        self.topic_count_mat = [[0 for idx in doc] for doc in self.doc2id] # this list records assignment of each doc element to topics
        
        for _doc_id in range(self.M):
            _tempDoc = self.doc2id[_doc_id] 
            for idx in range(len(_tempDoc)):
                _word_id = _tempDoc[idx]
                _random_topic = np.random.choice(range(self.N_TOPICS))
                # Update each table:
                self.topic_count_mat[_doc_id][idx] = _random_topic
                self.DocTopic_mat[_doc_id,_random_topic] += 1
                self.WordTopic_mat[_word_id,_random_topic] += 1


    
    def _docmap(self, x):
        """
        Map list of lists of words to their id,
        a.k.a Tokenization
        """
        return [self.word2id[w] for w in x]
     
    
    def _train(self):
        """
        Actual training using Gibbs Sampling
        """
        
        for i in range(self.ITERATIONS): # Iterations
            if self.verbose:
                if i % 100 == 0:
                    print("Iteration-%d started..."%i)


            for _doc_id in range(self.M): # Each doc:
                _temp_doc = self.doc2id[_doc_id]
                for idx in range(len(_temp_doc)): # each word in doc d
                    # get word
                    _temp_word_idx = _temp_doc[idx]
                    # get topic
                    _temp_topic_idx = self.topic_count_mat[_doc_id][idx]



                    # Pre-exclude current word:
                    self.WordTopic_mat[_temp_word_idx,_temp_topic_idx] -= 1
                    self.DocTopic_mat[_doc_id,_temp_topic_idx] -= 1



                    # Update using Gibbs sampling:

                               # current word's topic assignment                # sum of all words count of each topic
                    phi_k_w= (self.WordTopic_mat[_temp_word_idx,:] + self.BETA)  /  (np.sum(self.WordTopic_mat,axis=0) + self.N*self.BETA) # phi
                            # current doc's topic assignment                      # sum of all doc count of each topic
                    theta_m_k = (self.DocTopic_mat[_doc_id,:] + self.ALPHA)/ (np.sum(self.DocTopic_mat[_doc_id,:],axis=0) + self.ALPHA*self.N_TOPICS ) # theta
                    p = phi_k_w*theta_m_k
                    # normalize the p to sum up to 1 the allow next step
                    p = p/np.sum(p)
                    # get the new topic assignment:
                    new_topic = np.random.choice(range(self.N_TOPICS),p=p)


                    self.WordTopic_mat[_temp_word_idx,new_topic] += 1


                    self.DocTopic_mat[_doc_id,new_topic] += 1

                    self.topic_count_mat[_doc_id][idx] = new_topic

        
        self.res_wordtopic = (self.WordTopic_mat + self.BETA) / (np.sum(self.WordTopic_mat,axis=0) + self.N*self.BETA) # aka phi
        self.res_doctopic = (self.DocTopic_mat + self.ALPHA)/ (np.sum(self.DocTopic_mat,axis=0) + self.ALPHA*self.N_TOPICS ) # aka theta

   
    def fit(
        self,
        data,
        stopwords
    ):
        """
        Wrap-up function to run the pipeline
        """
        self._preprocess_text(data,stopwords)
        self._initialize()
        self._train()
        

    def get_topic_keywords(self,TOPIC,TOP_N):
        """
        Query Top N keywords for certain topic
        """
        idxs = [i for i in reversed(self.res_wordtopic[:,TOPIC].argsort())][:TOP_N] # max -> min
        return [self.id2word[idx] for idx in idxs]
    


In [24]:

data = [
    'apple banana are delicious food',
    'video game go play in game studio',
    'lunch food is fruit apple banana icecream',
    'warcraft or starcraft or overwatch best game',
    'chocolate or banana or icecream the most delicious food',
    'banana apple smoothie is  best for lunch or dinner',
    'video game is good for geeks',
    'what to eat for dinner banana or chocolate',
    'which game company is better ubisoft or blizzard',
    'play game on ps4 or xbox',
    'banana is less sweet icecream is more sweet',
    'chocolate icecream taste more delicious than banana'
    
]

In [25]:
stopwords = ['to','or','is','the','and','in','for','are','on','go','best','than']

In [26]:
ALPHA = 0.2 # In per document the topic distribution, the higher the docs will havemore topic
BETA = 0.2 # per topic word distribution, the higher the topics will have more words
ITERATIONS = 1500 # Go large, go !
K = 2  

In [27]:
model = LDA(
    ALPHA=ALPHA,
    BETA=BETA,
    ITERATIONS = ITERATIONS,
    N_TOPICS = K,
    verbose = True)

In [28]:
model.fit(
    data=data,
    stopwords = stopwords
)

Iteration-0 started...
Iteration-100 started...
Iteration-200 started...
Iteration-300 started...
Iteration-400 started...
Iteration-500 started...
Iteration-600 started...
Iteration-700 started...
Iteration-800 started...
Iteration-900 started...
Iteration-1000 started...
Iteration-1100 started...
Iteration-1200 started...
Iteration-1300 started...
Iteration-1400 started...


In [29]:
model.get_topic_keywords(TOPIC=0,TOP_N=5)

['game', 'play', 'video', 'blizzard', 'starcraft']

In [30]:
model.get_topic_keywords(TOPIC=1,TOP_N=5)

['banana', 'icecream', 'food', 'delicious', 'apple']

-- Done --