# Topic modeling (LDA)
- It implements latent Dirichlet allocation (a popular topic modeling approach)
- The model uses collapsed Gibbs sampling (a faster inference model for topic modeling)

It operates in two steps.

*A) Preparing data (integer encoding documents)*  

*B) Performing topic modeling on integer encoded documents*

In [None]:
# Preparing the computational environment
!pip install -r requirements.txt

In [None]:
# It's a vanilla implementation of Topic modeling that only uses basic tools:
# json - to read from and write to files in JSON format 
# numpy - for faster matrix operations 
# pandas - to read CSV data
# string - to only keep English letters, removing punctuation and other characters
# random - to generate random numbers for initializing Markov-chain Monte Carlo, and 
#           and during the algorithm's working to avoid local optima


import json
import numpy as np
import random
import pandas as pd
import string

# A) Preparing data (integer encoding documents)

1. Read textual data
2. Generate integer encoding
3. Storing intermediate data

**Working with integers (representing words or unique tokens is much faster than the word strings itself)**

*At the end, the integers would be reversed back to their respective words*

## 1. Reading textual data


1.1 Clean text by removing punctuations and characters other than English letters \
1.2 Convert to lower case \
1.3 Tokenize 

In [None]:
en_stopwords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

def clean_text(text):
    clean_text = text.lower()
    # cleaning documents by removing unwanted characters
    clean_text = "".join([char for char in text if char in string.ascii_lowercase])
    # cleaning documents by stopwords
    clean_text = [word for word in text.split(" ") if word not in en_stopwords and len(word) > 2]
    return clean_text

1.4 Read data from the file 

In [None]:
# Read input data: titles of BBC articles available on the following link
# https://github.com/vahadruya/Capstone-Project-Unsupervised-ML-Topic-Modelling/blob/main/Input_Data/input.csv

with open('config.json', 'r') as file:
    config = json.load(file)

with open(config["input-path"], 'r') as file:
    input_data = file.read().split('\n')



# Tokenize sentences into words
tokenized_documents = []
for document in input_data: # Considering only the first 100 titles for the sake of demonstration
    document = clean_text(document)
    tokenized_documents.append(document)
len(tokenized_documents)

## 2. Configuration Setting

`config.json`

The method provides the following configuration options to alter the behavior of the method.

**numTopics:** Number of topics to extract from the dataset. The default value is 3. However, it generally depends on the nature of the data. Keeping the number of topics too few can only have the topics focused on broader concepts, and cannot identify the specific topics. While keeping the number of topics too high results in noisy (incoherent) topics. Therefore, a suitable number of topics depends on the data and the type of analysis required. The default value here is 3.

**numAlpha**: This hyperparameter helps in deciding the probabilities of topics in a document. A higher value (above 1) will have too many topics with similar probabilities, i.e., when the intended purpose is to get more topics per document. However, keeping the value too low will have only a few prominent topics with very high probabilities as compared to others. A lower value (below 1) is used when fewer topics are needed per document. A lower numAlpha value pushes higher probabilities higher and lower probabilities further lower. While a higher numAlpha value introduces a high bias, due to which all probabilities converge to similar values. In this method, we are using the numAlpha value 1.

**numBeta**: This hyperparameter helps in deciding the probabilities of words in a topic. a higher value (above 1) will have too many words with similar probabilities, i.e., when the intended purpose is to have more words representing a topic. However, keeping the value too low will have only a few most prominent words with very high probabilities as compared to others. Generally, considering the size of the vocabulary in a dataset, this value is kept smaller to determine only the most relevant words. In this method, we are using the numBeta value 0.01.

**numGSIterations**: It's the number of iterations of the inference technique (collapsed Gibbs sampling). Due to random initialization, more words switch their topics in the earlier iterations, which keeps dropping in the coming iterations, i.e., approaching the equilibrium state. Keeping the number of numGSIterations higher ensures that the words have settled down in their respective topics. Alternatively, the difference between two consecutive iterations is also used to avoid unnecessary iterations when the words have already settled. In this method, we only use the numGSIterations with a value of 1000.

**wordsPerTopic**: The number of top words to represent a topic. The default values is 10.

**text-doc-path**: The path of the input file
**integer-encoded-doc-path**: The path of integer integer-encoded file. It is the intermediate file that topic modeling uses. 
**integer-word-dict**

## 3. Generate Integer encoding
It preserves both frequency and position-related information. The process involves assigning each unique token a dedicated integer id, preserving it in a dictionary for later retrieval, while rewriting documents by replacing with with their integer ids.

It makes the operations a lot faster as numbers are much faster to read/store and compare as compared to strings. 

The integer IDs will be replaced with their original words at the end using the stored dictionary files

3.1 Generate integer encoded documents \
3.2 Generate word-integer index and integer index-word dictionaries

In [None]:
# Create a dictionary of unique tokens and assign integers
dictionary = {}
revdictionary = {}
index = 0

#tokenized_documents = [[word for word in doc if word not in esw] for doc in tokenized_documents]

for doc in tokenized_documents:
    for word in doc:
        if word not in dictionary.keys():
            dictionary[word] = index
            revdictionary[index] = word
            index += 1

# Replace words in sentences with their corresponding integers
encoded_documents = [[dictionary[word] for word in doc] for doc in tokenized_documents]

## 4. Storing intermediate data
The integer-encoded documents are stored in files
The word-to-id and id-to-word dictionaries are also stored

*It will help to avoid these steps each time topic modeling is performed under different settings*

In [None]:
toStr = ''
for endoc in encoded_documents:
    toStr = toStr + '\t'.join(str(item) for item in endoc)
    toStr = toStr + '\n'
toStr = toStr[:-2]
file = open('data/integer-encoded-data.txt', 'w')
file.write(toStr)
file.close()

#write dictionary to file
file = open('data/dictionary.json', 'w')
file.write(json.dumps(dictionary))
file.close()
file = open('data/revdictionary.json', 'w')
file.write(json.dumps(revdictionary))
file.close()

# B) Topic Modeling
- It identifies the hidden thematic structures within the documents and represents them as latent topics.
- Each document is a mixture of all possible topics with varying probabilities
- Each topic is a mixture of all the vocabulary of the dataset with varying probabilities
- This method implements Latent Dirichlet allocation (LDA), a commonly used topic model

*Setting random seeds*

In [None]:
# For reproducible results
random.seed(41)  # For Python random
np.random.seed(41)  # For NumPy random

## 1. Latent Dirichlet Allocation (LDA)

**LDA class**
The main functions are:
1. Random initialization (assigning word occurrences to topics at random)
2. Using Markov Chain Monte Carlo (MCMC) sampling, the posterior distribution is estimated using the current state (converging by iterations)
3. Collapsed Gibbs sampling inference: in each iteration \
   3.1 Iterates through all documents, all tokens/words in each document \
   3.2 For each token, compute its most suitable topic, given the current status of the model \
   3.3 Updates the new topic if different from the  current topic, associated estimates update, so does the model state \
4. Estimate document-topic distribution from the final state of the model 
5. Estimate topic-word distribution (organized in decreasing order of probabilities) from the final state of the model
6. Other utility functions

In [None]:
# The class implements topic modeling (Latent dirichlet allocation) algorithm using collapsed gibbs sampling as in inference. 
class LDA:
    # topics to extract from the data (Components)
    _numTopics = None
    # vocabulary (unique words) in the dataset
    _arrVocab = None
    #size of vocabulary (count of unique words)
    _numVocabSize = None
    # dataset
    _arrDocs = []
    # dataset size (number of documents)
    _numDocSize = None
    # dirichlet prior (document to topic prior)
    _numAlpha = None
    # dirichlet prior (topic to word prior)
    _numBeta = None
    _ifScalarHyperParameters = True
    # Gibb sampler iterations
    _numGSIterations = None
    # The iterations for initial burnin (update of parameters)
    _numNBurnin = None
    # The iterations for continuous burnin (update of parameters)
    _numSampleLag = None
    
    
    
    # The following attributes are for internal working
    __numTAlpha = None  
    __numVBeta = None   
    __arrTheta = None
    __arrThetaSum = None
    __arrPhi = None
    __arrPhiSum = None
    __arrNDT = None
    __arrNDSum = []
    __arrNTW = None
    __arrNTSum = []
    __arrZ = []
    
    # for alpha to be a list, its size must be equal to the size of the dataset, and has a value for each doc
    # for beta to be a list, its size must be equal to the number of topics, and has a value for each topic  
    def __init__(self, numTopics = 2, numAlpha = 1.0, numBeta = 0.01, 
                 numGSIterations = 1000, numNBurnin = 50, numSampleLag = 20, 
                 wordsPerTopic = 10):
        self._numTopics = config["numTopics"]
        self._numAlpha = config["numAlpha"]
        self._numBeta = config["numBeta"]
        self._numGSIterations = config["numGSIterations"]
        self._numNBurni = config["numNBurnin"]
        self._numSampleLag = config["numSampleLag"]
        self.__wordsPerTopic = config["wordsPerTopic"]
            
    #load data as integer encoding of words in a sequence (no padding or truncation)
    def getData(self, path):
        file = open(path, 'r')
        rawData = file.read()
        file.close()
        self.__loadData(rawData)
        self.__loadVocab()
        self.__prepareCollections()

    #load docs and docSize from the dataset
    def __loadData(self, rawData):
        rows = rawData.split('\n')
         
        #read dataset as documents of words IDs
        for row in rows:
            swordlist = row.split('\t')
            swordlist = list(filter(None, swordlist))   #remove empty items from list
            if len(swordlist) > 0:
                iwordlist = [eval(w) for w in swordlist]    
                self._arrDocs.append(iwordlist)

        # determine dataset size
        self._numDocSize = len(self._arrDocs)
        
        
    #Determine unique words (vocabulary) and count of unique words (vocabSize)    
    def __loadVocab(self):
        #determine unique vocabulary
        uniqueWords = []
        for doc in self._arrDocs:
            for word in doc:
                if word not in uniqueWords:
                    uniqueWords.append(word)
        self._arrVocab = uniqueWords
        self._numVocabSize = len(self._arrVocab)    

    def __prepareCollections(self):
        self.__arrNDSum = np.array([0] * self._numDocSize)
        self.__arrTheta = np.array([[0] * self._numTopics] * self._numDocSize)
        self.__arrThetasum = np.array([[0] * self._numTopics] * self._numDocSize)
        self.__arrNDT = np.array([[0] * self._numTopics] * self._numDocSize)
        
        self.__arrNTSum = np.array([0] * self._numTopics)
        self.__arrPhi = np.array([[0] * self._numVocabSize] * self._numTopics)
        self.__arrPhisum = np.array([[0] * self._numVocabSize] * self._numTopics)
        self.__arrNTW = np.array([[0] * self._numVocabSize] * self._numTopics)

        #Assign values to parameters based on hyper-parameters
        self.__numTAlpha = self._numTopics*self._numAlpha  
        self.__numVBeta = self._numVocabSize*self._numBeta   

        
        for d in range(0, self._numDocSize):
            rowOfZeros = [0] * len(self._arrDocs[d])
            self.__arrZ.append(rowOfZeros)
                
    # Initialize first markov chain randomly
    def randomMarkovChainInitialization(self):
        
        for d in range(self._numDocSize):
            wta = []                        #wta - word topic assignment
            doc = self._arrDocs[d]
            for ind in range(len(doc)): 
                randtopic = random.randint(0, self._numTopics - 1)      # generate a topic number at random
                self.__arrZ[d][ind] = randtopic
                self.__arrNDT[d][randtopic] += 1
                self.__arrNDSum[d] += 1
                wordid = self._arrDocs[d][ind]
                self.__arrNTW[randtopic][wordid] += 1
                self.__arrNTSum[randtopic] += 1
            
    
    #Inference (Collapsed Gibbs Sampling)
    def gibbsSampling(self):
        tAlpha = self._numAlpha * self._numTopics
        vBeta = self._numBeta * self._numVocabSize            
                    
        for it in range(self._numGSIterations):
            for d in range(self._numDocSize):
                dsize = len(self._arrDocs[d])
                for ind in range(dsize):
                    # remove old topic from a word instance
                    oldTopic = self.__arrZ[d][ind]
                    wordid = self._arrDocs[d][ind]
                    self.__arrNDT[d][oldTopic] -= 1
                    self.__arrNDSum[d] -= 1
                    self.__arrNTW[oldTopic][wordid] -= 1
                    self.__arrNTSum[oldTopic] -= 1   

                    # find a new more appropriate tpoic for the word instanc as per current state of the model
                    prob = [0] * self._numTopics
                    
                    for t in range(self._numTopics):
                        prob[t] = ((self.__arrNDT[d][t] + self._numAlpha) / (self.__arrNDSum[d] + tAlpha)) * \
                            (self.__arrNTW[t][wordid] + self._numBeta) / (self.__arrNTSum[t] + vBeta)
                    
                    #cumulate multinomial
                    cdf = prob
                    for x in range(1, len(cdf)):
                        cdf[x] += cdf[x-1]
                    
                    cutoff = random.random() * cdf[-1]
                    newTopic = 0
                    for i in range(len(cdf)):
                        if cdf[i] > cutoff:
                            newTopic = i
                            break
                    #update as per new topic
                    self.__arrZ[d][ind] = newTopic
                    self.__arrNDT[d][newTopic] += 1
                    self.__arrNDSum[d] += 1
                    self.__arrNTW[newTopic][wordid] += 1
                    self.__arrNTSum[newTopic] += 1
                
    def getTopicsPerDocument(self):
        dtd = {}
        for d in range(self._numDocSize):
            for t in range(self._numTopics):
                val = (self.__arrNDT[d][t]+self._numAlpha)/(self.__arrNDSum[d]+self.__numTAlpha)
                val = round(val, 4)
                key = "topic-" + str(t+1)
                if key not in dtd.keys():
                    dtd[key] = []
                dtd[key].append(val)
        return dtd

    def getWordsPerTopic(self, revdictionary):
        twd = []
        for t in range(self._numTopics):
            wpt = {}
            for v in range(self._numVocabSize):
                val = (self.__arrNTW[t][v]+self._numBeta)/(self.__arrNTSum[t]+self.__numVBeta)
                val = round(val, 4)
                wpt[revdictionary[str(v)]] = val
             #   flag += 1
             #   if flag == self.__wordsPerTopic:
             #       break
            wpt = sorted(wpt.items(), key=lambda x: x[1], reverse=True)[:self.__wordsPerTopic]
            twd.append(wpt)
        output = {}
        for i in range(len(twd)):
            output["topic-" + str(i+1)] = twd[i]
        return output
    
    def printall(self):
        print("topics: ", self._numTopics)
        print("dataset: ", self._arrDocs)
        print("dataset size: ", self._numDocSize)
        print("vocab: ", self._arrVocab)
        print("vocab size: ", self._numVocabSize)
        print("ndt: ", self.__arrNDT)
        print("ndsum: ", self.__arrNDSum)
        print("ntw: ", self.__arrNTW)
        print("ntsum: ", self.__arrNTSum)
        print("z: ", self.__arrZ)

## 2. Run the model

This may take several minutes

In [None]:
if __name__ == "__main__":
    lda = LDA()
    lda.getData(config["integer-encoded-path"])
    lda.randomMarkovChainInitialization()
    lda.gibbsSampling()

## 3. Results

Topic modeling has two important results
1. **Latent topics** identified in the corpus. Each topic is represented by the top-most presentable words for that topic. It's similar to clustering in the sense that the words are grouped as topics and labeled un-intuitively as topic 1, topic 2, etc. However, unlike clustering, the words have probabilities of relevance to the other words of the topic. Using these probabilities, only the top few words (10) are used to represent a topic. Therefore, it is also called *word topic distribution*. 

2. **Topics in documents** are the probabilities of topics within each document. A general conception is that a document is not entirely about a single topic and instead has different percentages of multiple topics. The topics in documents provide the probabilities of each topic in each document.

To observe the output generated by this method closely, please go to the `data/output-data/` folder, which has `word-topic-distribution.txt` and `document-topic-distribution.txt` 

*Topic word distribution*

**Topic distribution per document** \
Each document talks about the topics identified to a different extent. For example, document 1 is 12.5% topic 1, 37.5% topic 2, and 50% topic 3. 

Readers interested in a specific topic may only read the documents where that topic has high coverage. 


In [None]:

doc_topic_dist = lda.getTopicsPerDocument()
doc_topic_dist["Text"] = input_data

# Printing topic distribution for the top 10 documents
df_dtd = pd.DataFrame(doc_topic_dist, index = ["Doc "+str(i+1) for i in range(len(doc_topic_dist['topic-1']))])

df_dtd.to_csv("data/output-data/document-topic-distribution.tsv", sep="\t", index=False)

df_dtd.head()

**Word distribution per topic** \
The three latent topics determined from this dataset are labeled as Topic 1, Topic 2, and Topic 3. \
*Topic 1:* May be broadly interpreted as election and politics  from its top words \
*Topic 2:* May be broadly interpreted as business deals \
*Topic 3:* May be broadly interpreted as movies and showbiz

These three topics give a general idea of the topics covered in the data.

In [None]:
with open(config["integer-word-dict-path"], 'r') as file:
    revdictionary = json.load(file)
topic_word_distribution = lda.getWordsPerTopic(revdictionary)

twd = {}
for topic in topic_word_distribution.keys():
    words, probabilities = zip(*topic_word_distribution[topic])
    twd[topic + "-word"] = words
    twd[topic + "-word-probability"] = probabilities

df_twd = pd.DataFrame(twd, index = ["word " +str(i+1) for i in range(len(topic_word_distribution['topic-1']))])
df_twd.to_csv("data/output-data/topic-word-distribution.tsv", sep="\t", index = False)
df_twd