# Topic modeling
- It implements latent dirichlet allocation (a popular topic modeling approach)
- The model uses collapsed gibbs sampling (a faster inference model for topic modeling)

It operates in two steps.\

*A) Preparing data (integer encoding documents)*  

*B) Performing topic modeling on integer encoded documents*

In [15]:
!pip install numpy



In [16]:
# Its vanilla implementation of Topic modeling that only uses basic tools:
# json - to read from and write to files in json format 
# numpy - for faster matrix operations 
# string - to only keep English letters, removing puntuations and other characters
# random - to generate random numbers for initializing Markov-chain monte carlo, and 
#           and during algorithm working to avoid local optima


import json
import numpy as np
import random
import string

# A) Preparing data (integer encoding documents)

1. Read textual data
2. Generate integer encoding
3. Storing intemediate data

**Working with integers (representing words or unique tokens is much faster than the word strings itself)**

*At the end, the integers would be reversed back to their respective words*

## 1. Reading textual data
- Read raw text from .txt file having document per line
- Separate into list of documents
- Tokenize

1.1 Clean text by removing punctuations and characters othen than English letters

In [2]:
def clean_text(text):
    clean_text = [char for char in text if char in string.ascii_lowercase]
    return ''.join(clean_text)

1.2 Read data from the file \
1.3 convert to lower case \
1.4 tokenize

In [3]:
with open('config.json', 'r') as file:
    configurations = json.load(file)
file = open(configurations["text-doc-path"], 'r')
rawdata = file.read()
file.close()

#split on new line and convert to lower case
documents = rawdata.split('\n')
documents = [doc.lower() for doc in documents]

# Tokenize sentences into words
tokenized_documents = []
for document in documents:
    tokenized_documents.append([token for token in document.split(' ') if len(clean_text(token))>2])

## 2. Generate Integer encoding
It preserves both frequency and position related information. The process involves assigning each unique token a dedicated integer id, preserving it in a dictionary for later retrieval, while rewriting documents by replacing with with their integer ids.

It makes the operations a lot faster as numbers are much faster to read/store and compare as compared to strings. 

The integer ids will be replaced with their original words at the end using stored dictionary files

2.1 Generate integer encoded documents \
2.2 Generate word-integer index and integer index-word dictionaries

In [4]:
# Create a dictionary of unique tokens and assign integers
dictionary = {}
revdictionary = {}
index = 0

#tokenized_documents = [[word for word in doc if word not in esw] for doc in tokenized_documents]

for doc in tokenized_documents:
    for word in doc:
        if word not in dictionary.keys():
            dictionary[word] = index
            revdictionary[index] = word
            index += 1

# Replace words in sentences with their corresponding integers
encoded_documents = [[dictionary[word] for word in doc] for doc in tokenized_documents]

### 3. Storing intermediate data
The integer encoded documents are stored in files
the word-to-id and id-to-word dictionaries are also stored

*It will help to avoid these steps, each time topic modeling is performed under different settings*

In [5]:
toStr = ''
for endoc in encoded_documents:
    toStr = toStr + '\t'.join(str(item) for item in endoc)
    toStr = toStr + '\n'
toStr = toStr[:-2]
file = open('data/integer-encoded-data.txt', 'w')
file.write(toStr)
file.close()

#write dictionary to file
file = open('data/dictionary.json', 'w')
file.write(json.dumps(dictionary))
file.close()
file = open('data/revdictionary.json', 'w')
file.write(json.dumps(revdictionary))
file.close()

# B) Topic Modeling (LDA)
- It identifies the hidden thematic structures within the documents and represent them as latent topics.
- Each document is a mixture of all possible topics with varying probabilities
- Each topic is a mixture of all vocabulary of the dataset with varying probabilities

*Setting random seeds*

In [6]:
# For reproducible results
random.seed(41)  # For Python random
np.random.seed(41)  # For NumPy random

**LDA class**
main functions are:
1. Markov chain monte carlo initialization (giving the model a random inital state, expecting the model
    to converge for higher number of iterations.
2. Collapsed gibbs sampling inference: in each iteration \
   2.1 Iterates through all documents, all tokens/words in each document \
   2.2 For for each token computes its most suitable topic, given the current status of the model \
   2.3 Updates new topic if different from current topic, associated estimates update, so does the model state \
3. Estimate document-topic distribution from the final state of the model 
4. Estimate topic-word distribution (organized in decreasing order of probabilities) from the final state of the model
5. Other utility functions

In [7]:
# The class implements topic modeling (Latent dirichlet allocation) algorithm using collapsed gibbs sampling as in inference. 
class LDA:
    # topics to extract from the data (Components)
    _numTopics = None
    # vocabulary (unique words) in the dataset
    _arrVocab = None
    #size of vocabulary (count of unique words)
    _numVocabSize = None
    # dataset
    _arrDocs = []
    # dataset size (number of documents)
    _numDocSize = None
    # dirichlet prior (document to topic prior)
    _numAlpha = None
    # dirichlet prior (topic to word prior)
    _numBeta = None
    _ifScalarHyperParameters = True
    # Gibb sampler iterations
    _numGSIterations = None
    # The iterations for initial burnin (update of parameters)
    _numNBurnin = None
    # The iterations for continuous burnin (update of parameters)
    _numSampleLag = None
    
    
    
    # The following attributes are for internal working
    __numTAlpha = None  
    __numVBeta = None   
    __arrTheta = None
    __arrThetaSum = None
    __arrPhi = None
    __arrPhiSum = None
    __arrNDT = None
    __arrNDSum = []
    __arrNTW = None
    __arrNTSum = []
    __arrZ = []
    
    # for alpha to be a list, its size must be equal to the size of the dataset, has value for each doc
    # for beta to be a list, its size must be equal to the number of topics, has value for each topic  
    def __init__(self, numTopics = 2, numAlpha = 1.0, numBeta = 0.01, 
                 numGSIterations = 1000, numNBurnin = 50, numSampleLag = 20, 
                 wordsPerTopic = 10):
        self._numTopics = configurations["numTopics"]
        self._numAlpha = configurations["numAlpha"]
        self._numBeta = configurations["numBeta"]
        self._numGSIterations = configurations["numGSIterations"]
        self._numNBurni = configurations["numNBurnin"]
        self._numSampleLag = configurations["numSampleLag"]
        self.__wordsPerTopic = configurations["wordsPerTopic"]
            
    #load data as integer encoding of words in a sequence (no padding or truncation)
    def getData(self, path):
        file = open(path, 'r')
        rawData = file.read()
        file.close()
        self.__loadData(rawData)
        self.__loadVocab()
        self.__prepareCollections()

    #load docs and docSize from the dataset
    def __loadData(self, rawData):
        rows = rawData.split('\n')
         
        #read dataset as documents of words IDs
        for row in rows:
            swordlist = row.split('\t')
            swordlist = list(filter(None, swordlist))   #remove empty items from list
            if len(swordlist) > 0:
                iwordlist = [eval(w) for w in swordlist]    
                self._arrDocs.append(iwordlist)

        # determine dataset size
        self._numDocSize = len(self._arrDocs)
        
        
    #Determine unique words (vocabulary) and count of unique words (vocabSize)    
    def __loadVocab(self):
        #determine unique vocabulary
        uniqueWords = []
        for doc in self._arrDocs:
            for word in doc:
                if word not in uniqueWords:
                    uniqueWords.append(word)
        self._arrVocab = uniqueWords
        self._numVocabSize = len(self._arrVocab)    

    def __prepareCollections(self):
        self.__arrNDSum = np.array([0] * self._numDocSize)
        self.__arrTheta = np.array([[0] * self._numTopics] * self._numDocSize)
        self.__arrThetasum = np.array([[0] * self._numTopics] * self._numDocSize)
        self.__arrNDT = np.array([[0] * self._numTopics] * self._numDocSize)
        
        self.__arrNTSum = np.array([0] * self._numTopics)
        self.__arrPhi = np.array([[0] * self._numVocabSize] * self._numTopics)
        self.__arrPhisum = np.array([[0] * self._numVocabSize] * self._numTopics)
        self.__arrNTW = np.array([[0] * self._numVocabSize] * self._numTopics)

        #Assign values to parameters based on hyper-parameters
        self.__numTAlpha = self._numTopics*self._numAlpha  
        self.__numVBeta = self._numVocabSize*self._numBeta   

        
        for d in range(0, self._numDocSize):
            rowOfZeros = [0] * len(self._arrDocs[d])
            self.__arrZ.append(rowOfZeros)
                
    # Initialize first markov chain randomly
    def randomMarkovChainInitialization(self):
        
        for d in range(self._numDocSize):
            wta = []                        #wta - word topic assignment
            doc = self._arrDocs[d]
            for ind in range(len(doc)): 
                randtopic = random.randint(0, self._numTopics - 1)      # generate a topic number at random
                self.__arrZ[d][ind] = randtopic
                self.__arrNDT[d][randtopic] += 1
                self.__arrNDSum[d] += 1
                wordid = self._arrDocs[d][ind]
                self.__arrNTW[randtopic][wordid] += 1
                self.__arrNTSum[randtopic] += 1
            
    
    #Inference (Collapsed Gibbs Sampling)
    def gibbsSampling(self):
        tAlpha = self._numAlpha * self._numTopics
        vBeta = self._numBeta * self._numVocabSize            
                    
        for it in range(self._numGSIterations):
            for d in range(self._numDocSize):
                dsize = len(self._arrDocs[d])
                for ind in range(dsize):
                    # remove old topic from a word instance
                    oldTopic = self.__arrZ[d][ind]
                    wordid = self._arrDocs[d][ind]
                    self.__arrNDT[d][oldTopic] -= 1
                    self.__arrNDSum[d] -= 1
                    self.__arrNTW[oldTopic][wordid] -= 1
                    self.__arrNTSum[oldTopic] -= 1   

                    # find a new more appropriate tpoic for the word instanc as per current state of the model
                    prob = [0] * self._numTopics
                    
                    for t in range(self._numTopics):
                        prob[t] = ((self.__arrNDT[d][t] + self._numAlpha) / (self.__arrNDSum[d] + tAlpha)) * \
                            (self.__arrNTW[t][wordid] + self._numBeta) / (self.__arrNTSum[t] + vBeta)
                    
                    #cumulate multinomial
                    cdf = prob
                    for x in range(1, len(cdf)):
                        cdf[x] += cdf[x-1]
                    
                    cutoff = random.random() * cdf[-1]
                    newTopic = 0
                    for i in range(len(cdf)):
                        if cdf[i] > cutoff:
                            newTopic = i
                            break
                    #update as per new topic
                    self.__arrZ[d][ind] = newTopic
                    self.__arrNDT[d][newTopic] += 1
                    self.__arrNDSum[d] += 1
                    self.__arrNTW[newTopic][wordid] += 1
                    self.__arrNTSum[newTopic] += 1
                
    def getTopicsPerDocument(self):
        results = ''
        results += "***Topics per Document***\n"
        for d in range(self._numDocSize):
            results += "Document " + str(d) + ":\n"
            for t in range(self._numTopics):
                val = (self.__arrNDT[d][t]+self._numAlpha)/(self.__arrNDSum[d]+self.__numTAlpha)
                results += "Topic " + str(t) + ":" + str(val) + '\t'
            results += '\n'
        print(results)
        file = open('data/output-data/document-topic-distribution.txt', 'w')
        file.write(results)
                    
   
    def getWordsPerTopic(self, revdictionary):
        results = "***Words per Topic***\n"
        
        for t in range(self._numTopics):
            results += "\nTopic " + str(t) + ":"
            #flag = 0
            wpt = {}
            for v in range(self._numVocabSize):
                val = (self.__arrNTW[t][v]+self._numBeta)/(self.__arrNTSum[t]+self.__numVBeta)
                wpt[revdictionary[str(v)]] = float(val)
             #   flag += 1
             #   if flag == self.__wordsPerTopic:
             #       break
            results += '\n'
            wpt = sorted(wpt.items(), key=lambda x: x[1], reverse=True)[:self.__wordsPerTopic]
            for item in wpt:
                results += str(item)
        print(results)
    
    def printall(self):
        print("topics: ", self._numTopics)
        print("dataset: ", self._arrDocs)
        print("dataset size: ", self._numDocSize)
        print("vocab: ", self._arrVocab)
        print("vocab size: ", self._numVocabSize)
        print("ndt: ", self.__arrNDT)
        print("ndsum: ", self.__arrNDSum)
        print("ntw: ", self.__arrNTW)
        print("ntsum: ", self.__arrNTSum)
        print("z: ", self.__arrZ)

## Run the model

In [10]:
if __name__ == "__main__":
    lda = LDA()
    lda.getData(configurations["integer-encoded-doc-path"])
    lda.randomMarkovChainInitialization()
    lda.gibbsSampling()

## Results
- The results are printed on screen and also stored in `data/output-data/` folder
- Topics distribution per document
- words distribution per topic

*Document topic distribution*


In [None]:
lda.getTopicsPerDocument()

*Topic word distribution*

In [11]:
with open(configurations["integer-word-dict"], 'r') as file:
    revdictionary = json.load(file)
lda.getWordsPerTopic(revdictionary)

***Words per Topic***

Topic 0:
('the', 0.10956585236325013)('and', 0.03986325013276686)('for', 0.029905735528412112)('economy', 0.029905735528412112)('prices', 0.029905735528412112)('has', 0.019948220924057358)('outlook', 0.019948220924057358)('new', 0.019948220924057358)('menace', 0.019948220924057358)('are', 0.01330987785448752)
Topic 1:
('oil', 0.08079734784087046)('(reuters)', 0.06379632777966678)('reuters', 0.06379632777966678)('wall', 0.025544032641958515)('band', 0.025544032641958515)('carlyle', 0.025544032641958515)('iraq', 0.025544032641958515)('from', 0.025544032641958515)('main', 0.025544032641958515)('southern', 0.025544032641958515)


*print all details:*
- Integer encoded dataset
- Final state of the model

In [14]:
# prints everything - for debugging
#lda.printall()