Practical 7B: Topic Modelling 

**Import library + download pacxkages**
```Python
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

import gensim
from gensim import corpora

import string
from pathlib import Path
from pprint import pprint
```


**Reading in source files from directory** 
- given a zip file of 250 files of various categories,  unzip the file and read the file directly, append to the corpus to form a full corpus of words
```Python
#r is the raw string literals so that windows path slash won't create problem 
data_folder = Path(r'news')

#read each file from the directory into an array and name it corpus
corpus = []
filenames = []

for filename in data_folder.iterdir():
   fp = open(str(filename), 'r', encoding='latin1')
   corpus.append(fp.read())
   #keep the filename for later use
   filenames.append(filename.name)
   fp.close()

print(corpus.__len__())
corpus
```

**Pre-Processing**: stopwords removal, puncutation removal and lemmatization using WordNetLemmatizer()
```Python
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()
#clean up conntent and keep in the doc_clean variable
def clean(doc):
    punc_free = ''.join([ch for ch in doc.lower() if ch not in exclude]) #remove punctuations
    stop_free = ' '.join([i for i in punc_free.split() if i not in stop]) #remove stopwords
    normalized = ' '.join(lemma.lemmatize(word) for word in stop_free.split()) #lemmatisation
    #stemmed = ' '.join(stemmer.stem(word) for word in normalized.split())
    return normalized

doc_clean = [clean(doc).split() for doc in corpus]
```

**Preparing word representation**: use gensim library to do term frequency word representation
```Python
dictionary = corpora.Dictionary(doc_clean) #use gensium corpora to create data structure keeping all unique words
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean] #use dictionary to create doc-term matrix for each of doc / file using bag of words approach
```

**Creating LDA model**: uses gensim lda models to set value of 5 for the first model to specify the number of topics for LDA
```Python
topic_num = 5
word_num = 5

Lda = gensim.models.ldamodel.LdaModel
ldamodel = Lda(doc_term_matrix, num_topics = topic_num, id2word = dictionary, passes=20)

pprint(ldamodel.print_topics(num_topics=topic_num, num_words=word_num))
# Compute Perplexity
print('Perplexity: ', ldamodel.log_perplexity(doc_term_matrix))
```
Some pointers: 
- the results and topics are often difficult to identify a category and may not be meaningful
- how do you determine suitable number to use?
  - use perplexity value (statistical measure of how well probability model predicts sample. benefit comes when comparing different LDA models and model with lower perplexity value is considered better
- **increasing topic_num** to a large # **MAY NOT HELP** in understanding categories (unless prior knoweledge of possible large value), thus sacrificing clarity


*After Modelling Observations*  
- some words are appearing in list of topics (like said, mr) --> can be considered as stopwords and generic stopwords cannot handle domain specific words
```Python
#addon to stop words
domain_stop = ["said", "mr"]
stop.update(domain_stop)
#add stemming to pre-processing step. Stemming is done after lemmatization
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmed = ' '.join(stemmer.stem(word) for word in normalized.split())
```

**Retrieving topic details** 
- Part 1: find out file name and corresponding topic ids with probability. Given that LDA is probability in modelling mixture of topics on given content, LDA assign topic ids with probability to indicate content can potentially has more than topic
```Python
print('\nFile name and its corresponding topic id with probability:')
dic_topic_doc = {}
for index, doc in enumerate(doc_clean):
    #for doc in doc_clean:
    bow = dictionary.doc2bow(doc)
    
    #get topic distribution of the ldamodel
    t = ldamodel.get_document_topics(bow)
    
    #sort the probability value in descending order to extract the top contributing topic id
    sorted_t = sorted(t, key=lambda x: x[1], reverse=True)
    
    #print only the filename 
    print(filenames[index],sorted_t)
    
    #get the top scoring item
    top_item = sorted_t.pop(0)
    
    #create dictionary and keep key as topic id and filename and probability in tuple as value
    dic_topic_doc.setdefault(top_item[0],[]).append((filenames[index],top_item[1]))
```

Part 2: Making use of the above information, and transform to extract list of topic id, number of files (belong to topic) and list of file names with probability 
```Python
#print out identified topic id and associated
print('\nTopic id, number of documents, list of documents with probability and represented topic words:')

for key,value in dic_topic_doc.items():
    sorted_value = sorted(value, key=lambda x: x[1], reverse=True)
    print(key,len(value),sorted_value)
    #print the topic word and most represented doc
    print(ldamodel.print_topic(key,word_num))

The interpretation of the result, based on the below output:

0 13 [('206.txt', 0.99757373), ('112.txt', 0.99581325), ('221 .txt', 0.99573374) … <br />
0.005*"said" + 0.005*"network" + 0.005*"business" + 0.004*"uk" + 0.004*"could"<br />
1 28 [('111.txt', 0.9982385), ('245.txt', 0.9976861), ('127.txt', 0.9975066),….<br />
0.007*"people" + 0.007*"would" + 0.006*"said" + 0.005*"blair" + 0.005*"party"

means that topic id 0 has 13 files identified and 206.txt is assigned with the highest probability, followed by 112.txt and so on. Python starts its index with 0 but essentially, topic id 0 is the first topic identified.

Similarly, the next is topic id 1 with 28 files identified and 111.txt is assigned with the highest probability, followed by 245.txt and so on.


**Visualize Topics and Keywords**

Now, we are ready to visualize our LDA model.

The following code uses the pyLDAvis tool to visualize the fit of your LDA model across topics and their top words.
> pip install pyLDAvis
```Python
# plotting tools
import pyLDAvis
import pyLDAvis.gensim_models
import matplotlib.pyplot as plt
%matplotlib inline

# visualize the topics and keywords
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(ldamodel, doc_term_matrix, dictionary)
vis
```