Practical 7B: Topic Modelling 

**Import library + download pacxkages**
```Python
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

import gensim
from gensim import corpora

import string
from pathlib import Path
from pprint import pprint
```


**Reading in source files from directory** 
- given a zip file of 250 files of various categories,  unzip the file and read the file directly, append to the corpus to form a full corpus of words
```Python
#r is the raw string literals so that windows path slash won't create problem 
data_folder = Path(r'news')

#read each file from the directory into an array and name it corpus
corpus = []
filenames = []

for filename in data_folder.iterdir():
   fp = open(str(filename), 'r', encoding='latin1')
   corpus.append(fp.read())
   #keep the filename for later use
   filenames.append(filename.name)
   fp.close()

print(corpus.__len__())
corpus
```

**Pre-Processing**: stopwords removal, puncutation removal and lemmatization using WordNetLemmatizer()
```Python
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()
#clean up conntent and keep in the doc_clean variable
def clean(doc):
    punc_free = ''.join([ch for ch in doc.lower() if ch not in exclude]) #remove punctuations
    stop_free = ' '.join([i for i in punc_free.split() if i not in stop]) #remove stopwords
    normalized = ' '.join(lemma.lemmatize(word) for word in stop_free.split()) #lemmatisation
    #stemmed = ' '.join(stemmer.stem(word) for word in normalized.split())
    return normalized

doc_clean = [clean(doc).split() for doc in corpus]
```

**Preparing word representation**: use gensim library to do term frequency word representation
```Python
dictionary = corpora.Dictionary(doc_clean) #use gensium corpora to create data structure keeping all unique words
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean] #use dictionary to create doc-term matrix for each of doc / file using bag of words approach
```

**Creating LDA model**: uses gensim lda models to set value of 5 for the first model to specify the number of topics for LDA
```Python
topic_num = 5
word_num = 5

Lda = gensim.models.ldamodel.LdaModel
ldamodel = Lda(doc_term_matrix, num_topics = topic_num, id2word = dictionary, passes=20)

pprint(ldamodel.print_topics(num_topics=topic_num, num_words=word_num))
# Compute Perplexity
print('Perplexity: ', ldamodel.log_perplexity(doc_term_matrix))
```
Some pointers: 
- the results and topics are often difficult to identify a category and may not be meaningful
- how do you determine suitable number to use?
  - use perplexity value (statistical measure of how well probability model predicts sample. benefit comes when comparing different LDA models and model with lower perplexity value is considered better
- **increasing topic_num** to a large # MAY NOT HELP in understanding categories (unless prior knoweledge of possible large value), thus sacrificing clarity
- 