# Simple Wikipedia Dataset Topic Extraction Using LDA
Ryan Arnouk

Latent Dirichlet Allocation: http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf 

## Latent Dirichlet Allocation (LDA) in Topic Modelling
### Explanation
In natural language processing (NLP) latent Dirichlet allocation is a generative statistical model. This means the model can extract topics in an unlabeled dataset by frequency and distribution of words, to extract topics. 

LDA assumes all the words in a document are related and tries to figure out how each document could have been created. We need to just tell the model how many topics to construct to generate topic and word distributions over a corpus. Using this can allow us to identify similar documents within the corpus. 

Meaning, LDA Is useful for finding accurate mixtures of topics within a document. 

When using LDA you would choose a fixed number of topics (k) to discover from the dataset. In my case, I chose 200 to represent all the topics from the Simple Wikipedia Dataset. 

### Assumptions
- Documents with similar topics will use similar groups of words
- Document definitions/modeling: 
  - Documents are probability distributions over latent topics
  - Topics are probability distributions over words. 

Instead of focusing purely on frequency of words in a topic LDA focuses also on the distribution between words accross topics.  


### Plate Notation
A simple way to visually represent all the dependencies in the models parameters: 

![Plate Notation](https://upload.wikimedia.org/wikipedia/commons/d/d3/Latent_Dirichlet_allocation.svg)

By Bkkbrad - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=3610403


M represents total number of documents within corpus 

N denotes number of words in a document

Outside parameters: 
Dirchilet priors
- α = the parameter of the Dirichlet prior on the per-document topic distributions
    - High alpha denotes higher likelihood that each document is going to contain a mixture of the documents. 
    - Low alphas suggest that each document will only contain 1 or 2 topics
- β = is the parameter of the Dirichlet prior on the per-topic word distribution
    - High beta means each topic will contain a mixture of most of the words
    - Low beta means each topic will only contain a mixture of a few words.      
- Theta: Topic distribution for the document 
- z notates each topic, making each document a mixture of these topics. 
- w stands for word 

<img src="https://miro.medium.com/max/800/1*_NdnljMqi8L2_lAYwH3JDQ.gif" alt="Dirichlet Distribution" width="500px" height="500px"/>

1000 samples from a Dirichlet distribution using an increasing alpha value.

### How LDA Works
**LDA Works Backwards**

LDA runs in reverse, by starting with a corpus of documents and generating topics. In order to understand how LDA assumes the documents topics it is important to understand the generative process: 

LDA assumes documents are created in the same way: 
1. Determine number of words in document
2. Choose a topic mixture for the document over a fixed set of topics (k)
  a. Topic A: 20%
  b. Topic B: 20%
  c. Topic C: 60%
3. Generate topics in a document by: 
  a. First pick a topic a document in the topic distribution above. 
  b. Pick a word based on the topic distribution. 

Since we have a corpus of documents, we want LDA to learn the topic representation of K topics in each document and the word distribution of each topic. 

LDA **backtracks** from the document level to identify topics that have likely generated the corpus. 

Steps: 
- Randomly assign each word in each document to one of the K topics. 
- For each document d: 
  - Assume all the topic assignments except the current one are correct
  - Calculate two proportions: 
    - Proportion of words in document d that are currently assigned to topic t 
    - Proportion of assignments to topic t over all documents that come from this word. 

  - Multiply the two proportions and assign w a new topic based on that probability. 
- Eventually a state would be reached where assignments make sense. 

### LDA Pros vs Cons

Pros: 
- Has been shown to produce good results over main domains
- Effective, easy to understand tool. 

Cons: 
- Must know the number of topics K in advance (something that I struggled with in this project creating my own dataset) I would be forced to estimate a good number of topics for my needs. 
- Topic distribution cannot capture correlations among topics which makes it hard for me to group topics together as one. For example, linked lists and arrays are hard for me to group as simple computer science. 
- In my experience, it is a very tedious process discovering the optimal number of topics. There is not a very good explanation on the best way to determine the number of topics.  


### Topic Modelling vs Topic Classification
**Or LDA vs Neural Network** 

Difference between **identify one topic vs classify one topic**

Originally, when working on this project I was under the impression I would use text classification and a neural network. However, I soon realized that this would be ineffective, based on the fact that despite having a define set of topic I would want to have returned I would not have any idea of the input text. Since I would have no idea what text or what subject the text inputted into the model would be, it would make more sense for me to try and extract the topic from any text instead of trying to classify an endless text from an endless amount of topics. If I was developing my model to be more niche in one subject, a neural network would definitely help improve my accuracy. 

Sources: 

https://www.youtube.com/watch?v=DWJYZq_fQ2A

https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

https://www.quora.com/Latent-Dirichlet-Allocation-LDA-What-is-the-best-way-to-determine-k-number-of-topics-in-topic-modeling

https://datascience.stackexchange.com/questions/962/what-is-difference-between-text-classification-and-topic-models



## Import Dataset
In my case, I am importing a Simple Wikipedia Dataset I already preprocessed from the XML file dump Wikipedia has publicly available. Parsing code available in *parse.py*

In [0]:
f = open('articles/articles.txt', 'r')
text = f.read()

## Preprocessing 
The steps we need to do to preprocess our data is as follows: 

- Tokenization: 
  - Words with less than 3 characters must be removed. (Helps clear random words that may provide noise in the classifier)
- All stop words are removed: Stop words are commonly used words like the, a, an, in that the classifier needs to be programmed to ignore to help keep noise levels in the classifier minimal. 
- Capitalization: Lowercase all the data
- Stemming: Remove suffixes from words. 
- Lemmatization: Words in third person are changed to first person and verbs in past and future tenses are changed to present. 
- Vectorization: Convert words to vectors. Machine Learning can only read numbers so we must translate it to numbers. 
  - Types of vectorization includes: 
    - Bag of words (the technology used in this project) 
    - TFIDF
    - Word2Vec
    - GloVe

In [38]:
# import libraries 
import gensim
import nltk 
import numpy as np 

from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS 
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import * 

nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [39]:
# split each doc by a blank line
docs_all = text.split('\n\n')
print(docs_all[:5]) # print 5 of the documents

['\n April is the fourth month of the year and comes between March and May It is one of four months to have day s April always starts on the same day of week as July and additionally January in leap years April always ends on the same day of the week as December April s flower s are the Sweet Pea and Asteraceae Daisy Its birthstone is the diamond The meaning of the diamond is innocence ', ' August is the eighth month of the year in the Gregorian calendar coming between July and September It has day s the same number of days as the previous month July and is named after Roman Emperor Augustus Caesar August doesn t start on the same day of the week as another month in Common year common years but starts on the same day of the week as February in Leap year leap years August always ends on the same day of the week as November ', ' Art is a creative activity by people The artist hopes that it affects the emotions of people who experience it Artists express themselves by their art Some peopl

In [0]:
# Use all docs for training 
# docs_train = docs_all[:60000]
# no need for training data at all since we are using the model against unseen data
docs_train = docs_all

In [0]:
# preprocessing 
stemmer = SnowballStemmer("english")

def lemmatize_stemming(text):
  return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# Tokenize and lemmatize
def preprocess(text):
  result=[]
  for token in gensim.utils.simple_preprocess(text) :
      if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
          result.append(lemmatize_stemming(token))
          
  return result

In [0]:
# assign processed_docs to preprocessed data
processed_docs = []

for doc in docs_train: 
  processed_docs.append(preprocess(doc))

### Vectorization 
**Bag of Words** 
Now we are able to vectorize the data: Vectorization is converting the words to numbers. 

Bag-of-words is a representation of text that describes the occurence of words in a document. Containing the following: 
1. A vocabulary of known words. 
2. A measure of the presence of known words. 

It is called *bag* of words because information about order or structure is discarded and the model is simply concerned whether words occur in the document and not where in the document. 


https://machinelearningmastery.com/gentle-introduction-bag-words-model/

In [0]:
dictionary = gensim.corpora.Dictionary(processed_docs)

#  checking dictionary created
count = 0
for k, v in dictionary.iteritems(): 
    print(k, v)
    count+=1
    if count > 10: 
        break

# gensim filter extremes
# filter words that appear less than no_below
# filter any words that do not appear in more than no_above
dictionary.filter_extremes(no_below=0.05, no_above=0.40)

0 addit
1 april
2 asteracea
3 birthston
4 come
5 daisi
6 decemb
7 diamond
8 end
9 flower
10 fourth


## Model
Below, we execute the LDA on our corpus. 

First, we assign `bow_corpus` to the bag of words version of the document. This step converts the document which is a list of words into bag of words. 

### Running the Model
Parameters of `gensim.models.LdaMulticode`

bow_corpus = bag of words docs file

num_topics = the number of topics we want to get from our corpus. 

id2word = dictionary created above

passes = number of passes through corpus during training 

iterations = max number of iterations through the corpus when inferring topic 
distribution of a corpus 

workers = number of worker processed to be used for parallelization

More information can be found here: 
https://radimrehurek.com/gensim/models/ldamulticore.html

Load into a pickle file to cache model and not need to rerun every time seperatly. 

**Selecting Parameters:**
In my case I was limited in time to really expirement with tweaking all of the parameters, in the future instead of choosing something random and expirementing quickly to see what worked the best I will try and graph my results in order to see the best parameters that give me the most accurate results. Passing in an unseen document to the classifier and having no conclusive way to get the optimal number of topics really makes it a pain to tweak the parameters and becomes a game of trial and error. I ended up making this classifier with 350 topics and `model2.pkl` with 1000 topics and I am looking into expirementing with them to see what is more effective. 

Testing with 1000 topics expirement can be found in `/1000 topics testing`

In [0]:
import pickle

# gensim doc2bow
# convert document which is a list of words into bag of words
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

# running LDA on bad of words
lda_model = gensim.models.LdaMulticore(bow_corpus, 
                                      num_topics=350,
                                      id2word=dictionary, 
                                      passes=20,
                                      iterations=300,
                                      workers=3)

# dump LDA model using pickle to use the model in the future without needing to rerun everything
ldafile = open('model.pkl', 'wb')
pickle.dump(lda_model, ldafile)
ldafile.close()

Load picle as ```loadlda``` and save print 10 of the topics

In [43]:
pickle_in = open('model.pkl', 'rb')
loadlda = pickle.load(pickle_in)

# for each topic explore the words occuring in the topic and its relative weight 
for idx, topic in enumerate(loadlda.print_topics(num_topics=10, num_words = 10)): # print 10 of the 350 topics
   print("Topic: {} \nWords: {}".format(idx, topic))
   print("\n")

Topic: 0 
Words: (127, '0.165*"drug" + 0.145*"travel" + 0.052*"fair" + 0.050*"illeg" + 0.050*"guid" + 0.030*"biographi" + 0.029*"bat" + 0.026*"addict" + 0.021*"recreat" + 0.019*"logo"')


Topic: 1 
Words: (270, '0.181*"anim" + 0.048*"speci" + 0.036*"insect" + 0.032*"live" + 0.031*"hunt" + 0.027*"catch" + 0.024*"like" + 0.022*"predat" + 0.021*"fish" + 0.019*"egg"')


Topic: 2 
Words: (299, '0.189*"church" + 0.081*"cathol" + 0.050*"protest" + 0.048*"singapor" + 0.048*"pope" + 0.040*"cathedr" + 0.037*"roman" + 0.033*"reform" + 0.029*"orthodox" + 0.028*"bishop"')


Topic: 3 
Words: (61, '0.276*"center" + 0.150*"trade" + 0.127*"coach" + 0.062*"assist" + 0.035*"lloyd" + 0.026*"owen" + 0.026*"willi" + 0.022*"head" + 0.016*"panther" + 0.016*"septemb"')


Topic: 4 
Words: (298, '0.342*"pakistan" + 0.088*"punjab" + 0.075*"surgeri" + 0.040*"pakistani" + 0.031*"lahor" + 0.022*"khyber" + 0.021*"pakhtunkhwa" + 0.019*"india" + 0.018*"surgeon" + 0.018*"sikh"')


Topic: 5 
Words: (63, '0.182*"centuri" 

## Testing on an unseen dataset

In [35]:
def test(test): 
  bow_vector = dictionary.doc2bow(preprocess(test))

  for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))


test("Physics")

Score: 0.501427412033081	 Topic: 0.153*"physic" + 0.075*"liquid" + 0.056*"solid" + 0.048*"renam" + 0.033*"ridg"


The model has returned the stemmed versions of some words. In order to combat this quickly, I added a spellcheck to the output. This can be fixed and improved in the future. 

In [34]:
# return stemmed version of word with topic spell checked
! pip install autocorrect
from autocorrect import Speller 
spell = Speller()

def test(test): 
  bow_vector = dictionary.doc2bow(preprocess(test))

  for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, spell(lda_model.print_topic(index, 5))))


test("Physics")

Score: 0.5013906359672546	 Topic: 0.153*"physic" + 0.075*"liquid" + 0.056*"solid" + 0.048*"renal" + 0.033*"ring"
