# Simple Wikipedia Dataset Topic Extraction Using LDA
# 1000 topics test
Ryan Arnouk

Latent Dirchilet Allocation: http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf 

## Latent Dirchilet Allocation (LDA) in Topic Modelling
### Explanation
In natural language processing (NLP) latent Dirichlet allocation is a generative statistical model. This means the model can extract topics in an unlabeled dataset by frequency and distribution of words, to extract topics. 

LDA assumes all the words in a document are related and tries to figure out how each document could have been created. We need to just tell the model how many topics to construct to generate topic and word distributions over a corpus. Using this can allow us to identify similar documents within the corpus. 

Meaning, LDA Is useful for finding accurate mixtures of topics within a document. 

When using LDA you would choose a fixed number of topics (k) to discover from the dataset. In my case, I chose 200 to represent all the topics from the Simple Wikipedia Dataset. 

### Assumptions
- Documents with similar topics will use similar groups of words
- Document definitions/modeling: 
  - Documents are probability distributions over latent topics
  - Topics are probability distributions over words. 

Instead of focusing purely on frequency of words in a topic LDA focuses also on the distribution between words accross topics.  


### Plate Notation
A simple way to visually represent all the dependencies in the models parameters: 

![Plate Notation](https://upload.wikimedia.org/wikipedia/commons/d/d3/Latent_Dirichlet_allocation.svg)

By Bkkbrad - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=3610403


M represents total number of documents within corpus 

N denotes number of words in a document

Outside parameters: 
Dirchilet priors
- α = the parameter of the Dirichlet prior on the per-document topic distributions
    - High alpha denotes higher likelihood that each document is going to contain a mixture of the documents. 
    - Low alphas suggest that each document will only contain 1 or 2 topics
- β = is the parameter of the Dirichlet prior on the per-topic word distribution
    - High beta means each topic will contain a mixture of most of the words
    - Low beta means each topic will only contain a mixture of a few words.      
- Theta: Topic distribution for the document 
- z notates each topic, making each document a mixture of these topics. 
- w stands for word 

### How LDA Works
**LDA Works Backwards**

LDA runs in reverse, by starting with a corpus of documents and generating topics. In order to understand how LDA assumes the documents topics it is important to understand the generative process: 

LDA assumes documents are created in the same way: 
1. Determine number of words in document
2. Choose a topic mixture for the document over a fixed set of topics (k)
  a. Topic A: 20%
  b. Topic B: 20%
  c. Topic C: 60%
3. Generate topics in a document by: 
  a. First pick a topic a document in the topic distribution above. 
  b. Pick a word based on the topic distribution. 

Since we have a corpus of documents, we want LDA to learn the topic representation of K topics in each document and the word distribution of each topic. 

LDA **backtracks** from the document level to identify topics that have likely generated the corpus. 

Steps: 
- Randomly assign each word in each document to one of the K topics. 
- For each document d: 
  - Assume all the topic assignments except the current one are correct
  - Calculate two proportions: 
    - Proportion of words in document d that are currently assigned to topic t 
    - Proportion of assignments to topic t over all documents that come from this word. 

  - Multiply the two proportions and assign w a new topic based on that probability. 
- Eventually a state would be reached where assignments make sense. 

### LDA Pros vs Cons

Pros: 
- Has been shown to produce good results over main domains
- Effective, easy to understand tool. 

Cons: 
- Must know the number of topics K in advance (something that I struggled with in this project creating my own dataset) I would be forced to estimate a good number of topics for my needs. 
- Topic distribution cannot capture correlations among topics which makes it hard for me to group topics together as one. For example, linked lists and arrays are hard for me to group as simple computer science. 
- In my experience, it is a very tedious process discovering the optimal number of topics. There is not a very good explanation on the best way to determine the number of topics.  


### Topic Modelling vs Topic Classification
**Or LDA vs Neural Network** 

Difference between **identify one topic vs classify one topic**

Originally, when working on this project I was under the impression I would use text classification and a neural network. However, I soon realized that this would be ineffective, based on the fact that despite having a define set of topic I would want to have returned I would not have any idea of the input text. Since I would have no idea what text or what subject the text inputted into the model would be, it would make more sense for me to try and extract the topic from any text instead of trying to classify an endless text from an endless amount of topics. If I was developing my model to be more niche in one subject, a neural network would definitely help improve my accuracy. 

Sources: 

https://www.youtube.com/watch?v=DWJYZq_fQ2A

https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

https://www.quora.com/Latent-Dirichlet-Allocation-LDA-What-is-the-best-way-to-determine-k-number-of-topics-in-topic-modeling

https://datascience.stackexchange.com/questions/962/what-is-difference-between-text-classification-and-topic-models



## Import Dataset
In my case, I am importing a Simple Wikipedia Dataset I already preprocessed from the XML file dump Wikipedia has publicly available. Parsing code available in *parse.py*

In [0]:
f = open('articles/articles.txt', 'r')
text = f.read()

## Preprocessing 
The steps we need to do to preprocess our data is as follows: 

- Tokenization: 
  - Words with less than 3 characters must be removed. (Helps clear random words that may provide noise in the classifier)
- All stop words are removed: Stop words are commonly used words like the, a, an, in that the classifier needs to be programmed to ignore to help keep noise levels in the classifier minimal. 
- Capitalization: Lowercase all the data
- Stemming: Remove suffixes from words. 
- Lemmatization: Words in third person are changed to first person and verbs in past and future tenses are changed to present. 
- Vectorization: Convert words to vectors. Machine Learning can only read numbers so we must translate it to numbers. 
  - Types of vectorization includes: 
    - Bag of words (the technology used in this project) 
    - TFIDF
    - Word2Vec
    - GloVe

In [5]:
# import libraries 
import gensim
import nltk 
import numpy as np 

from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS 
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import * 

nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [6]:
# split each doc by a blank line
docs_all = text.split('\n\n')
print(docs_all[:5])

['\n April is the fourth month of the year and comes between March and May It is one of four months to have day s April always starts on the same day of week as July and additionally January in leap years April always ends on the same day of the week as December April s flower s are the Sweet Pea and Asteraceae Daisy Its birthstone is the diamond The meaning of the diamond is innocence ', ' August is the eighth month of the year in the Gregorian calendar coming between July and September It has day s the same number of days as the previous month July and is named after Roman Emperor Augustus Caesar August doesn t start on the same day of the week as another month in Common year common years but starts on the same day of the week as February in Leap year leap years August always ends on the same day of the week as November ', ' Art is a creative activity by people The artist hopes that it affects the emotions of people who experience it Artists express themselves by their art Some peopl

In [0]:
# Use all docs for training 
# docs_train = docs_all[:60000]
# no need for training data at all since we are using the model against unseen data
docs_train = docs_all

In [0]:
# preprocessing 
stemmer = SnowballStemmer("english")

def lemmatize_stemming(text):
  return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# Tokenize and lemmatize
def preprocess(text):
  result=[]
  for token in gensim.utils.simple_preprocess(text) :
      if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
          result.append(lemmatize_stemming(token))
          
  return result

In [0]:
# assign processed_docs to preprocessed information
processed_docs = []

for doc in docs_train: 
  processed_docs.append(preprocess(doc))

In [10]:
dictionary = gensim.corpora.Dictionary(processed_docs)

#  checking dictionary created
count = 0
for k, v in dictionary.iteritems(): 
    print(k, v)
    count+=1
    if count > 10: 
        break

# gensim filter extremes
# filter words that appear less than no_below
# filter any words that do not appear in more than no_above
dictionary.filter_extremes(no_below=3, no_above=0.40)

0 addit
1 april
2 asteracea
3 birthston
4 come
5 daisi
6 decemb
7 diamond
8 end
9 flower
10 fourth


In [29]:
import pickle

# gensim doc2bow
# convert document which is a list fo words into bag of words
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

# running LDA on bad of words
lda_model = gensim.models.LdaMulticore(bow_corpus, 
                                      num_topics=1000,
                                      id2word=dictionary,
                                      passes=10, 
                                      iterations=100,
                                      workers=3)

# dump LDA model using pickle to use the model in the future without needing to rerun everything
ldafile = open('model2.pkl', 'wb')
pickle.dump(lda_model, ldafile)
ldafile.close()

  diff = np.log(self.expElogbeta)


In [30]:
pickle_in = open('model2.pkl', 'rb')
loadlda = pickle.load(pickle_in)

# for each topic explore the words occuring in the topic and its relative weight 
for idx, topic in enumerate(loadlda.print_topics(num_topics=1000, num_words = 10)): # print 10 of the 350 topics
   print("Topic: {} \nWords: {}".format(idx, topic))
   print("\n")

Topic: 0 
Words: (0, '0.251*"peac" + 0.161*"anti" + 0.119*"fail" + 0.070*"metro" + 0.054*"circus" + 0.040*"elliott" + 0.034*"tension" + 0.029*"armor" + 0.025*"striker" + 0.022*"thomson"')


Topic: 1 
Words: (1, '0.325*"compar" + 0.144*"constel" + 0.035*"thirteenth" + 0.032*"cunningham" + 0.031*"orion" + 0.027*"name" + 0.026*"barrett" + 0.024*"dane" + 0.018*"modern" + 0.018*"stella"')


Topic: 2 
Words: (2, '0.265*"issu" + 0.259*"meet" + 0.105*"conflict" + 0.066*"debat" + 0.042*"karl" + 0.036*"millennium" + 0.035*"sacramento" + 0.021*"place" + 0.015*"begin" + 0.014*"peopl"')


Topic: 3 
Words: (3, '0.153*"penguin" + 0.144*"syria" + 0.084*"indus" + 0.080*"balochistan" + 0.050*"low" + 0.049*"syrian" + 0.044*"appalachian" + 0.044*"wheeler" + 0.032*"sack" + 0.028*"thorp"')


Topic: 4 
Words: (4, '0.151*"neutral" + 0.134*"shock" + 0.126*"jacob" + 0.089*"angola" + 0.064*"glori" + 0.048*"hymn" + 0.046*"usernam" + 0.039*"cameo" + 0.031*"astor" + 0.027*"brotherhood"')


Topic: 5 
Words: (5, '0.2

In [34]:
def test(test): 
  bow_vector = dictionary.doc2bow(preprocess(test))

  for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))


test("Unity 3d")

Score: 0.5005000233650208	 Topic: 0.450*"michigan" + 0.103*"withdraw" + 0.081*"pitcher" + 0.040*"uniti" + 0.037*"state"


In [35]:
# return stemmed version of word with topic spell checked
! pip install autocorrect
from autocorrect import Speller 
spell = Speller()

def test(test): 
  bow_vector = dictionary.doc2bow(preprocess(test))

  for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, spell(lda_model.print_topic(index, 5))))


test("Soccer")

Score: 0.5005000233650208	 Topic: 0.387*"football" + 0.316*"assoc" + 0.063*"american" + 0.057*"unit" + 0.050*"soccer"
