<a href="https://colab.research.google.com/github/soologua/ACL2013-CharParsing/blob/master/Topic_Modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Topic Modeling
We use the Gensim's LDA (Latent Dirichlet Allocation) model to model topics in newsgroup_data. 

#Introduction#
Suppose you have the following set of sentences:

* I like to eat broccoli and bananas.
* I ate a banana and spinach smoothie for breakfast.
* Chinchillas and kittens are cute.
* My sister adopted a kitten yesterday.
* Look at this cute hamster munching on a piece of broccoli.

What is latent Dirichlet allocation? It’s a way of automatically discovering topics that these sentences contain. For example, given these sentences and asked for 2 topics, LDA might produce something like:

* **Sentences 1 and 2:** 100% Topic A
* **Sentences 3 and 4:** 100% Topic B
* **Sentence 5:** 60% Topic A, 40% Topic B
* **Topic A:** 30% broccoli, 15% bananas, 10% breakfast, 10% munching, … (at which point, you could interpret topic A to be about food)
* **Topic B:** 20% chinchillas, 20% kittens, 20% cute, 15% hamster, … (at which point, you could interpret topic B to be about cute animals)

The question, of course, is: how does LDA perform this discovery?



#LDA Model#

LDA is a generative probabilistic model( a three-level hierarchical Bayesian model) for collections of discrete data such as text corpora, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document.
In more detail, LDA represents documents as mixtures of topics that spit out words with certain probabilities. It assumes that documents are produced in the following fashion: when writing each document, you

* Decide on the number of words N the document will have (say, according to a Poisson distribution).
* Choose a topic mixture for the document (according to a Dirichlet distribution over a fixed set of K topics). For example, assuming that we have the two food and cute animal topics above, you might choose the document to consist of 1/3 food and 2/3 cute animals.
Generate each word w_i in the document by:
First picking a topic (according to the multinomial distribution that you sampled above; for example, you might pick the food topic with 1/3 probability and the cute animals topic with 2/3 probability).
* Using the topic to generate the word itself (according to the topic’s multinomial distribution). For example, if we selected the food topic, we might generate the word “broccoli” with 30% probability, “bananas” with 15% probability, and so on.

Assuming this generative model for a collection of documents, LDA then tries to backtrack from the documents to find a set of topics that are likely to have generated the collection.



##Example##
Let’s make an example. According to the above process, when generating some particular document *D*, you might

* Pick 5 to be the number of words in D.
* Decide that D will be 1/2 about food and 1/2 about cute animals.
* Pick the first word to come from the food topic, which then gives you the word “broccoli”.
* Pick the second word to come from the cute animals topic, which gives you “panda”.
* Pick the third word to come from the cute animals topic, giving you “adorable”.
* Pick the fourth word to come from the food topic, giving you “cherries”.
* Pick the fifth word to come from the food topic, giving you “eating”.
So the document generated under the LDA model will be “broccoli panda adorable cherries eating” (note that LDA is a *bag-of-words* model).

##Learning##
So now suppose you have a set of documents. You’ve chosen some fixed number of **K** topics to discover, and want to use LDA to learn the topic representation of each document and the words associated to each topic. How do you do this? One way (known as collapsed **Gibbs sampling**) is the following:

Go through each document, and randomly assign each word in the document to one of the K topics.
Notice that this random assignment already gives you both topic representations of all the documents and word distributions of all the topics (albeit not very good ones).
So to improve on them, for each document d…
Go through each word w in d…
And for each topic t, compute two things: 1) p(topic t | document d) = the proportion of words in document d that are currently assigned to topic t, and 2) p(word w | topic t) = the proportion of assignments to topic t over all documents that come from this word w. Reassign w a new topic, where we choose topic t with probability p(topic t | document d) * p(word w | topic t) (according to our generative model, this is essentially the probability that topic t generated word w, so it makes sense that we resample the current word’s topic with this probability). (Also, I’m glossing over a couple of things here, in particular the use of priors/pseudocounts in these probabilities.)
In other words, in this step, we’re assuming that all topic assignments except for the current word in question are correct, and then updating the assignment of the current word using our model of how documents are generated.
After repeating the previous step a large number of times, you’ll eventually reach a roughly steady state where your assignments are pretty good. So use these assignments to estimate the topic mixtures of each document (by counting the proportion of words assigned to each topic within that document) and the words associated to each topic (by counting the proportion of words assigned to each topic overall).

In [0]:
from google.colab import files
# # from IPython.display import Image
# uploaded = files.upload()

#Gensim: models.ldamodel – Latent Dirichlet Allocation

[This module](https://radimrehurek.com/gensim/models/ldamodel.html) allows both LDA model estimation from a training corpus and inference of topic distribution on new, unseen documents. The model can also be updated with new documents for online training. The core estimation code is based on the [onlineldavb.py script](https://github.com/blei-lab/onlineldavb/blob/master/onlineldavb.py), by [Hoffman, et al(2010)](https://papers.nips.cc/paper/3902-online-learning-for-latent-dirichlet-allocation.pdf)

#Step 1:  Load the Dataset

In [0]:
import pandas as pd
url='https://raw.githubusercontent.com/soologua/NLP-Exercises/master/2.2-topic-modeling/abcnews-date-text.csv'
data = pd.read_csv(url,error_bad_lines=False)


In [3]:
data_text = data[:300000][['headline_text']];
data_text['index'] = data_text.index

documents = data_text
documents[:5]

Unnamed: 0,headline_text,index
0,aba decides against community broadcasting lic...,0
1,act fire witnesses must be aware of defamation,1
2,a g calls for infrastructure protection summit,2
3,air nz staff in aust strike for pay rise,3
4,air nz strike to affect australian travellers,4


#Step 2: Preprocessing

In [4]:
# pip install gensim
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
print np.random.seed(400)

None


In [5]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [0]:
# Write a function to perform the pre processing steps on the entire dataset
stemmer = SnowballStemmer("english")

def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# Tokenize and lemmatize
def preprocess(text):
    result=[]
    for token in gensim.utils.simple_preprocess(text) :
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            # TODO: Apply lemmatize_stemming on the token, then add to the results list
            result.append(lemmatize_stemming(token))
    return result

In [7]:
# Preview a document after preprocessing

document_num = 2310
doc_sample = documents[documents['index'] == document_num].values[0][0]

print("Original document: ")
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print("\n\nTokenized and lemmatized document: ")
print(preprocess(doc_sample))

Original document: 
['sculpture', 'exhibition', 'revealed', 'in', 'tasmania']


Tokenized and lemmatized document: 
[u'sculptur', u'exhibit', u'reveal', u'tasmania']


In [0]:
#preprocess all the headlines, saving the list of results as 'processed_docs'
processed_docs= documents['headline_text'].map(preprocess)

#Step 3: Bag of words on the dataset#
Now let's create a dictionary from 'processed_docs' containing the number of times a word appears in the training set. To do that, let's pass processed_docs to gensim.corpora.Dictionary() and call it 'dictionary'.

In [0]:
dictionary = gensim.corpora.Dictionary(processed_docs)

In [13]:
'''
Checking dictionary created
'''
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

(7958, u'verplank')
(13916, u'woodi')
(14482, u'francesco')
(17648, u'kaniva')
(23125, u'yanagisawa')
(12187, u'scold')
(11412, u'szabic')
(6913, u'emptiv')
(23206, u'strikebreak')
(6219, u'citrus')
(23770, u'donger')


##filter_extremes##
` filter_extremes(no_below=5, no_above=0.5, keep_n=100000)`
* Filter out tokens that appear in less than no_below documents (absolute number) or
more than no_above documents (fraction of total corpus size, not absolute number).
* after (1) and (2), keep only the first keep_n most frequent tokens (or keep all if None).

In [0]:
'''
OPTIONAL STEP
Remove very rare and very common words:

- words appearing less than 15 times
- words appearing in more than 10% of all documents
'''
# TODO: apply dictionary.filter_extremes() with the parameters mentioned above
dictionary.filter_extremes(no_below=15, no_above=0.1, keep_n=100000)

##Doc2bow##
Convert document (a list of words) into the bag-of-words(bow) format = list of (token_id, token_count) 2-tuples. 
Each word is assumed to be a tokenized and normalized string (either unicode or utf8-encoded).
* Create the Bag-of-words model for each document 
i.e for each document we create a dictionary reporting how many words and how many times those words appear. 
* Save this to 'bow_corpus'

In [0]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

In [20]:
'''
Preview BOW for our sample preprocessed document
'''
# Here document_num is document number 4310 which we have checked in Step 2
bow_doc_2310 = bow_corpus[document_num]

for i in range(len(bow_doc_2310)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_4310[i][0], 
                                                     dictionary[bow_doc_4310[i][0]], 
                                                     bow_doc_4310[i][1]))

Word 524 ("reveal") appears 1 time.
Word 2095 ("tasmania") appears 1 time.
Word 2534 ("exhibit") appears 1 time.
Word 2675 ("sculptur") appears 1 time.


#Step 4: Running LDA using Bag of Words
We are going for **10** topics in the document corpus.

Some of the parameters we will be tweaking are:


* `num_topics` is the number of requested latent topics to be extracted from the training corpus.
* `id2word` is a mapping from word ids (integers) to words (strings). It is used to determine the vocabulary size, as well as for debugging and topic printing.
* `alpha` and `eta` are hyperparameters that affect sparsity of the document-topic (`theta`) and topic-word (`lambda`) distributions. We will let these be the default values for now(default value is 1/num_topics)

  * `Alpha` is the per document topic distribution.

    * `High alpha`: Every document has a mixture of all topics(documents appear similar to each other).
    * `Low alpha`: Every document has a mixture of very few topics
  * `Eta` is the per topic word distribution.

    *`High eta`: Each topic has a mixture of most words(topics appear similar to each other).
    * `Low eta`: Each topic has a mixture of few words.
* `passes` is the number of training passes through the corpus. For example, if the training corpus has 50,000 documents, chunksize is 10,000, passes is 2, then online training is done in 10 updates:

In [0]:
lda_model = gensim.models.LdaModel(bow_corpus, 
                                   num_topics = 10, 
                                   id2word = dictionary,                                    
                                   passes = 50)


In [23]:
'''
For each topic, we will explore the words occuring in that topic and its relative weight
'''
for idx, topic in lda_model.print_topics(-1):
    print("Topic: {} \nWords: {}".format(idx, topic))
    print("\n")

Topic: 0 
Words: 0.024*"test" + 0.017*"south" + 0.014*"inquiri" + 0.014*"law" + 0.013*"crew" + 0.013*"bushfir" + 0.013*"hear" + 0.013*"studi" + 0.012*"research" + 0.011*"suspect"


Topic: 1 
Words: 0.037*"govt" + 0.034*"council" + 0.034*"plan" + 0.033*"urg" + 0.030*"water" + 0.021*"fund" + 0.016*"seek" + 0.015*"group" + 0.015*"push" + 0.014*"help"


Topic: 2 
Words: 0.036*"kill" + 0.026*"iraq" + 0.017*"attack" + 0.016*"howard" + 0.016*"arrest" + 0.016*"talk" + 0.015*"break" + 0.015*"protest" + 0.015*"troop" + 0.014*"nuclear"


Topic: 3 
Words: 0.033*"warn" + 0.023*"miss" + 0.021*"hous" + 0.019*"polic" + 0.016*"search" + 0.015*"rise" + 0.015*"industri" + 0.015*"claim" + 0.014*"centr" + 0.013*"water"


Topic: 4 
Words: 0.024*"chang" + 0.021*"labor" + 0.020*"opposit" + 0.018*"elect" + 0.017*"concern" + 0.017*"green" + 0.016*"price" + 0.016*"call" + 0.015*"govt" + 0.014*"back"


Topic: 5 
Words: 0.064*"polic" + 0.035*"charg" + 0.032*"crash" + 0.030*"court" + 0.028*"face" + 0.023*"investig"

In [0]:
'''
For each topic, we will explore the words occuring in that topic and its relative weight
'''
for idx, topic in lda_model.print_topics(-1):
    print("Topic: {} \nWords: {}".format(idx, topic))
    print("\n")

In [10]:
from IPython.display import Image
datasets = files.upload()

KeyboardInterrupt: ignored

In [0]:
import pickle
import gensim
from sklearn.feature_extraction.text import CountVectorizer

# Load the list of documents
with open('newsgroups', 'rb') as f:
    newsgroup_data = pickle.load(f)

# Use CountVectorizor to find three letter tokens, remove stop_words, 
# remove tokens that don't appear in at least 20 documents,
# remove tokens that appear in more than 20% of the documents
vect = CountVectorizer(min_df=20, max_df=0.2, stop_words='english', 
                       token_pattern='(?u)\\b\\w\\w\\w+\\b')

# Fit and transform
X = vect.fit_transform(newsgroup_data)

# Convert sparse matrix to gensim corpus.
corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)

# Mapping from word IDs to words (To be used in LdaModel's id2word parameter)
id_map = dict((v, k) for k, v in vect.vocabulary_.items())

In [0]:
# Use the gensim.models.ldamodel.LdaModel constructor to estimate 
# LDA model parameters on the corpus, and save to the variable `ldamodel`
ldamodel = LdaModel(common_corpus, num_topics=50, alpha='auto', eval_every=5)  # learn asymmetric alpha from data
