# Generative Models and LDA

![](images/generativemodels1.png)

The generative models for text basically starts with this magic chest. Suppose you have this chest and words come out of this chest magically. And you pick words from this chest to then create your document. So when you start pulling out words, you start seeing words like this. `Harry, Potter`, `Is`, and then you have other words like `movie` and `the`, and so on. Already, by just looking at the first two words, you know that this chest gives out words about `harry potter`, so this is some distribution that favors words coming from harry potter. And then you can these words that come out to create this document to generate this document. And this document will be something like the movie harry potter is based on books from j.k rowling. And now you see that in the generation process, you have a model that gives out words, and then you use those words coming from that model to generate the document. 

But then you could go the other way. You could start from the document and see how many times the word `the` occurs, or `harry` occurs, or `potter `occurs. And then create a distribution of words, that is you create a probability distribution of how likely it is to see the words, `harry`, in this document or the word, `movie`, in this document. And you'll notice that when you generate this model, when you infer this model, the word, `the`, is most frequent. The probability is 0.1. That means one in ten words is `the`. And then you have `is`, and then `harry`, and `potter`, and so on. So notice that because the documents were about harry potter, the model favors the words `harry potter` there. It's very unlikely that you would see harry and potter being this frequent in any other topic model or in any other corpus of documents. So here you had a very simple generative process. You had one topic model and you pulled out words from that topic model to create your document. That was a generation story. 

![](images/generativemodels2.png)

However the generation story can be very complex in most cases. Suppose you have, instead of one topic, you have four topic models, four chests. And you have this magic hat that pulls out words from these chests at random or it has its own policy of choosing one chest over the other. And then you have these words that come and then you still create this document. Now your model is more complex because instead of learning, the generation is almost like where you decide which chest the word comes out of. And once you have made that choice, then you have a different distribution of that word coming from that chest. But you still create the same document, but then, when you are using these documents to infer your models, you need to infer four models. And you need to somehow infer what was the combination of words coming from these four chests, these four topics. So you not only have to somehow figure out what were the individual topic models, individual word distributions. But also, this mixture model of how you use these four topic models and combine them to create one document. So this is typically called the **mixture model**, the first one that we saw in the previous slide was unique model, where you have one topic distribution and you get words from there. Whereas here, it's a mixture of topics.So you have the same document, generated by four different topics. Some of them represented with a higher proportion and others that are not. It should remind you of the example we started this topic model discussion from. That was on the bare necessities in science article, and you saw that there was a topic model for computation and another topic model for genetics.  And a third topic model for anatomy that was not represented as well in the document and so on. So this is kind of similar model here.

# Latent Dirichlet Allocation (LDA) - generative probabilistic model
LDA is also a generative model and it creates its documents based on some notion of length of the document, mixture of topics in that document and then, individual topics multinomial distributions. 
* Generative model for a document `d`
    * Choose length of document `d`
    * Choose a mixture of topics for document `d`
    * Use a topic's multinomial distribution to output words to fill that topic's quota

## Topic modeling in practice
* How many topics?
    * Finding or even guession the number of topics is hard
* Interpreting topics
    * Topics are just woord distributions
    Making sense of words/generatig label is subjective

The probabilistic topic model estimated by LDA consists of two tables (matrices). The first table describes the probability or chance of selecting a particular part when sampling a particular topic (category). The second table describes the chance of selecting a particular topic when sampling a particular document or composite.

In practice the questions you need to ask though when you create a model such as LDA is how many topics you want. There is no good answer for it. Finding or even guessing that number is actually very hard. So you make a choice, just based on a guess of how distinct these topics could be. But if you are in a domain where you know these topics a little bit well. So for example, you have all medical documents. And you know that these medical documents come from radiology and pathology and urology, and there are these streams, then you might say, okay, I'm interested in these seven streams of medicine, and those are my topics. So that there at least you have some sense of how many topics there should be.

The other big problem is interpreting the topics. So you would get topics, but topics are just word distributions. They just tell you which words are more frequent or more probable coming from particular topic and which ones are not as probable. But making sense of that or generating a coherent label for the topic is a subjective decision. There have been some work that have looked into generating names for these topics. But most likely whenever you see a name in a topic model, it just comes out manually. When people just look at these words like genetics and genes and so on, and say that this is a genetic topic. Or if they say computation, and model, and data and information and there's something to do with computation or computer science or informatics. So those names are fairly subjective. But actual topics that you learn from LDA is basically a solution of an optimization function. 

## Topic Modeling: Summary
* Great tool for exploratory text analysis
    * What are the documents (tweets, reviews, news, articles) about?

## Working with LDA in Python
* Many packages available, such as gensim, lda
* Preprocessing text
    * Tokenize, normalize (lowercase) `text11.split(' ')`, `nltk.word_tokenize(text11)`, `sentences = nltk.sent_tokenize(text12)`
    * Stop word removal 

In [None]:
stop_words = set(stopwords.words('english'))
filtered_sentence = [w for w in word_tokens if not w in stop_words] 
  
filtered_sentence = [] 
  
for w in word_tokens: 
    if w not in stop_words: 
        filtered_sentence.append(w)
        

* stemming
    

In [None]:
porter = nltk.PorterStemmer()
[porter.stem(t) for t in words1]

WNlemma = nltk.WordNetLemmatizer()
[WNlemma.lemmatize(t) for t in udhr[:20]]

* Convert tokenized documents to a document - term matrix
* Build LDA models on the doc-term matrix
* doc_set: set of pre-processed text documents
    1. Create a dictionary
    2. create a corpus - create a doc-term matrix
    3. Input that in the LdaModel call, so that you use a gensim.models LdaModel, where you also specify the number of topics you want to learn. So in this case, we said number of topics is going to be four, and you also specify this mapping, the id2word mapping. That's a dictionary that is learned two steps ahead.
    
* Ldamodel can also be used to find topic distribution of documents
* Feature extraction

In [None]:
import gensim
from gensim import corpora, models
dictionary=corpora.Dictionary(doc_set)
corpus=[dictionary.doc2bow(doc) for doc in doc_set]
ldamodel =gensim.models.ldamodel.LdaModel(corpus, num_topics=4, id2word=dictionary, passes=50)
print(ldamodel.print_topics(num_topics=4, num_words=5))