!! Intuition : Documents as a mixture of topics

# Topic Modeling

## 1. What is Topic Modeling?
-> A corpus-level analysis of what's in a text collection.
<br>-> What kind of document is this?

- Topic : the subject (theme) of a discourse.
    - ex) human, genome, genetic, disease, computer science, news, ...
- Topics are represented as a word distribution.
- A document is assumed to be a mixture of topics.

-> What's known?
    - The text colletion or corpus
    - Number of topics

-> What's not known?
    - The actual topics
    - Topic distribution for each document

-> Essentially, it is a **text clustering** problem

-> Different topic modeling approaches are available
    - Probabilistic Latent Semantic Analysis (PLSA)
    - Latent Dirichlet Allocation (LDA)

### 1) Generative Models for Text

**1) Coming from 1 chest**
<br>['Harry', 'Potter', 'movie', 'the', 'is'] -> Generation -> Document

(Now the Inverse)

Document -> Inference, Estimation -> Distribution of words from 1 topic

**2) Mixture Chest (model) **
<br>Document -> Inference, Estimation -> Coming from different topics

### 2) Latent Dirichlet Allocation (LDA, generative model)

1. Generative model for a document d
    - Choose length of document d
    - Choose a mixture of topics for document d
    - Use a topic's multinomial distribution to output words to fill that topic's quota

## 2. Topic Modeling Issues

1. How many topics?
    - Finding one or even guessing the number is very hard.
2. Interpreting topics
    - Topics are just word distributions
    - Making sense of words / generating labels is subjective

<br><br>
To summarize,

1. Great tool for exploratory text analysis
    - What are the documents (tweets, reviews, news articles) about?
    - Many tools available in Python

## 3. Working with LDA in Python

1. Simple steps
    - Many packages available, such as gensim, lda
    - Pre-processing text
        - Tokenize, normalize (lowercase)
        - Stop word removal
        - Stemming
    - Convert tokenized documents to a document-term matrix
    - Build LDA models on the doc-term matrix




In [None]:
# doc_set : set of pre-processed text documents

import gensim
from gensim import corpora, models
dictionary = corpora.Dictionary(doc_set)
corpus = [dictonary.doc2bow(doc) for doc in doc_set]
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=4, 
                                           id2words = dictionary, passes=50)
print(ldamodel.print_topics(num_topics=4, num_words=5))

The primary reference for the Latent Dirichlet Allocation (LDA) is the following. The first five pages (Pg nos. 993-997) describes the model and the plate notation.David M. Blei, Andrew Y. Ng, Michael I. Jordan; Latent Dirichlet Allocation. Journal of Machine Learning Research (JMLR); 3(Jan):993-1022, 2003.http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf

Also see the following Wikipedia pages on:

    Latent Dirichlet Allocation: https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
    Description of the plate notation: https://en.wikipedia.org/wiki/Plate_notation

WordNet based similarity measures in NLTK: http://www.nltk.org/howto/wordnet.html