https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-latent-dirichlet-allocation-437c81220158

# Generative Process

- Each document can be described by a distribution of topics 
- Each topic can be described by a distribution of words

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/d3/Latent_Dirichlet_allocation.svg/250px-Latent_Dirichlet_allocation.svg.png" width="300">


- $\eta$ determined $\beta$: model parameter $\eta$ determined topic-word distribution: P(word|topic), which is distribution $\beta$

- $\alpha$ determines $\theta$: model parameter $\alpha$ determined document-topic distribution: P(topic|document), which is distribution $\theta$

- Given document $d$, based on $\theta$, randomly sample topic $z$ and get corresponding $\beta_z$, pick word $w$ from $\beta_z$ .

# Dirichlet distribution

- $\alpha$: Shape of (Number of Documents $\times$ Number of Topics) 
- $\eta$: Shape of (Number of Topics $\times$ Number of Words) 
- $\beta$, $\theta$: Dirichlet distribution, parameters defined by $\alpha$ and $\eta$
- Advantage: the words are likely to belong to a single topic 

<img src="https://cdn-images-1.medium.com/max/880/1*3oOHy1tUfUT9Z379Alb9nA.png" width="400">

# Implementation with scikit-learn

In [8]:
from nltk.corpus import brown
from sklearn.decomposition import NMF, LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer

## Import Data

In [17]:
data = []
for fileid in brown.fileids():
    document = ' '.join(brown.words(fileid))
    data.append(document)

NO_DOCUMENTS = len(data)

In [10]:
print(data[0][:100])

The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produ


## Tokenize/Clean/Vectorize

In [11]:
vectorizer = CountVectorizer(min_df=5, max_df=0.9, 
                             stop_words='english', lowercase=True, 
                             token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
data_vectorized = vectorizer.fit_transform(data) 

In [19]:
print(len(vectorizer.vocabulary_), NO_DOCUMENTS, data_vectorized.shape)

10625 500 (500, 10625)


## Train Model

In [20]:
NUM_TOPICS = 10
lda_model = LatentDirichletAllocation(n_topics=NUM_TOPICS, max_iter=10, learning_method='online')
lda_Z = lda_model.fit_transform(data_vectorized)
print(lda_Z.shape)  # (NO_DOCUMENTS, NO_TOPICS)



(500, 10)


## View results

Distribution of first document over 10 topics

In [22]:
print(lda_Z[0])

[5.33720693e-02 1.05608769e-04 1.05615909e-04 1.05615847e-04
 1.05596981e-04 7.01004075e-01 1.05607427e-04 1.05614813e-04
 2.44884580e-01 1.05615899e-04]


Distribution of first topic over vocabulary

In [31]:
lda_model.components_[0].shape

(10625,)