# Topic Modeling using Latent Dirichlet Allocation

### Prerequisites:

- Natural Language Processing Fundamentals in Python

- Things to be familiar with: 
    - tokenization
    - stopwords
    - term frequency
    - Bag-Of-Words representation

### Going to discuss:

- What is topic modeling?

- How does Latent Dirichlet Allocation (LDA) work?

- How to train and use LDA with gensim?

## What is topic modeling? 

- **topic**: a collection of related words

- a document can be composed of several topics

### Given a collection of documents, we can ask:

- What words make up each topic?

- What topics make up each document?

<img src="http://deliveryimages.acm.org/10.1145/2140000/2133826/figs/f1.jpg">

David Blei

### First, a simple example:

In [1]:
import numpy as np # we'll want this later

vocab = ['baseball','cat','dog','pet','played','tennis']

V = len(vocab) # size of vocabulary

In [2]:
K = 2 # number of topics

In [3]:
# the probability of each term given topic 1
topic_1 = [.33,   0,   0,   0, .33, .33]

In [4]:
# the probability of each term given topic 2
topic_2 = [  0, .25, .25, .25, .25,   0]

In [5]:
# per topic word distributions
phi = [topic_1, topic_2]

In [6]:
print(np.array(phi).shape) # K x V (number of topics x size of vocabulary)

(2, 6)


### If we had some documents, what topics make up each document?

In [7]:
corpus = ['the dog and cat played tennis',
          'tennis and baseball are sports',
          'a dog or a cat can be a pet']

# recall
vocab = ['baseball','cat','dog','pet','played','tennis']

phi = [[.33,   0,   0,   0, .33, .33],
       [  0, .25, .25, .25, .25,   0]]

In [8]:
# per document topic distributions
theta = [[.50, .50],
         [.99, .01],
         [.01, .99]]

In [9]:
print(np.array(theta).shape) # M x K (number of documents x number of topics)

(3, 2)


### We can even generate a document

In [10]:
np.random.seed(123) # for demo purposes

N = 6 # number of tokens in document

In [11]:
new_theta = [.6,.4] # draw a topic distribution (theta die)

In [12]:
new_doc = []
for i in range(N):
    z = np.argmax(np.random.multinomial(1, new_theta)) # get a topic
    
    idx = np.argmax(np.random.multinomial(1,phi[z]))   
    x = vocab[idx]                                     # get a term
    
    new_doc.append(x)                                  # add to document

In [13]:
' '.join(new_doc)

'pet baseball pet tennis played played'

### NOTE: But usually, we don't know the theta or phi!  
### We need to learn these from a set of documents (corpus)!

### Uses for $\phi$ (phi), the per topic word distributions:

- infering labels for topics
- word clouds

### Uses for $\theta$ (theta), the per document topic weights:

- dimentionality reduction
- clustering
- similarity

### How do we learn phi ($\phi$) and theta ($\theta$)?

### Latent Dirichlet Allocation (LDA)

 - generative statistical model
 - *Blei, D., Ng, A., Jordan, M. Latent Dirichlet allocation. J. Mach. Learn. Res. 3 (Jan 2003)*
 

### Dirichlet Distribution

- Conjugate prior to the Multinomial Distribution

- Multinomial is like a "die"

- Dirichlet is like a "die factory"

<img src="https://upload.wikimedia.org/wikipedia/commons/4/4d/Smoothed_LDA.png" style="width: 30%">

```
K     # number of topics

phi   # per topic word distributions

beta  # parameters for word distribution die factory, length = V
```

```
M     # number of documents
N     # number of words/tokens in each document

theta # per document topic distributions

alpha # parameters for topic die factory, length = K
```

```
z     # topic indexes
```

```
Dirichlet   # dirichlet distribution (aka die factory)
```

<img src="https://upload.wikimedia.org/wikipedia/commons/4/4d/Smoothed_LDA.png" style="width: 30%">

```
phi = []  # word distribution die, 1 per topic

# pseudocode to generate topic word distributions
for k in range(K):
    phi.append(Dirichlet(beta,V).get_die())  # generate word distribution die
```

```
corpus = []

# pseudocode to generate corpus
for m in range(M):
    document_m = []
    
    theta_m = Dirichlet(alpha,K).get_die()   # generate a topic die
    
    for n in range(N):
        z_mn = theta_m.get_topic()     # roll topic die
        w_mn = phi[z_mn].get_word()    # roll word distribution die
        
        document_m.append(w_mn)
    
    corpus.append(document_m)
```

## Review

### Things we know: 

 - M : the number of documents
 - N : the lengths of document
 

### Things we choose:

 - K : the number of topics
 - V : our vocabulary

### Things we want to learn: 

 - $\theta$'s (theta's) : the per document topic weights
 - $\phi$'s (phi's) : the per topic word weights

#### Note:

We may want to infer $\alpha$ and $\beta$ as well

## Example using sklearn

In [14]:
import warnings # to deal with deprecation warnings

from sklearn.datasets import fetch_20newsgroups
newsgroups = fetch_20newsgroups()
X = newsgroups.data
len(X)

11314

In [15]:
# example document
X[4].replace('\n',' ')[:200]

'From: jcm@head-cfa.harvard.edu (Jonathan McDowell) Subject: Re: Shuttle Launch Question Organization: Smithsonian Astrophysical Observatory, Cambridge, MA,  USA Distribution: sci Lines: 23  From artic'

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(min_df=50, stop_words='english')

In [17]:
# transform our documents (this might take a moment)
X_tfidf = tfidf.fit_transform(X)
X_tfidf.shape

(11314, 4175)

In [18]:
# this is our vocabulary (the column names of our dataset)
feature_names = tfidf.get_feature_names()

In [19]:
print(feature_names[:10])
print(feature_names[-10:])

['00', '000', '01', '02', '03', '04', '05', '06', '07', '08']
['ysu', 'za', 'zealand', 'zero', 'zeus', 'zip', 'zone', 'zoo', 'zuma', 'zx']


In [20]:
from sklearn.decomposition import LatentDirichletAllocation

warnings.simplefilter(action='ignore', category=DeprecationWarning) # to remove warning

In [21]:
# create model with 20 topics
lda = LatentDirichletAllocation(n_components=20,  # the number of topics
                                n_jobs=-1,        # use all cpus
                                random_state=123) # for reproducability

In [22]:
# learn phi and theta (lda.components_ and X_lda)
# this will take a while!
X_lda = lda.fit_transform(X_tfidf)

In [44]:
X_lda[100] # lda representation of document_100

array([0.00684573, 0.00684573, 0.36756156, 0.00684573, 0.00684573,
       0.00684573, 0.00684573, 0.00684573, 0.00684573, 0.00684573,
       0.00684573, 0.00684573, 0.00684573, 0.00684573, 0.00684573,
       0.00684573, 0.00684573, 0.23407952, 0.00684573, 0.28198151])

In [45]:
np.argsort(X_lda[100])[::-1][:3] # the top topics of document_100

array([ 2, 19, 17])

In [23]:
# a utility function to print out the most likely terms for each topic
# taken from https://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic {:#2d}: ".format(topic_idx)
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)

In [24]:
print_top_words(lda,feature_names,5)

Topic  0: team game hockey players games
Topic  1: uiuc cso urbana illinois uxa
Topic  2: windows file window dos thanks
Topic  3: henry alaska toronto zoo aurora
Topic  5: stratus cramer optilink sw clayton
Topic  6: cmu andrew ti dseg mellon
Topic  7: israel israeli turkish jews armenian
Topic  8: satellite com mounted corp operator
Topic  9: nhl cold rsa main hair
Topic 10: rit isc rochester scsi genocide
Topic 11: umn minnesota card ati plus
Topic 12: key clipper encryption chip access
Topic 13: pitt gordon geb banks cs
Topic 14: buffalo detroit collins hewlett packard
Topic 15: cwru cleveland freenet reserve ins
Topic 16: double board file designs points
Topic 17: drive scsi sale mac controller
Topic 18: dartmouth nh macs nuclear adding
Topic 19: edu com writes article subject


## Example using gensim

In [25]:
from sklearn.datasets import fetch_20newsgroups
newsgroups = fetch_20newsgroups()
corpus_fname = '../../../scikit_learn_data/20news-bydate_py3.data.txt'
with open(corpus_fname,'w') as f:
    for doc in newsgroups.data[:1000]:
        f.write(doc.replace('\n',' ') + '\n')

In [26]:
newsgroups.data[4].replace('\n',' ')[:200]

'From: jcm@head-cfa.harvard.edu (Jonathan McDowell) Subject: Re: Shuttle Launch Question Organization: Smithsonian Astrophysical Observatory, Cambridge, MA,  USA Distribution: sci Lines: 23  From artic'

In [27]:
from gensim.corpora import TextCorpus

In [28]:
%time corpus = TextCorpus(input=corpus_fname)

CPU times: user 1.24 s, sys: 0 ns, total: 1.24 s
Wall time: 1.23 s


In [29]:
corpus.length # M

1000

In [30]:
len(corpus.dictionary) # V

24635

In [31]:
from gensim.models.ldamodel import LdaModel

In [32]:
%%time 

K = 20

lda = LdaModel(corpus=corpus,
               id2word=corpus.dictionary,
               num_topics=K,
               passes=2, chunksize=100)

CPU times: user 12.6 s, sys: 120 ms, total: 12.7 s
Wall time: 6.45 s


### What words make up each topic?

In [33]:
lda.show_topic(15) # phi

[('writes', 0.01506076),
 ('edu', 0.012521725),
 ('article', 0.008918398),
 ('internet', 0.008860075),
 ('com', 0.008342215),
 ('isaiah', 0.007613711),
 ('day', 0.0074239774),
 ('organization', 0.007343069),
 ('uiuc', 0.0072105792),
 ('lines', 0.007118412)]

### What topics make up each document?

In [34]:
text = next(corpus.sample_texts(1))

In [35]:
lda[corpus.dictionary.doc2bow(text)] # theta

[(0, 0.024533147),
 (6, 0.033794474),
 (7, 0.024932316),
 (8, 0.07899931),
 (9, 0.019313833),
 (11, 0.014760868),
 (12, 0.058210887),
 (14, 0.535788),
 (16, 0.16075274),
 (17, 0.022140486),
 (19, 0.017233592)]

### Topics covered:

- What is topic modeling?

- How does Latent Dirichlet Allocation (LDA) work?

- How to train and use LDA with sklearn?

- How to train and use LDA with gensim?

## Thank you!