# Topic Modeling with gensim
We'll try out [Latent Dirichlet Allocation (LDA)](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) in [gensim](http://radimrehurek.com/gensim/index.html) on the [20 Newsgroups dataset](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html) with some simple preprocessing.

#### Install gensim

In [1]:
# !conda install gensim -y

##### imports

In [2]:
# gensim
from gensim import corpora, models, similarities, matutils

# sklearn
from sklearn import datasets
from sklearn.feature_extraction.text import CountVectorizer

# logging for gensim (set to INFO)
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

Let's retain only a subset of the 20 categories in the original 20 Newsgroups Dataset.

In [3]:
# Set categories
categories = ['alt.atheism', 'comp.graphics', 'rec.sport.baseball', 
              'rec.motorcycles', 'sci.space', 'talk.politics.mideast']

# Download the training subset of the 20 NG dataset, with headers, footers, quotes removed
# Only keep docs from the 6 categories above
ng_train = datasets.fetch_20newsgroups(subset='train', categories=categories, 
                                      remove=('headers', 'footers', 'quotes'))

In [4]:
# Take a look at the first doc
ng_train.data[0]

'Well, the Red Sox have apparenly resigned Herm Winningham to a AAA contract.\nTed "Larry" Simmons signed him to a AAA contract then released him from\nBuffalo, allowing Lou "Curly" Gorman to circumvent the rule about not\nresigning free agents until May 1. Clearly, neither of these guys is bright\nenough to be Moe.\n\n Mike Jones | AIX High-End Development | mjones@donald.aix.kingston.ibm.com'

## Document Preprocessing
We'll need to generate a term-document matrix of word (token) counts for use in LDA.

We'll use `sklearn`'s `CountVectorizer` to generate our term-document matrix of counts. We'll make use of a few parameters to accomplish the following preprocessing of the text documents all within the `CountVectorizer`:
* `analyzer=word`: Tokenize by word
* `ngram_range=(1,2)`: Keep all 1 and 2-word grams
* `stop_words=english`: Remove all English stop words
* `token_pattern=\\b[a-z][a-z]+\\b`: Match all tokens with 2 or more (strictly) alphabet characters

In [5]:
# Create a CountVectorizer for parsing/counting words
count_vectorizer = CountVectorizer(ngram_range=(1, 2),  
                                   stop_words='english', token_pattern="\\b[a-z][a-z]+\\b")

count_vectorizer.fit(ng_train.data)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 2), preprocessor=None, stop_words='english',
                strip_accents=None, token_pattern='\\b[a-z][a-z]+\\b',
                tokenizer=None, vocabulary=None)

In [6]:
# Create the term-document matrix
# Transpose it so the terms are the rows
doc_word = count_vectorizer.transform(ng_train.data).transpose()

In [7]:
import pandas as pd

pd.DataFrame(doc_word.toarray(), count_vectorizer.get_feature_names()).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,3406,3407,3408,3409,3410,3411,3412,3413,3414,3415
aa,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
aa aaa,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
aa albany,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
aa atlanta,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
aa does,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [8]:
doc_word.shape

(272502, 3416)

##### Convert to gensim
We need to convert our sparse `scipy` matrix to a `gensim`-friendly object called a Corpus:

In [9]:
# Convert sparse matrix of counts to a gensim corpus
corpus = matutils.Sparse2Corpus(doc_word)  # converting data into another format that gensim can understand

##### Map matrix rows to words (tokens)
We need to save a mapping (dict) of row id to word (token) for later use by gensim:

In [10]:
id2word = dict((v, k) for k, v in count_vectorizer.vocabulary_.items())

In [11]:
len(id2word)

272502

## LDA
At this point we can simply plow ahead in creating an LDA model.  It requires our corpus of word counts, mapping of row ids to words, and the number of topics (3).

In [12]:
# Create lda model (equivalent to "fit" in sklearn)
lda = models.LdaModel(corpus=corpus,  # we converted our data into a way gensim can understadn above
                      num_topics=3,   # how many topics are there
                      id2word=id2word, 
                      passes=5)   # telling gensim to run trough how many iterations

2019-11-24 17:22:42,421 : INFO : using symmetric alpha at 0.3333333333333333
2019-11-24 17:22:42,431 : INFO : using symmetric eta at 0.3333333333333333
2019-11-24 17:22:42,505 : INFO : using serial LDA version on this node
2019-11-24 17:22:42,661 : INFO : running online (multi-pass) LDA training, 3 topics, 5 passes over the supplied corpus of 3416 documents, updating model once every 2000 documents, evaluating perplexity every 3416 documents, iterating 50x with a convergence threshold of 0.001000
2019-11-24 17:22:42,807 : INFO : PROGRESS: pass 0, at document #2000/3416
2019-11-24 17:22:44,976 : INFO : merging changes from 2000 documents into a model of 3416 documents
2019-11-24 17:22:45,093 : INFO : topic #0 (0.333): 0.002*"people" + 0.001*"just" + 0.001*"know" + 0.001*"time" + 0.001*"like" + 0.001*"space" + 0.001*"don" + 0.001*"edu" + 0.001*"think" + 0.001*"does"
2019-11-24 17:22:45,102 : INFO : topic #1 (0.333): 0.002*"like" + 0.001*"space" + 0.001*"don" + 0.001*"think" + 0.001*"peop

2019-11-24 17:23:09,105 : INFO : topic #0 (0.333): 0.002*"just" + 0.002*"people" + 0.002*"don" + 0.002*"like" + 0.002*"space" + 0.001*"think" + 0.001*"know" + 0.001*"edu" + 0.001*"does" + 0.001*"time"
2019-11-24 17:23:09,113 : INFO : topic #1 (0.333): 0.002*"people" + 0.002*"armenian" + 0.002*"armenians" + 0.002*"said" + 0.002*"turkish" + 0.001*"don" + 0.001*"like" + 0.001*"know" + 0.001*"just" + 0.001*"jews"
2019-11-24 17:23:09,121 : INFO : topic #2 (0.333): 0.001*"image" + 0.001*"just" + 0.001*"don" + 0.001*"like" + 0.001*"space" + 0.001*"year" + 0.001*"jpeg" + 0.001*"people" + 0.001*"time" + 0.001*"good"
2019-11-24 17:23:09,127 : INFO : topic diff=0.235596, rho=0.386103
2019-11-24 17:23:11,696 : INFO : -11.577 per-word bound, 3054.3 perplexity estimate based on a held-out corpus of 1416 documents with 254128 words
2019-11-24 17:23:11,698 : INFO : PROGRESS: pass 4, at document #3416/3416
2019-11-24 17:23:12,368 : INFO : merging changes from 1416 documents into a model of 3416 documen

Let's take a look at what happened.  Here are the 5 most important words for each of the 3 topics we found:

In [13]:
lda.print_topics()   # it will give you the topics
                     # will return you the number of topics you specified above
    
    # HOWEVER, it is your job to infer what the topics are!!

2019-11-24 17:23:12,534 : INFO : topic #0 (0.333): 0.002*"people" + 0.002*"just" + 0.002*"don" + 0.002*"like" + 0.001*"space" + 0.001*"god" + 0.001*"know" + 0.001*"think" + 0.001*"does" + 0.001*"time"
2019-11-24 17:23:12,541 : INFO : topic #1 (0.333): 0.002*"people" + 0.002*"armenian" + 0.002*"said" + 0.002*"armenians" + 0.001*"don" + 0.001*"like" + 0.001*"turkish" + 0.001*"know" + 0.001*"just" + 0.001*"space"
2019-11-24 17:23:12,550 : INFO : topic #2 (0.333): 0.002*"image" + 0.001*"jpeg" + 0.001*"like" + 0.001*"don" + 0.001*"just" + 0.001*"space" + 0.001*"use" + 0.001*"time" + 0.001*"people" + 0.001*"year"


[(0,
  '0.002*"people" + 0.002*"just" + 0.002*"don" + 0.002*"like" + 0.001*"space" + 0.001*"god" + 0.001*"know" + 0.001*"think" + 0.001*"does" + 0.001*"time"'),
 (1,
  '0.002*"people" + 0.002*"armenian" + 0.002*"said" + 0.002*"armenians" + 0.001*"don" + 0.001*"like" + 0.001*"turkish" + 0.001*"know" + 0.001*"just" + 0.001*"space"'),
 (2,
  '0.002*"image" + 0.001*"jpeg" + 0.001*"like" + 0.001*"don" + 0.001*"just" + 0.001*"space" + 0.001*"use" + 0.001*"time" + 0.001*"people" + 0.001*"year"')]

#### Topic Space
If we want to map our documents to the topic space we need to actually use the LdaModel transformer that we created above, like so:

In [14]:
# Transform the docs from the word space to the topic space (like "transform" in sklearn)
lda_corpus = lda[corpus]
lda_corpus

<gensim.interfaces.TransformedCorpus at 0x279f0bc8208>

In [15]:
# Store the documents' topic vectors in a list so we can take a peak
lda_docs = [doc for doc in lda_corpus]

In [16]:
lda_docs[0]

[(2, 0.99065596)]

Now we can take a look at the document vectors in the topic space, which are measures of the component of each document along each topic.  Thus, at most a document vector can have num_topics=3 nonzero components in the topic space, and most have far fewer.

In [17]:
# Check out the document vectors in the topic space for the first 5 documents
lda_docs[0:5]  # individual document vectors in the topic space

# For document 1 (first row), it is mostly made out of topic 2
# in document 2, it is mostly made out of topic 1 and topic 2

[[(2, 0.99065596)],
 [(0, 0.98118085)],
 [(0, 0.011481263), (1, 0.97667), (2, 0.011848766)],
 [(2, 0.98503464)],
 [(0, 0.99290806)]]

In [18]:
ng_train.data[0]

'Well, the Red Sox have apparenly resigned Herm Winningham to a AAA contract.\nTed "Larry" Simmons signed him to a AAA contract then released him from\nBuffalo, allowing Lou "Curly" Gorman to circumvent the rule about not\nresigning free agents until May 1. Clearly, neither of these guys is bright\nenough to be Moe.\n\n Mike Jones | AIX High-End Development | mjones@donald.aix.kingston.ibm.com'

## On your own...
- Pick a few subsets of the 20newsgroups dataset  
- Try performing LDA on this data with gensim
- Play with some of the preprocessing options and parameters for LDA, observe what happens
- See if you can use the resulting topic space to extract topic vectors
- How do your results look?
- Can you think of how you could cluster this data?

In [19]:
# Create lda model (equivalent to "fit" in sklearn)
lda = models.LdaModel(corpus=corpus,  # we converted our data into a way gensim can understadn above
                      num_topics=6,   # how many topics are there
                      id2word=id2word, 
                      passes=10)   # telling gensim to run trough how many iterations

2019-11-24 17:23:14,321 : INFO : using symmetric alpha at 0.16666666666666666
2019-11-24 17:23:14,323 : INFO : using symmetric eta at 0.16666666666666666
2019-11-24 17:23:14,372 : INFO : using serial LDA version on this node
2019-11-24 17:23:14,607 : INFO : running online (multi-pass) LDA training, 6 topics, 10 passes over the supplied corpus of 3416 documents, updating model once every 2000 documents, evaluating perplexity every 3416 documents, iterating 50x with a convergence threshold of 0.001000
2019-11-24 17:23:14,672 : INFO : PROGRESS: pass 0, at document #2000/3416
2019-11-24 17:23:16,905 : INFO : merging changes from 2000 documents into a model of 3416 documents
2019-11-24 17:23:17,060 : INFO : topic #1 (0.167): 0.001*"people" + 0.001*"just" + 0.001*"space" + 0.001*"like" + 0.001*"know" + 0.001*"time" + 0.001*"don" + 0.001*"edu" + 0.001*"think" + 0.001*"good"
2019-11-24 17:23:17,069 : INFO : topic #4 (0.167): 0.002*"people" + 0.001*"like" + 0.001*"just" + 0.001*"don" + 0.001*"t

2019-11-24 17:23:34,891 : INFO : topic #5 (0.167): 0.004*"people" + 0.002*"don" + 0.002*"know" + 0.002*"armenian" + 0.002*"like" + 0.002*"just" + 0.002*"said" + 0.002*"armenians" + 0.002*"image" + 0.002*"time"
2019-11-24 17:23:34,903 : INFO : topic diff=0.400402, rho=0.460874
2019-11-24 17:23:35,008 : INFO : PROGRESS: pass 3, at document #2000/3416
2019-11-24 17:23:36,226 : INFO : merging changes from 2000 documents into a model of 3416 documents
2019-11-24 17:23:36,372 : INFO : topic #5 (0.167): 0.004*"people" + 0.002*"don" + 0.002*"know" + 0.002*"armenian" + 0.002*"like" + 0.002*"just" + 0.002*"armenians" + 0.002*"said" + 0.001*"turkish" + 0.001*"time"
2019-11-24 17:23:36,380 : INFO : topic #2 (0.167): 0.002*"don" + 0.002*"people" + 0.002*"just" + 0.002*"think" + 0.002*"said" + 0.002*"like" + 0.002*"israel" + 0.002*"know" + 0.002*"time" + 0.001*"say"
2019-11-24 17:23:36,388 : INFO : topic #1 (0.167): 0.001*"like" + 0.001*"just" + 0.001*"year" + 0.001*"edu" + 0.001*"good" + 0.001*"tim

2019-11-24 17:23:50,431 : INFO : topic diff=0.148887, rho=0.360188
2019-11-24 17:23:50,533 : INFO : PROGRESS: pass 6, at document #2000/3416
2019-11-24 17:23:51,644 : INFO : merging changes from 2000 documents into a model of 3416 documents
2019-11-24 17:23:51,791 : INFO : topic #4 (0.167): 0.002*"image" + 0.002*"data" + 0.001*"new" + 0.001*"edu" + 0.001*"graphics" + 0.001*"use" + 0.001*"like" + 0.001*"available" + 0.001*"software" + 0.001*"does"
2019-11-24 17:23:51,798 : INFO : topic #3 (0.167): 0.003*"space" + 0.002*"launch" + 0.001*"like" + 0.001*"just" + 0.001*"don" + 0.001*"dod" + 0.001*"satellite" + 0.001*"think" + 0.001*"time" + 0.001*"know"
2019-11-24 17:23:51,805 : INFO : topic #2 (0.167): 0.002*"people" + 0.002*"don" + 0.002*"just" + 0.002*"israel" + 0.002*"think" + 0.002*"like" + 0.002*"said" + 0.001*"know" + 0.001*"time" + 0.001*"say"
2019-11-24 17:23:51,815 : INFO : topic #0 (0.167): 0.002*"space" + 0.001*"think" + 0.001*"just" + 0.001*"years" + 0.001*"don" + 0.001*"like" 

2019-11-24 17:24:08,722 : INFO : merging changes from 2000 documents into a model of 3416 documents
2019-11-24 17:24:08,879 : INFO : topic #4 (0.167): 0.002*"image" + 0.002*"data" + 0.001*"graphics" + 0.001*"new" + 0.001*"edu" + 0.001*"use" + 0.001*"like" + 0.001*"available" + 0.001*"software" + 0.001*"does"
2019-11-24 17:24:08,890 : INFO : topic #1 (0.167): 0.001*"like" + 0.001*"year" + 0.001*"just" + 0.001*"good" + 0.001*"edu" + 0.001*"time" + 0.001*"think" + 0.001*"better" + 0.001*"years" + 0.001*"team"
2019-11-24 17:24:08,898 : INFO : topic #5 (0.167): 0.003*"people" + 0.002*"don" + 0.002*"armenian" + 0.002*"know" + 0.002*"like" + 0.002*"just" + 0.002*"armenians" + 0.002*"said" + 0.001*"time" + 0.001*"turkish"
2019-11-24 17:24:08,911 : INFO : topic #0 (0.167): 0.002*"space" + 0.001*"think" + 0.001*"just" + 0.001*"don" + 0.001*"years" + 0.001*"people" + 0.001*"like" + 0.001*"good" + 0.001*"time" + 0.001*"know"
2019-11-24 17:24:08,923 : INFO : topic #2 (0.167): 0.002*"people" + 0.002

In [20]:
lda.print_topics()   # it will give you the topics
                     # will return you the number of topics you specified above
    
    # HOWEVER, it is your job to infer what the topics are!!

2019-11-24 17:24:12,621 : INFO : topic #0 (0.167): 0.002*"space" + 0.001*"think" + 0.001*"don" + 0.001*"just" + 0.001*"people" + 0.001*"years" + 0.001*"like" + 0.001*"good" + 0.001*"time" + 0.001*"know"
2019-11-24 17:24:12,632 : INFO : topic #1 (0.167): 0.001*"like" + 0.001*"year" + 0.001*"good" + 0.001*"just" + 0.001*"time" + 0.001*"edu" + 0.001*"think" + 0.001*"better" + 0.001*"years" + 0.001*"new"
2019-11-24 17:24:12,642 : INFO : topic #2 (0.167): 0.002*"people" + 0.002*"don" + 0.002*"just" + 0.002*"israel" + 0.002*"think" + 0.002*"like" + 0.001*"said" + 0.001*"know" + 0.001*"time" + 0.001*"year"
2019-11-24 17:24:12,649 : INFO : topic #3 (0.167): 0.004*"space" + 0.002*"launch" + 0.001*"satellite" + 0.001*"like" + 0.001*"just" + 0.001*"don" + 0.001*"new" + 0.001*"time" + 0.001*"think" + 0.001*"nasa"
2019-11-24 17:24:12,659 : INFO : topic #4 (0.167): 0.002*"image" + 0.002*"data" + 0.001*"graphics" + 0.001*"new" + 0.001*"edu" + 0.001*"software" + 0.001*"use" + 0.001*"like" + 0.001*"ima

[(0,
  '0.002*"space" + 0.001*"think" + 0.001*"don" + 0.001*"just" + 0.001*"people" + 0.001*"years" + 0.001*"like" + 0.001*"good" + 0.001*"time" + 0.001*"know"'),
 (1,
  '0.001*"like" + 0.001*"year" + 0.001*"good" + 0.001*"just" + 0.001*"time" + 0.001*"edu" + 0.001*"think" + 0.001*"better" + 0.001*"years" + 0.001*"new"'),
 (2,
  '0.002*"people" + 0.002*"don" + 0.002*"just" + 0.002*"israel" + 0.002*"think" + 0.002*"like" + 0.001*"said" + 0.001*"know" + 0.001*"time" + 0.001*"year"'),
 (3,
  '0.004*"space" + 0.002*"launch" + 0.001*"satellite" + 0.001*"like" + 0.001*"just" + 0.001*"don" + 0.001*"new" + 0.001*"time" + 0.001*"think" + 0.001*"nasa"'),
 (4,
  '0.002*"image" + 0.002*"data" + 0.001*"graphics" + 0.001*"new" + 0.001*"edu" + 0.001*"software" + 0.001*"use" + 0.001*"like" + 0.001*"images" + 0.001*"available"'),
 (5,
  '0.004*"people" + 0.002*"don" + 0.002*"know" + 0.002*"armenian" + 0.002*"just" + 0.002*"like" + 0.002*"said" + 0.002*"armenians" + 0.001*"god" + 0.001*"jpeg"')]