## COMPLETE GUIDE TO TOPIC MODELING
##### By Ruben Seoane, all credit to  nlpforhackers.io
Based on the following article: https://nlpforhackers.io/topic-modeling/#more-8220


### Using Gensim for Topic Modeling
We will study the **gensim** implementations as they offer more functionality out of the box.We'll replicate it then with **sklearn**. Let's prepare the dataset:

In [1]:
from nltk.corpus import brown

data = []

for fileid in brown.fileids():
    document = ' '.join(brown.words(fileid))
    data.append(document)
    
NO_Documents = len(data)
print(NO_Documents)
print(data[:5])

500


Gensim doens't offer implementation for _NMF_ so we will only test _Latent Semantic Indexing_, also called _Latent Semantic Analysis_ models.

In [4]:
import re
from gensim import models, corpora
from nltk import word_tokenize
from nltk.corpus import stopwords

NUM_TOPICS = 10
STOPWORDS = stopwords.words('english')

def clean_text(text):
    tokenized_text = word_tokenize(text.lower())
    cleaned_text = [t for t in tokenized_text if t not in STOPWORDS and re.match('[a-zA-Z\-][a-zA-Z\-]{2,}', t)]
    return cleaned_text

# For gensim we will tokenize the data and filter out stopwords
tokenized_data = []
for text in data:
    tokenized_data.append(clean_text(text))

# Build a Dictionary - assciating wordto numeric id
dictionary = corpora.Dictionary(tokenized_data)

# Transform the collection of texts to a numerical form
corpus = [dictionary.doc2bow(text) for text in tokenized_data]

# Check how the 20th document looks like: [(word_id, count),...]
print(corpus[20])

[(12, 3), (14, 1), (21, 1), (25, 5), (30, 2), (31, 5), (33, 1), (42, 1), (43, 2), (44, 2), (45, 2), (46, 2), (47, 2), (49, 1), (50, 1), (53, 1), (56, 1), (59, 1), (60, 1), (66, 1), (75, 1), (80, 1), (98, 1), (101, 1), (106, 1), (117, 1), (129, 1), (130, 2), (132, 2), (135, 2), (140, 1), (141, 2), (143, 4), (144, 2), (145, 2), (166, 1), (195, 1), (198, 3), (219, 1), (220, 4), (221, 3), (223, 1), (229, 4), (230, 4), (231, 2), (235, 1), (236, 1), (242, 2), (246, 2), (255, 1), (263, 1), (269, 1), (270, 5), (271, 2), (275, 5), (276, 1), (278, 4), (280, 2), (281, 1), (307, 2), (310, 1), (311, 3), (313, 1), (314, 5), (318, 4), (322, 1), (336, 1), (338, 3), (339, 1), (340, 1), (341, 1), (345, 1), (346, 1), (351, 1), (354, 1), (355, 1), (366, 3), (368, 13), (370, 1), (372, 1), (374, 3), (377, 3), (381, 3), (386, 1), (392, 6), (396, 1), (401, 1), (412, 2), (426, 2), (428, 2), (431, 2), (434, 2), (439, 2), (444, 1), (450, 1), (452, 1), (462, 1), (465, 1), (467, 1), (470, 1), (478, 1), (483, 1), (

In [6]:
# Build the LDA model
lda_model = models.LdaModel(corpus=corpus, num_topics=NUM_TOPICS, id2word=dictionary)

# Build teh LSI model
lsi_model = models.LsiModel(corpus=corpus, num_topics=NUM_TOPICS, id2word=dictionary)

Let's display the topics inferred  by both models:

In [7]:
print('LDA Model: ')

for idx in range(NUM_TOPICS):
    # Print fist 10 most representative topics
    print("Topic #%s:" % idx, lda_model.print_topic(idx, 10))
    
print("=" * 20)

print("LSI Model:")

for idx in range(NUM_TOPICS):
    # Print the first 10 most representative topics
    print("Topic #%s:" % idx, lsi_model.print_topic(idx,10))
    
print("=" * 20)

LDA Model: 
Topic #0: 0.005*"would" + 0.005*"one" + 0.004*"could" + 0.003*"time" + 0.003*"said" + 0.003*"new" + 0.003*"years" + 0.003*"even" + 0.002*"like" + 0.002*"two"
Topic #1: 0.006*"would" + 0.004*"time" + 0.003*"could" + 0.003*"said" + 0.003*"one" + 0.003*"new" + 0.003*"may" + 0.003*"two" + 0.002*"like" + 0.002*"people"
Topic #2: 0.007*"would" + 0.006*"one" + 0.004*"new" + 0.003*"said" + 0.003*"could" + 0.002*"time" + 0.002*"back" + 0.002*"first" + 0.002*"even" + 0.002*"may"
Topic #3: 0.009*"one" + 0.005*"would" + 0.004*"new" + 0.003*"said" + 0.003*"time" + 0.003*"man" + 0.003*"two" + 0.002*"first" + 0.002*"could" + 0.002*"may"
Topic #4: 0.008*"one" + 0.005*"would" + 0.004*"could" + 0.003*"time" + 0.003*"two" + 0.003*"said" + 0.003*"new" + 0.002*"like" + 0.002*"made" + 0.002*"also"
Topic #5: 0.007*"one" + 0.006*"would" + 0.004*"said" + 0.004*"like" + 0.004*"could" + 0.003*"man" + 0.003*"first" + 0.003*"two" + 0.003*"time" + 0.002*"may"
Topic #6: 0.005*"one" + 0.004*"said" + 0.004

In [8]:
# Lets make the models work and transform the unseen documents to their topic distribution
text = "Te economy is working better than ever"
bow = dictionary.doc2bow(clean_text(text))

print(lsi_model[bow])

[(0, 0.09161592150250299), (1, -0.008647419152097076), (2, 0.016514037825629678), (3, -0.03994304879926142), (4, 0.0152839748674506), (5, -0.013844245215259777), (6, -0.03065641335641124), (7, -0.01620996546381877), (8, 0.05620355327096425), (9, 0.025955191358924518)]


In [9]:
print(lda_model[bow])

[(0, 0.020005101), (1, 0.020006312), (2, 0.020005705), (3, 0.819948), (4, 0.020006096), (5, 0.020005943), (6, 0.020006195), (7, 0.020005379), (8, 0.020005576), (9, 0.020005718)]


Gensim offers a simpley way to perform smilarity queries using topic models:

In [13]:
from gensim import similarities

lda_index = similarities.MatrixSimilarity(lda_model[corpus])

# Perform some queries
similarities = lda_index[lda_model[bow]]
# Sort similarities
similarities = sorted(enumerate(similarities), key=lambda item: -item[1])

# Top most similar documents:
print(similarities [:10])

[(228, 0.99817294), (237, 0.99790853), (263, 0.9978553), (6, 0.9976989), (236, 0.99762785), (28, 0.9976256), (196, 0.99762464), (156, 0.9976114), (412, 0.997604), (212, 0.99759096)]


In [14]:
# Let's find out the most similar document
document_id, similarity = similarities[0]
print(data[document_id][:1000])

I feel obliged to describe this cubbyhole . It had a single porcelain stall and but one cabinet for the chairing of the bards . It was here that the terror-stricken Dennis Moon played an unrehearsed role during the children's party . A much larger room , adjacent to the lavatory , served as a passageway to and from the skimpy toilet . That unused room was large enough for -- well , say an elephant could get into it and , as a matter of fact , an elephant did . Something occurred on the morning of the children's party which may illustrate the kind of trouble our restricted toilet facilities caused us . It so happened that sports writer Arthur Robinson got out of the hospital that morning after promising his doctor that he be back in an hour or two to continue his convalescence . Arthur Robinson traveled with the baseball clubs as staff correspondent for the American . He was ghost writer for Babe Ruth , whose main talent for literary composition was the signing of his autograph . Robbie

### Using Scikit-learn for Topic Modeling
Let's go throgh the same process with **_sklearn_**. This library offers a NMF (Non-Negative Matrix Factorization).

In [17]:
from sklearn.decomposition import NMF, LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer

NUM_TOPICS = 10

vectorizer = CountVectorizer(min_df=5, max_df=0.9,
                            stop_words="english", lowercase=True,
                            token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
data_vectorized = vectorizer.fit_transform(data)

# Build a Latent Dirichlet Allocation Model
lda_model = LatentDirichletAllocation(n_topics=NUM_TOPICS, max_iter=10, learning_method='online')
lda_Z = lda_model.fit_transform(data_vectorized)
print(lda_Z.shape) # (NO_DOCUMENTS, NO_TOPICS )

# Build a Non-Negative Matrix Factorization Model
nmf_model = NMF(n_components=NUM_TOPICS)
nmf_Z = nmf_model.fit_transform(data_vectorized)
print(nmf_Z.shape) # (NO_DOCUMENTS, NO_TOPICS )

# Build a Latent Semantic Indexing Model
lsi_model = TruncatedSVD(n_components=NUM_TOPICS)
lsi_Z = lsi_model.fit_transform(data_vectorized)
print(lsi_Z.shape) # (NO_DOCUMENTS, NO_TOPICS )

# Let's check how the first document in the corpus looks in different topic spaces
print(lda_Z[0])
print(nmf_Z[0])
print(lsi_Z[0])



(500, 10)
(500, 10)
(500, 10)
[1.05605894e-04 1.05609265e-04 1.05605700e-04 5.28445455e-01
 1.05620250e-04 2.91624357e-01 1.05616459e-04 1.79190917e-01
 1.05596826e-04 1.05616419e-04]
[0.         0.         2.11710165 0.07697039 0.         0.54562424
 1.06816587 0.         0.         0.24738382]
[ 23.30684397   1.59508401  21.79982266   0.05007477   0.93375042
  11.63274128   3.9182653   -1.47678147   1.06527667 -14.70875259]


In order to transform a new document:

In [18]:
text = "The economy is working better than ever"
x = nmf_model.transform(vectorizer.transform([text]))[0]
print(x)

[0.00290271 0.         0.         0.         0.         0.00441021
 0.         0.         0.         0.00468816]


**-->** In order to implement the same functionality from the **_gensim_** section:

In [19]:
from sklearn.metrics.pairwise import euclidean_distances

def most_similar(x, Z, top_n=5):
    dists = euclidean_distances(x.reshape(1, -1), Z)
    pairs = enumerate(dists[0])
    most_similar = sorted(pairs, key=lambda item: item[1])[:top_n]
    return most_similar

similarities = most_similar(x, nmf_Z)
document_id, similarity = similarities[0]
print(data[document_id][:1000])

Livery stable -- J. Vernon , prop. '' . Coaching had declined considerably by 1905 , but the sign was still there , near the old Wells Fargo building in San Francisco , creaking in the fog as it had for thirty years . John Vernon had had all the patronage he cared for -- he had prospered , but he could not retire from horsedom . Coaching was in his blood . He had two interests in life : the pleasures of the table and driving . Twice a week he drove his tallyho over the Santa Cruz road , upland and through the redwood forest , with orchards below him at one hand , and glimpses of the Pacific at the other . The journey back he made along the coast road , traveling hell-for-leather , every lantern of the tallyho ablaze . The southward route was the classic run in California , and the most fashionable . His patronage on this stretch was made up largely of San Franciscans -- regulars , most of them , and trenchermen like himself . They did not complain at the inhuman hour of starting ( seve

### Plotting words and documents in 2D with SVD


In [21]:
# Initiating Bokeh
import pandas as pd
from bokeh.io import push_notebook, show, output_notebook
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource, LabelSet
output_notebook()

In [22]:
# Let's plor the documents in 2D

In [24]:
svd = TruncatedSVD(n_components=2)
documents_2d = svd.fit_transform(data_vectorized)

df = pd.DataFrame(columns=['x','y', 'document'])
df['x'], df['y'], df['document'] = documents_2d[:,0], documents_2d[:,1], range(len(data))

source = ColumnDataSource(ColumnDataSource.from_df(df))
labels = LabelSet(x='x', y='y', text='document', y_offset=8,
                 text_font_size='8pt', text_color="#555555",
                 source=source, text_align='center')

plot = figure(plot_width=600, plot_height=600)
plot.circle('x','y', size=12, source=source, line_color='black', fill_alpha=0.8)
plot.add_layout(labels)
show(plot, notebook_handle=True)

To display words in 2D we need to transpose the vectorized data with: **words_2d = svd.fit_transform(data_vectorized.T)**

In [28]:
svd = TruncatedSVD(n_components=2)
words_2d = svd.fit_transform(data_vectorized.T)

df = pd.DataFrame(columns=['x', 'y', 'word'])
df['x'], df['y'], df['word'] = words_2d[:,0], words_2d[:,1], vectorizer.get_feature_names()

source = ColumnDataSource(ColumnDataSource.from_df(df))
labels = LabelSet(x='x', y='y', text='word', y_offset=8,
                  text_font_size='8pt', text_color='#555555',
                  source=source, text_align='center')

plot = figure(plot_width=600, plot_height=600)
plot.circle('x','y', size=12, source=source, line_color='black', fill_alpha=0.8)
plot.add_layout(labels)
show(plot, notebook_handle=True)

### More on Latent Dirichlet Allocation
LDA is the most popular method for performing toppic modeling in real-world applications. It provides accurate results, can be trained online (does not need to retrain when new data is acquired) and can be run in multiple cores.

In [29]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

NUM_TOPICS = 10
vectorizer = CountVectorizer(min_df=5, max_df=0.9,
                            stop_words='english', lowercase=True,
                            token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
data_vectorized = vectorizer.fit_transform(data)

# Bulding a Latent Dirichlet Allocation Model
lda_model = LatentDirichletAllocation(n_topics=NUM_TOPICS, max_iter=10, learning_method='online')
lda_Z = lda_model.fit_transform(data_vectorized)

text = "The economy is working better than ever"
x = lda_model.transform(vectorizer.transform([text]))[0]
print(x, x.sum())



[0.02500403 0.77496911 0.0250082  0.0250003  0.02500716 0.02500039
 0.02500037 0.02500005 0.02500774 0.02500264] 0.9999999999999999


LDA is an _iterative algorithm_, its two main steps being:
- Initialization stage: each word is assigned a random topic.
- Iteratively, the algorithm goes through each word and reassigns it to a topic given:
    - Possibility that the word belongs to a topic
    - Probability of the document being generated by a topic
    
We will use the **PyLDAVis** tool to visualize the results:

In [30]:
import pyLDAvis.sklearn

pyLDAvis.enable_notebook()
panel = pyLDAvis.sklearn.prepare(lda_model, data_vectorized, vectorizer, mds='tsne')
panel

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  topic_term_dists = topic_term_dists.ix[topic_order]


#### Interpreting the visualization
1. Larger topics appear more frequently in the corpus.
2. The closer the topics, the similarity is higher.
3. When selecting a topic, the most representative words will be shown, as a measure of how frequent or how discriminant the word is. The weight of each property can be adjusted with the slider.
4. Hovering over a word will adjust the topic sizes to show how representative the word is for each of them.

LDA can be used for automatic tagging, we could go through each topic and attach a label to it. The effectiveness of this approach will depend on how clearly defined the topics are. Experimenting with **_num_topics_** can improve results. Also take in account that the larger the corpus (this one is only 500 instances) the better the topics will be defined.