# LDA is an iterative algorithm. 

Here are the two main steps:

* In the initialization stage, each word is assigned to a random topic.

* Iteratively, the algorithm goes through each word and reassigns the word to a topic taking into consideration:
    
      * What’s the probability of the word belonging to a topic
        
      * What’s the probability of the document to be generated by a topic
    
Due to these important qualities, we can visualize LDA results easily

In [2]:
from nltk.corpus import brown
 
data = []
 
for fileid in brown.fileids():
    document = ' '.join(brown.words(fileid))
    data.append(document)
 
NO_DOCUMENTS = len(data)
print(NO_DOCUMENTS)
print(data[:1])

500


In [24]:
brown.words()

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

In [56]:
brown.words('ca02')

['Austin', ',', 'Texas', '--', 'Committee', 'approval', ...]

In [54]:
brown.fileids()

['ca01',
 'ca02',
 'ca03',
 'ca04',
 'ca05',
 'ca06',
 'ca07',
 'ca08',
 'ca09',
 'ca10',
 'ca11',
 'ca12',
 'ca13',
 'ca14',
 'ca15',
 'ca16',
 'ca17',
 'ca18',
 'ca19',
 'ca20',
 'ca21',
 'ca22',
 'ca23',
 'ca24',
 'ca25',
 'ca26',
 'ca27',
 'ca28',
 'ca29',
 'ca30',
 'ca31',
 'ca32',
 'ca33',
 'ca34',
 'ca35',
 'ca36',
 'ca37',
 'ca38',
 'ca39',
 'ca40',
 'ca41',
 'ca42',
 'ca43',
 'ca44',
 'cb01',
 'cb02',
 'cb03',
 'cb04',
 'cb05',
 'cb06',
 'cb07',
 'cb08',
 'cb09',
 'cb10',
 'cb11',
 'cb12',
 'cb13',
 'cb14',
 'cb15',
 'cb16',
 'cb17',
 'cb18',
 'cb19',
 'cb20',
 'cb21',
 'cb22',
 'cb23',
 'cb24',
 'cb25',
 'cb26',
 'cb27',
 'cc01',
 'cc02',
 'cc03',
 'cc04',
 'cc05',
 'cc06',
 'cc07',
 'cc08',
 'cc09',
 'cc10',
 'cc11',
 'cc12',
 'cc13',
 'cc14',
 'cc15',
 'cc16',
 'cc17',
 'cd01',
 'cd02',
 'cd03',
 'cd04',
 'cd05',
 'cd06',
 'cd07',
 'cd08',
 'cd09',
 'cd10',
 'cd11',
 'cd12',
 'cd13',
 'cd14',
 'cd15',
 'cd16',
 'cd17',
 'ce01',
 'ce02',
 'ce03',
 'ce04',
 'ce05',
 'ce06',
 

# Topic Modeling Algorithms

There are several algorithms for doing topic modeling. The most popular ones include

* LDA – Latent Dirichlet Allocation – The one we’ll be focusing in this tutorial. Its foundations are Probabilistic Graphical Models

* LSA or LSI – Latent Semantic Analysis or Latent Semantic Indexing – Uses Singular Value Decomposition (SVD) on the Document-Term Matrix. Based on Linear Algebra

* NMF – Non-Negative Matrix Factorization – Based on Linear Algebra

# Using Gensim for Topic Modeling

Gensim doesn’t have an implementation for NMF (non negative matrix factorization) so we’re only going to play with LDA and LSI (Latent Semantic Indexing AKA Latent Semantic Analysis) models.

In [4]:
!pip install gensim

Collecting gensim
  Downloading https://files.pythonhosted.org/packages/52/d8/1a966940585bdd828d6ca8bca37d1be81e3e7e2fa1f51098117f15c32a1b/gensim-3.6.0-cp36-cp36m-win_amd64.whl (23.6MB)
Collecting smart-open>=1.2.1 (from gensim)
  Downloading https://files.pythonhosted.org/packages/4b/1f/6f27e3682124de63ac97a0a5876da6186de6c19410feab66c1543afab055/smart_open-1.7.1.tar.gz
Collecting bz2file (from smart-open>=1.2.1->gensim)
  Using cached https://files.pythonhosted.org/packages/61/39/122222b5e85cd41c391b68a99ee296584b2a2d1d233e7ee32b4532384f2d/bz2file-0.98.tar.gz
Building wheels for collected packages: smart-open, bz2file
  Running setup.py bdist_wheel for smart-open: started
  Running setup.py bdist_wheel for smart-open: finished with status 'done'
  Stored in directory: C:\Users\Plaban_Nayak\AppData\Local\pip\Cache\wheels\23\00\44\e5b939f7a80c04e32297dbd6d96fa3065af89ecf57e2b5f89f
  Running setup.py bdist_wheel for bz2file: started
  Running setup.py bdist_wheel for bz2file: finished w

You are using pip version 10.0.1, however version 18.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [5]:
import re
from gensim import models, corpora
from nltk import word_tokenize
from nltk.corpus import stopwords
 
NUM_TOPICS = 10
STOPWORDS = stopwords.words('english')
 
def clean_text(text):
    tokenized_text = word_tokenize(text.lower())
    cleaned_text = [t for t in tokenized_text if t not in STOPWORDS and re.match('[a-zA-Z\-][a-zA-Z\-]{2,}', t)]
    return cleaned_text
 
# For gensim we need to tokenize the data and filter out stopwords
tokenized_data = []
for text in data:
    tokenized_data.append(clean_text(text))
tokenized_data[0]



['fulton',
 'county',
 'grand',
 'jury',
 'said',
 'friday',
 'investigation',
 'atlanta',
 'recent',
 'primary',
 'election',
 'produced',
 'evidence',
 'irregularities',
 'took',
 'place',
 'jury',
 'said',
 'term-end',
 'presentments',
 'city',
 'executive',
 'committee',
 'over-all',
 'charge',
 'election',
 'deserves',
 'praise',
 'thanks',
 'city',
 'atlanta',
 'manner',
 'election',
 'conducted',
 'september-october',
 'term',
 'jury',
 'charged',
 'fulton',
 'superior',
 'court',
 'judge',
 'durwood',
 'pye',
 'investigate',
 'reports',
 'possible',
 'irregularities',
 'hard-fought',
 'primary',
 'mayor-nominate',
 'ivan',
 'allen',
 'relative',
 'handful',
 'reports',
 'received',
 'jury',
 'said',
 'considering',
 'widespread',
 'interest',
 'election',
 'number',
 'voters',
 'size',
 'city',
 'jury',
 'said',
 'find',
 'many',
 'georgia',
 'registration',
 'election',
 'laws',
 'outmoded',
 'inadequate',
 'often',
 'ambiguous',
 'recommended',
 'fulton',
 'legislators',
 'ac

In [6]:
# Build a Dictionary - association word to numeric id
dictionary = corpora.Dictionary(tokenized_data)

In [16]:
max(dictionary.keys())

44939

In [13]:
# Transform the collection of texts to a numerical form
corpus = [dictionary.doc2bow(text) for text in tokenized_data]

In [17]:
print(corpus[0])

[(0, 1), (1, 1), (2, 1), (3, 2), (4, 3), (5, 1), (6, 3), (7, 2), (8, 2), (9, 1), (10, 2), (11, 1), (12, 1), (13, 2), (14, 1), (15, 1), (16, 1), (17, 2), (18, 1), (19, 2), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 3), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 3), (33, 2), (34, 1), (35, 1), (36, 1), (37, 1), (38, 2), (39, 1), (40, 1), (41, 1), (42, 2), (43, 2), (44, 1), (45, 1), (46, 1), (47, 1), (48, 7), (49, 1), (50, 2), (51, 1), (52, 2), (53, 1), (54, 1), (55, 2), (56, 1), (57, 1), (58, 1), (59, 2), (60, 1), (61, 1), (62, 3), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 2), (73, 2), (74, 2), (75, 1), (76, 3), (77, 6), (78, 1), (79, 1), (80, 1), (81, 1), (82, 3), (83, 2), (84, 3), (85, 1), (86, 3), (87, 1), (88, 5), (89, 4), (90, 2), (91, 1), (92, 1), (93, 1), (94, 1), (95, 3), (96, 1), (97, 2), (98, 2), (99, 1), (100, 1), (101, 1), (102, 2), (103, 9), (104, 1), (105, 2), (106, 1), (107, 2), (108, 1), (109, 1), (110, 3),

In [15]:
# Have a look at how the 20th document looks like: [(word_id, count), ...]
print(corpus[20])
# [(12, 3), (14, 1), (21, 1), (25, 5), (30, 2), (31, 5), (33, 1), (42, 1), (43, 2),  ...
 

[(12, 3), (14, 1), (21, 1), (25, 5), (30, 2), (31, 5), (33, 1), (42, 1), (43, 2), (44, 2), (45, 2), (46, 2), (47, 2), (49, 1), (50, 1), (53, 1), (56, 1), (59, 1), (60, 1), (66, 1), (75, 1), (80, 1), (98, 1), (101, 1), (106, 1), (117, 1), (129, 1), (130, 2), (132, 2), (135, 2), (140, 1), (141, 2), (143, 4), (144, 2), (145, 2), (166, 1), (195, 1), (198, 3), (219, 1), (220, 4), (221, 3), (223, 1), (229, 4), (230, 4), (231, 2), (235, 1), (236, 1), (242, 2), (246, 2), (255, 1), (263, 1), (269, 1), (270, 5), (271, 2), (275, 5), (276, 1), (278, 4), (280, 2), (281, 1), (307, 2), (310, 1), (311, 3), (313, 1), (314, 5), (318, 4), (322, 1), (336, 1), (338, 3), (339, 1), (340, 1), (341, 1), (345, 1), (346, 1), (351, 1), (354, 1), (355, 1), (366, 3), (368, 13), (370, 1), (372, 1), (374, 3), (377, 3), (381, 3), (386, 1), (392, 6), (396, 1), (401, 1), (412, 2), (426, 2), (428, 2), (431, 2), (434, 2), (439, 2), (444, 1), (450, 1), (452, 1), (462, 1), (465, 1), (467, 1), (470, 1), (478, 1), (483, 1), (

In [18]:
# Build the LDA model
lda_model = models.LdaModel(corpus=corpus, num_topics=NUM_TOPICS, id2word=dictionary)

In [19]:
print("LDA Model:")
 
for idx in range(NUM_TOPICS):
    # Print the first 10 most representative topics
    print("Topic #%s:" % idx, lda_model.print_topic(idx, 10))

LDA Model:
Topic #0: 0.006*"one" + 0.006*"would" + 0.004*"said" + 0.003*"could" + 0.003*"time" + 0.003*"new" + 0.003*"first" + 0.002*"state" + 0.002*"may" + 0.002*"like"
Topic #1: 0.006*"would" + 0.005*"one" + 0.003*"said" + 0.003*"new" + 0.002*"could" + 0.002*"two" + 0.002*"may" + 0.002*"man" + 0.002*"time" + 0.002*"world"
Topic #2: 0.008*"one" + 0.005*"would" + 0.004*"could" + 0.003*"said" + 0.003*"two" + 0.003*"time" + 0.003*"new" + 0.003*"man" + 0.003*"must" + 0.002*"like"
Topic #3: 0.005*"said" + 0.004*"would" + 0.004*"new" + 0.004*"one" + 0.003*"two" + 0.003*"man" + 0.003*"could" + 0.002*"time" + 0.002*"may" + 0.002*"like"
Topic #4: 0.007*"one" + 0.005*"would" + 0.004*"said" + 0.003*"time" + 0.003*"may" + 0.003*"new" + 0.003*"even" + 0.003*"first" + 0.002*"man" + 0.002*"also"
Topic #5: 0.006*"would" + 0.005*"one" + 0.004*"time" + 0.003*"two" + 0.003*"could" + 0.003*"new" + 0.003*"may" + 0.003*"said" + 0.002*"also" + 0.002*"made"
Topic #6: 0.008*"one" + 0.005*"said" + 0.004*"could

In [20]:
# Build the LSI model
lsi_model = models.LsiModel(corpus=corpus, num_topics=NUM_TOPICS, id2word=dictionary)
 

In [21]:
print("LSI Model:")
 
for idx in range(NUM_TOPICS):
    # Print the first 10 most representative topics
    print("Topic #%s:" % idx, lsi_model.print_topic(idx, 10))
 

LSI Model:
Topic #0: 0.308*"one" + 0.280*"would" + 0.202*"said" + 0.175*"could" + 0.146*"time" + 0.144*"new" + 0.126*"man" + 0.125*"like" + 0.125*"two" + 0.120*"first"
Topic #1: 0.294*"said" + -0.219*"may" + -0.179*"state" + 0.176*"could" + 0.153*"would" + -0.143*"states" + -0.141*"new" + 0.140*"like" + 0.139*"back" + 0.105*"man"
Topic #2: -0.339*"said" + -0.338*"state" + 0.229*"one" + -0.191*"states" + -0.161*"year" + -0.152*"mrs." + -0.135*"would" + -0.132*"united" + -0.131*"federal" + -0.130*"government"
Topic #3: 0.262*"new" + 0.256*"mrs." + -0.155*"feed" + -0.151*"per" + 0.149*"world" + -0.144*"used" + 0.140*"church" + 0.117*"god" + 0.107*"life" + 0.102*"people"
Topic #4: 0.510*"mrs." + -0.235*"would" + -0.192*"states" + -0.153*"united" + -0.131*"could" + -0.122*"man" + -0.121*"state" + -0.108*"government" + 0.103*"year" + 0.100*"school"
Topic #5: -0.379*"feed" + 0.368*"would" + -0.279*"per" + -0.241*"state" + -0.130*"daily" + -0.126*"man" + -0.125*"god" + -0.122*"drug" + -0.115*"

 Now put the models to work and transform unseen documents to their topic distribution:

In [22]:
text = "The economy is working better than ever"
bow = dictionary.doc2bow(clean_text(text))

In [23]:
print(lsi_model[bow])

[(0, 0.09161302308686932), (1, 0.008867641109445129), (2, 0.016321120030067717), (3, 0.04059079773105638), (4, 0.01563141790090322), (5, -0.011982448259454674), (6, 0.029358994347650536), (7, -0.01693051773271266), (8, 0.05387962329432002), (9, 0.02718187338112224)]


In [25]:
print(lda_model[bow])

[(0, 0.020005789), (1, 0.02000573), (2, 0.020005442), (3, 0.020007197), (4, 0.020006083), (5, 0.020006085), (6, 0.020007065), (7, 0.8199452), (8, 0.02000523), (9, 0.020006116)]


The LDA result can be interpreted as a distribution over topics. Let’s take an example:(7, 0.8199452), This result suggests that topic  7 has the strongest representation in this text.

#### Gensim offers a simple way of performing similarity queries using topic models.

In [26]:
from gensim import similarities
 
lda_index = similarities.MatrixSimilarity(lda_model[corpus])

  if np.issubdtype(vec.dtype, np.int):


In [27]:
type(lda_index)

gensim.similarities.docsim.MatrixSimilarity

In [28]:
# Let's perform some queries
similarities = lda_index[lda_model[bow]]

In [30]:
type(similarities)

numpy.ndarray

In [40]:
import numpy as np
sorted_arr = np.sort(similarities)[::-1]

In [37]:
similarities[::-10]

array([0.03216965, 0.03342898, 0.02433511, 0.02944509, 0.02433511,
       0.02470046, 0.09760837, 0.03365789, 0.03988977, 0.03209806,
       0.02586904, 0.0647141 , 0.02608944, 0.02433467, 0.02917915,
       0.02433405, 0.02433375, 0.02915832, 0.0314522 , 0.02433384,
       0.02433363, 0.02433405, 0.03290509, 0.03387373, 0.03743141,
       0.02639742, 0.03577298, 0.02807488, 0.14601803, 0.02766575,
       0.03419813, 0.03195687, 0.0284569 , 0.07921891, 0.03696393,
       0.02462874, 0.05638801, 0.02472343, 0.9973318 , 0.02433363,
       0.02646691, 0.0266231 , 0.03302009, 0.03944929, 0.03434067,
       0.03759175, 0.03495016, 0.02466111, 0.9926223 , 0.03269291],
      dtype=float32)

In [35]:
sorted_arr[:10]

array([0.99762017, 0.99754405, 0.9973318 , 0.9973318 , 0.9973318 ,
       0.9973318 , 0.9926223 , 0.97524023, 0.9657318 , 0.9635604 ],
      dtype=float32)

In [44]:
sorted_similarities = sorted(enumerate(similarities),key = lambda x : -x[1])

In [45]:
sorted_similarities 

[(102, 0.99762017),
 (252, 0.99754405),
 (119, 0.9973318),
 (311, 0.9973318),
 (352, 0.9973318),
 (358, 0.9973318),
 (19, 0.9926223),
 (36, 0.97524023),
 (117, 0.9657318),
 (245, 0.9635604),
 (278, 0.9391573),
 (254, 0.9326615),
 (228, 0.9207625),
 (193, 0.84629834),
 (21, 0.81571317),
 (10, 0.7779896),
 (150, 0.7019821),
 (435, 0.68829703),
 (8, 0.54207504),
 (122, 0.49643984),
 (232, 0.40678567),
 (250, 0.4041347),
 (363, 0.40289044),
 (121, 0.39978585),
 (237, 0.39536878),
 (83, 0.36172813),
 (465, 0.3524224),
 (214, 0.34782773),
 (474, 0.3372036),
 (493, 0.30049428),
 (400, 0.20707911),
 (477, 0.20292608),
 (387, 0.20023753),
 (157, 0.19632539),
 (45, 0.1777074),
 (55, 0.16659969),
 (451, 0.14759742),
 (54, 0.14642057),
 (219, 0.14601803),
 (18, 0.11447427),
 (290, 0.10886304),
 (439, 0.097608365),
 (256, 0.088277206),
 (169, 0.07921891),
 (234, 0.067697264),
 (389, 0.0647141),
 (248, 0.061860748),
 (139, 0.05638801),
 (155, 0.054072462),
 (220, 0.05322205),
 (215, 0.05004117),
 (2

In [47]:
# Top most similar documents:
print(sorted_similarities[:10])

[(102, 0.99762017), (252, 0.99754405), (119, 0.9973318), (311, 0.9973318), (352, 0.9973318), (358, 0.9973318), (19, 0.9926223), (36, 0.97524023), (117, 0.9657318), (245, 0.9635604)]


In [49]:
# Let's see what's the most similar document
document_id, similarity = sorted_similarities[0]
print(data[document_id][:1000])

Individuals possessing unusual gifts and great personal power were transmuted at death into awesome spirits ; ; they were almost immediately worshipped for these newer , even more terrible abilities . Their direct descendants inherited not only their worldly fortunes , but also the mandate of their newfound power as spirits in the other half of the universe . Royal lineages could be based on extraordinary worldly achievements translated into eternal otherworldly power . Thus , the emperor could draw on sources not available to those with less puissant ancestors . But this eminence was not without its weighty responsibilities . Since he possessed more power in an interdependent universe of living beings and dead spirits , the emperor had to use it for the benefit of the living . The royal ritual generated power into the other world : it also provided the living with a way to control the spirits , and bring their powers directly to bear on the everyday affairs of the world . Proper ritua

In [51]:
similarity

0.99762017

# Using Scikit-Learn for Topic Modeling

In [52]:
from sklearn.decomposition import NMF, LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer

In [53]:
NUM_TOPICS = 10
 
vectorizer = CountVectorizer(min_df=5, max_df=0.9, 
                             stop_words='english', lowercase=True, 
                             token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
data_vectorized = vectorizer.fit_transform(data)

In [58]:
data_vectorized.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 1]], dtype=int64)

In [59]:
# Build a Latent Dirichlet Allocation Model
lda_model = LatentDirichletAllocation(n_topics=NUM_TOPICS, max_iter=10, learning_method='online')
lda_Z = lda_model.fit_transform(data_vectorized)
print(lda_Z.shape)  # (NO_DOCUMENTS, NO_TOPICS)



(500, 10)


In [60]:
# Build a Non-Negative Matrix Factorization Model
nmf_model = NMF(n_components=NUM_TOPICS)
nmf_Z = nmf_model.fit_transform(data_vectorized)
print(nmf_Z.shape)  # (NO_DOCUMENTS, NO_TOPICS)

(500, 10)


In [61]:
# Build a Latent Semantic Indexing Model
lsi_model = TruncatedSVD(n_components=NUM_TOPICS)
lsi_Z = lsi_model.fit_transform(data_vectorized)
print(lsi_Z.shape)  # (NO_DOCUMENTS, NO_TOPICS)

(500, 10)


In [65]:
# Let's see how the first document in the corpus looks like in different topic spaces
print(lda_Z[0])
print('='*100)
print(nmf_Z[0])
print('='*100)
print(lsi_Z[0])

[1.05614792e-04 1.05611165e-04 1.05607235e-04 1.05596724e-04
 1.05609084e-04 1.05596686e-04 4.32074703e-01 5.48583453e-01
 1.86025897e-02 1.05619160e-04]
[0.         0.         2.11450964 0.07698104 0.         0.54218266
 1.06713408 0.         0.         0.24383522]
[ 23.30684073   1.59484762  21.84475453  -0.1770806    0.83952194
  11.70216028   3.91073545  -2.49811138   1.00518376 -14.68453042]


In [67]:
lda_model.components_

array([[ 4.16937195,  3.76977123, 28.32276564, ...,  3.02060139,
         0.95282284,  0.16362471],
       [ 0.11199851,  1.85131404,  0.13787975, ...,  0.1193579 ,
         0.11952565,  0.11303288],
       [ 0.11525117,  0.11375061,  0.111563  , ...,  0.11665562,
         0.16726142,  0.11436419],
       ...,
       [ 4.35904604,  0.12691881,  0.11654586, ...,  4.52117284,
         4.43368908,  0.13898176],
       [ 0.13753706,  0.12124778,  0.24526669, ...,  0.4407562 ,
         0.11700029,  5.03211314],
       [ 0.15215982,  0.1717078 ,  0.3471131 , ...,  2.44650703,
         0.14768766,  4.19188983]])

In order to inspect the inferred topics we need to implement a print function ourselves:

In [68]:
def print_topics(model, vectorizer, top_n=10):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names()[i], topic[i])
                        for i in topic.argsort()[:-top_n - 1:-1]])

In [69]:
print("LDA Model:")
print("=" * 100)
print_topics(lda_model, vectorizer)


LDA Model:
Topic 0:
[('used', 260.73135235520107), ('surface', 140.463941266208), ('temperature', 126.64123593872615), ('water', 117.32771333483869), ('time', 116.49945657434408), ('use', 114.64203200913902), ('number', 112.04386167174746), ('data', 101.80392995979544), ('information', 97.20055735214618), ('small', 90.55597405427515)]
Topic 1:
[('clay', 58.434277982909116), ('knife', 40.16053581754376), ('mold', 31.934230684163406), ('cutting', 31.807477027084968), ('pieces', 30.223275492935716), ('place', 28.568821587635213), ('thickness', 27.490727631668438), ('fig', 25.70689757046462), ('cut', 24.99254345645288), ('artery', 23.635523269420226)]
Topic 2:
[('feed', 90.886103626434), ('cattle', 43.109723948432276), ('catholic', 27.28852528625859), ('drug', 25.547511974989764), ('daily', 23.352547631351136), ('meeting', 22.459871644347867), ('chemical', 19.777256079779548), ('head', 19.39407110456644), ('england', 18.881126725468363), ('catholics', 18.59477170553208)]
Topic 3:
[('new', 

In [70]:
print("NMF Model:")
print("=" * 100)
print_topics(nmf_model, vectorizer)

NMF Model:
Topic 0:
[('like', 5.6020906320296815), ('man', 4.233978125191391), ('time', 3.8482689633870146), ('just', 3.4045161139752684), ('did', 3.4002102632740527), ('little', 2.9276203679946335), ('said', 2.824307138911025), ('know', 2.579499469301804), ('way', 2.45890846517182), ('don', 2.4230785475343346)]
Topic 1:
[('new', 5.340065049862802), ('world', 3.933525175308675), ('life', 2.854564162424948), ('man', 2.8324332828645065), ('people', 2.551182295094836), ('god', 2.164931418560605), ('american', 2.1048340927218305), ('church', 2.1024438115801622), ('great', 2.018858770338277), ('social', 2.0009229114493823)]
Topic 2:
[('said', 12.919610069929513), ('city', 1.3865455919758538), ('house', 1.3795782192934252), ('new', 1.2722600887811855), ('home', 1.2475584335570469), ('told', 1.2438100814017814), ('asked', 1.1515042595545026), ('president', 1.0112112525698356), ('police', 0.9596901297507506), ('mike', 0.8413034157751607)]
Topic 3:
[('mrs', 11.852264058358948), ('miss', 1.70699

In [71]:
print("LSI Model:")
print("=" * 100)
print_topics(lsi_model, vectorizer)

LSI Model:
Topic 0:
[('said', 0.2737390118750234), ('new', 0.19420876242907179), ('time', 0.18634773600145163), ('man', 0.16422112236082392), ('like', 0.16151254638566004), ('did', 0.1274120925458355), ('years', 0.11440148855880737), ('just', 0.10896555065814136), ('way', 0.10626155487324442), ('state', 0.1054667026249522)]
Topic 1:
[('state', 0.22243418873120047), ('states', 0.17248362761292516), ('new', 0.1586953471625929), ('united', 0.122687184846442), ('government', 0.10979706446492513), ('year', 0.10783924634946097), ('program', 0.09500714032008974), ('development', 0.08702961666440939), ('federal', 0.08451315462913425), ('use', 0.07985467724027541)]
Topic 2:
[('said', 0.36752991958484155), ('state', 0.3586683848061567), ('mrs', 0.1797159954764812), ('states', 0.17591815767075086), ('year', 0.16294877500535576), ('federal', 0.1355966533464947), ('united', 0.11282676094317032), ('government', 0.11042448609837262), ('tax', 0.11018466758044265), ('fiscal', 0.0973191948584693)]
Topic

Transforming an unseen document goes like this:

In [83]:
text = "The economy is working better than ever"
x = nmf_model.transform(vectorizer.transform([text]))[0]
print(x)

[0.00289958 0.         0.         0.         0.         0.00438217
 0.         0.         0.         0.00462177]


In [73]:
x = lda_model.transform(vectorizer.transform([text]))[0]
print(x)

[0.02500377 0.02500453 0.02500005 0.02500002 0.02500001 0.02500001
 0.02500003 0.77497667 0.02500878 0.02500614]


In [81]:
x = lsi_model.transform(vectorizer.transform([text]))[0]
print(x)

[ 0.07701771  0.01116967 -0.018353    0.0017112   0.02272241  0.03308654
 -0.0639079   0.0031346  -0.00747202  0.02178376]


Here’s how to implement the similarity functionality we’ve seen in the gensim section:

In [77]:
from sklearn.metrics.pairwise import euclidean_distances
 
def most_similar(x, Z, top_n=5):
    dists = euclidean_distances(x.reshape(1, -1), Z)
    pairs = enumerate(dists[0])
    print(pairs)
    most_similar = sorted(pairs, key=lambda item: item[1])[:top_n]
    return most_similar

In [84]:
similarities = most_similar(x, nmf_Z)
document_id, similarity = similarities[0]
print(document_id)
print(data[document_id][:1000])

<enumerate object at 0x000000000C60CEE8>
115
Livery stable -- J. Vernon , prop. '' . Coaching had declined considerably by 1905 , but the sign was still there , near the old Wells Fargo building in San Francisco , creaking in the fog as it had for thirty years . John Vernon had had all the patronage he cared for -- he had prospered , but he could not retire from horsedom . Coaching was in his blood . He had two interests in life : the pleasures of the table and driving . Twice a week he drove his tallyho over the Santa Cruz road , upland and through the redwood forest , with orchards below him at one hand , and glimpses of the Pacific at the other . The journey back he made along the coast road , traveling hell-for-leather , every lantern of the tallyho ablaze . The southward route was the classic run in California , and the most fashionable . His patronage on this stretch was made up largely of San Franciscans -- regulars , most of them , and trenchermen like himself . They did not co

In [82]:
similarities = most_similar(x, lsi_Z)
document_id, similarity = similarities[0]
print(document_id)
print(data[document_id][:1000])

<enumerate object at 0x000000000FA50CF0>
115
Livery stable -- J. Vernon , prop. '' . Coaching had declined considerably by 1905 , but the sign was still there , near the old Wells Fargo building in San Francisco , creaking in the fog as it had for thirty years . John Vernon had had all the patronage he cared for -- he had prospered , but he could not retire from horsedom . Coaching was in his blood . He had two interests in life : the pleasures of the table and driving . Twice a week he drove his tallyho over the Santa Cruz road , upland and through the redwood forest , with orchards below him at one hand , and glimpses of the Pacific at the other . The journey back he made along the coast road , traveling hell-for-leather , every lantern of the tallyho ablaze . The southward route was the classic run in California , and the most fashionable . His patronage on this stretch was made up largely of San Franciscans -- regulars , most of them , and trenchermen like himself . They did not co

# Plotting words and documents in 2D with SVD

We can use SVD with 2 components (topics) to display words and documents in 2D. The process is really similar. Let’s start with displaying documents since it’s a bit more straightforward.

In [85]:
import pandas as pd
from bokeh.io import push_notebook, show, output_notebook
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource, LabelSet
output_notebook()

In [86]:
svd = TruncatedSVD(n_components=2)
documents_2d = svd.fit_transform(data_vectorized)
 
df = pd.DataFrame(columns=['x', 'y', 'document'])
df['x'], df['y'], df['document'] = documents_2d[:,0], documents_2d[:,1], range(len(data))

In [87]:
df

Unnamed: 0,x,y,document
0,23.306846,1.594048,0
1,20.625829,6.273644,1
2,23.361914,7.730232,2
3,17.627841,15.491396,3
4,26.006862,0.837562,4
5,28.497284,6.088844,5
6,18.627743,9.816717,6
7,17.841993,11.398074,7
8,21.107400,-0.997875,8
9,24.257646,4.543724,9


In [88]:
source = ColumnDataSource(ColumnDataSource.from_df(df))
labels = LabelSet(x="x", y="y", text="document", y_offset=8,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
 
plot = figure(plot_width=600, plot_height=600)
plot.circle("x", "y", size=12, source=source, line_color="black", fill_alpha=0.8)
plot.add_layout(labels)
show(plot, notebook_handle=True)
 

You can try going through the documents to see if indeed closer documents on the plot are more similar. To display words in 2D we just need to transpose the vectorized data: words_2d = svd.fit_transform(data_vectorized.T).

In [90]:
data_vectorized.T.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 1]], dtype=int64)

In [92]:
svd = TruncatedSVD(n_components=2)
words_2d = svd.fit_transform(data_vectorized.T)
 
df = pd.DataFrame(columns=['x', 'y', 'word'])
df['x'], df['y'], df['word'] = words_2d[:,0], words_2d[:,1], vectorizer.get_feature_names()
df

Unnamed: 0,x,y,word
0,0.355058,-0.202161,-day
1,0.242418,-0.087040,-degree
2,0.697432,-0.559980,-degrees
3,0.509785,-0.629486,-degrees-c
4,0.699847,-0.153789,-foot
5,1.213586,-0.337543,-inch
6,0.178087,-0.251501,-month
7,0.349349,0.065152,-yard
8,0.881137,-0.794290,-year
9,1.912024,0.030347,-year-old


In [97]:
 
source = ColumnDataSource(ColumnDataSource.from_df(df))
labels = LabelSet(x="x", y="y", text="word", y_offset=8,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
 
plot = figure(plot_width=1200, plot_height=1000)
plot.circle("x", "y", size=12, source=source, line_color="black", fill_alpha=0.8)
plot.add_layout(labels)
show(plot, notebook_handle=True)

# More about Latent Dirichlet Allocation

LDA is the most popular method for doing topic modeling in real-world applications. That is because
* it provides accurate results,

* can be trained online (do not retrain every time we get new data)

* can be run on multiple cores


In [98]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
 
NUM_TOPICS = 10
 
vectorizer = CountVectorizer(min_df=5, max_df=0.9, 
                             stop_words='english', lowercase=True, 
                             token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
data_vectorized = vectorizer.fit_transform(data)
 
# Build a Latent Dirichlet Allocation Model
lda_model = LatentDirichletAllocation(n_topics=NUM_TOPICS, max_iter=10, learning_method='online')
lda_Z = lda_model.fit_transform(data_vectorized)
 
text = "The economy is working better than ever"
x = lda_model.transform(vectorizer.transform([text]))[0]
print(x, x.sum())
 



[0.02500571 0.02500635 0.02500554 0.02500266 0.0250001  0.02500816
 0.02500002 0.02501343 0.02500488 0.77495314] 1.0


Notice how the factors corresponding to each component (topic) add up to 1. That’s not a coincidence. Indeed, LDA considers documents as being generated by a mixture of the topics. 

#### The purpose of LDA is to compute how much of the document was generated by which topic.

In this example, more than 77 % of the document has been generated by the last (10th) topic:


In [100]:
!pip install pyLDAvis

Collecting pyLDAvis
  Downloading https://files.pythonhosted.org/packages/a5/3a/af82e070a8a96e13217c8f362f9a73e82d61ac8fff3a2561946a97f96266/pyLDAvis-2.1.2.tar.gz (1.6MB)
Collecting funcy (from pyLDAvis)
  Downloading https://files.pythonhosted.org/packages/47/a4/204fa23012e913839c2da4514b92f17da82bf5fc8c2c3d902fa3fa3c6eec/funcy-1.11-py2.py3-none-any.whl
Building wheels for collected packages: pyLDAvis
  Running setup.py bdist_wheel for pyLDAvis: started
  Running setup.py bdist_wheel for pyLDAvis: finished with status 'done'
  Stored in directory: C:\Users\Plaban_Nayak\AppData\Local\pip\Cache\wheels\98\71\24\513a99e58bb6b8465bae4d2d5e9dba8f0bef8179e3051ac414
Successfully built pyLDAvis
Installing collected packages: funcy, pyLDAvis
Successfully installed funcy-1.11 pyLDAvis-2.1.2


You are using pip version 10.0.1, however version 18.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [103]:
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()
panel = pyLDAvis.sklearn.prepare(lda_model, data_vectorized, vectorizer, mds='tsne')
panel

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


Let’s interpret the topic visualization. Notice how topics are shown on the left while words are on the right. Here are the main things you should consider:

* Larger topics are more frequent in the corpus.

* Topics closer together are more similar, topics further apart are less similar.

* When you select a topic, you can see the most representative words for the selected topic. This measure can be a combination of how frequent or how discriminant the word is. You can adjust the weight of each property using the slider.

* Hovering over a word will adjust the topic sizes according to to how representative the word is for the topic.


As we mentioned before, LDA can be used for automatic tagging. We can go over each topic (pyLDAVis helps a lot) and attach a label to it