# What is Topic Modeling?
Topic modelling, in the context of Natural Language Processing, is described as a method of uncovering hidden structure in a collection of texts. Although that is indeed true it is also a pretty useless definition. Let’s define topic modeling in more practical terms.

# Why is Topic Modeling useful?
There are several scenarios when topic modeling can prove useful. Here are some of them:

<b>Text classification</b> – Topic modeling can improve classification by grouping similar words together in topics rather than using each word as a feature<br>
<b>Recommender Systems</b> – Using a similarity measure we can build recommender systems. If our system would recommend articles for readers, it will recommend articles with a topic structure similar to the articles the user has already read.<br>
<b>Uncovering Themes in Texts</b> – Useful for detecting trends in online publications for example

# Topic Modeling Algorithms
There are several algorithms for doing topic modeling. The most popular ones include

<b>LDA</b> – Latent Dirichlet Allocation – The one we’ll be focusing in this tutorial. Its foundations are Probabilistic Graphical Models<br>
<b>LSA or LSI</b> – Latent Semantic Analysis or Latent Semantic Indexing – Uses Singular Value Decomposition (SVD) on the Document-Term Matrix. Based on Linear Algebra<br>
<b>NMF </b>– Non-Negative Matrix Factorization – Based on Linear Algebra

# Using Gensim for Topic Modeling

In [4]:
from nltk.corpus import brown

In [8]:
data = []

for fileid in brown.fileids():
    document = ' '.join(brown.words(fileid))
    data.append(document)
No_DOCUMENTS = len(data)
print(No_DOCUMENTS)
print(data[:5])

500


In [12]:
#!pip install -U gensim

In [13]:
import re
from gensim import models, corpora
from nltk import word_tokenize
from nltk.corpus import stopwords

NUM_TOPICS = 10
STOPWORDS = stopwords.words('english')

def clean_text(text):
    tokenized_text = word_tokenize(text.lower())
    cleaned_text = [t for t in tokenized_text if t not in STOPWORDS and\
                    re.match('[a-zA-Z\-][a-zA-Z\-]{2,}', t)]
    return cleaned_text

# for gensim, we need to tokenize the data and filter out stopwords
tokenized_data = []
for text in data:
    tokenized_data.append(clean_text(text))
    


In [14]:
tokenized_data[:20]

[['fulton',
  'county',
  'grand',
  'jury',
  'said',
  'friday',
  'investigation',
  'atlanta',
  'recent',
  'primary',
  'election',
  'produced',
  'evidence',
  'irregularities',
  'took',
  'place',
  'jury',
  'said',
  'term-end',
  'presentments',
  'city',
  'executive',
  'committee',
  'over-all',
  'charge',
  'election',
  'deserves',
  'praise',
  'thanks',
  'city',
  'atlanta',
  'manner',
  'election',
  'conducted',
  'september-october',
  'term',
  'jury',
  'charged',
  'fulton',
  'superior',
  'court',
  'judge',
  'durwood',
  'pye',
  'investigate',
  'reports',
  'possible',
  'irregularities',
  'hard-fought',
  'primary',
  'mayor-nominate',
  'ivan',
  'allen',
  'relative',
  'handful',
  'reports',
  'received',
  'jury',
  'said',
  'considering',
  'widespread',
  'interest',
  'election',
  'number',
  'voters',
  'size',
  'city',
  'jury',
  'said',
  'find',
  'many',
  'georgia',
  'registration',
  'election',
  'laws',
  'outmoded',
  'inadequ

In [15]:
#Build a Dictionary - association word to numeric id
dictionary = corpora.Dictionary(tokenized_data)

In [21]:
for i in dictionary.items():
    print(i)

(0, 'accepted')
(1, 'according')
(2, 'achieve')
(3, 'act')
(4, 'action')
(5, 'actions')
(6, 'added')
(7, 'additional')
(8, 'adjournment')
(9, 'adjustments')
(10, 'administration')
(11, 'administrators')
(12, 'afternoon')
(13, 'age')
(14, 'agreed')
(15, 'agriculture')
(16, 'aid')
(17, 'airport')
(18, 'ala.')
(19, 'allen')
(20, 'allotted')
(21, 'allow')
(22, 'allowances')
(23, 'allowed')
(24, 'alpharetta')
(25, 'also')
(26, 'alternative')
(27, 'ambiguous')
(28, 'amendment')
(29, 'amicable')
(30, 'among')
(31, 'announced')
(32, 'anonymous')
(33, 'apparently')
(34, 'appointed')
(35, 'appointment')
(36, 'appraisers')
(37, 'approve')
(38, 'approved')
(39, 'areas')
(40, 'armed')
(41, 'aside')
(42, 'ask')
(43, 'asked')
(44, 'asking')
(45, 'assistance')
(46, 'assistant')
(47, 'association')
(48, 'atlanta')
(49, 'attended')
(50, 'attorney')
(51, 'attorneys')
(52, 'audience')
(53, 'aug.')
(54, 'authorities')
(55, 'authority')
(56, 'automobile')
(57, 'available')
(58, 'awarding')
(59, 'back')
(60,

(2163, 'form')
(2164, 'fortin')
(2165, 'frequently')
(2166, 'full-time')
(2167, 'gaining')
(2168, 'gen.')
(2169, 'goal')
(2170, 'governs')
(2171, 'grocery')
(2172, 'grooming')
(2173, 'gross')
(2174, 'grounds')
(2175, 'guilty')
(2176, 'hand')
(2177, 'handed')
(2178, 'handled')
(2179, 'happens')
(2180, 'hawksley')
(2181, 'headquarters')
(2182, 'heavy')
(2183, 'hesitated')
(2184, 'hire')
(2185, 'hope')
(2186, 'hotel')
(2187, 'hundred')
(2188, 'included')
(2189, 'increases')
(2190, 'industrial')
(2191, 'inspiring')
(2192, 'instances')
(2193, 'insurgent')
(2194, 'intervals')
(2195, 'job')
(2196, 'johnston')
(2197, 'joseph')
(2198, 'justices')
(2199, 'know')
(2200, 'knowing')
(2201, 'known')
(2202, 'labor-management')
(2203, 'laughlin')
(2204, 'letter')
(2205, 'level')
(2206, 'lever')
(2207, 'liberals')
(2208, 'licenses')
(2209, 'limit')
(2210, 'list')
(2211, 'livelihood')
(2212, 'locally')
(2213, 'luncheon')
(2214, 'machinist')
(2215, 'manufacturers')
(2216, 'martinelli')
(2217, 'mary')
(22

(3635, 'cooperating')
(3636, 'coordination')
(3637, 'culbertson')
(3638, 'dec.')
(3639, 'defendant')
(3640, 'deferred')
(3641, 'defraud')
(3642, 'delegates')
(3643, 'denials')
(3644, 'denomination')
(3645, 'denying')
(3646, 'dependence')
(3647, 'desmond')
(3648, 'detriment')
(3649, 'develop')
(3650, 'diety')
(3651, 'dinner')
(3652, 'disaster')
(3653, 'dismissal')
(3654, 'dist')
(3655, 'doors')
(3656, 'economically')
(3657, 'edith')
(3658, 'editing')
(3659, 'emerald')
(3660, 'emphasizes')
(3661, 'empire')
(3662, 'engaging')
(3663, 'england')
(3664, 'evacuate')
(3665, 'evacuation')
(3666, 'everybody')
(3667, 'everyone')
(3668, 'exaltation')
(3669, 'examined')
(3670, 'exported')
(3671, 'extraordinarily')
(3672, 'faith')
(3673, 'father')
(3674, 'fear')
(3675, 'five-year')
(3676, 'forthcoming')
(3677, 'fraud')
(3678, 'fruitful')
(3679, 'fund-raising')
(3680, 'fundamentalism')
(3681, 'game')
(3682, 'germany')
(3683, 'gives')
(3684, 'god')
(3685, 'goods')
(3686, 'graduate')
(3687, 'graduates'

(4982, 'nerves')
(4983, 'newest')
(4984, 'nice')
(4985, 'nicer')
(4986, 'nori')
(4987, 'out-of-bounds')
(4988, 'outstanding')
(4989, 'over-corrected')
(4990, 'owned')
(4991, 'packers')
(4992, 'palmer')
(4993, 'par')
(4994, 'par-5')
(4995, 'paragon')
(4996, 'patricia')
(4997, 'payson')
(4998, 'perfection')
(4999, 'permission')
(5000, 'personalities')
(5001, 'perturbed')
(5002, 'picked')
(5003, 'pirates')
(5004, 'pittsburgh')
(5005, 'pixies')
(5006, 'playable')
(5007, 'polytechnic')
(5008, 'potato')
(5009, 'prize')
(5010, 'prominent')
(5011, 'prominently')
(5012, 'provisional')
(5013, 'quit')
(5014, 'rains')
(5015, 'reconsider')
(5016, 'refused')
(5017, 'rejoining')
(5018, 'relentlessly')
(5019, 'relieves')
(5020, 'rensselaer')
(5021, 'respects')
(5022, 'retain')
(5023, 'richardson')
(5024, 'rickey')
(5025, 'rochester')
(5026, 'rock-strewn')
(5027, 'roger')
(5028, 'rozelle')
(5029, 'ruefully')
(5030, 'ruth')
(5031, 'sacrilege')
(5032, 'saluted')
(5033, 'sands')
(5034, 'secondary')
(5035,

(6325, 'spurdle')
(6326, 'staffing')
(6327, 'stella')
(6328, 'stone')
(6329, 'sub-zero')
(6330, 'suburbs')
(6331, 'susan')
(6332, 'sweet')
(6333, 'taffeta')
(6334, 'taylor')
(6335, 'temperatures')
(6336, 'threads')
(6337, 'thrift')
(6338, 'throughout')
(6339, 'tiers')
(6340, 'tomato-red')
(6341, 'totals')
(6342, 'tradition')
(6343, 'trimmed')
(6344, 'troop')
(6345, 'tulane')
(6346, 'tulle')
(6347, 'tyson')
(6348, 'usn')
(6349, 'valley')
(6350, 'vickery')
(6351, 'vienna')
(6352, 'vieth')
(6353, 'vieux')
(6354, 'volney')
(6355, 'voorhees')
(6356, 'walbridge')
(6357, 'walkers')
(6358, 'walnut')
(6359, 'waveland')
(6360, 'weinberg')
(6361, 'wellesley')
(6362, 'wheeler')
(6363, 'wilkinson')
(6364, 'wins')
(6365, 'wissahickon')
(6366, 'wolcott')
(6367, 'worn')
(6368, 'year-round')
(6369, 'zeising')
(6370, 'zinman')
(6371, 'affect')
(6372, 'aiken')
(6373, 'ambulance')
(6374, 'amounted')
(6375, 'annapolis')
(6376, 'apartment-building')
(6377, 'arrival')
(6378, 'arundel')
(6379, 'assailant')
(6

(7586, 'hartselle')
(7587, 'heel')
(7588, 'hester')
(7589, 'hired')
(7590, 'hobart')
(7591, 'hodosh')
(7592, 'hose')
(7593, 'hunting')
(7594, 'impact')
(7595, 'imprisonment')
(7596, 'improperly')
(7597, 'incident')
(7598, 'incidents')
(7599, 'incinerator')
(7600, 'interfering')
(7601, 'interstate')
(7602, 'intruders')
(7603, 'investigated')
(7604, 'italy')
(7605, 'jewelry')
(7606, 'jewels')
(7607, 'journal-bulletin')
(7608, 'juror')
(7609, 'jury-tampering')
(7610, 'kkk')
(7611, 'klan')
(7612, 'klux')
(7613, 'kochanek')
(7614, 'kochaneks')
(7615, 'kretchmer')
(7616, 'lamb')
(7617, 'lenient')
(7618, 'liberty')
(7619, 'lieutenants')
(7620, 'lovett')
(7621, 'maple')
(7622, 'masked')
(7623, 'mcn')
(7624, 'membership')
(7625, 'mfg')
(7626, 'mob')
(7627, 'newport')
(7628, 'newport-based')
(7629, 'newsom')
(7630, 'nickels')
(7631, 'nolan')
(7632, 'nunes')
(7633, 'nyu')
(7634, 'oliver')
(7635, 'one-story')
(7636, 'onto')
(7637, 'pains')
(7638, 'parrillo')
(7639, 'pawtucket')
(7640, 'pennies')
(

(8921, 'concert')
(8922, 'connecticut')
(8923, 'continents')
(8924, 'cookie')
(8925, 'cooper')
(8926, 'coopers')
(8927, 'cott')
(8928, 'crackpots')
(8929, 'craftsmen')
(8930, 'craig')
(8931, 'crispin')
(8932, 'curry')
(8933, 'currys')
(8934, 'cynthia')
(8935, 'decorator')
(8936, 'decorators')
(8937, 'delight')
(8938, 'denote')
(8939, 'desired')
(8940, 'disharmony')
(8941, 'dismal')
(8942, 'dreamed')
(8943, 'drs')
(8944, 'duque')
(8945, 'edge')
(8946, 'edition')
(8947, 'egerton')
(8948, 'eighteenth')
(8949, 'elizabeth')
(8950, 'ellsworth')
(8951, 'eng')
(8952, 'entails')
(8953, 'era')
(8954, 'esnards')
(8955, 'ever-changing')
(8956, 'examiners')
(8957, 'exchanged')
(8958, 'execute')
(8959, 'finalist')
(8960, 'finals')
(8961, 'finger-paint')
(8962, 'fixture')
(8963, 'flexible')
(8964, 'foliage')
(8965, 'fool')
(8966, 'foot')
(8967, 'fritz')
(8968, 'frosting')
(8969, 'geddes')
(8970, 'ghormley')
(8971, 'glaze')
(8972, 'gloriana')
(8973, 'grammar')
(8974, 'gregg')
(8975, 'grow')
(8976, 'ha

(10201, 'equality')
(10202, 'equivalent')
(10203, 'evaluating')
(10204, 'evoked')
(10205, 'exemption')
(10206, 'exercise')
(10207, 'exist')
(10208, 'extend')
(10209, 'facts')
(10210, 'fallacious')
(10211, 'fallen')
(10212, 'fantastic')
(10213, 'fellows')
(10214, 'fictitious')
(10215, 'flex')
(10216, 'flood')
(10217, 'folklore')
(10218, 'foolish')
(10219, 'footing')
(10220, 'forbid')
(10221, 'generated')
(10222, 'giant')
(10223, 'gigantic')
(10224, 'govern')
(10225, 'guise')
(10226, 'hatters')
(10227, 'heavy-electrical-goods')
(10228, 'hoffa')
(10229, 'hypocrisies')
(10230, 'hypocrisy')
(10231, 'ideology')
(10232, 'immunization')
(10233, 'impersonal')
(10234, 'implicit')
(10235, 'impolitic')
(10236, 'imposes')
(10237, 'improper')
(10238, 'indignation')
(10239, 'indirect')
(10240, 'ineptness')
(10241, 'inform')
(10242, 'injunctions')
(10243, 'interfered')
(10244, 'interpretation')
(10245, 'interrupt')
(10246, 'invoked')
(10247, 'irrational')
(10248, 'irrationality')
(10249, 'jurisdiction

(11707, 'delusion')
(11708, 'demise')
(11709, 'demonstrating')
(11710, 'deserve')
(11711, 'deserved')
(11712, 'deter')
(11713, 'deterrence')
(11714, 'dictum')
(11715, 'digest')
(11716, 'displeasure')
(11717, 'dominican')
(11718, 'drag')
(11719, 'echo')
(11720, 'emmerich')
(11721, 'even-handed')
(11722, 'extremity')
(11723, 'fiefdom')
(11724, 'fireworks')
(11725, 'five-member')
(11726, 'fortunately')
(11727, 'frankly')
(11728, 'friendship')
(11729, 'galindez')
(11730, 'gearing')
(11731, 'go-it-alone')
(11732, 'governors')
(11733, 'guess')
(11734, 'half-million')
(11735, 'helpful')
(11736, 'high-speed')
(11737, 'hunk')
(11738, 'illustrating')
(11739, 'imperialists')
(11740, 'inattentive')
(11741, 'industrialization')
(11742, 'insured')
(11743, 'intensive')
(11744, 'jimenez')
(11745, 'junks')
(11746, 'launching')
(11747, 'leeway')
(11748, 'limits')
(11749, 'lobbies')
(11750, 'low-grade')
(11751, 'maintained')
(11752, 'marcos')
(11753, 'maria')
(11754, 'mennen')
(11755, 'milledgeville')
(1

(13037, 'broad-scale')
(13038, 'candidate-picking')
(13039, 'carvey')
(13040, 'cdc')
(13041, 'ceiling')
(13042, 'chestnuts')
(13043, 'clamored')
(13044, 'clique')
(13045, 'close-in')
(13046, 'clubrooms')
(13047, 'columns')
(13048, 'committeeman')
(13049, 'committeemen')
(13050, 'communese')
(13051, 'component')
(13052, 'compulsive')
(13053, 'consequence')
(13054, 'considers')
(13055, 'consistency')
(13056, 'consistent')
(13057, 'contaminate')
(13058, 'contrarily')
(13059, 'crimes')
(13060, 'crush')
(13061, 'cure')
(13062, 'currents')
(13063, 'curtain')
(13064, 'cyril')
(13065, 'cyrus')
(13066, 'darned')
(13067, 'datelined')
(13068, 'dazzling')
(13069, 'deluge')
(13070, 'dissensions')
(13071, 'distinctive')
(13072, 'dove')
(13073, 'egged')
(13074, 'endorsing')
(13075, 'enlist')
(13076, 'espoused')
(13077, 'essays')
(13078, 'ex-gov')
(13079, 'excerpts')
(13080, 'exercises')
(13081, 'eye-to-eye')
(13082, 'fanny')
(13083, 'fascinated')
(13084, 'favorites')
(13085, 'foe')
(13086, 'fortunes'

(14704, 'jelly')
(14705, 'jocularly')
(14706, 'kills')
(14707, 'kwame')
(14708, 'leninism-marxism')
(14709, 'liberate')
(14710, 'liberation')
(14711, 'linked')
(14712, 'mainland')
(14713, 'manpower')
(14714, 'mao')
(14715, 'materiel')
(14716, 'moscow-allied')
(14717, 'nephews')
(14718, 'nkrumah')
(14719, 'odd')
(14720, 'overtly')
(14721, 'paralysis')
(14722, 'passion')
(14723, 'perforce')
(14724, 'planeload')
(14725, 'pointedly')
(14726, 'poverty')
(14727, 'practically')
(14728, 'privations')
(14729, 'pro-europe')
(14730, 'purity')
(14731, 'puzzles')
(14732, 'quemoy')
(14733, 'rags')
(14734, 'rape')
(14735, 're-enter')
(14736, 'reappears')
(14737, 'redundant')
(14738, 'relevant')
(14739, 'removing')
(14740, 'residue')
(14741, 'scarcity')
(14742, 'scars')
(14743, 'self-deceiving')
(14744, 'self-delusion')
(14745, 'semi-independent')
(14746, 'septuagenarian')
(14747, 'shockingly')
(14748, 'sixty')
(14749, 'skill')
(14750, 'smarted')
(14751, 'snap')
(14752, 'spine')
(14753, 'starvation')


(16052, 'flawless')
(16053, 'fledgling')
(16054, 'floyd')
(16055, 'folk-dance')
(16056, 'fracases')
(16057, 'frolic')
(16058, 'fugual')
(16059, 'gennaro')
(16060, 'gorshin')
(16061, 'gowns')
(16062, 'groom')
(16063, 'hackneyed')
(16064, 'hamming')
(16065, 'harnick')
(16066, 'head-in-the-clouds')
(16067, 'healthily')
(16068, 'hedison')
(16069, 'heffernan')
(16070, 'hereabouts')
(16071, 'hewett')
(16072, 'hilariously')
(16073, 'hildy')
(16074, 'hires')
(16075, 'hurrying')
(16076, 'hutton')
(16077, 'impassioned')
(16078, 'impromptu')
(16079, 'intensely')
(16080, 'irritations')
(16081, 'jen')
(16082, 'knill')
(16083, 'laguardia')
(16084, 'lance')
(16085, 'laura')
(16086, 'lawless')
(16087, 'levin')
(16088, 'lingers')
(16089, 'lipson')
(16090, 'loesser')
(16091, 'loew')
(16092, 'lyrics')
(16093, 'matlowsky')
(16094, 'matunuck')
(16095, 'melisande')
(16096, 'mermaid')
(16097, 'mgm')
(16098, 'mimieux')
(16099, 'monologist')
(16100, 'multi-lingual')
(16101, 'nightclub')
(16102, 'novelties')
(1

KeyboardInterrupt: 

In [22]:
# Transform the collection of text to a numerical form
corpus = [dictionary.doc2bow(text) for text in tokenized_data]

In [25]:
# Have a look at how the 20th document looks like: [(word_id, count), ...]
corpus[20]

[(12, 3),
 (14, 1),
 (21, 1),
 (25, 5),
 (30, 2),
 (31, 5),
 (33, 1),
 (42, 1),
 (43, 2),
 (44, 2),
 (45, 2),
 (46, 2),
 (47, 2),
 (49, 1),
 (50, 1),
 (53, 1),
 (56, 1),
 (59, 1),
 (60, 1),
 (66, 1),
 (75, 1),
 (80, 1),
 (98, 1),
 (101, 1),
 (106, 1),
 (117, 1),
 (129, 1),
 (130, 2),
 (132, 2),
 (135, 2),
 (140, 1),
 (141, 2),
 (143, 4),
 (144, 2),
 (145, 2),
 (166, 1),
 (195, 1),
 (198, 3),
 (219, 1),
 (220, 4),
 (221, 3),
 (223, 1),
 (229, 4),
 (230, 4),
 (231, 2),
 (235, 1),
 (236, 1),
 (242, 2),
 (246, 2),
 (255, 1),
 (263, 1),
 (269, 1),
 (270, 5),
 (271, 2),
 (275, 5),
 (276, 1),
 (278, 4),
 (280, 2),
 (281, 1),
 (307, 2),
 (310, 1),
 (311, 3),
 (313, 1),
 (314, 5),
 (318, 4),
 (322, 1),
 (336, 1),
 (338, 3),
 (339, 1),
 (340, 1),
 (341, 1),
 (345, 1),
 (346, 1),
 (351, 1),
 (354, 1),
 (355, 1),
 (366, 3),
 (368, 13),
 (370, 1),
 (372, 1),
 (374, 3),
 (377, 3),
 (381, 3),
 (386, 1),
 (392, 6),
 (396, 1),
 (401, 1),
 (412, 2),
 (426, 2),
 (428, 2),
 (431, 2),
 (434, 2),
 (439, 2),

In [26]:
#Build the LDA MODEL
lda_model = models.LdaModel(corpus=corpus, num_topics=NUM_TOPICS, id2word=dictionary)

#Build the lsi model
lsi_model = models.LsiModel(corpus=corpus, num_topics=NUM_TOPICS, id2word=dictionary)



In [29]:
print("LDA MODEL: ")
for idx in range(NUM_TOPICS):
    #print the first 10 most representative topics
    print("Topics #%s:"%idx, lda_model.print_topic(idx, 10))

print("="*20)

print("LSI MODEL: ")
for idx in range(NUM_TOPICS):
    #print the first 10 most representative topics
    print("Topics #%s:"%idx, lsi_model.print_topic(idx, 10))

print("="*20)

LDA MODEL: 
Topics #0: 0.007*"one" + 0.004*"would" + 0.003*"could" + 0.003*"time" + 0.003*"first" + 0.003*"may" + 0.003*"new" + 0.002*"man" + 0.002*"two" + 0.002*"made"
Topics #1: 0.007*"would" + 0.006*"one" + 0.005*"said" + 0.004*"new" + 0.003*"time" + 0.003*"could" + 0.002*"like" + 0.002*"even" + 0.002*"two" + 0.002*"made"
Topics #2: 0.006*"one" + 0.005*"would" + 0.005*"could" + 0.004*"new" + 0.003*"time" + 0.003*"first" + 0.003*"said" + 0.002*"two" + 0.002*"state" + 0.002*"may"
Topics #3: 0.005*"one" + 0.005*"would" + 0.004*"new" + 0.004*"said" + 0.003*"two" + 0.003*"time" + 0.003*"even" + 0.003*"could" + 0.002*"man" + 0.002*"like"
Topics #4: 0.006*"would" + 0.005*"one" + 0.004*"said" + 0.003*"could" + 0.003*"new" + 0.003*"may" + 0.003*"time" + 0.003*"man" + 0.002*"first" + 0.002*"two"
Topics #5: 0.006*"one" + 0.004*"would" + 0.004*"said" + 0.003*"time" + 0.003*"may" + 0.003*"two" + 0.003*"could" + 0.003*"first" + 0.003*"made" + 0.002*"new"
Topics #6: 0.005*"one" + 0.005*"would" + 0

### Let’s now put the models to work and transform unseen documents to their topic distribution:

In [30]:
text = "The economy is working better than ever"
bow = dictionary.doc2bow(clean_text(text))

print(lsi_model[bow])

print(lda_model[bow])

[(0, 0.09161422216323217), (1, 0.008795453970450224), (2, -0.016126281735796222), (3, 0.0419577356860358), (4, 0.015798897931842155), (5, -0.013310106124666876), (6, -0.029027627213179377), (7, 0.013920019655489536), (8, 0.055082325342692404), (9, -0.024999248648203926)]
[(0, 0.020017121), (1, 0.02001808), (2, 0.02001696), (3, 0.020017436), (4, 0.02001796), (5, 0.020017637), (6, 0.0200171), (7, 0.020017354), (8, 0.819844), (9, 0.020016363)]


# Gensim offers a simple way of performing similarity queries using topic models.

In [35]:
from gensim import similarities

lda_index = similarities.MatrixSimilarity(lda_model[corpus])


In [37]:
# let's perform some queries
similarities = lda_index[lda_model[bow]]
#sort the similarities
similarities = sorted(enumerate(similarities), key=lambda item: -item[1])

#top most similar documents
print(similarities[:10])


[(1, 0.9983691), (254, 0.997828), (70, 0.9976204), (328, 0.99752945), (138, 0.99732924), (140, 0.99732924), (198, 0.99732924), (335, 0.99732924), (428, 0.99732924), (477, 0.99732924)]


In [48]:
#lets see what's most similar documents
document_id, similarity = similarities[0]
print(data[document_id])[:1000]

Austin , Texas -- Committee approval of Gov. Price Daniel's `` abandoned property '' act seemed certain Thursday despite the adamant protests of Texas bankers . Daniel personally led the fight for the measure , which he had watered down considerably since its rejection by two previous Legislatures , in a public hearing before the House Committee on Revenue and Taxation . Under committee rules , it went automatically to a subcommittee for one week . But questions with which committee members taunted bankers appearing as witnesses left little doubt that they will recommend passage of it . Daniel termed `` extremely conservative '' his estimate that it would produce 17 million dollars to help erase an anticipated deficit of 63 million dollars at the end of the current fiscal year next Aug. 31 . He told the committee the measure would merely provide means of enforcing the escheat law which has been on the books `` since Texas was a republic '' . It permits the state to take over bank accou

TypeError: 'NoneType' object is not subscriptable

In [47]:
data[1]

"Austin , Texas -- Committee approval of Gov. Price Daniel's `` abandoned property '' act seemed certain Thursday despite the adamant protests of Texas bankers . Daniel personally led the fight for the measure , which he had watered down considerably since its rejection by two previous Legislatures , in a public hearing before the House Committee on Revenue and Taxation . Under committee rules , it went automatically to a subcommittee for one week . But questions with which committee members taunted bankers appearing as witnesses left little doubt that they will recommend passage of it . Daniel termed `` extremely conservative '' his estimate that it would produce 17 million dollars to help erase an anticipated deficit of 63 million dollars at the end of the current fiscal year next Aug. 31 . He told the committee the measure would merely provide means of enforcing the escheat law which has been on the books `` since Texas was a republic '' . It permits the state to take over bank acco

# Using Scikit-Learn for Topic Modeling

In [50]:
from sklearn.decomposition import NMF, LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer

NUM_TOPICS = 10
vectorizer = CountVectorizer(min_df=5, max_df=0.9, stop_words='english', lowercase=True,
                            token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
data_vectorized = vectorizer.fit_transform(data)


In [57]:
#Build a latent Dirichlet Allocation Model
lda_model = LatentDirichletAllocation(n_topics=NUM_TOPICS, max_iter=10, learning_method='online')
lda_Z = lda_model.fit_transform(data_vectorized)
print(lda_Z.shape) #NO_DOCUMENT, No_TOPICS



(500, 10)


In [58]:
# Build a Non-Negative Matrix Factorization Model
nmf_model = NMF(n_components=NUM_TOPICS)
nmf_Z = nmf_model.fit_transform(data_vectorized)
print(nmf_Z.shape) #NO_DOCUMENT, No_TOPICS

(500, 10)


In [59]:
#Build a latent semantic indexing model
lsi_model = TruncatedSVD(n_components=NUM_TOPICS)
lsi_Z = lsi_model.fit_transform(data_vectorized)
print(lsi_Z.shape) #NO_DOCUMENT, No_TOPICS

(500, 10)


In [60]:
# Let's see how the first document in the corpus looks like in different topic spaces
print(lda_Z[0])
print(nmf_Z[0])
print(lsi_Z[0])

[6.74533774e-02 1.05598894e-04 1.05611674e-04 1.05609068e-04
 1.05614042e-04 1.05596725e-04 1.05622497e-04 1.05611091e-04
 9.31701741e-01 1.05617567e-04]
[0.         0.         2.11447062 0.07697085 0.         0.54302051
 1.0679036  0.         0.         0.24610751]
[ 23.30684189   1.59496631  21.81192637   0.02843286   0.82193236
  11.55822573   4.13259287  -2.16724643   1.28621153 -14.48700975]


In [61]:
def print_topics(model, vectorizer, top_n=10):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:"%(idx))
        print([(vectorizer.get_feature_names()[i], topic[i])
              for i in topic.argsort()[:-top_n-1: -1]])

print("LDA MODEL:")
print_topics(lda_model, vectorizer)
print("="*20)
    
print("NMF MODEL:")
print_topics(nmf_model, vectorizer)
print("="*20)

print("LSI MODEL:")
print_topics(lsi_model, vectorizer)
print("="*20)

LDA MODEL:
Topic 0:
[('said', 1330.481457037019), ('like', 683.8213631997498), ('just', 517.9249160840924), ('man', 469.05301437291683), ('did', 434.8781552321674), ('got', 419.7989888662137), ('time', 414.1774204120324), ('don', 413.8891758583642), ('didn', 384.2324449782051), ('little', 373.2081041682479)]
Topic 1:
[('vacation', 11.070741683211683), ('visit', 9.814330378361838), ('park', 6.772043485853762), ('midwest', 5.1937568069753075), ('pictures', 4.358139360465948), ('festival', 4.321331034000411), ('northeast', 4.245211389222703), ('michigan', 4.106477160373366), ('tour', 4.101270306232989), ('famous', 3.9836359767920544)]
Topic 2:
[('used', 203.25099657400648), ('feed', 98.75733646359501), ('temperature', 90.73425404967541), ('surface', 81.47830024967152), ('data', 80.79679133383357), ('cells', 73.51153387486163), ('index', 72.91810413595488), ('radiation', 70.95833825508628), ('volume', 69.24914549069912), ('number', 67.71409624969544)]
Topic 3:
[('operator', 30.163599968592

[('united', 0.2726559164533169), ('states', 0.23418886064412672), ('mrs', 0.22014729701654986), ('shall', 0.19211379719392527), ('school', 0.17044958687215928), ('government', 0.16646161925658895), ('section', 0.12420290478238397), ('act', 0.11418656025564448), ('agreement', 0.10990994978342863), ('india', 0.09444805815655859)]
Topic 8:
[('form', 0.3221938747316226), ('dictionary', 0.3000492601094134), ('information', 0.2962435008476408), ('text', 0.22674283107063034), ('cell', 0.19080248467974342), ('forms', 0.18755105142414272), ('year', 0.17574930233131048), ('tax', 0.1538991822514836), ('list', 0.13453643974683593), ('fiscal', 0.13360242016538665)]
Topic 9:
[('year', 0.2587433862449727), ('fiscal', 0.2586852727105204), ('tax', 0.18633999202240803), ('school', 0.1702099522913236), ('time', 0.1383759193509987), ('like', 0.11344531755506945), ('states', 0.11144374251576465), ('years', 0.10761836951991578), ('child', 0.07699251948826631), ('towns', 0.07549920481774683)]


# Transforming an unseen document goes like this:`

In [70]:
text = "The economy is working better than ever"
x = nmf_model.transform(vectorizer.transform([text]))[0]
print(x)


[0.00289985 0.         0.         0.         0.         0.0043889
 0.         0.         0.         0.00466501]


In [73]:
from sklearn.metrics.pairwise import euclidean_distances

def most_similar(x, Z, top_n=5):
    dists = euclidean_distances(x.reshape(1, -1), Z)
    pairs = enumerate(dists[0])
    most_similar = sorted(pairs, key=lambda item:item[1])[:top_n]
    return most_similar

similarities = most_similar(x, nmf_Z)
document_id, similarity = similarities[0]
print(data[document_id][:1000])

Livery stable -- J. Vernon , prop. '' . Coaching had declined considerably by 1905 , but the sign was still there , near the old Wells Fargo building in San Francisco , creaking in the fog as it had for thirty years . John Vernon had had all the patronage he cared for -- he had prospered , but he could not retire from horsedom . Coaching was in his blood . He had two interests in life : the pleasures of the table and driving . Twice a week he drove his tallyho over the Santa Cruz road , upland and through the redwood forest , with orchards below him at one hand , and glimpses of the Pacific at the other . The journey back he made along the coast road , traveling hell-for-leather , every lantern of the tallyho ablaze . The southward route was the classic run in California , and the most fashionable . His patronage on this stretch was made up largely of San Franciscans -- regulars , most of them , and trenchermen like himself . They did not complain at the inhuman hour of starting ( seve

# Plotting words and documents in 2D with SVD


In [74]:
import pandas as pd
from bokeh.io import push_notebook, show, output_notebook
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource, LabelSet
output_notebook()
 

In [75]:
svd = TruncatedSVD(n_components=2)
documents_2d = svd.fit_transform(data_vectorized)
 
df = pd.DataFrame(columns=['x', 'y', 'document'])
df['x'], df['y'], df['document'] = documents_2d[:,0], documents_2d[:,1], range(len(data))
 
source = ColumnDataSource(ColumnDataSource.from_df(df))
labels = LabelSet(x="x", y="y", text="document", y_offset=8,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
 
plot = figure(plot_width=600, plot_height=600)
plot.circle("x", "y", size=12, source=source, line_color="black", fill_alpha=0.8)
plot.add_layout(labels)
show(plot, notebook_handle=True)
 