LDA is a matrix factorization technique. In vector space, any corpus (collection of documents) can be represented as a document-term matrix. The following matrix shows a corpus of N documents D1, D2, D3 … Dn and vocabulary size of M words W1,W2 .. Wn. The value of i,j cell gives the frequency count of word Wj in Document Di.

LDA converts this Document-Term Matrix into two lower dimensional matrices – M1 and M2.
M1 is a document-topics matrix and M2 is a topic – terms matrix with dimensions (N,  K) and (K, M) respectively, where N is the number of documents, K is the number of topics and M is the vocabulary size.

Alpha and Beta Hyperparameters – alpha represents document-topic density and Beta represents topic-word density. Higher the value of alpha, documents are composed of more topics and lower the value of alpha, documents contain fewer topics. On the other hand, higher the beta, topics are composed of a large number of words in the corpus, and with the lower value of beta, they are composed of few words.

All the text documents combined is known as the corpus. To run any mathematical model on text corpus, it is a good practice to convert it into a matrix representation. LDA model looks for repeating term patterns in the entire DT matrix. 

In [48]:
import pandas as pd
df = pd.read_json('https://raw.githubusercontent.com/selva86/datasets/master/newsgroups.json')
#data_text = data[['headline_text']]
#data_text['index'] = data_text.index
#documents = data_text
df.head(10)

Unnamed: 0,content,target,target_names
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,7,rec.autos
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,4,comp.sys.mac.hardware
10,From: irwin@cmptrc.lonestar.org (Irwin Arnstei...,8,rec.motorcycles
100,From: tchen@magnus.acs.ohio-state.edu (Tsung-K...,6,misc.forsale
1000,From: dabl2@nlm.nih.gov (Don A.B. Lindbergh)\n...,2,comp.os.ms-windows.misc
10000,From: a207706@moe.dseg.ti.com (Robert Loper)\n...,7,rec.autos
10001,From: kimman@magnus.acs.ohio-state.edu (Kim Ri...,6,misc.forsale
10002,From: kwilson@casbah.acns.nwu.edu (Kirtley Wil...,2,comp.os.ms-windows.misc
10003,Subject: Re: Don't more innocents die without ...,0,alt.atheism
10004,From: livesey@solntze.wpd.sgi.com (Jon Livesey...,0,alt.atheism


In [49]:
data_text = df[['content']]
data_text['index'] = data_text.index
documents = data_text
documents.shape

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


(11314, 2)

In [50]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
from nltk.stem import PorterStemmer
import numpy as np
np.random.seed(2018)
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

Tokenization: Split the text into sentences and the sentences into words. 
Lowercase the words and remove punctuation.
Words that have fewer than 3 characters are removed.
All stopwords are removed.
Words are lemmatized — words in third person are changed to first person and verbs in past and future tenses are changed into present.
Words are stemmed — words are reduced to their root form.

In [0]:
porter = PorterStemmer()
def lemmatize_stemming(text):
    return porter.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
  
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result

PorterStemmer uses Suffix Stripping to produce stems. Notice how the PorterStemmer is giving the root (stem) of the word "cats" by simply removing the 's' after cat. This is a suffix added to cat to make it plural.
PorterStemmer algorithm does not follow linguistics rather a set of 05 rules for different cases that are applied in phases (step by step) to generate stems
his is the reason why PorterStemmer does not often generate stems that are actual English words. It does not keep a lookup table for actual stems of the word but applies algorithmic rules to generate stems. It uses the rules to decide whether it is wise to strip a suffix.

SnowballStemmers is used to create non-English Stemmers!

The LancasterStemmer (Paice-Husk stemmer) is an iterative algorithm with rules saved externally. One table containing about 120 rules indexed by the last letter of a suffix. On each iteration, it tries to find an applicable rule by the last character of the word. Each rule specifies either a deletion or replacement of an ending. If there is no such rule, it terminates. It also terminates if a word starts with a vowel and there are only two letters left or if a word starts with a consonant and there are only three characters left. Otherwise, the rule is applied, and the process repeats. LancasterStemmer is simple, but heavy stemming due to iterations and over-stemming may occur. Over-stemming causes the stems to be not linguistic, or they may have no meaning

In [70]:
doc_sample = documents[documents['index'] == 0].values[0][0]
print('original document: ')
words = []

for word in doc_sample.split(' '):
    words.append(word)
print(words)
print('\n\n tokenized and lemmatized document: ')
print(preprocess(doc_sample))

original document: 
['From:', 'lerxst@wam.umd.edu', "(where's", 'my', 'thing)\nSubject:', 'WHAT', 'car', 'is', 'this!?\nNntp-Posting-Host:', 'rac3.wam.umd.edu\nOrganization:', 'University', 'of', 'Maryland,', 'College', 'Park\nLines:', '15\n\n', 'I', 'was', 'wondering', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'I', 'saw\nthe', 'other', 'day.', 'It', 'was', 'a', '2-door', 'sports', 'car,', 'looked', 'to', 'be', 'from', 'the', 'late', '60s/\nearly', '70s.', 'It', 'was', 'called', 'a', 'Bricklin.', 'The', 'doors', 'were', 'really', 'small.', 'In', 'addition,\nthe', 'front', 'bumper', 'was', 'separate', 'from', 'the', 'rest', 'of', 'the', 'body.', 'This', 'is', '\nall', 'I', 'know.', 'If', 'anyone', 'can', 'tellme', 'a', 'model', 'name,', 'engine', 'specs,', 'years\nof', 'production,', 'where', 'this', 'car', 'is', 'made,', 'history,', 'or', 'whatever', 'info', 'you\nhave', 'on', 'this', 'funky', 'looking', 'car,', 'please', 'e-mail.\n\nThanks,\n-', 

In [53]:
processed_docs = documents['content'].map(preprocess)
processed_docs[:10]

0        [lerxst, thing, subject, nntp, post, host, org...
1        [guykuo, carson, washington, subject, clock, p...
10       [irwin, cmptrc, lonestar, irwin, arnstein, sub...
100      [tchen, magnu, ohio, state, tsung, chen, subje...
1000     [dabl, lindbergh, subject, diamond, mous, curs...
10000    [dseg, robert, loper, subject, nntp, post, hos...
10001    [kimman, magnu, ohio, state, richard, subject,...
10002    [kwilson, casbah, acn, kirtley, wilson, subjec...
10003    [subject, innoc, death, penalti, bobb, vice, r...
10004    [livesey, solntz, livesey, subject, genocid, c...
Name: content, dtype: object

corpora.dictionary – Construct word<->id mappings 
This module implements the concept of a Dictionary – a mapping between words and their integer ids.

In [54]:
dictionary = gensim.corpora.Dictionary(processed_docs)
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 addit
1 bodi
2 bricklin
3 bring
4 bumper
5 call
6 colleg
7 door
8 earli
9 engin
10 enlighten


Filter out tokens that appear in
less than 15 documents (absolute number) or
more than 0.5 documents (fraction of total corpus size, not absolute number).
after the above two steps, keep only the first 100000 most frequent tokens.

In [0]:
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

doc2bow(document, allow_update=False, return_missing=False)
Convert document into the bag-of-words (BoW) format = list of (token_id, token_count) tuples.


Parameters:
document (list of str) – Input document.

allow_update (bool, optional) – Update self, by adding new tokens from document and updating internal corpus statistics.

return_missing (bool, optional) – Return missing tokens (tokens present in document but not in self) with frequencies?
Returns:

list of (int, int) – BoW representation of document.

list of (int, int), dict of (str, int) – If return_missing is True, return BoW representation of document + dictionary with missing tokens and their frequencies.

In [71]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
bow_corpus[0]

[(0, 1),
 (1, 1),
 (2, 1),
 (3, 1),
 (4, 1),
 (5, 1),
 (6, 2),
 (7, 1),
 (8, 1),
 (9, 1),
 (10, 1),
 (11, 1),
 (12, 1),
 (13, 1),
 (14, 1),
 (15, 2),
 (16, 1),
 (17, 1),
 (18, 1),
 (19, 1),
 (20, 1),
 (21, 1),
 (22, 1),
 (23, 1),
 (24, 1),
 (25, 1),
 (26, 1),
 (27, 1),
 (28, 1),
 (29, 1),
 (30, 1),
 (31, 1),
 (32, 1)]

In [73]:
bow_doc_0 = bow_corpus[0]
for i in range(len(bow_doc_0)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_0[i][0], 
                                               dictionary[bow_doc_0[i][0]], 
bow_doc_0[i][1]))

Word 0 ("addit") appears 1 time.
Word 1 ("bodi") appears 1 time.
Word 2 ("bring") appears 1 time.
Word 3 ("bumper") appears 1 time.
Word 4 ("call") appears 1 time.
Word 5 ("colleg") appears 1 time.
Word 6 ("door") appears 2 time.
Word 7 ("earli") appears 1 time.
Word 8 ("engin") appears 1 time.
Word 9 ("enlighten") appears 1 time.
Word 10 ("histori") appears 1 time.
Word 11 ("host") appears 1 time.
Word 12 ("info") appears 1 time.
Word 13 ("know") appears 1 time.
Word 14 ("late") appears 1 time.
Word 15 ("look") appears 2 time.
Word 16 ("mail") appears 1 time.
Word 17 ("maryland") appears 1 time.
Word 18 ("model") appears 1 time.
Word 19 ("neighborhood") appears 1 time.
Word 20 ("nntp") appears 1 time.
Word 21 ("park") appears 1 time.
Word 22 ("product") appears 1 time.
Word 23 ("rest") appears 1 time.
Word 24 ("separ") appears 1 time.
Word 25 ("small") appears 1 time.
Word 26 ("spec") appears 1 time.
Word 27 ("sport") appears 1 time.
Word 28 ("thank") appears 1 time.
Word 29 ("thing")

In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling.

The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.

The pprint module provides a capability to “pretty-print” arbitrary Python data structures in a form which can be used as input to the interpreter. If the formatted structures include objects which are not fundamental Python types, the representation may not be loadable.

In [58]:
from gensim import corpora, models
tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]
from pprint import pprint
for doc in corpus_tfidf:
    pprint(doc)
    break

[(0, 0.16531596959995393),
 (1, 0.16785300112088367),
 (2, 0.15019839074046323),
 (3, 0.2858093866531402),
 (4, 0.11438968484597732),
 (5, 0.15167771657985907),
 (6, 0.3893228656323328),
 (7, 0.16890540828616113),
 (8, 0.12279310710364673),
 (9, 0.26345362807652206),
 (10, 0.164457883902288),
 (11, 0.041906971663713447),
 (12, 0.13943545149619074),
 (13, 0.0532662930087635),
 (14, 0.17840425276381963),
 (15, 0.16145581407375503),
 (16, 0.1018221519189483),
 (17, 0.2333916943145223),
 (18, 0.16220115600873633),
 (19, 0.2804570052453337),
 (20, 0.042646900587070304),
 (21, 0.18493988543555143),
 (22, 0.1486736252074099),
 (23, 0.15971058881319783),
 (24, 0.18156419819616518),
 (25, 0.14476899410325858),
 (26, 0.20971806193095166),
 (27, 0.19290262455075843),
 (28, 0.08432119282836148),
 (29, 0.08377121895934883),
 (30, 0.04441770413224336),
 (31, 0.13721451448079763),
 (32, 0.08432119282836148)]


Train our lda model using gensim.models.LdaMulticore and save it to ‘lda_model’

models.ldamulticore – parallelized Latent Dirichlet Allocation 
Online Latent Dirichlet Allocation (LDA) in Python, using all CPU cores to parallelize and speed up model training.
The parallelization uses multiprocessing

gensim.models.ldamodel.LdaModel class is an equivalent, but more straightforward and single-core implementation.

The training algorithm:
is streamed: training documents may come in sequentially, no random access required,
runs in constant memory w.r.t. the number of documents: size of the training corpus does not affect memory footprint, can process corpora larger than RAM



Parameters:

corpus ({iterable of list of (int, float), scipy.sparse.csc}, optional) – Stream of document vectors or sparse matrix of shape (num_terms, num_documents). If not given, the model is left untrained (presumably because you want to call update() manually).

num_topics (int, optional) – The number of requested latent topics to be extracted from the training corpus.

id2word ({dict of (int, str), gensim.corpora.dictionary.Dictionary}) – Mapping from word IDs to words. It is used to determine the vocabulary size, as well as for debugging and topic printing.

workers (int, optional) – Number of workers processes to be used for parallelization. If None all available cores (as estimated by workers=cpu_count()-1 will be used. 

Note however that for hyper-threaded CPUs, this estimation returns a too high number – set workers directly to the number of your real cores (not hyperthreads) minus one, for optimal performance.

In [0]:
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=10, id2word=dictionary, passes=3, workers=3)

In [61]:
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.008*"univers" + 0.007*"articl" + 0.007*"year" + 0.006*"know" + 0.006*"host" + 0.006*"nntp" + 0.005*"team" + 0.005*"like" + 0.004*"think" + 0.004*"david"
Topic: 1 
Words: 0.011*"articl" + 0.009*"think" + 0.007*"like" + 0.006*"game" + 0.006*"nntp" + 0.005*"host" + 0.005*"peopl" + 0.005*"univers" + 0.005*"team" + 0.005*"know"
Topic: 2 
Words: 0.009*"articl" + 0.008*"time" + 0.008*"like" + 0.007*"think" + 0.006*"know" + 0.005*"peopl" + 0.005*"good" + 0.005*"state" + 0.005*"univers" + 0.005*"say"
Topic: 3 
Words: 0.009*"file" + 0.007*"univers" + 0.006*"know" + 0.006*"nntp" + 0.006*"host" + 0.006*"like" + 0.005*"articl" + 0.005*"state" + 0.005*"card" + 0.005*"work"
Topic: 4 
Words: 0.012*"peopl" + 0.007*"say" + 0.006*"think" + 0.006*"armenian" + 0.006*"know" + 0.006*"time" + 0.005*"go" + 0.005*"like" + 0.005*"come" + 0.004*"articl"
Topic: 5 
Words: 0.008*"christian" + 0.008*"know" + 0.007*"say" + 0.007*"think" + 0.007*"believ" + 0.006*"jesu" + 0.006*"articl" + 0.005*"like"

In [62]:
#Running LDA using TFIDF
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, num_topics=10, id2word=dictionary, passes=2, workers=4)
for idx, topic in lda_model_tfidf.print_topics(-1):
    print('Topic: {} Word: {}'.format(idx, topic))

Topic: 0 Word: 0.004*"pitt" + 0.004*"gordon" + 0.004*"bank" + 0.003*"israel" + 0.003*"buffalo" + 0.002*"file" + 0.002*"articl" + 0.002*"univers" + 0.002*"window" + 0.002*"surrend"
Topic: 1 Word: 0.002*"like" + 0.002*"know" + 0.002*"articl" + 0.002*"univers" + 0.002*"peopl" + 0.002*"say" + 0.002*"year" + 0.002*"time" + 0.002*"sandvik" + 0.002*"steve"
Topic: 2 Word: 0.003*"virginia" + 0.002*"cramer" + 0.002*"univers" + 0.002*"problem" + 0.002*"window" + 0.002*"know" + 0.002*"mail" + 0.002*"nasa" + 0.002*"space" + 0.002*"optilink"
Topic: 3 Word: 0.002*"chip" + 0.002*"window" + 0.002*"know" + 0.002*"encrypt" + 0.002*"thank" + 0.002*"driver" + 0.002*"clipper" + 0.002*"problem" + 0.002*"drive" + 0.002*"netcom"
Topic: 4 Word: 0.003*"armenian" + 0.003*"govern" + 0.003*"peopl" + 0.002*"encrypt" + 0.002*"isra" + 0.002*"right" + 0.002*"clipper" + 0.002*"turkish" + 0.002*"chip" + 0.002*"stratu"
Topic: 5 Word: 0.002*"cwru" + 0.002*"drive" + 0.002*"univers" + 0.002*"peopl" + 0.002*"printer" + 0.002*

In [74]:
processed_docs[0]

['lerxst',
 'thing',
 'subject',
 'nntp',
 'post',
 'host',
 'organ',
 'univers',
 'maryland',
 'colleg',
 'park',
 'line',
 'wonder',
 'enlighten',
 'door',
 'sport',
 'look',
 'late',
 'earli',
 'call',
 'bricklin',
 'door',
 'small',
 'addit',
 'bumper',
 'separ',
 'rest',
 'bodi',
 'know',
 'tellm',
 'model',
 'engin',
 'spec',
 'year',
 'product',
 'histori',
 'info',
 'funki',
 'look',
 'mail',
 'thank',
 'bring',
 'neighborhood',
 'lerxst']

In [76]:
for index, score in sorted(lda_model[bow_corpus[0]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))


Score: 0.43487852811813354	 
Topic: 0.007*"space" + 0.006*"articl" + 0.005*"scsi" + 0.005*"like" + 0.005*"host" + 0.005*"program" + 0.005*"data" + 0.005*"chip" + 0.005*"need" + 0.005*"nntp"

Score: 0.27545031905174255	 
Topic: 0.008*"peopl" + 0.006*"articl" + 0.005*"armenian" + 0.005*"think" + 0.005*"time" + 0.004*"like" + 0.004*"want" + 0.004*"univers" + 0.004*"go" + 0.004*"know"

Score: 0.2702215015888214	 
Topic: 0.008*"univers" + 0.007*"articl" + 0.007*"year" + 0.006*"know" + 0.006*"host" + 0.006*"nntp" + 0.005*"team" + 0.005*"like" + 0.004*"think" + 0.004*"david"


In [75]:
#Performance evaluation by classifying sample document using LDA TF-IDF model.
for index, score in sorted(lda_model_tfidf[bow_corpus[0]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index, 10)))


Score: 0.6385207176208496	 
Topic: 0.003*"team" + 0.003*"game" + 0.002*"year" + 0.002*"think" + 0.002*"player" + 0.002*"peopl" + 0.002*"like" + 0.002*"articl" + 0.002*"play" + 0.002*"good"

Score: 0.3392506539821625	 
Topic: 0.003*"virginia" + 0.002*"cramer" + 0.002*"univers" + 0.002*"problem" + 0.002*"window" + 0.002*"know" + 0.002*"mail" + 0.002*"nasa" + 0.002*"space" + 0.002*"optilink"


In [67]:
#Testing model on unseen document
unseen_document = 'How a Pentagon deal became an identity crisis for Google'
bow_vector = dictionary.doc2bow(preprocess(unseen_document))
for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))

Score: 0.7749364972114563	 Topic: 0.012*"peopl" + 0.007*"say" + 0.006*"think" + 0.006*"armenian" + 0.006*"know"
Score: 0.025016067549586296	 Topic: 0.009*"file" + 0.007*"univers" + 0.006*"know" + 0.006*"nntp" + 0.006*"host"
Score: 0.025007765740156174	 Topic: 0.008*"christian" + 0.008*"know" + 0.007*"say" + 0.007*"think" + 0.007*"believ"
Score: 0.02500748075544834	 Topic: 0.007*"space" + 0.006*"articl" + 0.005*"scsi" + 0.005*"like" + 0.005*"host"
Score: 0.02500622719526291	 Topic: 0.008*"univers" + 0.007*"articl" + 0.007*"year" + 0.006*"know" + 0.006*"host"
Score: 0.02500622719526291	 Topic: 0.009*"articl" + 0.008*"time" + 0.008*"like" + 0.007*"think" + 0.006*"know"
Score: 0.025005655363202095	 Topic: 0.008*"peopl" + 0.006*"articl" + 0.005*"armenian" + 0.005*"think" + 0.005*"time"
Score: 0.025005513802170753	 Topic: 0.006*"mail" + 0.006*"host" + 0.006*"univers" + 0.006*"articl" + 0.006*"avail"
Score: 0.025004345923662186	 Topic: 0.011*"articl" + 0.009*"think" + 0.007*"like" + 0.006*"ga