# NLTK Corpus Analysis with Gensim's LDA Model 

## Preparation
First of all, you need to import necessary libraries (with pip command).
* nltk
* gensim
* pyLDAvis

In [127]:
!pip install nltk
!pip install gensim
!pip install pyLDAvis



After installing the dependencies, you need to download the following datasets.

In [128]:
import nltk
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("semcor")
nltk.download("punkt")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Shireen\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Shireen\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package semcor to
[nltk_data]     C:\Users\Shireen\AppData\Roaming\nltk_data...
[nltk_data]   Package semcor is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Shireen\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Datasets
Load the corpus from NLTK package.

In [129]:
from nltk.corpus import semcor as corpus

Let us check out the content of the corpus.

In [130]:
for n,item in enumerate(corpus.words(corpus.fileids()[0])[:100000]):
    print(item, end=" ")
    if (n%25) ==24:
      print(" ")
    
print("")
print("")
print("Total number of documents:",len(corpus.fileids()))

The Fulton County Grand Jury said Friday an investigation of Atlanta 's recent primary election produced `` no evidence '' that any irregularities took place  
. The jury further said in term end presentments that the City Executive Committee , which had over-all charge of the election , `` deserves  
the praise and thanks of the City of Atlanta '' for the manner in which the election was conducted . The September October term jury  
had been charged by Fulton Superior Court Judge Durwood Pye to investigate reports of possible `` irregularities '' in the hard-fought primary which was won  
by Mayor-nominate Ivan Allen Jr. . `` Only a relative handful of such reports was received '' , the jury said , `` considering the  
widespread interest in the election , the number of voters and the size of this city '' . The jury said it did find that  
many of Georgia 's registration and election laws `` are outmoded or inadequate and often ambiguous '' . It recommended that Fulton legislators act

, the State Board of Education should be directed to `` give priority '' to teacher pay raises . After a long , hot controversy  
, Miller County has a new school superintendent , elected , as a policeman put it , in the `` coolest election I ever saw  
in this county '' . The new school superintendent is Harry Davis , a veteran agriculture teacher , who defeated Felix Bush , a school  
principal and chairman of the Miller County Democratic Executive Committee . Davis received 1119 votes in Saturday 's election , and Bush got 402 .  
Ordinary Carey Williams , armed with a pistol , stood by at the polls to insure order . `` This was the coolest , calmest  
election I ever saw '' , Colquitt Policeman Tom Williams said . `` Being at the polls was just like being at church . I  
did n't smell a drop of liquor , and we did n't have a bit of trouble '' . The campaign leading to the election  
was not so quiet , however . It was marked by controversy , anonymous midnight phone calls and veile

You can train the model with first K number of documents or all documents.

In [131]:
# All documents
docs=[corpus.words(fileid) for fileid in corpus.fileids()]

print(docs[:10])
print ("")
print("num of docs:", len(docs))

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...], ['Committee', 'approval', 'of', 'Gov.', 'Price', ...], ['The', 'Orioles', 'tonight', 'retained', 'the', ...], ['A', 'Texas', 'halfback', 'who', 'does', "n't", ...], ['Rookie', 'Ron', 'Nischwitz', 'continued', 'his', ...], ['Nick', 'Skorich', ',', 'the', 'line', 'coach', 'for', ...], ['If', 'the', 'Cardinals', 'heed', 'Manager', 'Gene', ...], ['Sizzling', 'temperatures', 'and', 'hot', 'summer', ...], ['The', 'nuclear', 'war', 'is', 'already', 'being', ...], ['It', 'is', 'not', 'news', 'that', 'Nathan', ...]]

num of docs: 352


## Data preprocessing
First, let us define some stopwords. Here we consider English stopwords from the NLTK package and some noises that may affect our LDA analysis result.  
(Optional) Try to ignore numbers and words through regular expression.

In [132]:
# English stopwords defined by the NLTK package.
from nltk.corpus import stopwords
en_stop = stopwords.words("english")

# Ignore noises that might affect our result.
en_stop = ["``","/",",.",".,",";","--",":",")","(",'"','&',"'",'),',',"','-','.,','.,"','.-',"?",">","<","$","''","*","!"]\
+["0","1","2","3","4","5","6","7","8","9","10","87","50","100","31","29","13","637","71","30","74","1119","18","1913","1923","1937","1961","1962","87","402"]\
+["f","a's" ,"mr.", "able" , "about" , "above" , "according" , "accordingly" , "across" , "actually" , "after" , "afterwards" , "again" , "against" , "ain't" , "all" , "allow" , "allows" , "almost" , "alone" , "along" , "already" , "also" , "although" , "always" , "am" , "among" , "amongst" , "an" , "and" , "another" , "any" , "anybody" , "anyhow" , "anyone" , "anything" , "anyway" , "anyways" , "anywhere" , "apart" , "appear" , "appreciate" , "appropriate" , "are" , "aren't" , "around" , "as" , "aside" , "ask" , "asking" , "associated" , "at" , "available" , "away" , "awfully" , "be" , "became" , "because" , "become" , "becomes" , "becoming" , "been" , "before" , "beforehand" , "behind" , "being" , "believe" , "below" , "beside" , "besides" , "best" , "better" , "between" , "beyond" , "both" , "brief" , "but" , "by" , "c'mon" , "c's" , "came" , "can" , "can't" , "cannot" , "cant" , "cause" , "causes" , "certain" , "certainly" , "changes" , "clearly" , "co" , "com" , "come" , "comes" , "concerning" , "consequently" , "consider" , "considering" , "contain" , "containing" , "contains" , "corresponding" , "could" , "couldn't" , "course" , "currently" , "definitely" , "described" , "despite" , "did" , "didn't" , "different" , "do" , "does" , "doesn't" , "doing" , "don't" , "done" , "down" , "downwards" , "during" , "each" , "edu" , "eg" , "eight" , "either" , "else" , "elsewhere" , "enough" , "entirely" , "especially" , "et" , "etc" , "even" , "ever" , "every" , "everybody" , "everyone" , "everything" , "everywhere" , "ex" , "exactly" , "example" , "except" , "far" , "few" , "fifth" , "first" , "five" , "followed" , "following" , "follows" , "for" , "former" , "formerly" , "forth" , "four" , "from" , "further" , "furthermore" , "get" , "gets" , "getting" , "given" , "gives" , "go" , "goes" , "going" , "gone" , "got" , "gotten" , "greetings" , "had" , "hadn't" , "happens" , "hardly" , "has" , "hasn't" , "have" , "haven't" , "having" , "he" , "he's" , "hello" , "help" , "hence" , "her" , "here" , "here's" , "hereafter" , "hereby" , "herein" , "hereupon" , "hers" , "herself" , "hi" , "him" , "himself" , "his" , "hither" , "hopefully" , "how" , "howbeit" , "however" , "i'd" , "i'll" , "i'm" , "i've" , "ie" , "if" , "ignored" , "immediate" , "in" , "inasmuch" , "inc" , "indeed" , "indicate" , "indicated" , "indicates" , "inner" , "insofar" , "instead" , "into" , "inward" , "is" , "isn't" , "it" , "it'd" , "it'll" , "it's" , "its" , "itself" , "just" , "keep" , "keeps" , "kept" , "know" , "known" , "knows" , "last" , "lately" , "later" , "latter" , "latterly" , "least" , "less" , "lest" , "let" , "let's" , "like" , "liked" , "likely" , "little" , "look" , "looking" , "looks" , "ltd" , "mainly" , "many" , "may" , "maybe" , "me" , "mean" , "meanwhile" , "merely" , "might" , "more" , "moreover" , "most" , "mostly" , "much" , "must" , "my" , "myself" , "name" , "namely" , "nd" , "near" , "nearly" , "necessary" , "need" , "needs" , "neither" , "never" , "nevertheless" , "new" , "next" , "nine" , "no" , "nobody" , "non" , "none" , "noone" , "nor" , "normally" , "not" , "nothing" , "novel" , "now" , "nowhere" , "obviously" , "of" , "off" , "often" , "oh" , "ok" , "okay" , "old" , "on" , "once" , "one" , "ones" , "only" , "onto" , "or" , "other" , "others" , "otherwise" , "ought" , "our" , "ours" , "ourselves" , "out" , "outside" , "over" , "overall" , "own" , "particular" , "particularly" , "per" , "perhaps" , "placed" , "please" , "plus" , "possible" , "presumably" , "probably" , "provides" , "que" , "quite" , "qv" , "rather" , "rd" , "re" , "really" , "reasonably" , "regarding" , "regardless" , "regards" , "relatively" , "respectively" , "right" , "said" , "same" , "saw" , "say" , "saying" , "says" , "second" , "secondly" , "see" , "seeing" , "seem" , "seemed" , "seeming" , "seems" , "seen" , "self" , "selves" , "sensible" , "sent" , "serious" , "seriously" , "seven" , "several" , "shall" , "she" , "should" , "shouldn't" , "since" , "six" , "so" , "some" , "somebody" , "somehow" , "someone" , "something" , "sometime" , "sometimes" , "somewhat" , "somewhere" , "soon" , "sorry" , "specified" , "specify" , "specifying" , "still" , "sub" , "such" , "sup" , "sure" , "t's" , "take" , "taken" , "tell" , "tends" , "th" , "than" , "thank" , "thanks" , "thanx" , "that" , "that's" , "thats" , "the" , "their" , "theirs" , "them" , "themselves" , "then" , "thence" , "there" , "there's" , "thereafter" , "thereby" , "therefore" , "therein" , "theres" , "thereupon" , "these" , "they" , "they'd" , "they'll" , "they're" , "they've" , "think" , "third" , "this" , "thorough" , "thoroughly" , "those" , "though" , "three" , "through" , "throughout" , "thru" , "thus" , "to" , "together" , "too" , "took" , "toward" , "towards" , "tried" , "tries" , "truly" , "try" , "trying" , "twice" , "two" , "un" , "under" , "unfortunately" , "unless" , "unlikely" , "until" , "unto" , "up" , "upon" , "us" , "use" , "used" , "useful" , "uses" , "using" , "usually" , "value" , "various" , "very" , "via" , "viz" , "vs" , "want" , "wants" , "was" , "wasn't" , "way" , "we" , "we'd" , "we'll" , "we're" , "we've" , "welcome" , "well" , "went" , "were" , "weren't" , "what" , "what's" , "whatever" , "when" , "whence" , "whenever" , "where" , "where's" , "whereafter" , "whereas" , "whereby" , "wherein" , "whereupon" , "wherever" , "whether" , "which" , "while" , "whither" , "who" , "who's" , "whoever" , "whole" , "whom" , "whose" , "why" , "will" , "willing" , "wish" , "with" , "within" , "without" , "won't" , "wonder" , "would" , "wouldn't" , "yes" , "yet" , "you" , "you'd" , "you'll" , "you're" , "you've" , "your" , "yours" , "yourself" , "yourselves" , "zero"]\
+["sunday","friday","september","october","million","said","tuesday","wednesday","friday","years","monday","november","jan.","aug.","sept.","go","back","one","year"]\
+[" of ","gov.","w.","add","size","jr.","end","mrs.","m.","see","rep.","b.","d.","13th","yet","n't","want","put","p.","make","say","let","tabb","gop","tom","four"]\
+en_stop

Next, let us define several preprocessing functions.

In [133]:
from nltk.corpus import wordnet as wn # import for lemmatize

def preprocess_word(word, stopwordset):
    
    #1.convert words to lowercase (e.g., Python =>python)
    word=word.lower()
    
    #2.remove ",", ".", and "'s"
    if word in [",",".","'s"]:
        return None
    
    #3.remove stopwords  (e.g., the => (None)) 
    if word in stopwordset:
        return None
    
    #4.lemmatize  (e.g., cooked=>cook)
    lemma = wn.morphy(word)
    if lemma is None:
        return word

    # lemmatized words could be in the stopwords set
    elif lemma in stopwordset: 
        return None
    else:
        return lemma
    

def preprocess_document(document):
    document=[preprocess_word(w, en_stop) for w in document]
    document=[w for w in document if w is not None]
    return document

def preprocess_documents(documents):
    return [preprocess_document(document) for document in documents]

Let us check out the preprocessing result.

In [134]:
# before
print(docs[0][:10]) 

# after
print(preprocess_documents(docs)[0][:10])

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of']
['fulton', 'county', 'grand', 'jury', 'investigation', 'atlanta', 'recent', 'primary', 'election', 'produce']


Next, we need to reshape our documents with the available format for the gensim LDA model.

In [135]:
import gensim
from gensim import corpora

In [136]:
# build the dictionary
dictionary = corpora.Dictionary(preprocess_documents(docs))
# construct the corpus
corpus_ = [dictionary.doc2bow(doc) for doc in preprocess_documents(docs)]

Let us check out the contents of the built dictionary and corpus.

In [137]:
# token2id is the attribute which indicates the mapping between words and dictionary ID

print(dictionary.token2id)



In [138]:
# corpus_ contains words of each document with a list (ID, appear frequency)

# note that there is not the appearing order in the documents, but the order of the dictionary
print(corpus_[0][:10]) 

[(0, 1), (1, 1), (2, 1), (3, 2), (4, 4), (5, 2), (6, 2), (7, 1), (8, 2), (9, 1)]


Let us compare the original document with our preprocessing result that is available for the LDA model.

In [139]:
# before
print([w.lower() for w in corpus.sents(corpus.fileids()[0])[0]])

# after
print(dictionary.doc2bow([w.lower() for w in corpus.sents(corpus.fileids()[0])[0]]))

['the', 'fulton', 'county', 'grand', 'jury', 'said', 'friday', 'an', 'investigation', 'of', 'atlanta', "'s", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.']
[(37, 1), (108, 1), (151, 1), (165, 1), (196, 1), (211, 1), (249, 1), (262, 1), (340, 1), (360, 1), (392, 1)]


## Training with k = 0.1

In [141]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus=corpus_,
                                           num_topics=10,
                                           id2word=dictionary,
                                           alpha=0.1,                 # optional LDA hyperparameter alpha
                                           eta=0.1,                  # optional LDA hyperparameter beta
                                           minimum_probability=0.1    # optional the lower bound of the topic/word generative probability
                                          )

Check out the learned parameters.

In [142]:
# the top num_words of words for each topic (topic ID, the word generative probability for the topic).

topics = ldamodel.print_topics(num_words=15)
for topic in topics:
    print(topic)

(0, '0.004*"time" + 0.003*"state" + 0.003*"man" + 0.002*"people" + 0.002*"long" + 0.002*"men" + 0.002*"american" + 0.002*"work" + 0.002*"good" + 0.002*"life" + 0.002*"interest" + 0.002*"program" + 0.001*"give" + 0.001*"area" + 0.001*"house"')
(1, '0.004*"man" + 0.003*"time" + 0.002*"state" + 0.002*"point" + 0.002*"work" + 0.002*"good" + 0.002*"people" + 0.002*"men" + 0.002*"place" + 0.002*"day" + 0.002*"turn" + 0.002*"problem" + 0.002*"home" + 0.002*"long" + 0.002*"world"')
(2, '0.003*"time" + 0.003*"man" + 0.002*"state" + 0.002*"good" + 0.002*"work" + 0.002*"life" + 0.002*"show" + 0.002*"great" + 0.002*"head" + 0.002*"house" + 0.002*"give" + 0.002*"world" + 0.002*"point" + 0.002*"call" + 0.002*"men"')
(3, '0.004*"state" + 0.003*"time" + 0.002*"man" + 0.002*"good" + 0.002*"long" + 0.002*"work" + 0.002*"form" + 0.002*"show" + 0.002*"point" + 0.002*"world" + 0.002*"men" + 0.002*"american" + 0.002*"high" + 0.001*"give" + 0.001*"number"')
(4, '0.005*"time" + 0.003*"man" + 0.002*"people" + 

In [143]:
# for each document, show the probabilities of topics which beyond the minimum_probability [(topic ID, probability)]

for n,item in enumerate(corpus_[:10]):
    print("document ID "+str(n)+":" ,end="")
    print(ldamodel.get_document_topics(item))

document ID 0:[(9, 0.89282316)]
document ID 1:[(3, 0.9638631)]
document ID 2:[(8, 0.85658246)]
document ID 3:[(1, 0.99906105)]
document ID 4:[(1, 0.86402935), (8, 0.13510785)]
document ID 5:[(1, 0.24346182), (9, 0.7534818)]
document ID 6:[(1, 0.9990257)]
document ID 7:[(4, 0.9856418)]
document ID 8:[(0, 0.5728181), (1, 0.1503983), (9, 0.24226104)]
document ID 9:[(1, 0.98321724)]


Let us check out the ```nth``` document in the result.

In [144]:
n=0

# nth document's topic distribution
print(ldamodel.get_document_topics(corpus_[n]))

# nth document's category
#print(categories[n])

# show the original document
print(" ".join(docs[n]))

[(0, 0.10448071), (9, 0.88848794)]


## Visualization
We can further analyze our result through visualization.

In [145]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook()
lda_display = pyLDAvis.gensim.prepare(ldamodel, corpus_, dictionary)
lda_display

## Training with k = 0.4

In [146]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus=corpus_,
                                           num_topics=10,
                                           id2word=dictionary,
                                           alpha=0.4,                 # optional LDA hyperparameter alpha
                                           eta=0.4,                   # optional LDA hyperparameter beta
                                           minimum_probability=0.4    # optional the lower bound of the topic/word generative probability
                                          )

Check out the learned parameters.

In [147]:
# the top num_words of words for each topic (topic ID, the word generative probability for the topic).

topics = ldamodel.print_topics(num_words=15)
for topic in topics:
    print(topic)

(0, '0.003*"state" + 0.002*"time" + 0.002*"american" + 0.002*"man" + 0.002*"country" + 0.001*"people" + 0.001*"place" + 0.001*"life" + 0.001*"program" + 0.001*"head" + 0.001*"call" + 0.001*"world" + 0.001*"general" + 0.001*"work" + 0.001*"area"')
(1, '0.003*"state" + 0.003*"time" + 0.002*"man" + 0.002*"school" + 0.002*"good" + 0.001*"day" + 0.001*"work" + 0.001*"give" + 0.001*"point" + 0.001*"place" + 0.001*"people" + 0.001*"show" + 0.001*"home" + 0.001*"interest" + 0.001*"long"')
(2, '0.004*"time" + 0.003*"state" + 0.002*"man" + 0.002*"good" + 0.002*"long" + 0.002*"people" + 0.002*"day" + 0.002*"work" + 0.001*"line" + 0.001*"show" + 0.001*"\'ll" + 0.001*"head" + 0.001*"american" + 0.001*"point" + 0.001*"area"')
(3, '0.002*"time" + 0.002*"man" + 0.002*"home" + 0.002*"men" + 0.001*"people" + 0.001*"state" + 0.001*"house" + 0.001*"life" + 0.001*"thought" + 0.001*"show" + 0.001*"long" + 0.001*"work" + 0.001*"open" + 0.001*"good" + 0.001*"call"')
(4, '0.002*"time" + 0.002*"man" + 0.002*"st

In [148]:
# for each document, show the probabilities of topics which beyond the minimum_probability [(topic ID, probability)]

for n,item in enumerate(corpus_[:10]):
    print("document ID "+str(n)+":" ,end="")
    print(ldamodel.get_document_topics(item))

document ID 0:[(9, 0.9950942)]
document ID 1:[(0, 0.5188121)]
document ID 2:[(7, 0.9888009)]
document ID 3:[(7, 0.9952507)]
document ID 4:[(2, 0.6112989)]
document ID 5:[(6, 0.53051144)]
document ID 6:[(7, 0.77149904)]
document ID 7:[(7, 0.73050636)]
document ID 8:[(9, 0.6415391)]
document ID 9:[(7, 0.6796661)]


In [149]:
n=0

# nth document's topic distribution
print(ldamodel.get_document_topics(corpus_[n]))

# nth document's category
#print(categories[n])

# show the original document
print(" ".join(docs[n]))

[(9, 0.9950936)]


## Visualization
We can further analyze our result through visualization.

In [150]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook()
lda_display = pyLDAvis.gensim.prepare(ldamodel, corpus_, dictionary)
lda_display