# Capstone: Text Factorizing with NLP
## Thomas Ludlow

# 04 - LDA Topic Modeling

This notebook will use the DataFrame assembled in Notebook 03 containing for all paragraphs of n+ length:
 - `author`
 - `work`
 - `paragraph`
 
Preprocessing will tokenize and lemmatize each paragraph into a list of lemma using spaCy.  These lists will then feed into Gensim to build a dictionary for vectorizing.  The Gensim vectors will be the basis for topic modeling for each paragraph row.

**Libraries**

In [26]:
# Python Data Science
import re
import time
import numpy as np
import pandas as pd
from tqdm import tqdm

# Natural Language Processing
import spacy
import pyLDAvis.gensim
from nltk.stem import PorterStemmer

import gensim
import gensim.downloader as api
from gensim.corpora import Dictionary
from gensim.models import TfidfModel, ldamodel, ldamulticore, CoherenceModel
from gensim.models.word2vec import Word2Vec

# Plotting
import matplotlib.pyplot as plt
%matplotlib inline

# Override deprecation warnings
import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

**Import Book Data**

In [3]:
book_df = pd.read_csv('./data/book_df.csv')

In [4]:
book_df.head()

Unnamed: 0,author,work,paragraph
0,Ani,Book of the Dead: The Papyrus of Ani,"""Homage to thee, Osiris, Lord of eternity, Kin..."
1,Ani,Book of the Dead: The Papyrus of Ani,"""Thou art the Great Chief, the first among thy..."
2,Ani,Book of the Dead: The Papyrus of Ani,"""Thou rollest up into the horizon, thou hast s..."
3,Ani,Book of the Dead: The Papyrus of Ani,"His sister [Isis] hath protected him, and hath..."
4,Ani,Book of the Dead: The Papyrus of Ani,A HYMN OF PRAISE TO RA WHEN HE RISETH IN THE E...


In [23]:
text_data = pd.read_csv('./data/text_data.csv')
text_data.head()

Unnamed: 0,Title,Author,Filename,Start Key,End Key,Category,Bumper Sticker,Original Language,Country,Year,Year Val,Wiki Link,Wiki Text,Paragraphs
0,Book of the Dead: The Papyrus of Ani,Ani,ani_papyrus.txt,THE PAPYRUS OF ANI,***END***,Polytheism,Magic spells will assist the dead in journey t...,Heiroglyphic,Egypt,2400-1250 BC,-1250,https://en.wikipedia.org/wiki/Book_of_the_Dead,The Book of the Dead is an ancient Egyptian fu...,343
1,The Categories,Aristotle,aristotle_categories.txt,*** START OF THIS PROJECT GUTENBERG EBOOK THE ...,End of the Project Gutenberg EBook of The Cate...,Hylomorphism,Being is a compound of matter and form,Greek,Greece,~335 BC,-335,https://en.wikipedia.org/wiki/Categories_(Aris...,The Categories (Greek Κατηγορίαι Katēgoriai; L...,132
2,The Poetics,Aristotle,aristotle_poetics.txt,ARISTOTLE ON THE ART OF POETRY,End of the Project Gutenberg EBook of The Poet...,Dramatic and Literary Theory,"Dramatic works imitate but vary in music, char...",Greek,Greece,335 BC,-335,https://en.wikipedia.org/wiki/Poetics_(Aristotle),Aristotle's Poetics (Greek: Περὶ ποιητικῆς; La...,72
3,The Gospel,"Buddha, Siddhartha Guatama",buddha_gospel.txt,500 BC,***END***,Buddhism,Human suffering and the cycle of death and reb...,English,India,~500 BC,-500,https://en.wikipedia.org/wiki/The_Gospel_of_Bu...,The Gospel of Buddha was an 1894 book by Paul ...,576
4,The Word,"Buddha, Siddhartha Guatama",buddha_word.txt,"BUDDHA, THE WORD",THE END,Buddhism,Four noble truths are understood by the enligh...,English,India,~500 BC,-500,https://en.wikipedia.org/wiki/Noble_Eightfold_...,The Noble Eightfold Path (Pali: ariyo aṭṭhaṅgi...,193


## spaCy NLP Preprocessing

### By Paragraphs

In [5]:
# Using medium English library which does not include vectors
nlp = spacy.load('en_core_web_md')

In [6]:
# spaCy processing for tokens, lemma, part-of-speech, dependency
pars_nlp = []

for par in book_df.paragraph:
    pars_nlp.append(nlp(par))

In [7]:
len(pars_nlp)

28805

In [8]:
pars_nlp[:2]

["Homage to thee, Osiris, Lord of eternity, King of the Gods, whose names are manifold, whose forms are holy, thou being of hidden form in the temples, whose Ka is holy. Thou art the governor of Tattu (Busiris), and also the mighty one in Sekhem (Letopolis). Thou art the Lord to whom praises are ascribed in the nome of Ati, thou art the Prince of divine food in Anu. Thou art the Lord who is commemorated in Maati, the Hidden Soul, the Lord of Qerrt (Elephantine), the Ruler supreme in White Wall (Memphis). Thou art the Soul of Ra, his own body, and hast thy place of rest in Henensu (Herakleopolis). Thou art the beneficent one, and art praised in Nart. Thou makest thy soul to be raised up. Thou art the Lord of the Great House in Khemenu (Hermopolis). Thou art the mighty one of victories in Shas-hetep, the Lord of eternity, the Governor of Abydos. The path of his throne is in Ta-tcheser (a part of Abydos). Thy name is established in the mouths of men. Thou art the substance of Two Lands (E

### By Sentence

In [10]:
sent_nlp = []
for par in pars_nlp:
    sent_list = []
    for s in par.sents:
        sent_list.append(s.text)
    sent_nlp.append(sent_list)

In [19]:
sent_nlp[0][0]

'"Homage to thee, Osiris, Lord of eternity, King of the Gods, whose names are manifold, whose forms are holy, thou being of hidden form in the temples, whose Ka is holy.'

In [32]:
sent_df = pd.DataFrame(columns=['author','work','a_num','w_num','p_num','s_num','sentence'])

a_num = 0
w_num = 0
p_num = 0

for p, sents_in_par in enumerate(sent_nlp):
    for s, sent in enumerate(sents_in_par):
        sent_df = sent_df.append({'author':book_df.loc[p, 'author'], 
                                  'work':book_df.loc[p, 'work'], 
                                  'a_num':a_num,
                                  'w_num':w_num,
                                  'p_num':p_num, 
                                  's_num':s, 
                                  'sentence':sent}, ignore_index=True)
    p_num += 1
    if p_num == text_data.loc[w_num, 'Paragraphs']:
        p_num = 0
        w_num += 1
        if w_num == text_data.shape[0]: 
            break
        if text_data.loc[w_num, 'Author'] != text_data.loc[w_num-1, 'Author']:
            a_num += 1
    if not p % 2000: print(p)

0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
22000
24000
26000
28000


In [33]:
sent_df.head()

Unnamed: 0,author,work,a_num,w_num,p_num,s_num,sentence
0,Ani,Book of the Dead: The Papyrus of Ani,0,0,0,0,"""Homage to thee, Osiris, Lord of eternity, Kin..."
1,Ani,Book of the Dead: The Papyrus of Ani,0,0,0,1,"Thou art the governor of Tattu (Busiris), and ..."
2,Ani,Book of the Dead: The Papyrus of Ani,0,0,0,2,Thou art the Lord to whom praises are ascribed...
3,Ani,Book of the Dead: The Papyrus of Ani,0,0,0,3,Thou art the Lord who is commemorated in Maati...
4,Ani,Book of the Dead: The Papyrus of Ani,0,0,0,4,"Thou art the Soul of Ra, his own body, and has..."


In [35]:
sent_df.tail(25)

Unnamed: 0,author,work,a_num,w_num,p_num,s_num,sentence
119016,"Vyasa, Veda",Bhagavad Gita,32,44,68,3,and th' other shall be ours!
119017,"Vyasa, Veda",Bhagavad Gita,32,44,68,4,"To-day we slew a foe, and we will slay"
119018,"Vyasa, Veda",Bhagavad Gita,32,44,69,0,"Into some devilish womb, whence- birth by birth-"
119019,"Vyasa, Veda",Bhagavad Gita,32,44,69,1,"The devilish wombs re-spawn them, all beguiled;"
119020,"Vyasa, Veda",Bhagavad Gita,32,44,69,2,"And, till they find and worship Me, sweet Prince!"
119021,"Vyasa, Veda",Bhagavad Gita,32,44,69,3,Tread they
119022,"Vyasa, Veda",Bhagavad Gita,32,44,69,4,that Nether Road.
119023,"Vyasa, Veda",Bhagavad Gita,32,44,70,0,"All those three gates of Narak, wendeth straight"
119024,"Vyasa, Veda",Bhagavad Gita,32,44,70,1,"To find his peace, and comes to Swarga's gate."
119025,"Vyasa, Veda",Bhagavad Gita,32,44,71,0,"The ""Soothfast"" meat."


In [36]:
sent_df.to_csv('./data/sent_df.csv', index=False)

In [11]:
pars_lemma = []

for par_nlp in pars_nlp:
    pars_lemma.append([token.lemma_ for token in par_nlp     # List comprehension
                       if token.lemma_ != '-PRON-'           # Pronouns are excluded
                       and token.pos_ != 'PUNCT'             # Punctionation is excluded
                       and token.is_alpha                    # Numbers are excluded
                       and not token.is_stop])               # Stop words are excluded

In [12]:
len(pars_lemma)

28805

**Remove Additional Stopwords**
- Roman Numerals
- Articles

**Stemmer (Optional)**

In [14]:
ps = PorterStemmer()

In [15]:
def remove_sw(vec_list, sw_list, to_stem=False):
    update_list = []
    for token in vec_list:
        if token in sw_list: continue
        if to_stem: update_list.append(ps.stem(token))
        else: update_list.append(token)
    return update_list

In [16]:
count = 0
for par in pars_lemma:
    count += len(par)
count

2671496

**Additional Stopwords List**

In [17]:
sw = ['i','ii','iii','iv','v','vi','vii','viii','ix','x','xi','xii','xiii','xiv','xv','xvi','xvii','xviii','xix','xx','xxi','xxii',
      'the','a','but','like','for']

In [18]:
pars_lemma_sw = []
for par in pars_lemma:
    pars_lemma_sw.append(remove_sw(par, sw))

## Gensim LDA Model

In [509]:
# g_dict.compactify()

In [589]:
# Build dictionary to contain all terms from normalized text
g_dict = Dictionary(pars_lemma_sw)

In [590]:
len(g_dict)

22992

**Remove Outliers from Dictionary**

In [591]:
g_dict.filter_extremes(no_below=2, no_above=0.9, keep_n=24000)

len(g_dict)

14452

In [592]:
for i in range(6):
    print(i, g_dict[i], '\n')

0 abydo 

1 anu 

2 appear 

3 art 

4 ascrib 

5 ati 



In [593]:
# Build corpus of normalized text relative to dictionary
bow_corpus = [g_dict.doc2bow(par) for par in pars_lemma_sw]

In [594]:
bow_corpus[50]

[(35, 2),
 (82, 1),
 (215, 1),
 (294, 1),
 (326, 1),
 (333, 1),
 (353, 1),
 (374, 1),
 (412, 1)]

### TF-IDF

**ERROR NOTE:** The TF-IDF corpus is causing an IndexError with the Gensim LDA models, so the Bag-of-Words count vector corpus `bow_corpus` is being used with the model instead.

In [516]:
tfidf = TfidfModel(corpus, normalize=False)

In [517]:
corpus = tfidf[corpus]

In [518]:
len(corpus)

9224

In [519]:
corpus[50]

[(18, 0.0035043453853684735),
 (19, 0.20654649731674726),
 (32, 2.2252737992878764),
 (36, 0.025630059457384068),
 (42, 0.2840805814295968),
 (81, 22.720424794604636),
 (145, 0.60389578698253288),
 (155, 0.10885871772967382),
 (174, 1.1050740576070781),
 (401, 21.091033070557796),
 (451, 1.1925511127230761),
 (457, 21.091033070557796),
 (458, 102.3012787279696),
 (459, 3.8637590432257332),
 (460, 0.80421288331133167),
 (461, 0.21845621492428335)]

**LDA Model - Single Core**

In [595]:
lda_model.clear()

In [637]:
lda_model = ldamodel.LdaModel(corpus=bow_corpus, 
                              id2word=g_dict,
                              num_topics=8, 
                              random_state=207,
                              update_every=1,
                              chunksize=120,
                              passes=20,
                              alpha='auto',
                              per_word_topics=True)

In [638]:
lda_model.print_topics()

[(0,
  '0.053*"great" + 0.037*"good" + 0.027*"govern" + 0.023*"tax" + 0.019*"right" + 0.017*"pay" + 0.013*"valu" + 0.013*"peopl" + 0.013*"high" + 0.012*"revenu"'),
 (1,
  '0.034*"price" + 0.029*"labour" + 0.027*"land" + 0.026*"year" + 0.025*"money" + 0.022*"capit" + 0.017*"work" + 0.016*"silver" + 0.016*"time" + 0.015*"gold"'),
 (2,
  '0.028*"natur" + 0.018*"state" + 0.017*"differ" + 0.013*"in" + 0.012*"mean" + 0.010*"certain" + 0.009*"mind" + 0.009*"note" + 0.009*"knowledg" + 0.009*"particular"'),
 (3,
  '0.036*"bodi" + 0.033*"idea" + 0.030*"exist" + 0.026*"produc" + 0.022*"object" + 0.019*"emot" + 0.019*"conceiv" + 0.018*"gener" + 0.018*"kind" + 0.015*"thing"'),
 (4,
  '0.052*"man" + 0.025*"thing" + 0.017*"know" + 0.015*"far" + 0.014*"mind" + 0.013*"if" + 0.012*"great" + 0.012*"law" + 0.012*"life" + 0.011*"way"'),
 (5,
  '0.049*"god" + 0.045*"say" + 0.027*"thi" + 0.018*"come" + 0.018*"and" + 0.017*"bring" + 0.017*"armi" + 0.013*"read" + 0.013*"market" + 0.013*"rent"'),
 (6,
  '0.056*

In [639]:
lda_model.log_perplexity(bow_corpus)

-8.1668555830599932

In [640]:
cm = CoherenceModel(model=lda_model, texts=pars_lemma_sw, dictionary=g_dict, coherence='c_v')
cm.get_coherence()

0.4485877711966515

**LDA Multicore**

In [631]:
lda_multi.clear()

In [632]:
lda_multi = ldamulticore.LdaMulticore(corpus=bow_corpus,
                                      id2word=g_dict,
                                      num_topics=8,
                                      random_state=207,
                                      chunksize=100,
                                      passes=50,
                                      per_word_topics=True,
                                      workers=4
)

In [633]:
lda_multi.print_topics()

[(0,
  '0.040*"man" + 0.023*"know" + 0.017*"thing" + 0.013*"mind" + 0.013*"think" + 0.012*"and" + 0.012*"say" + 0.011*"knowledg" + 0.010*"socrat" + 0.009*"word"'),
 (1,
  '0.030*"man" + 0.023*"govern" + 0.020*"good" + 0.016*"right" + 0.012*"power" + 0.012*"peopl" + 0.009*"state" + 0.008*"evil" + 0.008*"law" + 0.007*"life"'),
 (2,
  '0.020*"great" + 0.014*"countri" + 0.011*"nation" + 0.009*"state" + 0.008*"tax" + 0.007*"interest" + 0.007*"good" + 0.007*"money" + 0.007*"in" + 0.007*"public"'),
 (3,
  '0.020*"idea" + 0.014*"object" + 0.013*"exist" + 0.013*"differ" + 0.013*"thing" + 0.010*"natur" + 0.008*"case" + 0.008*"gener" + 0.007*"relat" + 0.007*"bodi"'),
 (4,
  '0.041*"and" + 0.033*"shall" + 0.032*"ye" + 0.028*"thou" + 0.027*"unto" + 0.025*"say" + 0.022*"allah" + 0.021*"lord" + 0.020*"thi" + 0.016*"thee"'),
 (5,
  '0.023*"year" + 0.014*"zarathustra" + 0.012*"thousand" + 0.011*"number" + 0.010*"time" + 0.010*"old" + 0.009*"gold" + 0.008*"hous" + 0.008*"day" + 0.007*"silver"'),
 (6,
  

In [634]:
lda_multi.log_perplexity(bow_corpus)

-7.5657569997796923

In [635]:
cm2 = CoherenceModel(model=lda_multi, texts=pars_lemma_sw, dictionary=g_dict, coherence='c_v')
cm2.get_coherence()

0.51080767642288183

## Visualize LDA with pyLDAvis

In [641]:
lda_display = pyLDAvis.gensim.prepare(lda_model, bow_corpus, g_dict, sort_topics=True)
pyLDAvis.display(lda_display)

In [636]:
lda_display2 = pyLDAvis.gensim.prepare(lda_multi, bow_corpus, g_dict, sort_topics=True)
pyLDAvis.display(lda_display2)

# Test of Topics for new text

In [642]:
newpar = 'Only strong unified government can save from "the war of all against all"'

In [643]:
newpar_vec = ['only','strong','unified','government','can','save','from','war','all','against','all']

In [644]:
newpar_corp = g_dict.doc2bow(newpar_vec)
newpar_corp

[(433, 1),
 (669, 2),
 (786, 1),
 (1468, 1),
 (1477, 1),
 (1715, 1),
 (3427, 1),
 (5432, 1)]

In [645]:
lda_model.get_document_topics(newpar_corp)

[(0, 0.15774071),
 (1, 0.044295356),
 (2, 0.13752215),
 (3, 0.11871302),
 (4, 0.31861287),
 (5, 0.060938392),
 (6, 0.085572109),
 (7, 0.076605402)]

In [646]:
lda_multi.get_document_topics(newpar_corp)

[(0, 0.16327205),
 (1, 0.62483293),
 (2, 0.14934538),
 (3, 0.012512724),
 (4, 0.012505741),
 (5, 0.012517393),
 (6, 0.012503612),
 (7, 0.012510177)]

# Save corpus LDA topics to DataFrame, to disk

In [647]:
corpus_topic_df = pd.DataFrame(columns=[n for n in range(lda_model.num_topics)])

for i, doc in enumerate(lda_model.get_document_topics(bow_corpus)):
    for topic, proba in doc:
        corpus_topic_df.loc[i, topic] = proba

In [648]:
corpus_topic_df.fillna(0, inplace=True)

In [649]:
corpus_topic_df.isnull().sum()

0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    0
dtype: int64

In [650]:
corpus_topic_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7
0,0.186861,0.041349,0.023008,0.055623,0.196905,0.464256,0.0,0.028641
1,0.076598,0.131217,0.043864,0.061473,0.379999,0.292917,0.0,0.0
2,0.163516,0.05471,0.057267,0.0488,0.278653,0.367343,0.0165,0.013212
3,0.060407,0.111848,0.052424,0.04542,0.497444,0.19588,0.01414,0.022437
4,0.080818,0.113798,0.029826,0.032807,0.209167,0.469323,0.032304,0.031957


In [651]:
corpus_topic_df.to_csv('./models/corpus_topic_df.csv', index=False)

## Save LDA Model to disk

In [652]:
g_dict.save('./models/g_dict')

In [653]:
lda_model.save('./models/lda_model')

In [None]:
# lda_model =  models.LdaModel.load('./models/lda_model')

## Automatic Topic Labeling - Gensim Word2Vec

Download Gensim common `text8` corpus to build Word2Vec model

In [3]:
api_corpus = api.load('text8')



In [4]:
wv_model = Word2Vec(api_corpus, 
                    size=100, 
                    window=10, 
                    min_count=2, 
                    sg=1,
                    workers=4
)

In [570]:
wv_model.most_similar('animal')

[('animals', 0.8601729869842529),
 ('insect', 0.7537721395492554),
 ('predatory', 0.7357544898986816),
 ('husbandry', 0.7356052994728088),
 ('human', 0.7284411787986755),
 ('carrion', 0.7254469394683838),
 ('livers', 0.7162550687789917),
 ('instinct', 0.7133628129959106),
 ('excrement', 0.7114371657371521),
 ('carnivore', 0.7063280344009399)]