# Capstone: Philosophical Factors for NLP
**_Measuring Similarity to Philosophical Concepts in Text Data_**

## Thomas W. Ludlow, Jr.
**General Assembly Data Science Immersive DSI-NY-6**

**February 12, 2019**

# Notebook 2 - LDA Topic Modeling

### Table of Contents

[**2.1 Gensim LDA**](#2.1-Gensim-LDA)
- [2.1.1 Build Dictionary and Corpora](#2.1.1-Build-Dictionary-and-Corpora)
- [2.1.2 LDA Model](#2.1.2-LDA-Model)
- [2.1.3 Visualize with pyLDAvis](#2.1.3-Visualize-with-pyLDAvis)
- [2.1.4 Optimization for Number of Topics](#2.1.4-Optimization-for-Number-of-Topics)

[**2.2 Topic Labeling with Gensim Word2Vec**](#2.2-Topic-Labeling-with-Gensim-Word2Vec)
- [2.2.1 Word2Vec](#2.2.1-Word2Vec)
- [2.2.2 Identify Vectors for Specificity](#2.2.2-Identify-Vectors-for-Specificity)
- [2.2.3 Get Topic Labels](#2.2.3-Get-Topic-Labels)

[**2.3 LDA Features for Corpora**](#2.3-LDA-Features-for-Corpora)
- [2.3.1 Training Text LDA](#2.3.1-Training-Text-LDA)
- [2.3.2 Testing Text LDA](#2.3.2-Testing-Text-LDA)

**Libraries**

In [54]:
# Python Data Science
import re
import ast
import time
import pickle
import numpy as np
import pandas as pd
from tqdm import tqdm_notebook as tqdm
from IPython.display import clear_output

# Natural Language Processing
import spacy
from nltk.stem import PorterStemmer

# Gensim
import gensim
import gensim.downloader as api
from gensim.corpora import Dictionary
from gensim.models import TfidfModel, ldamulticore, CoherenceModel
from gensim.models.word2vec import Word2Vec
import pyLDAvis.gensim

# Modeling Prep
from sklearn.preprocessing import StandardScaler
from sklearn.externals import joblib

# Plotting
import matplotlib.pyplot as plt
%matplotlib inline

# Override deprecation warnings
import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

## 2.1 Gensim LDA

**Load Preprocessed Text Data**

In [2]:
nlp_df = pd.read_csv('../ga_dsi_capstone_ec2only/data_eda/nlp_df.csv')
t_nlp_df = pd.read_csv('../ga_dsi_capstone_ec2only/data_eda/t_nlp_df.csv')

In [3]:
nlp_df.head()

Unnamed: 0,author,work,a_num,w_num,p_num,s_num,sent_text,sent_lemma,par_text,par_lemma
0,Aristotle,The Categories,0,0,0,0,Things are said to be named 'equivocally' when...,"['thing', 'be', 'say', 'to', 'be', 'name', 'eq...",Things are said to be named 'equivocally' when...,"['thing', 'be', 'say', 'to', 'be', 'name', 'eq..."
1,Aristotle,The Categories,0,0,0,1,"Thus, a real man and a figure in a picture can...","['thus', 'real', 'man', 'and', 'figure', 'in',...",Things are said to be named 'equivocally' when...,"['thing', 'be', 'say', 'to', 'be', 'name', 'eq..."
2,Aristotle,The Categories,0,0,0,2,For should any one define in what sense each i...,"['should', 'any', 'one', 'define', 'in', 'what...",Things are said to be named 'equivocally' when...,"['thing', 'be', 'say', 'to', 'be', 'name', 'eq..."
3,Aristotle,The Categories,0,0,1,0,"On the other hand, things are said to be named...","['on', 'other', 'hand', 'thing', 'be', 'say', ...","On the other hand, things are said to be named...","['on', 'other', 'hand', 'thing', 'be', 'say', ..."
4,Aristotle,The Categories,0,0,1,1,"A man and an ox are both 'animal', and these a...","['man', 'and', 'an', 'ox', 'be', 'both', 'anim...","On the other hand, things are said to be named...","['on', 'other', 'hand', 'thing', 'be', 'say', ..."


In [4]:
nlp_df.shape

(70922, 10)

In [5]:
t_nlp_df.head()

Unnamed: 0,author,work,a_num,w_num,p_num,s_num,sent_text,sent_lemma,par_text,par_lemma
0,Aristotle,Ethics,0,0,0,0,"Every art, and every science reduced to a teac...","['every', 'art', 'and', 'every', 'science', 'r...","Every art, and every science reduced to a teac...","['every', 'art', 'and', 'every', 'science', 'r..."
1,Aristotle,Ethics,0,0,1,0,Now there plainly is a difference in the Ends ...,"['now', 'there', 'plainly', 'be', 'difference'...",Now there plainly is a difference in the Ends ...,"['now', 'there', 'plainly', 'be', 'difference'..."
2,Aristotle,Ethics,0,0,1,1,"Again, since actions and arts and sciences are...","['again', 'since', 'action', 'and', 'art', 'an...",Now there plainly is a difference in the Ends ...,"['now', 'there', 'plainly', 'be', 'difference'..."
3,Aristotle,Ethics,0,0,2,0,"And whatever of such actions, arts, or science...","['and', 'whatev', 'of', 'such', 'action', 'art...","And whatever of such actions, arts, or science...","['and', 'whatev', 'of', 'such', 'action', 'art..."
4,Aristotle,Ethics,0,0,3,0,(And in this comparison it makes no difference...,"['and', 'in', 'this', 'comparison', 'make', 'n...",(And in this comparison it makes no difference...,"['and', 'in', 'this', 'comparison', 'make', 'n..."


In [6]:
t_nlp_df.shape

(8395, 10)

### 2.1.1 Build Dictionary and Corpora

**Gensim Dictionary `g_dict`**

In [23]:
# Build dictionary to contain all terms from normalized text
g_dict = Dictionary([ast.literal_eval(lemma_str) for lemma_str in nlp_df.sent_lemma])
len(g_dict)

25722

**Remove Outliers from Dictionary**

In [24]:
g_dict.filter_extremes(no_below=3, no_above=0.88, keep_n=18000)
len(g_dict)

15792

**Bag of Words (BoW) Corpora**

Training Text Corpus

In [25]:
# Build corpus of normalized text relative to dictionary
bow_corpus_s = [g_dict.doc2bow(ast.literal_eval(sent)) for sent in nlp_df.sent_lemma]
bow_corpus_p = [g_dict.doc2bow(ast.literal_eval(par)) for par in nlp_df.par_lemma]

In [26]:
len(bow_corpus_s)

70922

In [27]:
len(bow_corpus_s[0])

15

In [28]:
len(bow_corpus_p)

70922

In [29]:
len(bow_corpus_p[0])

42

Testing Text Corpus

In [30]:
# Build corpus of normalized text relative to dictionary
t_bow_corpus_s = [g_dict.doc2bow(ast.literal_eval(sent)) for sent in t_nlp_df.sent_lemma]
t_bow_corpus_p = [g_dict.doc2bow(ast.literal_eval(par)) for par in t_nlp_df.par_lemma]

In [31]:
len(t_bow_corpus_p)

8395

In [32]:
len(t_bow_corpus_p[0])

31

**TF-IDF Vectorization**

In [33]:
tfidf = TfidfModel(bow_corpus_s, normalize=True)

In [34]:
corpus_s = tfidf[bow_corpus_s]

In [35]:
len(corpus_s)

70922

In [36]:
corpus_p = tfidf[bow_corpus_p]

In [37]:
t_corpus_s = tfidf[t_bow_corpus_s]

In [38]:
len(t_corpus_s)

8395

In [39]:
t_corpus_p = tfidf[t_bow_corpus_p]

**Save TF-IDF Model to Disk**

In [41]:
tfidf.save('./models/tfidf')

### 2.1.2 LDA Models

**Set Parameter Values**

In [42]:
sent_param= {
    'num_topics':16,
    'random_state':210,
    'chunksize':5000,
    'eval_every':10,
    'passes':5,
    'workers':3
}

par_param= {
    'num_topics':8,
    'random_state':210,
    'chunksize':1000,
    'eval_every':10,
    'passes':5,
    'workers':3
}

**LDA Multicore Model - Sentences**

In [43]:
# Instantiate model based on parameter values
lda_multi_s = ldamulticore.LdaMulticore(corpus=corpus_s,
                                        id2word=g_dict,
                                        num_topics=sent_param['num_topics'],
                                        random_state=sent_param['random_state'],
                                        chunksize=sent_param['chunksize'],
                                        eval_every=sent_param['eval_every'],
                                        passes=sent_param['passes'],
                                        per_word_topics=True,
                                        workers=sent_param['workers']
)

In [44]:
lda_multi_s.print_topics()

[(0,
  '0.007*"s" + 0.007*"to" + 0.006*"be" + 0.006*"of" + 0.006*"and" + 0.006*"xxix" + 0.005*"in" + 0.005*"not" + 0.005*"that" + 0.005*"socrates"'),
 (1,
  '0.009*"and" + 0.009*"to" + 0.008*"say" + 0.008*"thou" + 0.007*"be" + 0.007*"of" + 0.006*"that" + 0.006*"god" + 0.006*"not" + 0.006*"have"'),
 (2,
  '0.011*"idea" + 0.011*"be" + 0.010*"that" + 0.010*"of" + 0.009*"in" + 0.009*"to" + 0.008*"as" + 0.008*"and" + 0.008*"have" + 0.007*"or"'),
 (3,
  '0.009*"emotion" + 0.008*"of" + 0.007*"be" + 0.007*"in" + 0.007*"to" + 0.006*"and" + 0.005*"this" + 0.005*"as" + 0.005*"that" + 0.005*"which"'),
 (4,
  '0.012*"of" + 0.010*"in" + 0.009*"be" + 0.009*"which" + 0.008*"to" + 0.008*"as" + 0.008*"this" + 0.007*"conception" + 0.007*"that" + 0.007*"not"'),
 (5,
  '0.033*"prop" + 0.008*"love" + 0.008*"and" + 0.008*"to" + 0.007*"be" + 0.007*"of" + 0.007*"as" + 0.006*"in" + 0.006*"pleasure" + 0.006*"by"'),
 (6,
  '0.011*"and" + 0.010*"to" + 0.009*"of" + 0.008*"be" + 0.007*"have" + 0.007*"in" + 0.007*"th

**LDA Metrics - Sentences**

In [45]:
lda_multi_s.log_perplexity(corpus_s)

-9.866216883253053

In [48]:
cm_s = CoherenceModel(model=lda_multi_s, texts=[ast.literal_eval(sent) for sent in nlp_df.sent_lemma], 
                      dictionary=g_dict, coherence='c_v')
cm_s.get_coherence()

0.4989356266514545

**LDA Multicore Model - Paragraphs**

In [49]:
# Instantiate model based on parameter values
lda_multi_p = ldamulticore.LdaMulticore(corpus=corpus_p,
                                        id2word=g_dict,
                                        num_topics=par_param['num_topics'],
                                        random_state=par_param['random_state'],
                                        chunksize=par_param['chunksize'],
                                        eval_every=par_param['eval_every'],
                                        passes=par_param['passes'],
                                        per_word_topics=True,
                                        workers=par_param['workers']
)

In [50]:
lda_multi_p.print_topics()

[(0,
  '0.034*"xl" + 0.023*"dog" + 0.018*"xxxiii" + 0.017*"stuff" + 0.011*"xxxiv" + 0.011*"xxxvi" + 0.011*"specially" + 0.010*"repentance" + 0.010*"lend" + 0.010*"xxxv"'),
 (1,
  '0.013*"tzu" + 0.012*"tu" + 0.012*"yu" + 0.010*"say" + 0.010*"mu" + 0.010*"spy" + 0.010*"enemy" + 0.009*"chang" + 0.009*"li" + 0.008*"kung"'),
 (2,
  '0.010*"of" + 0.009*"be" + 0.008*"to" + 0.008*"in" + 0.008*"as" + 0.007*"that" + 0.007*"and" + 0.007*"which" + 0.006*"or" + 0.006*"have"'),
 (3,
  '0.017*"glorious" + 0.015*"xlv" + 0.012*"ride" + 0.011*"yoke" + 0.010*"appreciate" + 0.009*"transform" + 0.009*"xlvi" + 0.007*"shun" + 0.007*"cripple" + 0.006*"fetter"'),
 (4,
  '0.022*"surface" + 0.019*"adequately" + 0.017*"objection" + 0.016*"simultaneous" + 0.012*"cæsar" + 0.010*"snow" + 0.008*"centre" + 0.008*"inner" + 0.008*"project" + 0.008*"thoroughly"'),
 (5,
  '0.015*"de" + 0.014*"c" + 0.011*"la" + 0.011*"b" + 0.010*"d" + 0.010*"et" + 0.010*"xliv" + 0.009*"xliii" + 0.009*"illustrate" + 0.008*"à"'),
 (6,
  '0.0

**LDA Metrics - Paragraphs**

In [51]:
lda_multi_p.log_perplexity(corpus_p)

-8.597665828808871

In [52]:
cm_p = CoherenceModel(model=lda_multi_p, texts=[ast.literal_eval(par) for par in nlp_df.par_lemma], 
                      dictionary=g_dict, coherence='c_v')
cm_p.get_coherence()

0.49757637118692033

### 2.1.3 Visualize with pyLDAvis

In [55]:
lda_display_s = pyLDAvis.gensim.prepare(lda_multi_s, corpus_s, g_dict, sort_topics=True)
pyLDAvis.display(lda_display_s)

In [56]:
lda_display_p = pyLDAvis.gensim.prepare(lda_multi_p, corpus_p, g_dict, sort_topics=True)
pyLDAvis.display(lda_display_p)

### 2.1.4 Optimization for Number of Topics

**Set Optimizing Parameters**

In [68]:
sent_opt_params = {
    'num_topics':[12,14,16,18],
    'random_state':210,
    'chunksize':5000,
    'passes':1,
    'workers':3
}

In [69]:
par_opt_params = {
    'num_topics':[4,6,8,10],
    'random_state':210,
    'chunksize':1000,
    'passes':1,
    'workers':3
}

**Sentence LDA**

In [72]:
lda_multi_s_df = pd.DataFrame(columns=['num_topics','perplexity','coherence'])

for nt in tqdm(range(len(sent_opt_params['num_topics']))):
    temp_df = pd.DataFrame(columns=['num_topics','perplexity','coherence'])
    temp_df.loc[nt, 'num_topics'] = sent_opt_params['num_topics'][nt]

    lda_multi_s = ldamulticore.LdaMulticore(corpus=corpus_s,
                                          id2word=g_dict,
                                          num_topics=sent_opt_params['num_topics'][nt],
                                          random_state=210,
                                          chunksize=sent_opt_params['chunksize'],
                                          passes=sent_opt_params['passes'],
                                          per_word_topics=True,
                                          workers=sent_opt_params['workers'])
    temp_df.perplexity = lda_multi_s.log_perplexity(corpus_s)
    cm = CoherenceModel(model=lda_multi_s, texts=[ast.literal_eval(sent) for sent in nlp_df.sent_lemma], dictionary=g_dict, coherence='c_v')
    temp_df.coherence = cm.get_coherence()
    lda_multi_s_df = lda_multi_s_df.append(temp_df)
                
lda_multi_s_df

HBox(children=(IntProgress(value=0, max=4), HTML(value='')))




Unnamed: 0,num_topics,perplexity,coherence
0,12,-9.698559,0.567684
1,14,-9.884078,0.568675
2,16,-10.075981,0.535987
3,18,-10.289799,0.521392


**Paragraph LDA**

In [74]:
lda_multi_p_df = pd.DataFrame(columns=['num_topics','perplexity','coherence'])

for nt in tqdm(range(len(par_opt_params['num_topics']))):
    temp_df = pd.DataFrame(columns=['num_topics','perplexity','coherence'])
    temp_df.loc[nt, 'num_topics'] = par_opt_params['num_topics'][nt]

    lda_multi_p = ldamulticore.LdaMulticore(corpus=corpus_p,
                                          id2word=g_dict,
                                          num_topics=par_opt_params['num_topics'][nt],
                                          random_state=210,
                                          chunksize=par_opt_params['chunksize'],
                                          passes=par_opt_params['passes'],
                                          per_word_topics=True,
                                          workers=par_opt_params['workers'])
    temp_df.perplexity = lda_multi_p.log_perplexity(corpus_p)
    cm = CoherenceModel(model=lda_multi_p, texts=[ast.literal_eval(par) for par in nlp_df.par_lemma], dictionary=g_dict, coherence='c_v')
    temp_df.coherence = cm.get_coherence()
    lda_multi_p_df = lda_multi_p_df.append(temp_df)
                
lda_multi_p_df

HBox(children=(IntProgress(value=0, max=4), HTML(value='')))




Unnamed: 0,num_topics,perplexity,coherence
0,4,-8.307919,0.254286
1,6,-8.51489,0.31203
2,8,-8.853865,0.294095
3,10,-9.597312,0.37986


**Create Empty Optimizing Lists**

In [None]:
lda_sent = pd.DataFrame(columns=['num_topics','perplexity','coherence'])
lda_par = pd.DataFrame(columns=['num_topics','perplexity','coherence'])

**Add Latest Results to Lists**

In [None]:
lda_sent = lda_sent.append(lda_multi_s_df)

In [None]:
lda_par = lda_par.append(lda_multi_p_df)

**Final Sentence LDA Model**

In [None]:
# Run model
lda_multi_s = ldamulticore.LdaMulticore(corpus=corpus_s,
                                        id2word=g_dict,
                                        num_topics=sent_param['num_topics'],
                                        random_state=sent_param['random_state'],
                                        chunksize=sent_param['chunksize'],
                                        eval_every=sent_param['eval_every'],
                                        passes=sent_param['passes'],
                                        per_word_topics=True,
                                        workers=sent_param['workers']

In [None]:
# Save model to disk
lda_multi_s.save('../ga_dsi_capstone_ec2only/models/lda_multi_s')

**Final Paragraph LDA Model**

In [None]:
# Run model
lda_multi_p = ldamulticore.LdaMulticore(corpus=corpus_p,
                                        id2word=g_dict,
                                        num_topics=par_param['num_topics'],
                                        random_state=par_param['random_state'],
                                        chunksize=par_param['chunksize'],
                                        eval_every=par_param['eval_every'],
                                        passes=par_param['passes'],
                                        per_word_topics=True,
                                        workers=par_param['workers']
)

In [None]:
# Save model to disk
lda_multi_p.save('../ga_dsi_capstone_ec2only/models/lda_multi_p')

## 2.2 Topic Labeling with Gensim Word2Vec

### 2.2.1 Word2Vec

**Load Gensim Wikipedia Vector Dataset**

In [84]:
text8_corpus = api.load('text8')

In [75]:
wiki_corpus = api.load('wiki-english-20171001')

In [82]:
wv_vecsize_s = 32
wv_vecsize_p = 16

**Train Word2Vec Model**

Sentences

In [86]:
wv_model_s = Word2Vec(text8_corpus, 
                    size=wv_vecsize_s, 
                    window=2, 
                    min_count=2, 
                    # sg=0,
                    # workers=3
)

In [None]:
wv_model_sw = Word2Vec(wiki_corpus, 
                    size=wv_vecsize_s, 
                    window=2, 
                    min_count=2, 
                    # sg=0,
                    # workers=3
)

Paragraphs

In [87]:
wv_model_p = Word2Vec(text8_corpus, 
                    size=wv_vecsize_p, 
                    window=2, 
                    min_count=2, 
                    #sg=0,
                    #workers=3
)

In [None]:
wv_model_pw = Word2Vec(wiki_corpus, 
                    size=wv_vecsize_p, 
                    window=2, 
                    min_count=2, 
                    #sg=0,
                    #workers=3
)

**Save Word2Vec Models to Disk**

In [None]:
wv_model_s.save('../ga_dsi_capstone_ec2only/models/wv_model')
wv_model_p.save('../ga_dsi_capstone_ec2only/models/wv_model')

### 2.2.2 Identify Vectors for Specificity

**Get Vectors for _Tiger_ >> _Animal_ for Sentences and Paragraphs**

In [None]:
tiger_s = wv_model_s['tiger']
cat_s = wv_model_s['cat']
mammal_s = wv_model_s['mammal']
animal_s = wv_model_s['animal']

**Check for Specificity Vectors**

In [None]:
tiger_p = wv_model_p['tiger']
cat_p = wv_model_p['cat']
mammal_p = wv_model_p['mammal']
animal_p = wv_model_p['animal']

Sentences

In [None]:
for i in range(wv_vecsize_s):
    x_vals = [1,2,3,4,5]
    y_vals = [tiger_s[i],cat_s[i],mammal_s[i],animal_s[i]]
    i_min_dif = None
    min_dif = 999999

    # Check for elements with unidirectionality
    if (tiger_s[i]<cat_s[i]) & (cat_s[i]<mammal_s[i]) & (mammal_s[i]<animal_s[i]):
        print(i, 'ascending')
        print(y_vals)
    elif (tiger_s[i]>cat_s[i]) & (cat_s[i]>mammal_s[i]) & (mammal_s[i]>animal_s[i]):
        print(i, 'descending')
        print(y_vals)
        
print('')
print(i_min_dif)
print(min_dif)
print([tiger_s[i_min_dif],cat_s[i_min_dif],mammal_s[i_min_dif],animal_s[i_min_dif]])

Paragraphs

In [None]:
for i in range(wv_vecsize_p):
    x_vals = [1,2,3,4,5]
    y_vals = [tiger_p[i],cat_p[i],mammal_p[i],animal_p[i]]
    i_min_dif = None
    min_dif = 999999

    # Check for elements with unidirectionality
    if (tiger_p[i]<cat_p[i]) & (cat_p[i]<mammal_p[i]) & (mammal_p[i]<animal_p[i]):
        print(i, 'ascending')
        print(y_vals)
    elif (tiger_p[i]>cat_p[i]) & (cat_p[i]>mammal_p[i]) & (mammal_p[i]>animal_p[i]):
        print(i, 'descending')
        print(y_vals)
        
print('')
print(i_min_dif)
print(min_dif)
print([tiger_p[i_min_dif],cat_p[i_min_dif],mammal_p[i_min_dif],animal_p[i_min_dif]])

**Adjust Specificity Values**

### 2.2.3 Get Topic Labels

In [None]:
topics_s = lda_multi_s.get_topics()
topics_p = lda_multi_p.get_topics()

## 2.3 LDA Features for Corpora

### 2.3.1 Training Text LDA

### 2.3.2 Testing Text LDA

## Continue to Notebook 3: Document Vectors