# Capstone: Philosophical Factors for NLP
**_Measuring Similarity to Philosophical Concepts in Text Data_**

## Thomas W. Ludlow, Jr.
**General Assembly Data Science Immersive DSI-NY-6**

**February 12, 2019**

# Notebook 2 - LDA Topic Modeling

### Table of Contents

[**2.1 Gensim LDA**](#2.1-Gensim-LDA)
- [2.1.1 Build Dictionary and Corpora](#2.1.1-Build-Dictionary-and-Corpora)
- [2.1.2 LDA Model](#2.1.2-LDA-Model)
- [2.1.3 Visualize with pyLDAvis](#2.1.3-Visualize-with-pyLDAvis)
- [2.1.4 Optimization for Number of Topics](#2.1.4-Optimization-for-Number-of-Topics)

[**2.2 Topic Labeling with Gensim Word2Vec**](#2.2-Topic-Labeling-with-Gensim-Word2Vec)
- [2.2.1 Word2Vec](#2.2.1-Word2Vec)
- [2.2.2 Identify Vectors for Specificity](#2.2.2-Identify-Vectors-for-Specificity)
- [2.2.3 Get Topic Labels](#2.2.3-Get-Topic-Labels)

[**2.3 LDA Features for Corpora**](#2.3-LDA-Features-for-Corpora)
- [2.3.1 Training Text LDA](#2.3.1-Training-Text-LDA)
- [2.3.2 Testing Text LDA](#2.3.2-Testing-Text-LDA)

**Libraries**

In [1]:
# Python Data Science
import re
import ast
import time
import pickle
import numpy as np
import pandas as pd
from tqdm import tqdm_notebook as tqdm
from IPython.display import clear_output

# Natural Language Processing
import spacy
from nltk.stem import PorterStemmer

# Gensim
import gensim
import gensim.downloader as api
from gensim.corpora import Dictionary
from gensim.models import TfidfModel, ldamulticore, CoherenceModel
from gensim.models.word2vec import Word2Vec
import pyLDAvis.gensim

# Modeling Prep
from sklearn.preprocessing import StandardScaler
from sklearn.externals import joblib

# Plotting
import matplotlib.pyplot as plt
%matplotlib inline

# Override deprecation warnings
import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

## 2.1 Gensim LDA

**Load Preprocessed Text Data**

In [2]:
nlp_df = pd.read_csv('../data_eda/nlp_df.csv')
t_nlp_df = pd.read_csv('../data_eda/t_nlp_df.csv')

In [147]:
nlp_df.head()

Unnamed: 0,author,work,a_num,w_num,p_num,s_num,sent_text,sent_lemma,par_text,par_lemma
0,Aristotle,The Categories,0,0,0,0,Things are said to be named 'equivocally' when...,"['thing', 'be', 'say', 'to', 'be', 'name', 'eq...",Things are said to be named 'equivocally' when...,"['thing', 'be', 'say', 'to', 'be', 'name', 'eq..."
1,Aristotle,The Categories,0,0,0,1,"Thus, a real man and a figure in a picture can...","['thus', 'real', 'man', 'and', 'figure', 'in',...",Things are said to be named 'equivocally' when...,"['thing', 'be', 'say', 'to', 'be', 'name', 'eq..."
2,Aristotle,The Categories,0,0,0,2,For should any one define in what sense each i...,"['should', 'any', 'one', 'define', 'in', 'what...",Things are said to be named 'equivocally' when...,"['thing', 'be', 'say', 'to', 'be', 'name', 'eq..."
3,Aristotle,The Categories,0,0,1,0,"On the other hand, things are said to be named...","['on', 'other', 'hand', 'thing', 'be', 'say', ...","On the other hand, things are said to be named...","['on', 'other', 'hand', 'thing', 'be', 'say', ..."
4,Aristotle,The Categories,0,0,1,1,"A man and an ox are both 'animal', and these a...","['man', 'and', 'an', 'ox', 'be', 'both', 'anim...","On the other hand, things are said to be named...","['on', 'other', 'hand', 'thing', 'be', 'say', ..."


In [4]:
word_count = sum([sent.strip().count(' ') for sent in nlp_df.sent_text.tolist()]) + nlp_df.shape[0]
word_count

1887382

In [149]:
nlp_df.a_num.value_counts()

9     9661
11    7813
4     6513
17    6295
7     6121
12    5887
13    4186
16    3520
15    3172
1     3133
14    2674
6     2598
18    2097
5     1657
2     1289
10    1181
0     1118
19    1112
3      596
8      299
Name: a_num, dtype: int64

In [148]:
nlp_df.shape

(70922, 10)

In [9]:
t_nlp_df.head()

Unnamed: 0,author,work,a_num,w_num,p_num,s_num,sent_text,sent_lemma,par_text,par_lemma
0,Aristotle,Ethics,0,0,0,0,"Every art, and every science reduced to a teac...","['every', 'art', 'and', 'every', 'science', 'r...","Every art, and every science reduced to a teac...","['every', 'art', 'and', 'every', 'science', 'r..."
1,Aristotle,Ethics,0,0,1,0,Now there plainly is a difference in the Ends ...,"['now', 'there', 'plainly', 'be', 'difference'...",Now there plainly is a difference in the Ends ...,"['now', 'there', 'plainly', 'be', 'difference'..."
2,Aristotle,Ethics,0,0,1,1,"Again, since actions and arts and sciences are...","['again', 'since', 'action', 'and', 'art', 'an...",Now there plainly is a difference in the Ends ...,"['now', 'there', 'plainly', 'be', 'difference'..."
3,Aristotle,Ethics,0,0,2,0,"And whatever of such actions, arts, or science...","['and', 'whatev', 'of', 'such', 'action', 'art...","And whatever of such actions, arts, or science...","['and', 'whatev', 'of', 'such', 'action', 'art..."
4,Aristotle,Ethics,0,0,3,0,(And in this comparison it makes no difference...,"['and', 'in', 'this', 'comparison', 'make', 'n...",(And in this comparison it makes no difference...,"['and', 'in', 'this', 'comparison', 'make', 'n..."


In [10]:
t_nlp_df.shape

(8395, 10)

### 2.1.1 Build Dictionary and Corpora

**Gensim Dictionary `g_dict`**

In [68]:
stopwords = ['xxiv','xxvii','xxxi','xxxvii','xxx','xxv','liii','xlvi','xxxviii','liv']

In [69]:
# Build dictionary to contain all terms from normalized text
g_dict = Dictionary([ast.literal_eval(lemma_str) for lemma_str in nlp_df.sent_lemma])
len(g_dict)

25682

In [70]:
for word in stopwords:
    g_dict.filter_tokens(bad_ids=[g_dict.token2id[word]])

In [71]:
len(g_dict)

25672

**Remove Outliers from Dictionary**

In [72]:
g_dict.filter_extremes(no_below=5, no_above=0.85, keep_n=10000)
len(g_dict)

9075

**Bag of Words (BoW) Corpora**

Training Text Corpus

In [73]:
# Build corpus of normalized text relative to dictionary
bow_corpus_s = [g_dict.doc2bow(ast.literal_eval(sent)) for sent in nlp_df.sent_lemma]
bow_corpus_p = [g_dict.doc2bow(ast.literal_eval(par)) for par in nlp_df.par_lemma]

In [74]:
len(bow_corpus_s)

70922

In [75]:
len(bow_corpus_s[0])

14

In [76]:
len(bow_corpus_p)

70922

In [77]:
len(bow_corpus_p[0])

41

Testing Text Corpus

In [78]:
# Build corpus of normalized text relative to dictionary
t_bow_corpus_s = [g_dict.doc2bow(ast.literal_eval(sent)) for sent in t_nlp_df.sent_lemma]
t_bow_corpus_p = [g_dict.doc2bow(ast.literal_eval(par)) for par in t_nlp_df.par_lemma]

In [79]:
len(t_bow_corpus_p)

8395

In [80]:
len(t_bow_corpus_p[0])

31

**TF-IDF Vectorization**

In [81]:
tfidf = TfidfModel(bow_corpus_s, normalize=True)

In [82]:
corpus_s = tfidf[bow_corpus_s]

In [83]:
len(corpus_s)

70922

In [84]:
corpus_p = tfidf[bow_corpus_p]

In [85]:
t_corpus_s = tfidf[t_bow_corpus_s]

In [86]:
len(t_corpus_s)

8395

In [87]:
t_corpus_p = tfidf[t_bow_corpus_p]

**Save TF-IDF Model to Disk**

In [88]:
tfidf.save('./models/tfidf')

### 2.1.2 LDA Models

**Set Parameter Values**

In [100]:
sent_param= {
    'num_topics':32,
    'random_state':211,
    'chunksize':5000,
    'eval_every':10,
    'passes':3,
    'workers':3
}

par_param= {
    'num_topics':20,
    'random_state':211,
    'chunksize':1000,
    'eval_every':10,
    'passes':3,
    'workers':3
}

**LDA Multicore Model - Sentences**

In [101]:
# Instantiate model based on parameter values
lda_multi_s = ldamulticore.LdaMulticore(corpus=corpus_s,
                                        id2word=g_dict,
                                        num_topics=sent_param['num_topics'],
                                        random_state=sent_param['random_state'],
                                        chunksize=sent_param['chunksize'],
                                        eval_every=sent_param['eval_every'],
                                        passes=sent_param['passes'],
                                        per_word_topics=True,
                                        workers=sent_param['workers'],
                                        alpha='symmetric'
)

In [102]:
lda_multi_s.print_topics()

[(15,
  '0.009*"and" + 0.009*"to" + 0.008*"of" + 0.008*"be" + 0.007*"in" + 0.006*"that" + 0.006*"have" + 0.006*"as" + 0.005*"not" + 0.005*"which"'),
 (25,
  '0.008*"absurd" + 0.008*"of" + 0.008*"be" + 0.008*"to" + 0.007*"and" + 0.006*"partly" + 0.006*"which" + 0.006*"amount" + 0.006*"have" + 0.006*"in"'),
 (26,
  '0.010*"and" + 0.009*"of" + 0.009*"to" + 0.008*"be" + 0.007*"modification" + 0.007*"in" + 0.007*"have" + 0.006*"that" + 0.006*"not" + 0.006*"france"'),
 (4,
  '0.015*"kung" + 0.013*"soldier" + 0.011*"chang" + 0.010*"yu" + 0.008*"be" + 0.008*"to" + 0.008*"say" + 0.007*"of" + 0.007*"everyone" + 0.007*"in"'),
 (5,
  '0.009*"of" + 0.009*"in" + 0.008*"be" + 0.008*"to" + 0.007*"that" + 0.007*"and" + 0.007*"associate" + 0.006*"socrates" + 0.006*"repeat" + 0.006*"as"'),
 (29,
  '0.009*"to" + 0.009*"of" + 0.008*"be" + 0.008*"and" + 0.008*"in" + 0.007*"government" + 0.007*"citizen" + 0.007*"have" + 0.006*"as" + 0.006*"that"'),
 (18,
  '0.008*"sun" + 0.008*"be" + 0.008*"and" + 0.007*"as"

**LDA Metrics - Sentences**

In [103]:
lda_multi_s.log_perplexity(corpus_s)

-10.594897182885989

In [104]:
cm_s = CoherenceModel(model=lda_multi_s, texts=[ast.literal_eval(sent) for sent in nlp_df.sent_lemma], 
                      dictionary=g_dict, coherence='c_v')
cm_s.get_coherence()

0.47530049779106176

**LDA Multicore Model - Paragraphs**

In [94]:
# Instantiate model based on parameter values
lda_multi_p = ldamulticore.LdaMulticore(corpus=corpus_p,
                                        id2word=g_dict,
                                        num_topics=par_param['num_topics'],
                                        random_state=par_param['random_state'],
                                        chunksize=par_param['chunksize'],
                                        eval_every=par_param['eval_every'],
                                        passes=par_param['passes'],
                                        per_word_topics=True,
                                        workers=par_param['workers'],
                                        alpha='symmetric'
)

In [95]:
lda_multi_p.print_topics()

[(0,
  '0.100*"ye" + 0.030*"christian" + 0.026*"ah" + 0.012*"spake" + 0.012*"boy" + 0.010*"council" + 0.009*"commander" + 0.009*"ordain" + 0.009*"meeting" + 0.008*"quiet"'),
 (1,
  '0.026*"local" + 0.021*"relieve" + 0.021*"heavy" + 0.021*"yea" + 0.018*"brave" + 0.016*"bar" + 0.016*"cage" + 0.016*"courageous" + 0.015*"equilibrium" + 0.013*"yellow"'),
 (2,
  '0.013*"and" + 0.009*"that" + 0.008*"to" + 0.008*"god" + 0.008*"say" + 0.008*"be" + 0.007*"intuition" + 0.007*"of" + 0.007*"day" + 0.006*"not"'),
 (3,
  '0.011*"to" + 0.010*"of" + 0.010*"be" + 0.010*"that" + 0.009*"and" + 0.008*"in" + 0.008*"have" + 0.007*"as" + 0.007*"not" + 0.007*"which"'),
 (4,
  '0.071*"shih" + 0.044*"subordinate" + 0.031*"ching" + 0.031*"snow" + 0.029*"hsien" + 0.027*"shu" + 0.020*"bitter" + 0.020*"shang" + 0.020*"wei" + 0.017*"po"'),
 (5,
  '0.025*"civil" + 0.016*"intellect" + 0.013*"wang" + 0.012*"war" + 0.012*"enemy" + 0.011*"defeat" + 0.011*"opponent" + 0.010*"contract" + 0.009*"justice" + 0.009*"hsi"'),
 (6

**LDA Metrics - Paragraphs**

In [96]:
lda_multi_p.log_perplexity(corpus_p)

-12.224433383844822

In [97]:
cm_p = CoherenceModel(model=lda_multi_p, texts=[ast.literal_eval(par) for par in nlp_df.par_lemma], 
                      dictionary=g_dict, coherence='c_v')
cm_p.get_coherence()

0.4043455432103852

### 2.1.3 Visualize with pyLDAvis

In [None]:
lda_display_s = pyLDAvis.gensim.prepare(lda_multi_s, corpus_s, g_dict, sort_topics=True)
pyLDAvis.display(lda_display_s)

In [99]:
lda_display_p = pyLDAvis.gensim.prepare(lda_multi_p, corpus_p, g_dict, sort_topics=True)
pyLDAvis.display(lda_display_p)

### 2.1.4 Optimization for Number of Topics

**Set Optimizing Parameters**

In [105]:
sent_opt_params = {
    'num_topics':[28,32,36],
    'random_state':210,
    'chunksize':5000,
    'passes':2,
    'workers':3
}

In [106]:
par_opt_params = {
    'num_topics':[16,20,24],
    'random_state':210,
    'chunksize':1000,
    'passes':2,
    'workers':3
}

**Sentence LDA**

In [107]:
lda_multi_s_df = pd.DataFrame(columns=['num_topics','perplexity','coherence'])

for nt in tqdm(range(len(sent_opt_params['num_topics']))):
    temp_df = pd.DataFrame(columns=['num_topics','perplexity','coherence'])
    temp_df.loc[nt, 'num_topics'] = sent_opt_params['num_topics'][nt]

    lda_multi_s = ldamulticore.LdaMulticore(corpus=corpus_s,
                                          id2word=g_dict,
                                          num_topics=sent_opt_params['num_topics'][nt],
                                          random_state=210,
                                          chunksize=sent_opt_params['chunksize'],
                                          passes=sent_opt_params['passes'],
                                          per_word_topics=True,
                                          workers=sent_opt_params['workers'],
                                           alpha='symmetric')
    temp_df.perplexity = lda_multi_s.log_perplexity(corpus_s)
    cm = CoherenceModel(model=lda_multi_s, texts=[ast.literal_eval(sent) for sent in nlp_df.sent_lemma], dictionary=g_dict, coherence='c_v')
    temp_df.coherence = cm.get_coherence()
    lda_multi_s_df = lda_multi_s_df.append(temp_df)
                
lda_multi_s_df

HBox(children=(IntProgress(value=0, max=3), HTML(value='')))

Process ForkPoolWorker-104:
Process ForkPoolWorker-103:
Process ForkPoolWorker-105:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/multiprocessing/pool.py", line 103, in worker
    initializer(*initargs)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/multiprocessing/pool.py", line 103, in worker
    initializer(*in

KeyboardInterrupt: 

  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/gensim/models/ldamulticore.py", line 334, in worker_e_step
    chunk_no, chunk, worker_lda = input_queue.get()
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/gensim/models/ldamulticore.py", line 337, in worker_e_step
    worker_lda.do_estep(chunk)  # TODO: auto-tune alpha?
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/gensim/models/ldamodel.py", line 742, in do_estep
    gamma, sstats = self.inference(chunk, collect_sstats=True)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/gensim/models/ldamodel.py", line 707, in inference
    sstats[:, ids] += np.outer(expElogthetad.T, cts / phinorm)
KeyboardInterrupt
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/multiprocessing/queues.py", line 94, in get
    res = self._recv_bytes()
  File "/home/ubuntu/anaconda3/envs/ten

**Paragraph LDA**

In [None]:
lda_multi_p_df = pd.DataFrame(columns=['num_topics','perplexity','coherence'])

for nt in tqdm(range(len(par_opt_params['num_topics']))):
    temp_df = pd.DataFrame(columns=['num_topics','perplexity','coherence'])
    temp_df.loc[nt, 'num_topics'] = par_opt_params['num_topics'][nt]

    lda_multi_p = ldamulticore.LdaMulticore(corpus=corpus_p,
                                          id2word=g_dict,
                                          num_topics=par_opt_params['num_topics'][nt],
                                          random_state=210,
                                          chunksize=par_opt_params['chunksize'],
                                          passes=par_opt_params['passes'],
                                          per_word_topics=True,
                                          workers=par_opt_params['workers'], 
                                            alpha='symmetric')
    temp_df.perplexity = lda_multi_p.log_perplexity(corpus_p)
    cm = CoherenceModel(model=lda_multi_p, texts=[ast.literal_eval(par) for par in nlp_df.par_lemma], dictionary=g_dict, coherence='c_v')
    temp_df.coherence = cm.get_coherence()
    lda_multi_p_df = lda_multi_p_df.append(temp_df)
                
lda_multi_p_df

**Create Empty Optimizing DataFrames**

In [None]:
lda_sent = pd.DataFrame(columns=['num_topics','perplexity','coherence'])
lda_par = pd.DataFrame(columns=['num_topics','perplexity','coherence'])

**Add Latest Results to Lists**

In [None]:
lda_sent = lda_sent.append(lda_multi_s_df)

In [None]:
lda_par = lda_par.append(lda_multi_p_df)

**Final Sentence LDA Model**

In [108]:
sent_params = {
    'num_topics':32,
    'random_state':210,
    'chunksize':5000,
    'passes':2,
    'workers':3
}

In [109]:
# Run model
lda_multi_s = ldamulticore.LdaMulticore(corpus=corpus_s,
                                        id2word=g_dict,
                                        num_topics=sent_params['num_topics'],
                                        random_state=sent_params['random_state'],
                                        chunksize=sent_params['chunksize'],
                                        passes=sent_params['passes'],
                                        per_word_topics=True,
                                        workers=sent_params['workers'], 
                                        alpha='symmetric'
                                       )

In [75]:
# Save model to disk
lda_multi_s.save('./models/lda_multi_s')

**Final Paragraph LDA Model**

In [110]:
par_params = {
    'num_topics':20,
    'random_state':210,
    'chunksize':1000,
    'passes':2,
    'workers':3
}

In [111]:
# Run model
lda_multi_p = ldamulticore.LdaMulticore(corpus=corpus_p,
                                        id2word=g_dict,
                                        num_topics=par_params['num_topics'],
                                        random_state=par_params['random_state'],
                                        chunksize=par_params['chunksize'],
                                        passes=par_params['passes'],
                                        per_word_topics=True,
                                        workers=par_params['workers'], 
                                        alpha='symmetric'
                                       )

In [77]:
# Save model to disk
lda_multi_p.save('./models/lda_multi_p')

## 2.2 Get Topic Values for Corpora

**Load LDA Models**

In [2]:
lda_multi_s = Dictionary.load('./models/lda_multi_s')

In [4]:
lda_multi_p = Dictionary.load('./models/lda_multi_p')

**Sentences**

In [112]:
corpus_topic_df_s = pd.DataFrame(columns=[n for n in range(lda_multi_s.num_topics)])

for i, doc in enumerate(lda_multi_s.get_document_topics(tqdm(corpus_s))):
    for topic, proba in doc:
        corpus_topic_df_s.loc[i, topic] = proba

HBox(children=(IntProgress(value=0, max=70922), HTML(value='')))

In [113]:
corpus_topic_df_s.fillna(0, inplace=True)

In [114]:
corpus_topic_df_s.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,0.0,0.0,0.0,0.0,0.0,0.764414,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058748,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.771878,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.838629,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**Paragraphs**

In [115]:
corpus_topic_df_p = pd.DataFrame(columns=[n for n in range(lda_multi_p.num_topics)])

for i, doc in enumerate(lda_multi_p.get_document_topics(tqdm(corpus_p))):
    for topic, proba in doc:
        corpus_topic_df_p.loc[i, topic] = proba

HBox(children=(IntProgress(value=0, max=70922), HTML(value='')))

In [116]:
corpus_topic_df_p.fillna(0, inplace=True)

**SAVE LDA VALUES**

In [117]:
corpus_topic_df_s.to_csv('../data_vec/corpus_topic_df_s.csv', index=False)

In [118]:
corpus_topic_df_p.to_csv('../data_vec/corpus_topic_df_p.csv', index=False)

**Test Data**

In [123]:
t_corpus_topic_df_s = pd.DataFrame(columns=[n for n in range(lda_multi_s.num_topics)])

for i, doc in enumerate(lda_multi_s.get_document_topics(tqdm(t_corpus_s))):
    for topic, proba in doc:
        t_corpus_topic_df_s.loc[i, topic] = proba

HBox(children=(IntProgress(value=0, max=8395), HTML(value='')))

In [132]:
t_corpus_topic_df_s.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.831242,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.60195,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.074247,0.0,0.456575,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.327097,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [125]:
t_corpus_topic_df_s.shape

(8395, 32)

In [126]:
t_corpus_topic_df_p = pd.DataFrame(columns=[n for n in range(lda_multi_p.num_topics)])

for i, doc in enumerate(lda_multi_p.get_document_topics(tqdm(t_corpus_p))):
    for topic, proba in doc:
        t_corpus_topic_df_p.loc[i, topic] = proba

HBox(children=(IntProgress(value=0, max=8395), HTML(value='')))

In [133]:
t_corpus_topic_df_p.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.835277,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.818124,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.818124,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.828708,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.826057,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [128]:
t_corpus_topic_df_p.shape

(8395, 20)

In [130]:
t_corpus_topic_df_s.fillna(0, inplace=True)
t_corpus_topic_df_p.fillna(0, inplace=True)

**Save to Disk**

In [131]:
t_corpus_topic_df_s.to_csv('../data_vec/t_corpus_topic_df_s.csv', index=False)
t_corpus_topic_df_p.to_csv('../data_vec/t_corpus_topic_df_p.csv', index=False)

## 2.2 Topic Labeling with Gensim Word2Vec

### NOTE: This functionality is not included in the final model

### 2.2.1 Word2Vec

**Load Gensim Wikipedia Text8 Vector Dataset**

In [3]:
text8_corpus = api.load('text8')

In [4]:
wv_vecsize_s = 32
wv_vecsize_p = 16

**Train Word2Vec Model**

Sentences

In [5]:
wv_model_s = Word2Vec(text8_corpus, 
                    size=wv_vecsize_s, 
                    window=2, 
                    min_count=2, 
                    sg=0,
                    workers=3
)

Paragraphs

In [6]:
wv_model_p = Word2Vec(text8_corpus, 
                    size=wv_vecsize_p, 
                    window=2, 
                    min_count=2, 
                    sg=0,
                    workers=3
)

**Save Word2Vec Models to Disk**

In [None]:
wv_model_s.save('../ec2_models/wv_model')
wv_model_p.save('../ec2_models/wv_model')

### 2.2.2 Identify Vectors for Specificity

**Get Vectors for _Tiger_ >> _Animal_ for Sentences and Paragraphs**

In [None]:
tiger_s = wv_model_s['tiger']
cat_s = wv_model_s['cat']
mammal_s = wv_model_s['mammal']
animal_s = wv_model_s['animal']

**Check for Specificity Vectors**

In [None]:
tiger_p = wv_model_p['tiger']
cat_p = wv_model_p['cat']
mammal_p = wv_model_p['mammal']
animal_p = wv_model_p['animal']

Sentences

In [None]:
for i in range(wv_vecsize_s):
    x_vals = [1,2,3,4,5]
    y_vals = [tiger_s[i],cat_s[i],mammal_s[i],animal_s[i]]
    i_min_dif = None
    min_dif = 999999

    # Check for elements with unidirectionality
    if (tiger_s[i]<cat_s[i]) & (cat_s[i]<mammal_s[i]) & (mammal_s[i]<animal_s[i]):
        print(i, 'ascending')
        print(y_vals)
    elif (tiger_s[i]>cat_s[i]) & (cat_s[i]>mammal_s[i]) & (mammal_s[i]>animal_s[i]):
        print(i, 'descending')
        print(y_vals)
        
print('')
print(i_min_dif)
print(min_dif)
print([tiger_s[i_min_dif],cat_s[i_min_dif],mammal_s[i_min_dif],animal_s[i_min_dif]])

Paragraphs

In [None]:
for i in range(wv_vecsize_p):
    x_vals = [1,2,3,4,5]
    y_vals = [tiger_p[i],cat_p[i],mammal_p[i],animal_p[i]]
    i_min_dif = None
    min_dif = 999999

    # Check for elements with unidirectionality
    if (tiger_p[i]<cat_p[i]) & (cat_p[i]<mammal_p[i]) & (mammal_p[i]<animal_p[i]):
        print(i, 'ascending')
        print(y_vals)
    elif (tiger_p[i]>cat_p[i]) & (cat_p[i]>mammal_p[i]) & (mammal_p[i]>animal_p[i]):
        print(i, 'descending')
        print(y_vals)
        
print('')
print(i_min_dif)
print(min_dif)
print([tiger_p[i_min_dif],cat_p[i_min_dif],mammal_p[i_min_dif],animal_p[i_min_dif]])

**Adjust Specificity Values**

### 2.2.3 Get Topic Labels

In [None]:
topics_s = lda_multi_s.get_topics()
topics_p = lda_multi_p.get_topics()

## 2.3 LDA Features for Corpora

### 2.3.1 Training Text LDA

In [134]:
colname_s = ['lda_s_'+str(i) for i in range(corpus_topic_df_s.shape[1])]

In [136]:
corpus_topic_df_s.columns = colname_s

In [120]:
corpus_topic_df_s.shape

(70922, 32)

In [137]:
colname_p = ['lda_p_'+str(i) for i in range(corpus_topic_df_p.shape[1])]

In [138]:
corpus_topic_df_p.columns = colname_p

In [122]:
corpus_topic_df_p.shape

(70922, 20)

In [139]:
lda_train = pd.merge(corpus_topic_df_s, corpus_topic_df_p, left_index=True, right_index=True)

In [151]:
lda_train = lda_train.merge(nlp_df[['a_num','p_num','s_num']], left_index=True, right_index=True)

In [154]:
lda_train.head()

Unnamed: 0,lda_s_0,lda_s_1,lda_s_2,lda_s_3,lda_s_4,lda_s_5,lda_s_6,lda_s_7,lda_s_8,lda_s_9,...,lda_p_13,lda_p_14,lda_p_15,lda_p_16,lda_p_17,lda_p_18,lda_p_19,a_num,p_num,s_num
0,0.0,0.0,0.0,0.0,0.0,0.764414,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058748,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,1
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,2
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.771878,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.838629,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,1


### 2.3.2 Testing Text LDA

In [141]:
t_corpus_topic_df_s.columns = colname_s

In [142]:
t_corpus_topic_df_p.columns = colname_p

In [143]:
lda_test = pd.merge(t_corpus_topic_df_s, t_corpus_topic_df_p, left_index=True, right_index=True)

In [152]:
lda_test = lda_test.merge(t_nlp_df[['a_num','p_num','s_num']], left_index=True, right_index=True)

In [153]:
lda_test.head()

Unnamed: 0,lda_s_0,lda_s_1,lda_s_2,lda_s_3,lda_s_4,lda_s_5,lda_s_6,lda_s_7,lda_s_8,lda_s_9,...,lda_p_13,lda_p_14,lda_p_15,lda_p_16,lda_p_17,lda_p_18,lda_p_19,a_num,p_num,s_num
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0
1,0.60195,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,0
2,0.0,0.074247,0.0,0.456575,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,1
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,2,0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,3,0


**Save Files to Disk**

In [155]:
lda_train.to_csv('../data_vec/lda_train.csv', index=False)
lda_test.to_csv('../data_vec/lda_test.csv', index=False)

## Continue to Notebook 3: Document Vectors