# Capstone: Philosophical Factors for NLP
**_Measuring Similarity to Philosophical Concepts in Text Data_**

## Thomas W. Ludlow, Jr.
**General Assembly Data Science Immersive DSI-NY-6**

**February 12, 2019**

# Notebook 2 - LDA Topic Modeling

### Table of Contents

[**2.1 Gensim LDA**](#2.1-Gensim-LDA)
- [2.1.1 Build Dictionary and Corpora](#2.1.1-Build-Dictionary-and-Corpora)
- [2.1.2 LDA Model](#2.1.2-LDA-Model)
- [2.1.3 Visualize with pyLDAvis](#2.1.3-Visualize-with-pyLDAvis)
- [2.1.4 Optimization for Number of Topics](#2.1.4-Optimization-for-Number-of-Topics)

[**2.2 Topic Labeling with Gensim Word2Vec**](#2.2-Topic-Labeling-with-Gensim-Word2Vec)
- [2.2.1 Word2Vec](#2.2.1-Word2Vec)
- [2.2.2 Identify Vectors for Specificity](#2.2.2-Identify-Vectors-for-Specificity)
- [2.2.3 Get Topic Labels](#2.2.3-Get-Topic-Labels)

[**2.3 LDA Features for Corpora**](#2.3-LDA-Features-for-Corpora)
- [2.3.1 Training Text LDA](#2.3.1-Training-Text-LDA)
- [2.3.2 Testing Text LDA](#2.3.2-Testing-Text-LDA)

**Libraries**

In [1]:
# Python Data Science
import re
import ast
import time
import pickle
import numpy as np
import pandas as pd
from tqdm import tqdm
from IPython.display import clear_output

# Natural Language Processing
import spacy
from nltk.stem import PorterStemmer

# Gensim
import gensim
import gensim.downloader as api
from gensim.corpora import Dictionary
from gensim.models import TfidfModel, ldamulticore, CoherenceModel
from gensim.models.word2vec import Word2Vec
import pyLDAvis.gensim

# Modeling Prep
from sklearn.preprocessing import StandardScaler
from sklearn.externals import joblib

# Plotting
import matplotlib.pyplot as plt
%matplotlib inline

# Override deprecation warnings
import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

## 2.1 Gensim LDA

**Load Preprocessed Text Data**

In [2]:
nlp_df = pd.read_csv('../data_eda/nlp_df.csv')
t_nlp_df = pd.read_csv('../data_eda/t_nlp_df.csv')

In [3]:
nlp_df.head()

Unnamed: 0,author,work,a_num,w_num,p_num,s_num,sent_text,sent_lemma,par_text,par_lemma
0,Aristotle,The Categories,0,0,0,0,Things are said to be named 'equivocally' when...,"['thing', 'say', 'name', 'equivocally', 'commo...",Things are said to be named 'equivocally' when...,"['thing', 'say', 'name', 'equivocally', 'commo..."
1,Aristotle,The Categories,0,0,0,1,"Thus, a real man and a figure in a picture can...","['real', 'man', 'figure', 'picture', 'lay', 'c...",Things are said to be named 'equivocally' when...,"['thing', 'say', 'name', 'equivocally', 'commo..."
2,Aristotle,The Categories,0,0,0,2,For should any one define in what sense each i...,"['define', 'sense', 'animal', 'definition', 'c...",Things are said to be named 'equivocally' when...,"['thing', 'say', 'name', 'equivocally', 'commo..."
3,Aristotle,The Categories,0,0,1,0,"On the other hand, things are said to be named...","['hand', 'thing', 'say', 'name', 'univocally',...","On the other hand, things are said to be named...","['hand', 'thing', 'say', 'name', 'univocally',..."
4,Aristotle,The Categories,0,0,1,1,"A man and an ox are both 'animal', and these a...","['man', 'ox', 'animal', 'univocally', 'name', ...","On the other hand, things are said to be named...","['hand', 'thing', 'say', 'name', 'univocally',..."


In [4]:
word_count = sum([sent.strip().count(' ') for sent in nlp_df.sent_text.tolist()]) + nlp_df.shape[0]
word_count

1875949

In [5]:
nlp_df.a_num.value_counts()

9     8417
11    7556
4     6160
7     6067
12    5497
13    3982
17    3910
16    3426
15    3171
14    2631
6     2492
1     1955
18    1715
5     1566
2     1205
10    1159
19    1119
0     1111
3      569
8      224
Name: a_num, dtype: int64

In [6]:
nlp_df.shape

(63932, 10)

In [7]:
nlp_df[nlp_df.sent_lemma.str.len()<10].shape

(945, 10)

In [8]:
t_nlp_df[t_nlp_df.sent_lemma.str.len() < 10].shape

(107, 10)

In [9]:
t_nlp_df.head()

Unnamed: 0,author,work,a_num,w_num,p_num,s_num,sent_text,sent_lemma,par_text,par_lemma
0,Aristotle,Ethics,0,0,0,0,"Every art, and every science reduced to a teac...","['art', 'science', 'reduce', 'teachable', 'for...","Every art, and every science reduced to a teac...","['art', 'science', 'reduce', 'teachable', 'for..."
1,Aristotle,Ethics,0,0,1,0,Now there plainly is a difference in the Ends ...,"['plainly', 'difference', 'ends', 'propose', '...",Now there plainly is a difference in the Ends ...,"['plainly', 'difference', 'ends', 'propose', '..."
2,Aristotle,Ethics,0,0,1,1,"Again, since actions and arts and sciences are...","['action', 'art', 'science', 'ends', 'likewise...",Now there plainly is a difference in the Ends ...,"['plainly', 'difference', 'ends', 'propose', '..."
3,Aristotle,Ethics,0,0,2,0,"And whatever of such actions, arts, or science...","['action', 'art', 'science', 'range', 'faculty...","And whatever of such actions, arts, or science...","['action', 'art', 'science', 'range', 'faculty..."
4,Aristotle,Ethics,0,0,3,0,(And in this comparison it makes no difference...,"['comparison', 'make', 'difference', 'act', 'w...",(And in this comparison it makes no difference...,"['comparison', 'make', 'difference', 'act', 'w..."


In [10]:
t_nlp_df.shape

(7935, 10)

### 2.1.1 Build Dictionary and Corpora

**Gensim Dictionary `g_dict`**

In [11]:
pkl = open('../data_eda/sw.pkl','rb')
stopwords = pickle.load(pkl)
pkl.close()

In [12]:
stopwords[:10]

['i', 'ii', 'iii', 'iv', 'v', 'vi', 'vii', 'viii', 'ix', 'x']

In [13]:
# Build dictionary to contain all terms from normalized text
g_dict = Dictionary([ast.literal_eval(lemma_str) for lemma_str in nlp_df.sent_lemma])
len(g_dict)

25343

In [14]:
for word in stopwords:
    try:
        g_dict.filter_tokens(bad_ids=[g_dict.token2id[word]])
    except:
        continue

In [15]:
len(g_dict)

25343

**Remove Outliers from Dictionary**

In [16]:
g_dict.filter_extremes(no_below=4, no_above=0.88, keep_n=10000)
len(g_dict)

10000

In [17]:
g_dict.save('../models/g_dict')

**Bag of Words (BoW) Corpora**

Training Text Corpus

In [18]:
# Build corpus of normalized text relative to dictionary
bow_corpus_s = [g_dict.doc2bow(ast.literal_eval(sent)) for sent in nlp_df.sent_lemma]
bow_corpus_p = [g_dict.doc2bow(ast.literal_eval(par)) for par in nlp_df.par_lemma]

In [19]:
len(bow_corpus_s)

63932

In [20]:
len(bow_corpus_s[0])

7

In [21]:
len(bow_corpus_p)

63932

In [22]:
len(bow_corpus_p[0])

18

Testing Text Corpus

In [23]:
# Build corpus of normalized text relative to dictionary
t_bow_corpus_s = [g_dict.doc2bow(ast.literal_eval(sent)) for sent in t_nlp_df.sent_lemma]
t_bow_corpus_p = [g_dict.doc2bow(ast.literal_eval(par)) for par in t_nlp_df.par_lemma]

In [24]:
len(t_bow_corpus_p)

7935

In [25]:
len(t_bow_corpus_p[0])

18

**TF-IDF Vectorization**

In [26]:
tfidf = TfidfModel(bow_corpus_s, normalize=True)

In [27]:
corpus_s = tfidf[bow_corpus_s]

In [28]:
len(corpus_s)

63932

In [29]:
corpus_p = tfidf[bow_corpus_p]

In [30]:
t_corpus_s = tfidf[t_bow_corpus_s]

In [31]:
len(t_corpus_s)

7935

In [32]:
t_corpus_p = tfidf[t_bow_corpus_p]

**Save TF-IDF Model to Disk**

In [33]:
tfidf.save('../models/tfidf')

### 2.1.2 LDA Models

**Set Parameter Values**

In [34]:
sent_param= {
    'num_topics':32,
    'random_state':211,
    'chunksize':5000,
    'eval_every':10,
    'passes':3,
    'workers':3
}

par_param= {
    'num_topics':20,
    'random_state':211,
    'chunksize':1000,
    'eval_every':10,
    'passes':3,
    'workers':3
}

**LDA Multicore Model - Sentences**

In [35]:
# Instantiate model based on parameter values
lda_multi_s = ldamulticore.LdaMulticore(corpus=corpus_s,
                                        id2word=g_dict,
                                        num_topics=sent_param['num_topics'],
                                        random_state=sent_param['random_state'],
                                        chunksize=sent_param['chunksize'],
                                        eval_every=sent_param['eval_every'],
                                        passes=sent_param['passes'],
                                        per_word_topics=True,
                                        workers=sent_param['workers'],
                                        alpha='symmetric'
)

In [36]:
lda_multi_s.print_topics()

[(28,
  '0.014*"answer" + 0.007*"experiment" + 0.007*"plan" + 0.006*"follow" + 0.005*"refer" + 0.005*"man" + 0.004*"intuitive" + 0.004*"concerned" + 0.004*"compel" + 0.004*"question"'),
 (14,
  '0.012*"thy" + 0.011*"word" + 0.010*"sovereign" + 0.008*"verily" + 0.006*"speak" + 0.006*"volition" + 0.006*"right" + 0.006*"corollary" + 0.005*"man" + 0.005*"thou"'),
 (30,
  '0.011*"thee" + 0.010*"tax" + 0.005*"pay" + 0.005*"army" + 0.005*"die" + 0.005*"man" + 0.005*"colour" + 0.005*"gold" + 0.004*"sanction" + 0.004*"yellow"'),
 (6,
  '0.007*"politic" + 0.007*"let" + 0.006*"hat" + 0.006*"enemy" + 0.006*"man" + 0.006*"forget" + 0.005*"return" + 0.005*"wait" + 0.005*"ask" + 0.004*"passage"'),
 (13,
  '0.008*"love" + 0.006*"idea" + 0.006*"active" + 0.006*"hatred" + 0.006*"old" + 0.006*"accuracy" + 0.005*"man" + 0.005*"woman" + 0.005*"voice" + 0.005*"hardly"'),
 (17,
  '0.028*"ye" + 0.009*"taxis" + 0.007*"battle" + 0.006*"thyself" + 0.006*"virtue" + 0.006*"man" + 0.005*"high" + 0.005*"shall" + 0.0

**LDA Metrics - Sentences**

In [37]:
lda_multi_s.log_perplexity(corpus_s)

-10.718886352759409

In [38]:
cm_s = CoherenceModel(model=lda_multi_s, texts=[ast.literal_eval(sent) for sent in nlp_df.sent_lemma], 
                      dictionary=g_dict, coherence='c_v')
cm_s.get_coherence()

0.4023148255404584

**LDA Multicore Model - Paragraphs**

In [39]:
# Instantiate model based on parameter values
lda_multi_p = ldamulticore.LdaMulticore(corpus=corpus_p,
                                        id2word=g_dict,
                                        num_topics=par_param['num_topics'],
                                        random_state=par_param['random_state'],
                                        chunksize=par_param['chunksize'],
                                        eval_every=par_param['eval_every'],
                                        passes=par_param['passes'],
                                        per_word_topics=True,
                                        workers=par_param['workers'],
                                        alpha='symmetric'
)

In [40]:
lda_multi_p.print_topics()

[(0,
  '0.038*"prop" + 0.014*"war" + 0.009*"fellow" + 0.008*"today" + 0.007*"law" + 0.007*"enemy" + 0.007*"evil" + 0.007*"check" + 0.006*"declare" + 0.006*"punish"'),
 (1,
  '0.052*"civil" + 0.035*"morality" + 0.021*"shed" + 0.016*"phrase" + 0.012*"legitimate" + 0.010*"attributable" + 0.009*"administer" + 0.009*"vegetable" + 0.008*"melancholy" + 0.008*"fighting"'),
 (2,
  '0.010*"god" + 0.008*"man" + 0.007*"religion" + 0.007*"right" + 0.006*"king" + 0.006*"political" + 0.005*"law" + 0.005*"power" + 0.005*"church" + 0.004*"authority"'),
 (3,
  '0.016*"government" + 0.013*"tax" + 0.009*"army" + 0.009*"ye" + 0.009*"pay" + 0.008*"nation" + 0.007*"state" + 0.007*"country" + 0.006*"people" + 0.006*"man"'),
 (4,
  '0.014*"god" + 0.014*"thy" + 0.008*"dust" + 0.007*"spirit" + 0.007*"man" + 0.007*"thou" + 0.006*"wealth" + 0.005*"hath" + 0.005*"agriculture" + 0.005*"devil"'),
 (5,
  '0.011*"man" + 0.007*"state" + 0.006*"right" + 0.005*"good" + 0.005*"government" + 0.004*"society" + 0.004*"great" 

**LDA Metrics - Paragraphs**

In [41]:
lda_multi_p.log_perplexity(corpus_p)

-9.606194617283998

In [42]:
cm_p = CoherenceModel(model=lda_multi_p, texts=[ast.literal_eval(par) for par in nlp_df.par_lemma], 
                      dictionary=g_dict, coherence='c_v')
cm_p.get_coherence()

0.44013935060726117

### 2.1.3 Visualize with pyLDAvis

In [43]:
lda_display_s = pyLDAvis.gensim.prepare(lda_multi_s, corpus_s, g_dict, sort_topics=True)
pyLDAvis.display(lda_display_s)

In [44]:
lda_display_p = pyLDAvis.gensim.prepare(lda_multi_p, corpus_p, g_dict, sort_topics=True)
pyLDAvis.display(lda_display_p)

### 2.1.4 Optimization for Number of Topics

**Set Optimizing Parameters**

In [45]:
sent_opt_params = {
    'num_topics':[28,30,32,34],
    'random_state':211,
    'chunksize':5000,
    'passes':2,
    'workers':3
}

In [46]:
par_opt_params = {
    'num_topics':[16,18,20,22],
    'random_state':211,
    'chunksize':1000,
    'passes':2,
    'workers':3
}

**Sentence LDA**

*Estimated Run Time: ~6 min*

In [47]:
lda_multi_s_df = pd.DataFrame(columns=['num_topics','perplexity','coherence'])

for nt in tqdm(range(len(sent_opt_params['num_topics']))):
    temp_df = pd.DataFrame(columns=['num_topics','perplexity','coherence'])
    temp_df.loc[nt, 'num_topics'] = sent_opt_params['num_topics'][nt]

    lda_multi_s = ldamulticore.LdaMulticore(corpus=corpus_s,
                                          id2word=g_dict,
                                          num_topics=sent_opt_params['num_topics'][nt],
                                          random_state=sent_opt_params['random_state'],
                                          chunksize=sent_opt_params['chunksize'],
                                          passes=sent_opt_params['passes'],
                                          per_word_topics=True,
                                          workers=sent_opt_params['workers'],
                                           alpha='symmetric')
    temp_df.perplexity = lda_multi_s.log_perplexity(corpus_s)
    cm = CoherenceModel(model=lda_multi_s, texts=[ast.literal_eval(sent) for sent in nlp_df.sent_lemma], dictionary=g_dict, coherence='c_v')
    temp_df.coherence = cm.get_coherence()
    lda_multi_s_df = lda_multi_s_df.append(temp_df)
                
lda_multi_s_df

100%|██████████| 4/4 [04:16<00:00, 64.02s/it]


Unnamed: 0,num_topics,perplexity,coherence
0,28,-10.649647,0.335278
1,30,-10.724126,0.325636
2,32,-10.910636,0.390817
3,34,-10.91029,0.345179


**Paragraph LDA**

*Estimated Run Time: ~21 min*

In [None]:
lda_multi_p_df = pd.DataFrame(columns=['num_topics','perplexity','coherence'])

for nt in tqdm(range(len(par_opt_params['num_topics']))):
    temp_df = pd.DataFrame(columns=['num_topics','perplexity','coherence'])
    temp_df.loc[nt, 'num_topics'] = par_opt_params['num_topics'][nt]

    lda_multi_p = ldamulticore.LdaMulticore(corpus=corpus_p,
                                          id2word=g_dict,
                                          num_topics=par_opt_params['num_topics'][nt],
                                          random_state=par_opt_params['random_state'],
                                          chunksize=par_opt_params['chunksize'],
                                          passes=par_opt_params['passes'],
                                          per_word_topics=True,
                                          workers=par_opt_params['workers'], 
                                            alpha='symmetric')
    temp_df.perplexity = lda_multi_p.log_perplexity(corpus_p)
    cm = CoherenceModel(model=lda_multi_p, texts=[ast.literal_eval(par) for par in nlp_df.par_lemma], dictionary=g_dict, coherence='c_v')
    temp_df.coherence = cm.get_coherence()
    lda_multi_p_df = lda_multi_p_df.append(temp_df)
                
lda_multi_p_df

 50%|█████     | 2/4 [07:31<07:29, 224.56s/it]

In [50]:
lda_multi_p_df

Unnamed: 0,num_topics,perplexity,coherence
0,16,-9.382303,0.467432
1,18,-9.545009,0.443164
2,20,-9.649862,0.436338
3,22,-9.772338,0.426773


**Create Empty Optimizing DataFrames**

In [51]:
lda_sent = pd.DataFrame(columns=['num_topics','perplexity','coherence'])
lda_par = pd.DataFrame(columns=['num_topics','perplexity','coherence'])

**Add Latest Results to Lists**

In [52]:
lda_sent = lda_sent.append(lda_multi_s_df)

In [53]:
lda_par = lda_par.append(lda_multi_p_df)

**Final Sentence LDA Model**

In [54]:
sent_params = {
    'num_topics':32,
    'random_state':211,
    'chunksize':5000,
    'passes':2,
    'workers':3
}

In [55]:
# Run model
lda_multi_s = ldamulticore.LdaMulticore(corpus=corpus_s,
                                        id2word=g_dict,
                                        num_topics=sent_params['num_topics'],
                                        random_state=sent_params['random_state'],
                                        chunksize=sent_params['chunksize'],
                                        passes=sent_params['passes'],
                                        per_word_topics=True,
                                        workers=sent_params['workers'], 
                                        alpha='symmetric'
                                       )

In [56]:
# Save model to disk
lda_multi_s.save('../models/lda_multi_s')

In [57]:
lda_display_s = pyLDAvis.gensim.prepare(lda_multi_s, corpus_s, g_dict, sort_topics=True)
pyLDAvis.display(lda_display_s)

**Final Paragraph LDA Model**

In [58]:
par_params = {
    'num_topics':16,
    'random_state':211,
    'chunksize':1000,
    'passes':2,
    'workers':3
}

In [59]:
# Run model
lda_multi_p = ldamulticore.LdaMulticore(corpus=corpus_p,
                                        id2word=g_dict,
                                        num_topics=par_params['num_topics'],
                                        random_state=par_params['random_state'],
                                        chunksize=par_params['chunksize'],
                                        passes=par_params['passes'],
                                        per_word_topics=True,
                                        workers=par_params['workers'], 
                                        alpha='symmetric'
                                       )

In [60]:
# Save model to disk
lda_multi_p.save('../models/lda_multi_p')

In [61]:
lda_display_p = pyLDAvis.gensim.prepare(lda_multi_p, corpus_p, g_dict, sort_topics=True)
pyLDAvis.display(lda_display_p)

## 2.2 Get Topic Values for Corpora

**Load LDA Models**

In [None]:
lda_multi_s = ldamulticore.LdaMulticore.load('../models/lda_multi_s')

In [None]:
lda_multi_p = ldamulticore.LdaMulticore.load('../models/lda_multi_p')

**Sentences**

*Estimated Run Time: *

In [None]:
corpus_topic_df_s = pd.DataFrame(columns=[n for n in range(lda_multi_s.num_topics)])

for i, doc in enumerate(lda_multi_s.get_document_topics(tqdm(corpus_s))):
    for topic, proba in doc:
        corpus_topic_df_s.loc[i, topic] = proba

 34%|███▍      | 21654/63932 [33:10<1:55:28,  6.10it/s]

In [90]:
corpus_topic_df_s.fillna(0, inplace=True)

In [91]:
corpus_topic_df_s.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.727455,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.373238,0.0,0.0,0.0,0.412698,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.731888,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.774193,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [92]:
corpus_topic_df_s.shape

(63932, 32)

In [93]:
corpus_topic_df_s.tail()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
63927,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
63928,0.0,0.0,0.0,0.0,0.705821,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
63929,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.786194,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
63930,0.363093,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.446122,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
63931,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.083538,0.0,0.0,0.0


In [94]:
len(corpus_topic_df_s)

63932

In [95]:
corpus_topic_df_s.to_csv('../data_vec/corpus_topic_df_s.csv', index=False)

**Paragraphs**

In [96]:
corpus_topic_df_p = pd.DataFrame(columns=[n for n in range(lda_multi_p.num_topics)])

for i, doc in enumerate(lda_multi_p.get_document_topics(tqdm(corpus_p))):
    for topic, proba in doc:
        corpus_topic_df_p.loc[i, topic] = proba

100%|██████████| 63932/63932 [3:14:00<00:00,  4.73it/s]  


In [97]:
corpus_topic_df_p.fillna(0, inplace=True)

**SAVE LDA VALUES**

In [98]:
corpus_topic_df_p.to_csv('../data_vec/corpus_topic_df_p.csv', index=False)

**Test Data**

_Estimated Run Time:_

In [99]:
t_corpus_topic_df_s = pd.DataFrame(columns=[n for n in range(lda_multi_s.num_topics)])

for i, doc in enumerate(lda_multi_s.get_document_topics(tqdm(t_corpus_s))):
    for topic, proba in doc:
        t_corpus_topic_df_s.loc[i, topic] = proba

100%|██████████| 7935/7935 [06:00<00:00, 14.56it/s] 


In [100]:
t_corpus_topic_df_s.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,,,,,,,,,,,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,0.47376,,,,
2,,,,,,,,,,,...,,,,0.357535,0.371028,,,,,
3,,,,0.244073,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,


In [101]:
t_corpus_topic_df_s.shape

(7935, 32)

In [102]:
t_corpus_topic_df_p = pd.DataFrame(columns=[n for n in range(lda_multi_p.num_topics)])

for i, doc in enumerate(lda_multi_p.get_document_topics(tqdm(t_corpus_p))):
    for topic, proba in doc:
        t_corpus_topic_df_p.loc[i, topic] = proba

100%|██████████| 7935/7935 [03:18<00:00, 39.93it/s] 


In [103]:
t_corpus_topic_df_p.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,0.0125562,0.0125562,0.0125562,0.0125562,0.556335,0.0125562,0.0125562,0.0125562,0.0125562,0.0125562,0.0125562,0.0125562,0.0125562,0.267877,0.0125562,0.0125562
1,0.0114548,0.0114548,0.621974,0.0114548,0.0493363,0.0114549,0.0114548,0.0114548,0.0436826,0.0114548,0.011455,0.147549,0.0114549,0.0114548,0.0114548,0.0114548
2,0.0114548,0.0114548,0.621941,0.0114548,0.0492712,0.0114549,0.0114548,0.0114548,0.0436778,0.0114548,0.011455,0.147652,0.0114549,0.0114548,0.0114548,0.0114548
3,0.0126437,0.0126438,0.470218,0.0126437,0.352769,0.0126437,0.0126439,0.0126437,0.0126438,0.0126437,0.0126437,0.0126438,0.0126437,0.0126438,0.0126437,0.0126437
4,0.0151475,0.0151475,0.772788,0.0151475,0.0151475,0.0151475,0.0151475,0.0151475,0.0151475,0.0151475,0.0151475,0.0151475,0.0151475,0.0151476,0.0151475,0.0151475


In [104]:
t_corpus_topic_df_p.shape

(7935, 16)

In [105]:
t_corpus_topic_df_s.fillna(0, inplace=True)
t_corpus_topic_df_p.fillna(0, inplace=True)

**Save to Disk**

In [106]:
t_corpus_topic_df_s.to_csv('../data_vec/t_corpus_topic_df_s.csv', index=False)
t_corpus_topic_df_p.to_csv('../data_vec/t_corpus_topic_df_p.csv', index=False)

## 2.2 Topic Labeling with Gensim Word2Vec

### 2.2.1 Word2Vec

**Load Gensim Wikipedia Text8 Vector Dataset**

In [107]:
text8_corpus = api.load('text8')

In [108]:
wv_vecsize = 32

**Train Word2Vec Model**

In [109]:
wv_model = Word2Vec(text8_corpus, 
                    size=wv_vecsize, 
                    window=2, 
                    min_count=2, 
                    sg=0,
                    workers=3
)

**Save Word2Vec Models to Disk**

In [110]:
wv_model.save('../models/wv_model')

**Load Word2Vec Models**

In [None]:
wv_model = Word2Vec.load('../models/wv_model')

### 2.2.2 Identify Vectors for Specificity

**Target Vectors for Specificity (i.e., _Tiger_ >> _Animal_)**

In [117]:
spec_terms = [
    ['tiger','cat','mammal','animal'],
    #['emperor','king','man','human']
]

In [118]:
spec_vecs = []
for t_list in spec_terms:
    spec_vecs.append([wv_model[t_list[i]] for i in range(len(t_list))])

In [119]:
#spec_vecs[list_num][term_num][vec_num]
spec_vecs[0][0][0]

-0.6262831

**Check for Specificity Vectors**

In [120]:
spec_tups = []
asc_list = []
desc_list = []

for i in range(len(spec_vecs)):
    for j in range(wv_vecsize):
        y_vals = [spec_vecs[i][k][j] for k in range(len(spec_vecs[i]))]

        # Check for elements with unidirectionality
        for m in range(len(y_vals)-1):
            if (y_vals[m] < y_vals[m+1]) and (m == len(y_vals)-2):
                asc_list.append(j)
            elif (y_vals[m] < y_vals[m+1]):
                continue
            else:
                break

        for n in range(len(y_vals)-1):
            if (y_vals[n] > y_vals[n+1]) and (n == len(y_vals)-2):
                desc_list.append(j)
            elif (y_vals[n] > y_vals[n+1]):
                continue
            else:
                break

for j in range(wv_vecsize):
    num_tups = []
    if asc_list.count(j) == len(spec_vecs):
        print(j, 'ascending')        
        for i in range(len(spec_vecs)):
            y_vals = [spec_vecs[i][k][j] for k in range(len(spec_vecs[i]))]
            num_tups.append((j, y_vals[0], y_vals[1]))

    if desc_list.count(j) == len(spec_vecs):
        print(j, 'descending')
        for i in range(len(spec_vecs)):
            y_vals = [spec_vecs[i][k][j] for k in range(len(spec_vecs[i]))]
            num_tups.append((j, y_vals[0], y_vals[1]))

    if len(num_tups) > 0:
        spec_tups.append(num_tups)

2 ascending
4 descending
6 descending
23 ascending
25 ascending
26 descending


In [124]:
spec_tups

[[(2, -0.8940467, -0.5849001)],
 [(4, 1.4190073, 1.242038)],
 [(6, 0.47140855, -0.40742072)],
 [(23, -0.3608075, -0.2950177)],
 [(25, -1.1581342, -0.46998683)],
 [(26, 0.5083703, 0.3333232)]]

In [126]:
spec_adj = []
for i in range(len(spec_tups)):
    spec_adj.append((spec_tups[i][0][0], spec_tups[i][0][2] - spec_tups[i][0][1]))
spec_adj

[(2, 0.30914664),
 (4, -0.17696929),
 (6, -0.87882924),
 (23, 0.06578982),
 (25, 0.6881474),
 (26, -0.17504707)]

**Select vector 6 as proxy for Specificity**

In [139]:
spec_adj = spec_adj[2]

In [140]:
spec_adj

(6, -0.87882924)

### 2.2.3 Get Topic Labels

In [127]:
lda_multi_s.get_topic_terms(0)

[(11, 0.007872979),
 (1584, 0.00778076),
 (1420, 0.005396178),
 (8722, 0.0048204805),
 (736, 0.0047878237),
 (6383, 0.0037499499),
 (177, 0.0037371786),
 (1720, 0.0035256222),
 (6145, 0.003511541),
 (1142, 0.003407171)]

In [128]:
lda_multi_s.id2word.id2token[11]

'man'

In [129]:
lda_multi_p.get_topic_terms(0)

[(4716, 0.016807023),
 (1447, 0.015104228),
 (11, 0.007641598),
 (6804, 0.0074021094),
 (2358, 0.0040484834),
 (182, 0.0039092633),
 (4263, 0.003810972),
 (1332, 0.0036929073),
 (6570, 0.0036659513),
 (3055, 0.0035300446)]

In [130]:
lda_multi_p.get_term_topics(4716)

[(0, 0.016787399)]

**Weighted Average of Topic Terms**

In [162]:
topics_s = lda_multi_s.get_topics()
topics_p = lda_multi_p.get_topics()

In [163]:
topics_s

array([[2.06251396e-03, 5.65889832e-06, 1.59622519e-04, ...,
        5.65889832e-06, 5.65889832e-06, 1.27046387e-05],
       [1.40270672e-03, 2.34052553e-04, 2.11674161e-03, ...,
        5.28134660e-06, 5.28134660e-06, 1.44760415e-05],
       [1.63618976e-03, 1.93558182e-04, 2.27375631e-03, ...,
        5.02633748e-06, 5.02633748e-06, 5.02633748e-06],
       ...,
       [1.21191260e-03, 3.31283169e-04, 4.28829342e-04, ...,
        4.99208818e-06, 4.99208818e-06, 2.86017341e-04],
       [1.05547241e-03, 3.58389552e-05, 3.81037389e-05, ...,
        5.35092977e-06, 5.35092977e-06, 3.49287380e-04],
       [1.28497661e-03, 1.84891454e-04, 3.30628618e-03, ...,
        4.14707165e-06, 4.14707165e-06, 1.10628944e-05]], dtype=float32)

In [164]:
len(topics_s)

32

In [165]:
len(topics_s[0])

10000

In [166]:
# topics_s[sent_topic_num][word_id] -> topic value
topics_s[0][0]

0.002062514

**Sentences**

In [186]:
title_vecs_s = []

for i in range(len(topics_s)):
    term_tups_s = lda_multi_s.get_topic_terms(i)
    term_weights = []
    term_wvs = []
    for term in term_tups_s:
        try:
            term_wvs.append(wv_model[lda_multi_s.id2word.id2token[term[0]]])
            term_weights.append(term[1])
        except:
            continue
    wv_cols = []
    for vec_col_num in range(len(term_wvs[0])):
        wv_cols.append([term_wvs[vec_num][vec_col_num] for vec_num in range(len(term_wvs))])
    wt_avg_vec = [np.average(col_vec, weights=term_weights) for col_vec in wv_cols]
    title_vecs_s.append(wt_avg_vec)

In [187]:
len(title_vecs_s[0])

32

In [188]:
titles_s = []

for vec in title_vecs_s:
    adj_vec = vec
    adj_vec[spec_adj[0]] += (spec_adj[1] / 2)
    word = wv_model.wv.similar_by_vector(np.array(adj_vec), topn=1)
    titles_s.append(word)

In [189]:
t_str = [t[0][0] for t in titles_s]
dup_list = []

for t_tup in titles_s:
    t_str.remove(t_tup[0][0])
    if t_tup[0][0] in t_str:
        dup_list.append(t_tup[0][0])

**Cell loop to remove duplicate titles**

In [214]:
dup_list

['objection']

In [215]:
dup_titles = [title for title in titles_s if title[0][0] in dup_list]
dup_titles

[[('objection', 0.7961176037788391)], [('objection', 0.8729716539382935)]]

In [216]:
dup_ix = [i for i, t_tup in enumerate(titles_s) if t_tup in dup_titles]
dup_ix

[4, 8]

In [217]:
max_title_dict = {}

for di, dt in zip(dup_ix, dup_titles):
    try:
        if max_title_dict[dt[0][0]]['max_sim'] < dt[0][1]:
            max_title_dict[dt[0][0]]['max_sim'] = dt[0][1]
            max_title_dict[dt[0][0]]['ix'] = di
    except:
        max_title_dict[dt[0][0]] = {}
        max_title_dict[dt[0][0]]['max_sim'] = dt[0][1]
        max_title_dict[dt[0][0]]['ix'] = di
        
max_title_dict

{'objection': {'max_sim': 0.8729716539382935, 'ix': 8}}

In [218]:
max_ix = [d['ix'] for k, d in max_title_dict.items()]
max_ix

[8]

In [219]:
update_ix = [i for i in dup_ix if i not in max_ix]
update_ix

[4]

In [220]:
depth = 4

for ui in update_ix:
    titles_s[ui][0] = wv_model.wv.similar_by_vector(np.array(title_vecs_s[ui]), topn=depth)[depth-1]
t_str_s = [t[0][0] for t in titles_s]
t_str_s

['wrong',
 'note',
 'lives',
 'agenda',
 'reason',
 'way',
 'face',
 'subjection',
 'objection',
 'evil',
 'say',
 'certainty',
 'sense',
 'possibility',
 'guilt',
 'humanity',
 'thing',
 'thee',
 'contradiction',
 'limitation',
 'idea',
 'premise',
 'essence',
 'god',
 'question',
 'revenge',
 'happiness',
 'ours',
 'answer',
 'argument',
 'shipowner',
 'person']

In [221]:
t_str = [t[0][0] for t in titles_s]
dup_list = []

for t_tup in titles_s:
    t_str.remove(t_tup[0][0])
    if t_tup[0][0] in t_str:
        dup_list.append(t_tup[0][0])
dup_list

[]

**Paragraphs**

In [222]:
len(topics_p)

16

In [223]:
title_vecs_p = []

for i in range(len(topics_p)):
    term_tups_p = lda_multi_p.get_topic_terms(i)
    term_weights = []
    term_wvs = []
    for term in term_tups_p:
        try:
            term_wvs.append(wv_model[lda_multi_p.id2word.id2token[term[0]]])
            term_weights.append(term[1])
        except:
            continue
    wv_cols = []
    for vec_col_num in range(len(term_wvs[0])):
        wv_cols.append([term_wvs[vec_num][vec_col_num] for vec_num in range(len(term_wvs))])
    wt_avg_vec = [np.average(col_vec, weights=term_weights) for col_vec in wv_cols]
    title_vecs_p.append(wt_avg_vec)
    
len(title_vecs_p)

16

In [224]:
titles_p = []

for vec in title_vecs_p:
    adj_vec = vec
    adj_vec[spec_adj[0]] += (spec_adj[1] / 2)
    word = wv_model.wv.similar_by_vector(np.array(adj_vec), topn=1)
    titles_p.append(word)

titles_p

[[('god', 0.9365451335906982)],
 [('army', 0.858135461807251)],
 [('reason', 0.8370018005371094)],
 [('contention', 0.8185617923736572)],
 [('nation', 0.8717479705810547)],
 [('weakness', 0.8719848394393921)],
 [('hospitality', 0.8546667098999023)],
 [('accommodation', 0.8303316831588745)],
 [('happiness', 0.875681459903717)],
 [('civil', 0.8376868963241577)],
 [('deadites', 0.7941592931747437)],
 [('idea', 0.896360456943512)],
 [('love', 0.886397659778595)],
 [('sense', 0.8485308885574341)],
 [('biomolecular', 0.8313409090042114)],
 [('thee', 0.9111393690109253)]]

In [225]:
t_str = [t[0][0] for t in titles_p]
dup_list = []

for t_tup in titles_p:
    t_str.remove(t_tup[0][0])
    if t_tup[0][0] in t_str:
        dup_list.append(t_tup[0][0])

In [226]:
dup_list

[]

In [227]:
dup_titles = [title for title in titles_p if title[0][0] in dup_list]
dup_titles

[]

## 2.3 LDA Features for Corpora

### 2.3.1 Training Text LDA

In [None]:
colname_s = ['s'+str(i)+'_lda_'+title for i, title in enumerate(t_str_s)]

In [None]:
corpus_topic_df_s.columns = colname_s

In [None]:
corpus_topic_df_s.shape

In [None]:
colname_p = ['p'+str(i)+'_lda_'+title for i, title in enumerate(t_str_p)]

In [None]:
corpus_topic_df_p.columns = colname_p

In [None]:
corpus_topic_df_p.shape

In [None]:
lda_train = pd.merge(corpus_topic_df_s, corpus_topic_df_p, left_index=True, right_index=True)

In [None]:
lda_train = lda_train.merge(nlp_df[['a_num','p_num','s_num']], left_index=True, right_index=True)

### 2.3.2 Testing Text LDA

In [None]:
t_corpus_topic_df_s.columns = colname_s

In [None]:
t_corpus_topic_df_p.columns = colname_p

In [None]:
lda_test = pd.merge(t_corpus_topic_df_s, t_corpus_topic_df_p, left_index=True, right_index=True)

In [None]:
lda_test = lda_test.merge(t_nlp_df[['a_num','p_num','s_num']], left_index=True, right_index=True)

In [None]:
lda_test.head()

**Save Files to Disk**

In [None]:
lda_train.to_csv('../data_vec/lda_train.csv', index=False)
lda_test.to_csv('../data_vec/lda_test.csv', index=False)

## Continue to Notebook 3: Document Vectors