# Capstone: Philosophical Factors for NLP
**_Measuring Similarity to Philosophical Concepts in Text Data_**

## Thomas W. Ludlow, Jr.
**General Assembly Data Science Immersive DSI-NY-6**

**February 12, 2019**

# Notebook 2 - LDA Topic Modeling

### Table of Contents

[**2.1 Gensim LDA**](#2.1-Gensim-LDA)
- [2.1.1 Build Dictionary and Corpora](#2.1.1-Build-Dictionary-and-Corpora)
- [2.1.2 LDA Model](#2.1.2-LDA-Model)
- [2.1.3 Visualize with pyLDAvis](#2.1.3-Visualize-with-pyLDAvis)
- [2.1.4 Optimization for Number of Topics](#2.1.4-Optimization-for-Number-of-Topics)

[**2.2 Topic Labeling with Gensim Word2Vec**](#2.2-Topic-Labeling-with-Gensim-Word2Vec)
- [2.2.1 Word2Vec](#2.2.1-Word2Vec)
- [2.2.2 Identify Vectors for Specificity](#2.2.2-Identify-Vectors-for-Specificity)
- [2.2.3 Get Topic Labels](#2.2.3-Get-Topic-Labels)

[**2.3 LDA Features for Corpora**](#2.3-LDA-Features-for-Corpora)
- [2.3.1 Training Text LDA](#2.3.1-Training-Text-LDA)
- [2.3.2 Testing Text LDA](#2.3.2-Testing-Text-LDA)

**Libraries**

In [1]:
# Python Data Science
import re
import ast
import time
import pickle
import numpy as np
import pandas as pd
from tqdm import tqdm
from IPython.display import clear_output

# Natural Language Processing
import spacy
from nltk.stem import PorterStemmer

# Gensim
import gensim
import gensim.downloader as api
from gensim.corpora import Dictionary
from gensim.models import TfidfModel, ldamulticore, CoherenceModel
from gensim.models.word2vec import Word2Vec
import pyLDAvis.gensim

# Modeling Prep
from sklearn.preprocessing import StandardScaler
from sklearn.externals import joblib

# Plotting
import matplotlib.pyplot as plt
%matplotlib inline

# Override deprecation warnings
import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

## 2.1 Gensim LDA

**Load Preprocessed Text Data**

In [2]:
nlp_df = pd.read_csv('../data_eda/nlp_df.csv')
t_nlp_df = pd.read_csv('../data_eda/t_nlp_df.csv')

In [3]:
nlp_df.head()

Unnamed: 0,author,work,a_num,w_num,p_num,s_num,sent_text,sent_lemma,par_text,par_lemma
0,Aristotle,The Categories,0,0,0,0,Things are said to be named 'equivocally' when...,"['thing', 'say', 'name', 'equivocally', 'commo...",Things are said to be named 'equivocally' when...,"['thing', 'say', 'name', 'equivocally', 'commo..."
1,Aristotle,The Categories,0,0,0,1,"Thus, a real man and a figure in a picture can...","['real', 'man', 'figure', 'picture', 'lay', 'c...",Things are said to be named 'equivocally' when...,"['thing', 'say', 'name', 'equivocally', 'commo..."
2,Aristotle,The Categories,0,0,0,2,For should any one define in what sense each i...,"['define', 'sense', 'animal', 'definition', 'c...",Things are said to be named 'equivocally' when...,"['thing', 'say', 'name', 'equivocally', 'commo..."
3,Aristotle,The Categories,0,0,1,0,"On the other hand, things are said to be named...","['hand', 'thing', 'say', 'name', 'univocally',...","On the other hand, things are said to be named...","['hand', 'thing', 'say', 'name', 'univocally',..."
4,Aristotle,The Categories,0,0,1,1,"A man and an ox are both 'animal', and these a...","['man', 'ox', 'animal', 'univocally', 'name', ...","On the other hand, things are said to be named...","['hand', 'thing', 'say', 'name', 'univocally',..."


In [4]:
word_count = sum([sent.strip().count(' ') for sent in nlp_df.sent_text.tolist()]) + nlp_df.shape[0]
word_count

1889255

In [5]:
nlp_df.a_num.value_counts()

9     9871
11    8263
4     6989
12    6524
7     6466
17    5944
13    4266
16    3842
15    3289
1     3198
6     2999
14    2673
18    2287
5     1653
2     1261
10    1247
19    1195
0     1164
3      641
8      343
Name: a_num, dtype: int64

In [6]:
nlp_df.shape

(74115, 10)

In [7]:
t_nlp_df.head()

Unnamed: 0,author,work,a_num,w_num,p_num,s_num,sent_text,sent_lemma,par_text,par_lemma
0,Aristotle,Ethics,0,0,0,0,"Every art, and every science reduced to a teac...","['art', 'science', 'reduce', 'teachable', 'for...","Every art, and every science reduced to a teac...","['art', 'science', 'reduce', 'teachable', 'for..."
1,Aristotle,Ethics,0,0,0,1,"""",[],"Every art, and every science reduced to a teac...","['art', 'science', 'reduce', 'teachable', 'for..."
2,Aristotle,Ethics,0,0,1,0,Now there plainly is a difference in the Ends ...,"['plainly', 'difference', 'ends', 'propose', '...",Now there plainly is a difference in the Ends ...,"['plainly', 'difference', 'ends', 'propose', '..."
3,Aristotle,Ethics,0,0,1,1,"Again, since actions and arts and sciences are...","['action', 'art', 'science', 'ends', 'likewise...",Now there plainly is a difference in the Ends ...,"['plainly', 'difference', 'ends', 'propose', '..."
4,Aristotle,Ethics,0,0,2,0,"And whatever of such actions, arts, or science...","['action', 'art', 'science', 'range', 'faculty...","And whatever of such actions, arts, or science...","['action', 'art', 'science', 'range', 'faculty..."


In [8]:
t_nlp_df.shape

(8870, 10)

### 2.1.1 Build Dictionary and Corpora

**Gensim Dictionary `g_dict`**

In [9]:
pkl = open('../data_eda/sw.pkl','rb')
stopwords = pickle.load(pkl)

In [10]:
pkl.close()

In [11]:
stopwords[:10]

['i', 'ii', 'iii', 'iv', 'v', 'vi', 'vii', 'viii', 'ix', 'x']

In [12]:
# Build dictionary to contain all terms from normalized text
g_dict = Dictionary([ast.literal_eval(lemma_str) for lemma_str in nlp_df.sent_lemma])
len(g_dict)

25408

In [13]:
for word in stopwords:
    try:
        g_dict.filter_tokens(bad_ids=[g_dict.token2id[word]])
    except:
        continue

In [14]:
len(g_dict)

25408

**Remove Outliers from Dictionary**

In [15]:
g_dict.filter_extremes(no_below=5, no_above=0.85, keep_n=10000)
len(g_dict)

8860

**Bag of Words (BoW) Corpora**

Training Text Corpus

In [16]:
# Build corpus of normalized text relative to dictionary
bow_corpus_s = [g_dict.doc2bow(ast.literal_eval(sent)) for sent in nlp_df.sent_lemma]
bow_corpus_p = [g_dict.doc2bow(ast.literal_eval(par)) for par in nlp_df.par_lemma]

In [17]:
len(bow_corpus_s)

74115

In [18]:
len(bow_corpus_s[0])

7

In [19]:
len(bow_corpus_p)

74115

In [20]:
len(bow_corpus_p[0])

18

Testing Text Corpus

In [21]:
# Build corpus of normalized text relative to dictionary
t_bow_corpus_s = [g_dict.doc2bow(ast.literal_eval(sent)) for sent in t_nlp_df.sent_lemma]
t_bow_corpus_p = [g_dict.doc2bow(ast.literal_eval(par)) for par in t_nlp_df.par_lemma]

In [22]:
len(t_bow_corpus_p)

8870

In [23]:
len(t_bow_corpus_p[0])

18

**TF-IDF Vectorization**

In [24]:
tfidf = TfidfModel(bow_corpus_s, normalize=True)

In [25]:
corpus_s = tfidf[bow_corpus_s]

In [26]:
len(corpus_s)

74115

In [27]:
corpus_p = tfidf[bow_corpus_p]

In [28]:
t_corpus_s = tfidf[t_bow_corpus_s]

In [29]:
len(t_corpus_s)

8870

In [30]:
t_corpus_p = tfidf[t_bow_corpus_p]

**Save TF-IDF Model to Disk**

In [31]:
tfidf.save('../models/tfidf')

### 2.1.2 LDA Models

**Set Parameter Values**

In [32]:
sent_param= {
    'num_topics':36,
    'random_state':211,
    'chunksize':5000,
    'eval_every':10,
    'passes':3,
    'workers':3
}

par_param= {
    'num_topics':22,
    'random_state':211,
    'chunksize':1000,
    'eval_every':10,
    'passes':3,
    'workers':3
}

**LDA Multicore Model - Sentences**

In [33]:
# Instantiate model based on parameter values
lda_multi_s = ldamulticore.LdaMulticore(corpus=corpus_s,
                                        id2word=g_dict,
                                        num_topics=sent_param['num_topics'],
                                        random_state=sent_param['random_state'],
                                        chunksize=sent_param['chunksize'],
                                        eval_every=sent_param['eval_every'],
                                        passes=sent_param['passes'],
                                        per_word_topics=True,
                                        workers=sent_param['workers'],
                                        alpha='symmetric'
)

In [34]:
lda_multi_s.print_topics()

[(9,
  '0.019*"tu" + 0.014*"mu" + 0.009*"volition" + 0.008*"man" + 0.007*"ask" + 0.006*"piety" + 0.005*"say" + 0.005*"calculation" + 0.005*"accomplish" + 0.005*"native"'),
 (12,
  '0.007*"maxim" + 0.007*"verily" + 0.007*"alas" + 0.007*"help" + 0.007*"man" + 0.006*"contemplate" + 0.006*"resource" + 0.005*"cave" + 0.005*"ear" + 0.005*"dictate"'),
 (17,
  '0.010*"revolution" + 0.008*"france" + 0.008*"taxis" + 0.007*"government" + 0.007*"unto" + 0.007*"man" + 0.007*"emotion" + 0.007*"money" + 0.006*"speak" + 0.006*"people"'),
 (23,
  '0.011*"spy" + 0.010*"word" + 0.010*"premiss" + 0.007*"innate" + 0.007*"identity" + 0.006*"supposition" + 0.006*"settle" + 0.005*"show" + 0.005*"man" + 0.005*"copy"'),
 (13,
  '0.013*"passive" + 0.011*"active" + 0.011*"mind" + 0.007*"doth" + 0.007*"confused" + 0.006*"idea" + 0.006*"body" + 0.006*"guidance" + 0.006*"act" + 0.006*"reason"'),
 (11,
  '0.011*"essence" + 0.007*"slavery" + 0.007*"preservation" + 0.006*"wang" + 0.006*"freedom" + 0.006*"man" + 0.006*"

**LDA Metrics - Sentences**

In [35]:
lda_multi_s.log_perplexity(corpus_s)

-10.750533804973584

In [36]:
cm_s = CoherenceModel(model=lda_multi_s, texts=[ast.literal_eval(sent) for sent in nlp_df.sent_lemma], 
                      dictionary=g_dict, coherence='c_v')
cm_s.get_coherence()

0.42109109245182874

**LDA Multicore Model - Paragraphs**

In [37]:
# Instantiate model based on parameter values
lda_multi_p = ldamulticore.LdaMulticore(corpus=corpus_p,
                                        id2word=g_dict,
                                        num_topics=par_param['num_topics'],
                                        random_state=par_param['random_state'],
                                        chunksize=par_param['chunksize'],
                                        eval_every=par_param['eval_every'],
                                        passes=par_param['passes'],
                                        per_word_topics=True,
                                        workers=par_param['workers'],
                                        alpha='symmetric'
)

In [38]:
lda_multi_p.print_topics()

[(0,
  '0.014*"right" + 0.013*"man" + 0.011*"law" + 0.008*"government" + 0.007*"state" + 0.006*"people" + 0.005*"power" + 0.005*"authority" + 0.005*"god" + 0.005*"sovereign"'),
 (4,
  '0.029*"yu" + 0.025*"chang" + 0.025*"tzu" + 0.020*"say" + 0.019*"kung" + 0.015*"cheng" + 0.013*"chao" + 0.011*"gatherer" + 0.011*"tan" + 0.010*"yen"'),
 (9,
  '0.029*"emotions" + 0.024*"variation" + 0.023*"subordinate" + 0.020*"chap" + 0.016*"chapter" + 0.015*"tzu" + 0.013*"less" + 0.013*"stream" + 0.012*"edition" + 0.011*"farther"'),
 (1,
  '0.009*"cause" + 0.009*"phenomenon" + 0.008*"effect" + 0.008*"physical" + 0.006*"causation" + 0.006*"case" + 0.006*"causal" + 0.006*"meaning" + 0.005*"series" + 0.005*"relation"'),
 (6,
  '0.061*"ye" + 0.029*"zarathustra" + 0.021*"defensive" + 0.019*"identical" + 0.019*"laugh" + 0.019*"spake" + 0.016*"march" + 0.015*"defeat" + 0.012*"fight" + 0.011*"wait"'),
 (17,
  '0.035*"spy" + 0.019*"dominion" + 0.014*"proportion" + 0.013*"fellow" + 0.013*"private" + 0.011*"thysel

**LDA Metrics - Paragraphs**

In [39]:
lda_multi_p.log_perplexity(corpus_p)

-9.546337188079216

In [40]:
cm_p = CoherenceModel(model=lda_multi_p, texts=[ast.literal_eval(par) for par in nlp_df.par_lemma], 
                      dictionary=g_dict, coherence='c_v')
cm_p.get_coherence()

0.4588011582071301

### 2.1.3 Visualize with pyLDAvis

In [41]:
lda_display_s = pyLDAvis.gensim.prepare(lda_multi_s, corpus_s, g_dict, sort_topics=True)
pyLDAvis.display(lda_display_s)

In [42]:
lda_display_p = pyLDAvis.gensim.prepare(lda_multi_p, corpus_p, g_dict, sort_topics=True)
pyLDAvis.display(lda_display_p)

### 2.1.4 Optimization for Number of Topics

**Set Optimizing Parameters**

In [43]:
sent_opt_params = {
    'num_topics':[28,30,32,34,36],
    'random_state':211,
    'chunksize':5000,
    'passes':2,
    'workers':3
}

In [44]:
par_opt_params = {
    'num_topics':[16,18,20,22,24],
    'random_state':211,
    'chunksize':1000,
    'passes':2,
    'workers':3
}

**Sentence LDA**

*Estimated Run Time: ~6 min*

In [45]:
lda_multi_s_df = pd.DataFrame(columns=['num_topics','perplexity','coherence'])

for nt in tqdm(range(len(sent_opt_params['num_topics']))):
    temp_df = pd.DataFrame(columns=['num_topics','perplexity','coherence'])
    temp_df.loc[nt, 'num_topics'] = sent_opt_params['num_topics'][nt]

    lda_multi_s = ldamulticore.LdaMulticore(corpus=corpus_s,
                                          id2word=g_dict,
                                          num_topics=sent_opt_params['num_topics'][nt],
                                          random_state=sent_opt_params['random_state'],
                                          chunksize=sent_opt_params['chunksize'],
                                          passes=sent_opt_params['passes'],
                                          per_word_topics=True,
                                          workers=sent_opt_params['workers'],
                                           alpha='symmetric')
    temp_df.perplexity = lda_multi_s.log_perplexity(corpus_s)
    cm = CoherenceModel(model=lda_multi_s, texts=[ast.literal_eval(sent) for sent in nlp_df.sent_lemma], dictionary=g_dict, coherence='c_v')
    temp_df.coherence = cm.get_coherence()
    lda_multi_s_df = lda_multi_s_df.append(temp_df)
                
lda_multi_s_df

100%|██████████| 5/5 [05:35<00:00, 66.97s/it]


Unnamed: 0,num_topics,perplexity,coherence
0,28,-10.513294,0.360515
1,30,-10.619058,0.36142
2,32,-10.690512,0.386782
3,34,-10.800787,0.385941
4,36,-10.872494,0.375965


**Paragraph LDA**

*Estimated Run Time: ~21 min*

In [46]:
lda_multi_p_df = pd.DataFrame(columns=['num_topics','perplexity','coherence'])

for nt in tqdm(range(len(par_opt_params['num_topics']))):
    temp_df = pd.DataFrame(columns=['num_topics','perplexity','coherence'])
    temp_df.loc[nt, 'num_topics'] = par_opt_params['num_topics'][nt]

    lda_multi_p = ldamulticore.LdaMulticore(corpus=corpus_p,
                                          id2word=g_dict,
                                          num_topics=par_opt_params['num_topics'][nt],
                                          random_state=par_opt_params['random_state'],
                                          chunksize=par_opt_params['chunksize'],
                                          passes=par_opt_params['passes'],
                                          per_word_topics=True,
                                          workers=par_opt_params['workers'], 
                                            alpha='symmetric')
    temp_df.perplexity = lda_multi_p.log_perplexity(corpus_p)
    cm = CoherenceModel(model=lda_multi_p, texts=[ast.literal_eval(par) for par in nlp_df.par_lemma], dictionary=g_dict, coherence='c_v')
    temp_df.coherence = cm.get_coherence()
    lda_multi_p_df = lda_multi_p_df.append(temp_df)
                
lda_multi_p_df

100%|██████████| 5/5 [20:41<00:00, 248.72s/it]


Unnamed: 0,num_topics,perplexity,coherence
0,16,-9.274579,0.442267
1,18,-9.37392,0.48042
2,20,-9.460482,0.459751
3,22,-9.667657,0.448647
4,24,-9.668795,0.465975


**Create Empty Optimizing DataFrames**

In [47]:
lda_sent = pd.DataFrame(columns=['num_topics','perplexity','coherence'])
lda_par = pd.DataFrame(columns=['num_topics','perplexity','coherence'])

**Add Latest Results to Lists**

In [48]:
lda_sent = lda_sent.append(lda_multi_s_df)

In [49]:
lda_par = lda_par.append(lda_multi_p_df)

**Final Sentence LDA Model**

In [66]:
sent_params = {
    'num_topics':32,
    'random_state':211,
    'chunksize':5000,
    'passes':2,
    'workers':3
}

In [67]:
# Run model
lda_multi_s = ldamulticore.LdaMulticore(corpus=corpus_s,
                                        id2word=g_dict,
                                        num_topics=sent_params['num_topics'],
                                        random_state=sent_params['random_state'],
                                        chunksize=sent_params['chunksize'],
                                        passes=sent_params['passes'],
                                        per_word_topics=True,
                                        workers=sent_params['workers'], 
                                        alpha='symmetric'
                                       )

In [68]:
# Save model to disk
lda_multi_s.save('../models/lda_multi_s')

In [69]:
lda_display_s = pyLDAvis.gensim.prepare(lda_multi_s, corpus_s, g_dict, sort_topics=True)
pyLDAvis.display(lda_display_s)

**Final Paragraph LDA Model**

In [54]:
par_params = {
    'num_topics':18,
    'random_state':211,
    'chunksize':1000,
    'passes':2,
    'workers':3
}

In [55]:
# Run model
lda_multi_p = ldamulticore.LdaMulticore(corpus=corpus_p,
                                        id2word=g_dict,
                                        num_topics=par_params['num_topics'],
                                        random_state=par_params['random_state'],
                                        chunksize=par_params['chunksize'],
                                        passes=par_params['passes'],
                                        per_word_topics=True,
                                        workers=par_params['workers'], 
                                        alpha='symmetric'
                                       )

In [56]:
# Save model to disk
lda_multi_p.save('../models/lda_multi_p')

In [57]:
lda_display_p = pyLDAvis.gensim.prepare(lda_multi_p, corpus_p, g_dict, sort_topics=True)
pyLDAvis.display(lda_display_p)

## 2.2 Get Topic Values for Corpora

**Load LDA Models**

In [None]:
lda_multi_s = Dictionary.load('../models/lda_multi_s')

In [None]:
lda_multi_p = Dictionary.load('../models/lda_multi_p')

**Sentences**

*Estimated Run Time: ~10 hr*

In [None]:
corpus_topic_df_s = pd.DataFrame(columns=[n for n in range(lda_multi_s.num_topics)])

for i, doc in enumerate(lda_multi_s.get_document_topics(tqdm(corpus_s))):
    for topic, proba in doc:
        corpus_topic_df_s.loc[i, topic] = proba

In [83]:
corpus_topic_df_s.fillna(0, inplace=True)

In [84]:
corpus_topic_df_s.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,0.0,0.72789,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.336133,0.0,0.0,0.0,0.450036,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.71593,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.732766,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.589539,0.0,0.192299,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [85]:
corpus_topic_df_s.shape

(74115, 32)

In [86]:
corpus_topic_df_s.tail()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
74110,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.241788,0.0,0.0,0.0,0.0,0.0,0.0
74111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.706538
74112,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
74113,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
74114,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [87]:
len(corpus_topic_df_s)

74115

In [88]:
corpus_topic_df_s.to_csv('../data_vec/corpus_topic_df_s.csv', index=False)

**Paragraphs**

In [None]:
corpus_topic_df_p = pd.DataFrame(columns=[n for n in range(lda_multi_p.num_topics)])

for i, doc in enumerate(lda_multi_p.get_document_topics(tqdm(corpus_p))):
    for topic, proba in doc:
        corpus_topic_df_p.loc[i, topic] = proba

In [91]:
corpus_topic_df_p.fillna(0, inplace=True)

**SAVE LDA VALUES**

In [92]:
corpus_topic_df_p.to_csv('../data_vec/corpus_topic_df_p.csv', index=False)

**Test Data**

_Estimated Run Time: ~10 min_

In [None]:
t_corpus_topic_df_s = pd.DataFrame(columns=[n for n in range(lda_multi_s.num_topics)])

for i, doc in enumerate(lda_multi_s.get_document_topics(tqdm(t_corpus_s))):
    for topic, proba in doc:
        t_corpus_topic_df_s.loc[i, topic] = proba

In [94]:
t_corpus_topic_df_s.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,,,,0.757352,,,,,,,...,,,,,,,0.0545553,,,
1,0.03125,0.03125,0.03125,0.03125,0.03125,0.03125,0.03125,0.03125,0.03125,0.03125,...,0.03125,0.03125,0.03125,0.03125,0.03125,0.03125,0.03125,0.03125,0.03125,0.03125
2,,,,0.0867912,,,,,,0.316206,...,,,,,,,,,,
3,,,,,,,0.0728499,0.0655444,,,...,,,,,,,0.433428,,,
4,,,,,,,,,,0.203645,...,,,,,,,,,,


In [95]:
t_corpus_topic_df_s.shape

(8870, 32)

In [None]:
t_corpus_topic_df_p = pd.DataFrame(columns=[n for n in range(lda_multi_p.num_topics)])

for i, doc in enumerate(lda_multi_p.get_document_topics(tqdm(t_corpus_p))):
    for topic, proba in doc:
        t_corpus_topic_df_p.loc[i, topic] = proba

In [97]:
t_corpus_topic_df_p.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
0,0.0111462,0.0111462,0.0111462,0.0111462,0.0111462,0.0111462,0.0111462,0.0111462,0.810514,0.0111462,0.0111462,0.0111462,0.0111462,0.0111462,0.0111462,0.0111462,0.0111462,0.0111462
1,0.0111462,0.0111462,0.0111462,0.0111462,0.0111462,0.0111462,0.0111462,0.0111462,0.810514,0.0111462,0.0111462,0.0111462,0.0111462,0.0111462,0.0111462,0.0111462,0.0111462,0.0111462
2,0.0453188,0.010188,0.010188,0.225535,0.010188,0.010188,0.010188,0.010188,0.531825,0.010188,0.010188,0.010188,0.010188,0.010188,0.010188,0.010188,0.010188,0.054689
3,0.0453204,0.010188,0.010188,0.225411,0.010188,0.010188,0.010188,0.010188,0.531613,0.010188,0.010188,0.010188,0.010188,0.010188,0.010188,0.010188,0.010188,0.0550246
4,0.0112334,0.0112334,0.229055,0.165172,0.0112334,0.0112334,0.0112334,0.0112334,0.347701,0.100804,0.0112334,0.0112334,0.0112334,0.0112334,0.0112334,0.0112334,0.0112334,0.0112334


In [98]:
t_corpus_topic_df_p.shape

(8870, 18)

In [99]:
t_corpus_topic_df_s.fillna(0, inplace=True)
t_corpus_topic_df_p.fillna(0, inplace=True)

**Save to Disk**

In [100]:
t_corpus_topic_df_s.to_csv('../data_vec/t_corpus_topic_df_s.csv', index=False)
t_corpus_topic_df_p.to_csv('../data_vec/t_corpus_topic_df_p.csv', index=False)

## 2.2 Topic Labeling with Gensim Word2Vec

### 2.2.1 Word2Vec

**Load Gensim Wikipedia Text8 Vector Dataset**

In [101]:
text8_corpus = api.load('text8')

In [102]:
wv_vecsize = 32

**Train Word2Vec Model**

In [103]:
wv_model = Word2Vec(text8_corpus, 
                    size=wv_vecsize, 
                    window=2, 
                    min_count=2, 
                    sg=0,
                    workers=3
)

**Save Word2Vec Models to Disk**

In [104]:
wv_model.save('../models/wv_model')

**Load Word2Vec Models**

In [None]:
wv_model = Word2Vec.load('../models/wv_model')

### 2.2.2 Identify Vectors for Specificity

**Target Vectors for Specificity (i.e., _Tiger_ >> _Animal_)**

In [259]:
spec_terms = [
    ['tiger','cat','mammal','animal','thing'],
    #['emperor','king','man','human']
]

In [260]:
spec_vecs = []
for t_list in spec_terms:
    spec_vecs.append([wv_model[t_list[i]] for i in range(len(t_list))])

In [261]:
#spec_vecs[list_num][term_num][vec_num]
spec_vecs[0][0][0]

-0.33799845

**Check for Specificity Vectors**

In [262]:
spec_tups = []
asc_list = []
desc_list = []

for i in range(len(spec_vecs)):
    for j in range(wv_vecsize):
        y_vals = [spec_vecs[i][k][j] for k in range(len(spec_vecs[i]))]

        # Check for elements with unidirectionality
        for m in range(len(y_vals)-1):
            if (y_vals[m] < y_vals[m+1]) and (m == len(y_vals)-2):
                asc_list.append(j)
            elif (y_vals[m] < y_vals[m+1]):
                continue
            else:
                break

        for n in range(len(y_vals)-1):
            if (y_vals[n] > y_vals[n+1]) and (n == len(y_vals)-2):
                desc_list.append(j)
            elif (y_vals[n] > y_vals[n+1]):
                continue
            else:
                break

for j in range(wv_vecsize):
    num_tups = []
    if asc_list.count(j) == len(spec_vecs):
        print(j, 'ascending')        
        for i in range(len(spec_vecs)):
            y_vals = [spec_vecs[i][k][j] for k in range(len(spec_vecs[i]))]
            num_tups.append((j, y_vals[0], y_vals[1]))

    if desc_list.count(j) == len(spec_vecs):
        print(j, 'descending')
        for i in range(len(spec_vecs)):
            y_vals = [spec_vecs[i][k][j] for k in range(len(spec_vecs[i]))]
            num_tups.append((j, y_vals[0], y_vals[1]))

    if len(num_tups) > 0:
        spec_tups.append(num_tups)

13 descending


In [265]:
spec_tups[0][0]

(13, -0.030605197, -0.14433207)

In [268]:
spec_adj = (spec_tups[0][0][0], spec_tups[0][0][2] - spec_tups[0][0][1])
spec_adj

(13, -0.11372687)

### 2.2.3 Get Topic Labels

In [139]:
lda_multi_s.get_topic_terms(0)

[(3302, 0.009762119),
 (1904, 0.008175655),
 (1702, 0.007989921),
 (2988, 0.007782106),
 (2581, 0.0075615463),
 (7527, 0.007299436),
 (1462, 0.007117757),
 (5762, 0.006562113),
 (4759, 0.006415151),
 (5588, 0.0061109345)]

In [141]:
lda_multi_s.id2word.id2token[3302]

'eternity'

In [325]:
lda_multi_p.get_topic_terms(0)

[(11, 0.010391422),
 (1466, 0.004404074),
 (4815, 0.0035504263),
 (823, 0.0034957149),
 (3814, 0.00348365),
 (4712, 0.0034544116),
 (5522, 0.0033305753),
 (5943, 0.0031415748),
 (7040, 0.0031204817),
 (2047, 0.0030982427)]

In [143]:
lda_multi_p.get_term_topics(11)

[(0, 0.010382213), (8, 0.011547277), (13, 0.010863306)]

**Weighted Average of Topic Terms**

In [326]:
topics_s = lda_multi_s.get_topics()
topics_p = lda_multi_p.get_topics()

In [145]:
topics_s

array([[1.3438617e-03, 2.4746403e-05, 2.6819072e-04, ..., 5.8438509e-06,
        5.8438509e-06, 5.8438509e-06],
       [1.7041103e-03, 2.4329385e-04, 6.3801301e-03, ..., 4.7385415e-06,
        4.7385415e-06, 4.7385415e-06],
       [1.0460144e-03, 1.5074979e-04, 2.9246302e-04, ..., 5.0143212e-06,
        9.6547546e-06, 5.0143212e-06],
       ...,
       [1.3759901e-03, 1.1383430e-04, 1.5092314e-03, ..., 1.3918671e-03,
        5.0475182e-06, 1.5082081e-05],
       [5.8530300e-04, 9.8994162e-05, 3.1000783e-04, ..., 6.7997103e-06,
        6.7997103e-06, 1.9125566e-05],
       [1.9062037e-03, 1.2889551e-05, 2.6292037e-04, ..., 5.3739950e-06,
        5.3739950e-06, 5.3739950e-06]], dtype=float32)

In [146]:
len(topics_s)

32

In [149]:
len(topics_s[0])

8860

In [150]:
# topics_s[sent_topic_num][word_id] -> topic value
topics_s[0][0]

0.0013438617

**Sentences**

In [333]:
title_vecs_s = []

for i in range(len(topics_s)):
    term_tups_s = lda_multi_s.get_topic_terms(i)
    term_weights = []
    term_wvs = []
    for term in term_tups_s:
        try:
            term_wvs.append(wv_model[lda_multi_s.id2word.id2token[term[0]]])
            term_weights.append(term[1])
        except:
            continue
    wv_cols = []
    for vec_col_num in range(len(term_wvs[0])):
        wv_cols.append([term_wvs[vec_num][vec_col_num] for vec_num in range(len(term_wvs))])
    wt_avg_vec = [np.average(col_vec, weights=term_weights) for col_vec in wv_cols]
    title_vecs_s.append(wt_avg_vec)

In [334]:
len(title_vecs_s[0])

32

In [335]:
titles_s = []

for vec in title_vecs_s:
    adj_vec = vec
    adj_vec[spec_adj[0]] += spec_adj[1]
    word = wv_model.wv.similar_by_vector(np.array(adj_vec), topn=1)
    titles_s.append(word)

In [336]:
t_str = [t[0][0] for t in titles_s]
dup_list = []

for t_tup in titles_s:
    t_str.remove(t_tup[0][0])
    if t_tup[0][0] in t_str:
        dup_list.append(t_tup[0][0])

In [337]:
dup_list

['thing']

In [338]:
dup_titles = [title for title in titles_s if title[0][0] in dup_list]
dup_titles

[[('thing', 0.837311863899231)], [('thing', 0.8797816038131714)]]

In [339]:
dup_ix = [i for i, t_tup in enumerate(titles_s) if t_tup in dup_titles]
dup_ix

[3, 13]

In [340]:
min_sim = 1
min_ix = None

for di, dt in zip(dup_ix, dup_titles):
    if dt[0][1] < min_sim: 
        min_sim = dt[0][1]
        min_ix = di
        
update_ix = min_ix
update_ix

3

In [341]:
titles_s[update_ix][0] = wv_model.wv.similar_by_vector(np.array(title_vecs_s[update_ix]), topn=2)[1]
t_str_s = [t[0][0] for t in titles_s]
t_str_s

['submission',
 'sense',
 'electorate',
 'contention',
 'hana',
 'redistribution',
 'misfortune',
 'idea',
 'reverence',
 'engagement',
 'setback',
 'sympathy',
 'pleasure',
 'thing',
 'sorry',
 'terminate',
 'wager',
 'note',
 'soldier',
 'suggestion',
 'grasp',
 'motivation',
 'humanity',
 'consciousness',
 'experience',
 'muse',
 'aggrandize',
 'claim',
 'opponent',
 'rationale',
 'shipowner',
 'attack']

In [342]:
t_str = [t[0][0] for t in titles_s]
dup_list = []

for t_tup in titles_s:
    t_str.remove(t_tup[0][0])
    if t_tup[0][0] in t_str:
        dup_list.append(t_tup[0][0])
dup_list

[]

**Paragraphs**

In [330]:
len(topics_p)

18

In [343]:
title_vecs_p = []

for i in range(len(topics_p)):
    term_tups_p = lda_multi_p.get_topic_terms(i)
    term_weights = []
    term_wvs = []
    for term in term_tups_p:
        try:
            term_wvs.append(wv_model[lda_multi_p.id2word.id2token[term[0]]])
            term_weights.append(term[1])
        except:
            continue
    wv_cols = []
    for vec_col_num in range(len(term_wvs[0])):
        wv_cols.append([term_wvs[vec_num][vec_col_num] for vec_num in range(len(term_wvs))])
    wt_avg_vec = [np.average(col_vec, weights=term_weights) for col_vec in wv_cols]
    title_vecs_p.append(wt_avg_vec)
    
len(title_vecs_p)

18

In [345]:
titles_p = []

for vec in title_vecs_p:
    adj_vec = vec
    adj_vec[spec_adj[0]] += spec_adj[1]
    word = wv_model.wv.similar_by_vector(np.array(adj_vec), topn=1)
    titles_p.append(word)

titles_p

[[('women', 0.7687749862670898)],
 [('belief', 0.8786320090293884)],
 [('idea', 0.92423415184021)],
 [('electorate', 0.874103307723999)],
 [('punctuate', 0.8335524797439575)],
 [('linearity', 0.8871276378631592)],
 [('attack', 0.8616031408309937)],
 [('burden', 0.8572307825088501)],
 [('nation', 0.8645256757736206)],
 [('occupiers', 0.8778695464134216)],
 [('seditions', 0.8285216689109802)],
 [('consciousness', 0.8700054287910461)],
 [('god', 0.9235993027687073)],
 [('mankind', 0.8558157682418823)],
 [('weekend', 0.861332893371582)],
 [('pleasure', 0.8944377899169922)],
 [('opacity', 0.8683845400810242)],
 [('downside', 0.8268966674804688)]]

In [346]:
t_str = [t[0][0] for t in titles_p]
dup_list = []

for t_tup in titles_p:
    t_str.remove(t_tup[0][0])
    if t_tup[0][0] in t_str:
        dup_list.append(t_tup[0][0])

In [347]:
dup_list

[]

In [348]:
dup_titles = [title for title in titles_p if title[0][0] in dup_list]
dup_titles

[]

In [349]:
dup_ix = [i for i, t_tup in enumerate(titles_s) if t_tup in dup_titles]
dup_ix

[]

In [350]:
min_sim = 1
min_ix = None

for di, dt in zip(dup_ix, dup_titles):
    if dt[0][1] < min_sim: 
        min_sim = dt[0][1]
        min_ix = di
        
update_ix = min_ix
update_ix

In [351]:
if update_ix:
    titles_p[update_ix][0] = wv_model.wv.similar_by_vector(np.array(title_vecs_p[update_ix]), topn=2)[1]
t_str_p = [t[0][0] for t in titles_p]
t_str_p

['women',
 'belief',
 'idea',
 'electorate',
 'punctuate',
 'linearity',
 'attack',
 'burden',
 'nation',
 'occupiers',
 'seditions',
 'consciousness',
 'god',
 'mankind',
 'weekend',
 'pleasure',
 'opacity',
 'downside']

## 2.3 LDA Features for Corpora

### 2.3.1 Training Text LDA

In [364]:
colname_s = ['s'+str(i)+'_lda_'+title for i, title in enumerate(t_str_s)]

In [365]:
corpus_topic_df_s.columns = colname_s

In [366]:
corpus_topic_df_s.shape

(74115, 32)

In [367]:
colname_p = ['p'+str(i)+'_lda_'+title for i, title in enumerate(t_str_p)]

In [368]:
corpus_topic_df_p.columns = colname_p

In [369]:
corpus_topic_df_p.shape

(74115, 18)

In [370]:
lda_train = pd.merge(corpus_topic_df_s, corpus_topic_df_p, left_index=True, right_index=True)

In [371]:
lda_train = lda_train.merge(nlp_df[['a_num','p_num','s_num']], left_index=True, right_index=True)

### 2.3.2 Testing Text LDA

In [373]:
t_corpus_topic_df_s.columns = colname_s

In [374]:
t_corpus_topic_df_p.columns = colname_p

In [375]:
lda_test = pd.merge(t_corpus_topic_df_s, t_corpus_topic_df_p, left_index=True, right_index=True)

In [376]:
lda_test = lda_test.merge(t_nlp_df[['a_num','p_num','s_num']], left_index=True, right_index=True)

In [377]:
lda_test.head()

Unnamed: 0,s0_lda_submission,s1_lda_sense,s2_lda_electorate,s3_lda_contention,s4_lda_hana,s5_lda_redistribution,s6_lda_misfortune,s7_lda_idea,s8_lda_reverence,s9_lda_engagement,...,p11_lda_consciousness,p12_lda_god,p13_lda_mankind,p14_lda_weekend,p15_lda_pleasure,p16_lda_opacity,p17_lda_downside,a_num,p_num,s_num
0,0.0,0.0,0.0,0.757352,0.0,0.0,0.0,0.0,0.0,0.0,...,0.011146,0.011146,0.011146,0.011146,0.011146,0.011146,0.011146,0,0,0
1,0.03125,0.03125,0.03125,0.03125,0.03125,0.03125,0.03125,0.03125,0.03125,0.03125,...,0.011146,0.011146,0.011146,0.011146,0.011146,0.011146,0.011146,0,0,1
2,0.0,0.0,0.0,0.086791,0.0,0.0,0.0,0.0,0.0,0.316206,...,0.010188,0.010188,0.010188,0.010188,0.010188,0.010188,0.054689,0,1,0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.07285,0.065544,0.0,0.0,...,0.010188,0.010188,0.010188,0.010188,0.010188,0.010188,0.055025,0,1,1
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.203645,...,0.011233,0.011233,0.011233,0.011233,0.011233,0.011233,0.011233,0,2,0


**Save Files to Disk**

In [378]:
lda_train.to_csv('../data_vec/lda_train.csv', index=False)
lda_test.to_csv('../data_vec/lda_test.csv', index=False)

## Continue to Notebook 3: Document Vectors