# Capstone: Philosophical Factors for NLP
**_Measuring Similarity to Philosophical Concepts in Text Data_**

## Thomas W. Ludlow, Jr.
**General Assembly Data Science Immersive DSI-NY-6**

**February 12, 2019**

# Notebook 2 - LDA Topic Modeling

### Table of Contents

[**2.1 Gensim LDA**](#2.1-Gensim-LDA)
- [2.1.1 Build Dictionary and Corpora](#2.1.1-Build-Dictionary-and-Corpora)
- [2.1.2 LDA Model](#2.1.2-LDA-Model)
- [2.1.3 Visualize with pyLDAvis](#2.1.3-Visualize-with-pyLDAvis)

[**2.2 Topic Labeling with Gensim Word2Vec**](#2.2-Topic-Labeling-with-Gensim-Word2Vec)

[**2.3 Combine Features**](#2.3-Combine-Features)


**Libraries**

In [45]:
# Python Data Science
import re
import ast
import time
import pickle
import numpy as np
import pandas as pd
from tqdm import tqdm_notebook as tqdm
from IPython.display import clear_output

# Natural Language Processing
import spacy
from nltk.stem import PorterStemmer

# Gensim
import gensim
import gensim.downloader as api
from gensim.corpora import Dictionary
from gensim.models import TfidfModel, ldamulticore, CoherenceModel
from gensim.models.word2vec import Word2Vec

# Modeling Prep
from sklearn.preprocessing import StandardScaler
from sklearn.externals import joblib

# Plotting
import matplotlib.pyplot as plt
%matplotlib inline

# Override deprecation warnings
import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

## 2.1 Gensim LDA

**Load Preprocessed Text Data**

In [49]:
nlp_df = pd.read_csv('./data_eda/nlp_df.csv')
t_nlp_df = pd.read_csv('./data_eda/t_nlp_df.csv')

In [50]:
nlp_df.head()

Unnamed: 0,author,work,a_num,w_num,p_num,s_num,sent_text,par_text,par_lemma,sent_lemma
0,Aristotle,The Categories,0,0,0,0,Things are said to be named 'equivocally' when...,Things are said to be named 'equivocally' when...,"['thing', 'be', 'say', 'to', 'be', 'name', 'eq...","['thing', 'be', 'say', 'to', 'be', 'name', 'eq..."
1,Aristotle,The Categories,0,0,0,1,"Thus, a real man and a figure in a picture can...",Things are said to be named 'equivocally' when...,"['thing', 'be', 'say', 'to', 'be', 'name', 'eq...","['thus', 'real', 'man', 'and', 'figure', 'in',..."
2,Aristotle,The Categories,0,0,0,2,For should any one define in what sense each i...,Things are said to be named 'equivocally' when...,"['thing', 'be', 'say', 'to', 'be', 'name', 'eq...","['should', 'any', 'one', 'define', 'in', 'what..."
3,Aristotle,The Categories,0,0,1,0,"On the other hand, things are said to be named...","On the other hand, things are said to be named...","['on', 'other', 'hand', 'thing', 'be', 'say', ...","['thing', 'be', 'say', 'to', 'be', 'name', 'eq..."
4,Aristotle,The Categories,0,0,1,1,"A man and an ox are both 'animal', and these a...","On the other hand, things are said to be named...","['on', 'other', 'hand', 'thing', 'be', 'say', ...","['thus', 'real', 'man', 'and', 'figure', 'in',..."


In [13]:
nlp_df.shape

(70922, 10)

In [5]:
t_nlp_df.head()

Unnamed: 0,author,work,a_num,w_num,p_num,s_num,sent_text,sent_lemma,par_text,par_lemma
0,Aristotle,Ethics,0,0,0,0,"Every art, and every science reduced to a teac...","['every', 'art', 'and', 'every', 'science', 'r...","Every art, and every science reduced to a teac...","['every', 'art', 'and', 'every', 'science', 'r..."
1,Aristotle,Ethics,0,0,1,0,Now there plainly is a difference in the Ends ...,"['every', 'art', 'and', 'every', 'science', 'r...",Now there plainly is a difference in the Ends ...,"['now', 'there', 'plainly', 'be', 'difference'..."
2,Aristotle,Ethics,0,0,1,1,"Again, since actions and arts and sciences are...","['now', 'there', 'plainly', 'be', 'difference'...",Now there plainly is a difference in the Ends ...,"['now', 'there', 'plainly', 'be', 'difference'..."
3,Aristotle,Ethics,0,0,2,0,"And whatever of such actions, arts, or science...","['every', 'art', 'and', 'every', 'science', 'r...","And whatever of such actions, arts, or science...","['and', 'whatev', 'of', 'such', 'action', 'art..."
4,Aristotle,Ethics,0,0,3,0,(And in this comparison it makes no difference...,"['every', 'art', 'and', 'every', 'science', 'r...",(And in this comparison it makes no difference...,"['and', 'in', 'this', 'comparison', 'make', 'n..."


In [14]:
t_nlp_df.shape

(8395, 10)

### 2.1.1 Build Dictionary and Corpora

**Gensim Dictionary `g_dict`**

In [35]:
# Build dictionary to contain all terms from normalized text
g_dict = Dictionary([ast.literal_eval(lemma_str) for lemma_str in nlp_df.sent_lemma])

In [31]:
len(g_dict)

70922

**Remove Outliers from Dictionary**

In [None]:
g_dict.filter_extremes(no_below=3, no_above=0.9, keep_n=18000)

len(g_dict)

**Bag of Words (BoW) Corpora**

Training Text Corpus

In [None]:
# Build corpus of normalized text relative to dictionary
bow_corpus_s = [g_dict.doc2bow(sent) for sent in nlp_df.sent_lemma]
bos_corpus_p = [g_dict.doc2bow(par) for par in nlp_df.par_lemma]

In [None]:
len(bow_corpus_s)

In [None]:
len(bow_corpus_s[0])

In [None]:
len(bow_corpus_p)

In [None]:
len(bow_corpus_p[0])

Testing Text Corpus

In [None]:
# Build corpus of normalized text relative to dictionary
t_bow_corpus_s = [g_dict.doc2bow(sent) for sent in t_nlp_df.sent_lemma]
t_bow_corpus_p = [g_dict.doc2bow(par) for par in t_nlp_df.par_lemma]

In [None]:
len(t_bow_corpus_p)

In [None]:
len(t_bow_corpus_p[0])

**TF-IDF Vectorization**

In [None]:
tfidf = TfidfModel(bow_corpus_s, normalize=True)

In [None]:
corpus_s = tfidf[bow_corpus_s]

In [None]:
len(corpus_s)

In [None]:
corpus_p = tfidf[bow_corpus_p]

In [None]:
t_corpus_s = tfidf[t_bow_corpus_s]

In [None]:
len(t_corpus_s)

In [None]:
t_corpus_p = tfidf[t_bow_corpus_p]

**Save TF-IDF Model to Disk**

In [None]:
tfidf.save('./models/tfidf')

### 2.1.2 LDA Models

**Set Parameter Values**

In [48]:
sent_param= {
    'num_topics':16,
    'random_state':210,
    'chunksize':5000,
    'passes':5,
    'workers':3
}

par_param= {
    'num_topics':8,
    'random_state':210,
    'chunksize':1000,
    'passes':5,
    'workers':3
}

**LDA Multicore Model - Sentences**

In [None]:
# Instantiate model based on parameter values
lda_multi_s = ldamulticore.LdaMulticore(corpus=corpus_s,
                                        id2word=g_dict,
                                        num_topics=sent_param['num_topics'],
                                        random_state=sent_param['random_state'],
                                        chunksize=sent_param['chunksize'],
                                        passes=sent_param['passes'],
                                        per_word_topics=True,
                                        workers=sent_param['workers']
)

In [None]:
lda_multi_s.print_topics()

**LDA Metrics - Sentences**

In [None]:
lda_multi_s.log_perplexity(corpus_s)

In [None]:
cm_s = CoherenceModel(model=lda_multi_s, texts=nlp_df.sent_lemma, dictionary=g_dict, coherence='c_v')
cm_s.get_coherence()

**LDA Multicore Model - Paragraphs**

In [None]:
# Instantiate model based on parameter values
lda_multi_p = ldamulticore.LdaMulticore(corpus=corpus_p,
                                        id2word=g_dict,
                                        num_topics=par_param['num_topics'],
                                        random_state=par_param['random_state'],
                                        chunksize=par_param['chunksize'],
                                        passes=par_param['passes'],
                                        per_word_topics=True,
                                        workers=par_param['workers']
)

In [None]:
lda_multi_p.print_topics()

**LDA Metrics - Paragraphs**

In [None]:
lda_multi_p.log_perplexity(corpus_p)

In [None]:
cm_p = CoherenceModel(model=lda_multi_p, texts=nlp_df.par_lemma, dictionary=g_dict, coherence='c_v')
cm_p.get_coherence()

### 2.1.3 Visualize with pyLDAvis

In [None]:
lda_display_s = pyLDAvis.gensim.prepare(lda_multi_s, corpus_s, g_dict, sort_topics=True)
pyLDAvis.display(lda_display_s)

In [None]:
lda_display_p = pyLDAvis.gensim.prepare(lda_multi_p, corpus_p, g_dict, sort_topics=True)
pyLDAvis.display(lda_display_p)

### 2.1.4 Optimization

**Sentence LDA**

**Paragraph LDA**

## 2.2 Topic Labeling with Gensim Word2Vec

## 2.3 Combine Features

## Continue to Notebook 3: Document Vectors