# Capstone: Text Factorizing with NLP
## Thomas Ludlow

# 02 - Gensim LDA

This notebook contains a prototype model flow for the Exploratory Data Analysis (EDA) process to prepare raw text data from 4 works of philosophy for Latent Dirichlet Allocation (LDA).  The output of this LDA will be a ranking of conceptual differences between works along with Dirichlet-prior similarity weights.  These weights will comprise the inputs for a Recurrent Neural Net model to assess multi-class similarity probabilities.

**Libraries**

In [53]:
# Python Data Science
import re
import numpy as np
import pandas as pd

# Natural Language Processing
import spacy
import gensim
from gensim.corpora import Dictionary
from gensim.models import ldamodel, CoherenceModel

# Plotting
import matplotlib.pyplot as pyplot
%matplotlib inline

# Override deprecation warnings
import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

**Import Book Data**

In [4]:
# Read files from disk
plato_file = open('./data/plato_republic.txt','r')
aristotle_file = open('./data/aristotle_categories.txt','r')
descartes_file = open('./data/descartes_principles.txt','r')
kant_file = open('./data/kant_critique.txt','r')

# Convert files into list of lines
plato = plato_file.readlines()
aristotle = aristotle_file.readlines()
descartes = descartes_file.readlines()
kant = kant_file.readlines()

# Strip publishing text from front and back
plato_lines = plato[8494:24328]
aristotle_lines = aristotle[37:1492]
descartes_lines = descartes[362:-6]
kant_lines = kant[27:-373]

In [13]:
doc_list = [plato_lines, 
            aristotle_lines, 
            descartes_lines, 
            kant_lines]

raw_corpus = [' '.join(doc).replace('\n','') for doc in doc_list]

In [18]:
num_docs = len(doc_list)

In [19]:
for i in range(num_docs):
    print(raw_corpus[i][:200], '\n')

THE REPUBLIC.     PERSONS OF THE DIALOGUE.  Socrates, who is the narrator.  Glaucon.  Adeimantus.  Polemarchus.  Cephalus.  Thrasymachus.  Cleitophon.  And others who are mute auditors.  The scene is  

The Categories   By  Aristotle   Translated by E. M. Edghill    Section 1  Part 1  Things are said to be named 'equivocally' when, though they have a common name, the definition corresponding with the 

SELECTIONS FROM THE PRINCIPLES OF PHILOSOPHY  OF  RENE DESCARTES (1596-1650)  TRANSLATED BY JOHN VEITCH, LL. D. LATE PROFESSOR OF LOGIC AND RHETORIC IN THE UNIVERSITY OF GLASGOW     From the Publisher 

 THE CRITIQUE OF PURE REASON   By Immanuel Kant    Translated by J. M. D. Meiklejohn   Contents   Preface to the First Edition (1781)    Preface to the Second Edition (1787)    Introduction    I. Of t 



**spaCy English Tokens, Lemma, Stopwords**

In [23]:
# Using smallest English library which does not include vectors
nlp = spacy.load('en_core_web_sm')
nlp.max_length = 1_500_000

In [24]:
# spaCy processing for tokens, lemma, part-of-speech, dependency
docs_nlp = []

for i in range(num_docs):
    docs_nlp.append(nlp(raw_corpus[i]))

In [30]:
docs_lemma = []

for i in range(num_docs):
    docs_lemma.append([token.lemma_ for token in docs_nlp[i] # List comprehension
                       if token.lemma_ != '-PRON-'           # Pronouns are excluded
                       and token.pos_ != 'PUNCT'             # Punctionation is excluded
                       and token.is_alpha                    # Numbers are excluded
                       and not token.is_stop])               # Stop words are excluded

In [33]:
# Pre-processed text tokens

for i in range(num_docs):
    print(docs_lemma[i][:20], '\n')

['the', 'republic', 'person', 'of', 'the', 'dialogue', 'socrates', 'narrator', 'glaucon', 'adeimantus', 'polemarchus', 'cephalus', 'thrasymachus', 'cleitophon', 'and', 'mute', 'auditor', 'the', 'scene', 'lay'] 

['the', 'categories', 'by', 'aristotle', 'translate', 'edghill', 'section', 'part', 'thing', 'say', 'name', 'equivocally', 'common', 'definition', 'correspond', 'differ', 'thus', 'real', 'man', 'figure'] 

['selection', 'from', 'the', 'principle', 'of', 'philosophy', 'of', 'rene', 'descartes', 'translate', 'by', 'john', 'veitch', 'll', 'late', 'professor', 'of', 'logic', 'and', 'rhetoric'] 

['the', 'critique', 'of', 'pure', 'reason', 'by', 'immanuel', 'kant', 'translate', 'meiklejohn', 'contents', 'preface', 'first', 'edition', 'preface', 'second', 'edition', 'introduction', 'of', 'difference'] 



**Gensim Dictionary and Corpus**

In [35]:
# Build dictionary to contain all terms from normalized text
g_dict = Dictionary(docs_lemma)

In [44]:
# Each word is given an integer index value
for i in range(6):
    print(i, g_dict[i], '\n')

0 a 

1 abate 

2 abdera 

3 abhor 

4 abhorrence 

5 abide 



In [38]:
# Build corpus of normalized text relative to dictionary
corpus = [g_dict.doc2bow(docs_lemma[i])
          for doc in docs_lemma 
          for i in range(num_docs)]

In [45]:
# Text words are indexed to source and dictionary values
for i in range(num_docs):
    print(corpus[i][:5], '\n')

[(0, 31), (1, 1), (2, 1), (3, 1), (4, 2)] 

[(0, 9), (5, 5), (6, 1), (18, 2), (35, 1)] 

[(0, 5), (6, 18), (14, 1), (17, 5), (18, 18)] 

[(0, 97), (5, 6), (6, 80), (8, 3), (12, 1)] 



**Gensim LDA**

In [54]:
# Creation of LDA model
lda_model = ldamodel.LdaModel(corpus=corpus,
                              id2word=g_dict,
                              num_topics=4, 
                              random_state=131,
                              update_every=1,
                              chunksize=100,
                              passes=10,
                              alpha='auto',
                              per_word_topics=True)

In [55]:
lda_model.print_topics()

[(0,
  '0.019*"conception" + 0.015*"reason" + 0.014*"object" + 0.010*"experience" + 0.010*"condition" + 0.009*"the" + 0.009*"time" + 0.009*"pure" + 0.009*"phenomenon" + 0.009*"thing"'),
 (1,
  '0.016*"body" + 0.015*"thing" + 0.010*"mind" + 0.009*"god" + 0.008*"substance" + 0.008*"know" + 0.008*"truth" + 0.007*"nature" + 0.007*"place" + 0.007*"motion"'),
 (2,
  '0.026*"say" + 0.018*"and" + 0.013*"good" + 0.013*"man" + 0.011*"true" + 0.010*"yes" + 0.008*"state" + 0.006*"like" + 0.006*"reply" + 0.006*"thing"'),
 (3,
  '0.017*"man" + 0.017*"contrary" + 0.015*"case" + 0.014*"thing" + 0.014*"quality" + 0.014*"substance" + 0.011*"say" + 0.010*"subject" + 0.010*"for" + 0.010*"term"')]

**Model Performance**

_Perplexity_ is a rating for model performance, with lower values scoring better.  This is calculated against the entire corpus, but can be used in a Train-Test Split to validate performance.

In [56]:
lda_model.log_perplexity(corpus)

-6.8968450046381315

_Coherence_ measures human interpretability of the LDA results, and is calculated using probability calculations around the segmented topics.  Higher values are better.

In [59]:
cm = CoherenceModel(model=lda_model, texts=docs_lemma, dictionary=g_dict, coherence='c_v')
cm.get_coherence()

0.43383330894069172