# Capstone: Philosophical Factors for NLP
**_Measuring Similarity to Philosophical Concepts in Text Data_**

## Thomas W. Ludlow, Jr.
**General Assembly Data Science Immersive DSI-NY-6**

**February 12, 2019**

# Notebook 5 - Factorizing Unseen Text

### Table of Contents

[**5.1 Prepare Models for Prediction**](#5.1-Prepare-Models-for-Prediction)

[**5.2 Factorizing**](#5.2-Factorizing)
- [5.2.1 Preprocess Unseen Text for Factorizing](#5.2.1-Preprocess-Unseen-Text-for-Factorizing)
- [5.2.2 Factorize Text](#5.2.2-Factorize-Text)
- [5.2.3 Display Results](#5.2.3-Display-Results)

**Libraries**

In [85]:
# Python Data Science
import re
import ast
import time
import pickle
import numpy as np
import pandas as pd
from tqdm import tqdm

# Natural Language Processing
import spacy
import gensim
import pyLDAvis.gensim
from gensim.corpora import Dictionary
from gensim.models import TfidfModel, ldamulticore, CoherenceModel

# Modeling Prep
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.externals import joblib

# Neural Net
import keras
import tensorflow as tf
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.callbacks import EarlyStopping

# Plotting
import matplotlib.pyplot as plt
%matplotlib inline

# Override deprecation warnings
import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

## 5.1 Prepare Models and Input Text

**Load Models**

In [2]:
rnn = keras.models.load_model('../models/rnn')

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Instructions for updating:
Use tf.cast instead.


In [3]:
logreg = joblib.load('../models/logreg')
# mnb = joblib.load('./models/mnb')

In [4]:
file = open('../data_eda/sw.pkl','rb')
sw = pickle.load(file)
file.close()

In [6]:
# Using medium English library which does not include vectors
nlp = spacy.load('en_core_web_md')

In [8]:
d2v_s_file = open('../models/d2v_s.pkl','rb')
d2v_s = pickle.load(d2v_s_file)
d2v_s_file.close()

d2v_p_file = open('../models/d2v_p.pkl','rb')
d2v_p = pickle.load(d2v_p_file)
d2v_p_file.close()

In [102]:
lda_multi_s = Dictionary.load('../models/lda_multi_s')
lda_multi_p = Dictionary.load('../models/lda_multi_p')

### 5.1.1 New Text Preprocessing Function

**Convert Block String to List of Paragraphs**

In [53]:
quote = """
Why give a robot an order to obey orders—why aren't the original orders enough? 
Why command a robot not to do harm—wouldn't it be easier never to command it to 
do harm in the first place? Does the universe contain a mysterious force pulling 
entities toward malevolence, so that a positronic brain must be programmed to 
withstand it? Do intelligent beings inevitably develop an attitude problem?

Now that computers really have become smarter and more powerful, the anxiety has 
waned. Today's ubiquitous, networked computers have an unprecedented ability to 
do mischief should they ever go to the bad. But the only mayhem comes from 
unpredictable chaos or from human malice in the form of viruses. We no longer 
worry about electronic serial killers or subversive silicon cabals because we 
are beginning to appreciate that malevolence—like vision, motor coordination, 
and common sense—does not come free with computation but has to be programmed in.

Aggression, like every other part of human behavior we take for granted, is a 
challenging engineering problem!

Steven Pinker
"""

In [54]:
quote

"\nWhy give a robot an order to obey orders—why aren't the original orders enough? \nWhy command a robot not to do harm—wouldn't it be easier never to command it to \ndo harm in the first place? Does the universe contain a mysterious force pulling \nentities toward malevolence, so that a positronic brain must be programmed to \nwithstand it? Do intelligent beings inevitably develop an attitude problem?\n\nNow that computers really have become smarter and more powerful, the anxiety has \nwaned. Today's ubiquitous, networked computers have an unprecedented ability to \ndo mischief should they ever go to the bad. But the only mayhem comes from \nunpredictable chaos or from human malice in the form of viruses. We no longer \nworry about electronic serial killers or subversive silicon cabals because we \nare beginning to appreciate that malevolence—like vision, motor coordination, \nand common sense—does not come free with computation but has to be programmed in.\n\nAggression, like every o

In [55]:
quote_list = quote.strip().split('\n')

In [59]:
def par_list(text_list, min_lines=0):
    pars = []
    count = 0
    
    # Check the longest single line to determine threshold for end of paragraph lines
    line_lengths = [len(text_list[t].strip()) for t in range(len(text_list))]
    line_check_length = max(line_lengths)
    
    for i, line in enumerate(text_list):
        # If it reaches a blank line after too few lines, reset count
        if line == '\n' and count < min_lines: 
            count = 0
        
        # If it reaches the end of a paragraph after too few lines, reset count
        elif len(line.strip()) < (line_check_length * .67) and count < min_lines:
            count = 0
            
        # If it reaches a blank line after enough lines, save paragraph and reset count
        elif line == '\n' and count >= min_lines:
            loop_par = ''
            for j in range(count+1):
                loop_par += text_list[(i-count)+j].replace('\n','').strip() + ' '
            pars.append(loop_par[:-1])
            count = 0
        
        # If it sees the end of a paragraph after enough lines, save paragraph and reset count
        elif len(line.strip()) < (line_check_length * .67) and count >= min_lines:
            loop_par = ''
            for j in range(count+1):
                loop_par += text_list[(i-count)+j].replace('\n','').strip() + ' '
            pars.append(loop_par[:-1])
            count = 0
            
        # Otherwise increase count
        else:
            count += 1
    while '' in pars: pars.remove('')
    return pars

In [60]:
c_pars = par_list(quote_list)

In [61]:
c_pars

["Why give a robot an order to obey orders—why aren't the original orders enough? Why command a robot not to do harm—wouldn't it be easier never to command it to do harm in the first place? Does the universe contain a mysterious force pulling entities toward malevolence, so that a positronic brain must be programmed to withstand it? Do intelligent beings inevitably develop an attitude problem? ",
 "Now that computers really have become smarter and more powerful, the anxiety has waned. Today's ubiquitous, networked computers have an unprecedented ability to do mischief should they ever go to the bad. But the only mayhem comes from unpredictable chaos or from human malice in the form of viruses. We no longer worry about electronic serial killers or subversive silicon cabals because we are beginning to appreciate that malevolence—like vision, motor coordination, and common sense—does not come free with computation but has to be programmed in. ",
 'Aggression, like every other part of huma

**Build DataFrame from Paragraph List**

In [65]:
par_df = pd.DataFrame(columns=['paragraph'])

for i, book in enumerate(tqdm(c_pars)):
    #print(book)
    #temp_df = pd.DataFrame(columns=['paragraph'])
    #temp_df.paragraph = book
    #par_df = par_df.append(temp_df)
    
    par_df.loc[i, 'paragraph'] = book

# par_df.index = range(par_df.shape[0])
par_df.head()

100%|██████████| 4/4 [00:00<00:00, 886.93it/s]


Unnamed: 0,paragraph
0,Why give a robot an order to obey orders—why a...
1,Now that computers really have become smarter ...
2,"Aggression, like every other part of human beh..."
3,Steven Pinker


**Preprocess from DataFrame with `paragraph` Series**

In [81]:
def preprocess_to_df(par_file, nlp=nlp, sw=['the','a','but','like','for'], to_stem=False):
    # Run spaCy process on each paragraph and store docs in list
    print('1/8: nlp of paragraphs...')
    par_nlp = []
    for par in tqdm(par_file.paragraph):
        par_nlp.append(nlp(par))
    
    # Store paragraph lemma from spaCy docs
    print('2/8: nlp lemmatizing, part-of-speech, stopwords...')
    par_lemma = []
    for par in tqdm(par_nlp):
        par_lemma.append([token.lemma_ for token in par     # List comprehension
                           if token.lemma_ != '-PRON-'           # Pronouns are excluded
                           and token.pos_ != 'PUNCT'             # Punctuation is excluded
                           and token.is_alpha                    # Numbers are excluded
                           and not token.is_stop                 # Stop words are excluded
                          and len(token.lemma_) > 1])
    par_lemma = [[pl[i].lower() for i in range(len(pl))] for pl in par_lemma]
    
    # Stem lemma with NLTK PorterStemmer and remove stop words
    print('3/8: additional stopwords...')
    if to_stem: ps = PorterStemmer()
    par_lemma_sw = []
    for vec_list in tqdm(par_lemma):    
        update_list = []
        for token in vec_list:
            if token in sw: continue
            if to_stem: update_list.append(ps.stem(token))
            else: update_list.append(token)
        par_lemma_sw.append(update_list)
    
    # Run spaCy on each sentence doc: Text
    print('4/8: saving sentence text...')
    sent_text = []
    for par in tqdm(par_nlp):
        sent_list = []
        for s in par.sents:
            sent_list.append(s.text)
        sent_text.append(sent_list)
    
    # Run spaCy on each sentence doc: NLP
    print('5/8: nlp of sentences...')
    sent_nlp = []
    for par in tqdm(par_nlp):
        sent_list = []
        for s in par.sents:
            sent_list.append(nlp(s.text))
        sent_nlp.append(sent_list)
    
    # Store lemma from spaCy docs
    print('6/8: nlp lemmatizing, part-of-speech, stopwords...')
    sent_lemma = []
    for par in tqdm(sent_nlp):
        for sent in par:
            sent_lemma.append([token.lemma_ for token in sent     # List comprehension
                               if token.lemma_ != '-PRON-'           # Pronouns are excluded
                               and token.pos_ != 'PUNCT'             # Punctuation is excluded
                               and token.is_alpha                    # Numbers are excluded
                               and not token.is_stop                 # Stop words are excluded
                              and len(token.lemma_) > 1])
    sent_lemma = [[sl[j].lower() for j in range(len(sl))] for sl in sent_lemma]
    
    # Stem lemma with NLTK PorterStemmer and remove stop words
    print('7/8: additional stopwords...')
    if to_stem: ps = PorterStemmer()
    sent_lemma_sw = []
    for vec_list in tqdm(sent_lemma):    
        update_list = []
        for token in vec_list:
            if token in sw: continue
            if to_stem: update_list.append(ps.stem(token))
            else: update_list.append(token)
        sent_lemma_sw.append(update_list)
    
    print('8/8: constructing dataframe...')
    nlp_df = pd.DataFrame(columns=['p_num','s_num','sent_text','sent_lemma',
                                   'par_text','par_lemma'])

    p_num = 0
    skip = 0
    s_lem_count = 0

    for p, sents_in_par in enumerate(tqdm(sent_text)):
        for s, sent in enumerate(sents_in_par):
            if (len(sent) < 10)|(len(sent_lemma_sw[s_lem_count]) < 1):
                skip += 1
            else:
                nlp_df = nlp_df.append({'p_num':p_num,
                                        's_num':s - skip,
                                        'sent_text':sent,
                                        'sent_lemma':sent_lemma_sw[s_lem_count],
                                        'par_text':par_file.loc[p, 'paragraph'],
                                        'par_lemma':par_lemma_sw[p]
                                        }, ignore_index=True)
            s_lem_count += 1
        p_num += 1
        skip = 0
        if p == par_file.shape[0]-1: continue

    print('complete')
    return nlp_df

In [82]:
unseen_df = preprocess_to_df(par_df, nlp, sw)
unseen_df.head()

100%|██████████| 4/4 [00:00<00:00, 70.51it/s]
100%|██████████| 4/4 [00:00<00:00, 6713.57it/s]
100%|██████████| 4/4 [00:00<00:00, 4648.72it/s]
100%|██████████| 4/4 [00:00<00:00, 8371.86it/s]
100%|██████████| 4/4 [00:00<00:00, 32.51it/s]
100%|██████████| 4/4 [00:00<00:00, 5002.15it/s]
100%|██████████| 13/13 [00:00<00:00, 15740.75it/s]
  0%|          | 0/4 [00:00<?, ?it/s]

1/8: nlp of paragraphs...
2/8: nlp lemmatizing, part-of-speech, stopwords...
3/8: additional stopwords...
4/8: saving sentence text...
5/8: nlp of sentences...
6/8: nlp lemmatizing, part-of-speech, stopwords...
7/8: additional stopwords...
8/8: constructing dataframe...


100%|██████████| 4/4 [00:00<00:00, 114.43it/s]

complete





Unnamed: 0,p_num,s_num,sent_text,sent_lemma,par_text,par_lemma
0,0,0,Why give a robot an order to obey orders—why a...,"[robot, order, obey, order, original, order]",Why give a robot an order to obey orders—why a...,"[robot, order, obey, order, original, order, c..."
1,0,1,Why command a robot not to do harm,"[command, robot, harm]",Why give a robot an order to obey orders—why a...,"[robot, order, obey, order, original, order, c..."
2,0,2,it be easier never to command it to do harm in...,"[easier, command, harm, place]",Why give a robot an order to obey orders—why a...,"[robot, order, obey, order, original, order, c..."
3,0,3,Does the universe contain a mysterious force p...,"[universe, contain, mysterious, force, pull, e...",Why give a robot an order to obey orders—why a...,"[robot, order, obey, order, original, order, c..."
4,0,4,Do intelligent beings inevitably develop an at...,"[intelligent, being, inevitably, develop, atti...",Why give a robot an order to obey orders—why a...,"[robot, order, obey, order, original, order, c..."


In [84]:
unseen_df.shape

(11, 6)

### LDA Vectors

**BoW Dictionary**

In [103]:
g_dict = Dictionary.load('../models/g_dict')

In [104]:
bow_corpus_s = [g_dict.doc2bow(sent) for sent in unseen_df.sent_lemma]
bow_corpus_p = [g_dict.doc2bow(par) for par in unseen_df.par_lemma]

**TF-IDF**

In [105]:
tfidf = Dictionary.load('../models/tfidf')

In [106]:
corpus_s = tfidf[bow_corpus_s]
corpus_p = tfidf[bow_corpus_p]

**LDA Vectors**

In [107]:
lda_df_s = pd.DataFrame(columns=[n for n in range(lda_multi_s.num_topics)])

for i, doc in enumerate(lda_multi_s.get_document_topics(tqdm(corpus_s))):
    for topic, proba in doc:
        lda_df_s.loc[i, topic] = proba

100%|██████████| 11/11 [00:00<00:00, 115.23it/s]


In [108]:
lda_df_s.fillna(0, inplace=True)

In [109]:
lda_df_s.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,0.012024,0.012024,0.012024,0.012024,0.012024,0.012024,0.012024,0.012024,0.012024,0.012024,...,0.012024,0.012024,0.012024,0.012024,0.012024,0.012024,0.012024,0.012024,0.012024,0.012024
1,0.01301,0.01301,0.01301,0.01301,0.01301,0.01301,0.01301,0.01301,0.01301,0.01301,...,0.01301,0.01301,0.01301,0.283201,0.326505,0.01301,0.01301,0.01301,0.01301,0.01301
2,0.011645,0.011645,0.011645,0.011645,0.011645,0.011645,0.011645,0.011645,0.011645,0.011645,...,0.011645,0.011645,0.011645,0.011645,0.366406,0.011645,0.011645,0.011645,0.011645,0.011645
3,0.0,0.0,0.0,0.0,0.136429,0.0,0.0,0.0,0.0,0.0,...,0.482956,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.140249,0.0
4,0.0,0.295557,0.0,0.0,0.0,0.0,0.0,0.0,0.141469,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [110]:
lda_df_p = pd.DataFrame(columns=[n for n in range(lda_multi_p.num_topics)])

for i, doc in enumerate(lda_multi_p.get_document_topics(tqdm(corpus_p))):
    for topic, proba in doc:
        lda_df_p.loc[i, topic] = proba

100%|██████████| 11/11 [00:00<00:00, 150.54it/s]


In [111]:
lda_df_p.fillna(0, inplace=True)

In [112]:
lda_df = pd.merge(lda_df_s, lda_df_p, left_index=True, right_index=True)
lda_df.shape

(11, 50)

In [113]:
lda_train = pd.read_csv('../data_vec/lda_train.csv')

In [120]:
lda_df.columns = lda_train.columns[:-3]

In [141]:
lda_col_f = open('../models/lda_col.pkl','wb')
pickle.dump(lda_train.columns[:-3].tolist(), lda_col_f)
lda_col_f.close()

In [143]:
lda_col = lda_train.columns[:-3].tolist()
lda_col[:5]

['s0_lda_submission',
 's1_lda_sense',
 's2_lda_electorate',
 's3_lda_contention',
 's4_lda_hana']

In [121]:
lda_df.head()

Unnamed: 0,s0_lda_submission,s1_lda_sense,s2_lda_electorate,s3_lda_contention,s4_lda_hana,s5_lda_redistribution,s6_lda_misfortune,s7_lda_idea,s8_lda_reverence,s9_lda_engagement,...,p8_lda_nation,p9_lda_occupiers,p10_lda_seditions,p11_lda_consciousness,p12_lda_god,p13_lda_mankind,p14_lda_weekend,p15_lda_pleasure,p16_lda_opacity,p17_lda_downside
0,0.012024,0.012024,0.012024,0.012024,0.012024,0.012024,0.012024,0.012024,0.012024,0.012024,...,0.202096,0.010599,0.010599,0.056304,0.010599,0.010599,0.010599,0.010599,0.010599,0.010599
1,0.01301,0.01301,0.01301,0.01301,0.01301,0.01301,0.01301,0.01301,0.01301,0.01301,...,0.202143,0.010599,0.010599,0.056309,0.010599,0.010599,0.010599,0.010599,0.010599,0.010599
2,0.011645,0.011645,0.011645,0.011645,0.011645,0.011645,0.011645,0.011645,0.011645,0.011645,...,0.202521,0.010599,0.010599,0.056291,0.010599,0.010599,0.010599,0.010599,0.010599,0.010599
3,0.0,0.0,0.0,0.0,0.136429,0.0,0.0,0.0,0.0,0.0,...,0.202293,0.010599,0.010599,0.056301,0.010599,0.010599,0.010599,0.010599,0.010599,0.010599
4,0.0,0.295557,0.0,0.0,0.0,0.0,0.0,0.0,0.141469,0.0,...,0.202606,0.010599,0.010599,0.056288,0.010599,0.010599,0.010599,0.010599,0.010599,0.010599


**Doc2Vec Models**

In [90]:
unseen_s_vecs = []

for _, sent_vec in unseen_df.sent_lemma.iteritems():
    unseen_s_vecs.append(d2v_s.infer_vector(sent_vec))

In [94]:
len(unseen_s_vecs)

11

In [95]:
unseen_p_vecs = []

for _, par_vec in unseen_df.par_lemma.iteritems():
    unseen_p_vecs.append(d2v_p.infer_vector(par_vec))

In [96]:
len(unseen_p_vecs)

11

**Combine LDA and Doc2Vec**

In [98]:
# Feature names
s_vec_cols = ['s_vec_'+str(i) for i in range(len(unseen_s_vecs[0]))]
p_vec_cols = ['p_vec_'+str(j) for j in range(len(unseen_p_vecs[0]))]

In [99]:
vec_df = pd.DataFrame(unseen_s_vecs, columns=s_vec_cols)
vec_df.head()

Unnamed: 0,s_vec_0,s_vec_1,s_vec_2,s_vec_3,s_vec_4,s_vec_5,s_vec_6,s_vec_7,s_vec_8,s_vec_9,...,s_vec_22,s_vec_23,s_vec_24,s_vec_25,s_vec_26,s_vec_27,s_vec_28,s_vec_29,s_vec_30,s_vec_31
0,-0.00997,-0.009587,-0.013361,-0.010939,-0.004022,0.011073,-0.01021,-0.007418,0.007176,0.011053,...,-0.001503,0.00109,0.012897,-0.015453,-0.007155,0.01204,-0.006791,-0.011542,0.007764,0.010313
1,0.014659,-0.013282,-0.015417,-0.002377,-0.01493,-0.006043,0.007101,-0.005399,0.013152,0.014586,...,0.005967,-0.012714,0.001268,0.000655,0.007667,-0.012698,-0.011775,-0.004873,-0.015248,0.013013
2,0.001413,0.003536,-0.002364,0.012488,0.004198,0.009308,-0.005033,-0.009144,-0.012357,-0.005746,...,-0.00134,0.00954,0.003944,-0.008543,0.013252,0.001067,0.002793,0.010548,-0.003157,-0.005351
3,0.010007,0.001649,-0.015046,-0.007858,0.006602,0.007073,0.011532,0.014023,-0.007773,0.003398,...,-0.000625,-0.014121,0.004763,-0.000329,-0.012343,0.006974,-0.004036,-0.009591,0.006082,0.011071
4,-0.004119,-0.004192,-0.01338,-0.013126,0.009023,-0.014792,0.013927,4.4e-05,0.003122,0.005116,...,-0.010248,-0.010904,0.011992,0.006625,0.005886,-0.00373,0.000674,-0.009912,0.008087,-0.012772


In [100]:
p_vec_df = pd.DataFrame(unseen_p_vecs, columns=p_vec_cols)
p_vec_df.head()

Unnamed: 0,p_vec_0,p_vec_1,p_vec_2,p_vec_3,p_vec_4,p_vec_5,p_vec_6,p_vec_7,p_vec_8,p_vec_9,p_vec_10,p_vec_11,p_vec_12,p_vec_13,p_vec_14,p_vec_15,p_vec_16,p_vec_17
0,0.024637,-0.026958,-0.022527,-0.001932,-0.009715,-0.001683,0.00676,0.012981,-0.000117,0.004312,-0.018036,-0.018227,0.015383,0.017617,-0.021467,0.003174,-0.016102,-0.001315
1,0.024637,-0.026958,-0.022527,-0.001932,-0.009715,-0.001683,0.00676,0.012981,-0.000117,0.004312,-0.018036,-0.018227,0.015383,0.017617,-0.021467,0.003174,-0.016102,-0.001315
2,0.024637,-0.026958,-0.022527,-0.001932,-0.009715,-0.001683,0.00676,0.012981,-0.000117,0.004312,-0.018036,-0.018227,0.015383,0.017617,-0.021467,0.003174,-0.016102,-0.001315
3,0.024637,-0.026958,-0.022527,-0.001932,-0.009715,-0.001683,0.00676,0.012981,-0.000117,0.004312,-0.018036,-0.018227,0.015383,0.017617,-0.021467,0.003174,-0.016102,-0.001315
4,0.024637,-0.026958,-0.022527,-0.001932,-0.009715,-0.001683,0.00676,0.012981,-0.000117,0.004312,-0.018036,-0.018227,0.015383,0.017617,-0.021467,0.003174,-0.016102,-0.001315


In [101]:
for col_name in p_vec_cols:
    vec_df[col_name] = p_vec_df[col_name]

In [122]:
unseen_vec_df = pd.merge(lda_df, vec_df, left_index=True, right_index=True)
unseen_vec_df.head()

Unnamed: 0,s0_lda_submission,s1_lda_sense,s2_lda_electorate,s3_lda_contention,s4_lda_hana,s5_lda_redistribution,s6_lda_misfortune,s7_lda_idea,s8_lda_reverence,s9_lda_engagement,...,p_vec_8,p_vec_9,p_vec_10,p_vec_11,p_vec_12,p_vec_13,p_vec_14,p_vec_15,p_vec_16,p_vec_17
0,0.012024,0.012024,0.012024,0.012024,0.012024,0.012024,0.012024,0.012024,0.012024,0.012024,...,-0.000117,0.004312,-0.018036,-0.018227,0.015383,0.017617,-0.021467,0.003174,-0.016102,-0.001315
1,0.01301,0.01301,0.01301,0.01301,0.01301,0.01301,0.01301,0.01301,0.01301,0.01301,...,-0.000117,0.004312,-0.018036,-0.018227,0.015383,0.017617,-0.021467,0.003174,-0.016102,-0.001315
2,0.011645,0.011645,0.011645,0.011645,0.011645,0.011645,0.011645,0.011645,0.011645,0.011645,...,-0.000117,0.004312,-0.018036,-0.018227,0.015383,0.017617,-0.021467,0.003174,-0.016102,-0.001315
3,0.0,0.0,0.0,0.0,0.136429,0.0,0.0,0.0,0.0,0.0,...,-0.000117,0.004312,-0.018036,-0.018227,0.015383,0.017617,-0.021467,0.003174,-0.016102,-0.001315
4,0.0,0.295557,0.0,0.0,0.0,0.0,0.0,0.0,0.141469,0.0,...,-0.000117,0.004312,-0.018036,-0.018227,0.015383,0.017617,-0.021467,0.003174,-0.016102,-0.001315


In [123]:
unseen_vec_df['p_num'] = unseen_df.p_num
unseen_vec_df['s_num'] = unseen_df.s_num
unseen_vec_df.head()

Unnamed: 0,s0_lda_submission,s1_lda_sense,s2_lda_electorate,s3_lda_contention,s4_lda_hana,s5_lda_redistribution,s6_lda_misfortune,s7_lda_idea,s8_lda_reverence,s9_lda_engagement,...,p_vec_10,p_vec_11,p_vec_12,p_vec_13,p_vec_14,p_vec_15,p_vec_16,p_vec_17,p_num,s_num
0,0.012024,0.012024,0.012024,0.012024,0.012024,0.012024,0.012024,0.012024,0.012024,0.012024,...,-0.018036,-0.018227,0.015383,0.017617,-0.021467,0.003174,-0.016102,-0.001315,0,0
1,0.01301,0.01301,0.01301,0.01301,0.01301,0.01301,0.01301,0.01301,0.01301,0.01301,...,-0.018036,-0.018227,0.015383,0.017617,-0.021467,0.003174,-0.016102,-0.001315,0,1
2,0.011645,0.011645,0.011645,0.011645,0.011645,0.011645,0.011645,0.011645,0.011645,0.011645,...,-0.018036,-0.018227,0.015383,0.017617,-0.021467,0.003174,-0.016102,-0.001315,0,2
3,0.0,0.0,0.0,0.0,0.136429,0.0,0.0,0.0,0.0,0.0,...,-0.018036,-0.018227,0.015383,0.017617,-0.021467,0.003174,-0.016102,-0.001315,0,3
4,0.0,0.295557,0.0,0.0,0.0,0.0,0.0,0.0,0.141469,0.0,...,-0.018036,-0.018227,0.015383,0.017617,-0.021467,0.003174,-0.016102,-0.001315,0,4


## 5.2 Factorizing

In [129]:
text_meta = pd.read_csv('./data_source_text/text_meta.csv')
text_meta.head()

Unnamed: 0,Title,Author,Filename,Start Key,End Key,Category,Bumper Sticker,Original Language,Country,Year,Year Val,Wiki Link,Wiki Text,Paragraphs
0,The Categories,Aristotle,aristotle_categories.txt,*** START OF THIS PROJECT GUTENBERG EBOOK THE ...,End of the Project Gutenberg EBook of The Cate...,Hylomorphism,Being is a compound of matter and form,Greek,Greece,~335 BC,-335,https://en.wikipedia.org/wiki/Categories_(Aris...,The Categories (Greek Κατηγορίαι Katēgoriai; L...,132
1,The Poetics,Aristotle,aristotle_poetics.txt,ARISTOTLE ON THE ART OF POETRY,End of the Project Gutenberg EBook of The Poet...,Dramatic and Literary Theory,"Dramatic works imitate but vary in music, char...",Greek,Greece,335 BC,-335,https://en.wikipedia.org/wiki/Poetics_(Aristotle),Aristotle's Poetics (Greek: Περὶ ποιητικῆς; La...,72
2,Analects,Confucius,confucius_analects.txt,THE CHINESE CLASSICS,End of Project Gutenberg Etext THE CHINESE CLA...,Confucianism,Moral welfare and human virtue spring from alt...,Chinese,China,~475-206 BC,-206,https://en.wikipedia.org/wiki/Analects,The Analects (Chinese: 論語; pinyin: Lúnyǔ; Old ...,509
3,The Doctrine of the Mean,Confucius,confucius_mean.txt,THE DOCTRINE OF THE MEAN,THE END,Confucianism,"The superior man uses self-watchfulness, lenie...",Chinese,China,~500 BC,-500,https://en.wikipedia.org/wiki/Doctrine_of_the_...,The Doctrine of the Mean or Zhongyong is both ...,70
4,Meditations on First Philosophy,"Descartes, Rene",descartes_meditations.txt,TO THE MOST WISE AND ILLUSTRIOUS THE,"1Copyright: 1996, James Fieser (jfieser@utm.ed...",Skepticism,Man is a thinking being capable of understandi...,Latin,France,1641,1641,https://en.wikipedia.org/wiki/Meditations_on_F...,Meditations on First Philosophy in which the e...,144


In [133]:
author_df = pd.DataFrame(columns=['author','titles','categories','bumper_stickers','country','years'])

a_num = -1
prev_author = None

for i, row in text_meta.iterrows():
    if row.Author != prev_author:
        a_num += 1
        author_df.loc[a_num, 'author'] = row.Author
        author_df.loc[a_num, 'titles'] = row.Title
        author_df.loc[a_num, 'categories'] = row.Category
        author_df.loc[a_num, 'bumper_stickers'] = row['Bumper Sticker']
        author_df.loc[a_num, 'country'] = row.Country
        author_df.loc[a_num, 'years'] = row.Year
        prev_author = row.Author
    else:
        author_df.loc[a_num, 'titles'] += str('\n'+row.Title)
        if row.Category != author_df.loc[a_num, 'categories']:
            author_df.loc[a_num, 'categories'] += str('\n'+row.Category)
        author_df.loc[a_num, 'bumper_stickers'] += str('\n'+row['Bumper Sticker'])
        author_df.loc[a_num, 'years'] += str('\n'+row.Year)

author_df.head()

Unnamed: 0,author,titles,categories,bumper_stickers,country,years
0,Aristotle,The Categories\nThe Poetics,Hylomorphism\nDramatic and Literary Theory,Being is a compound of matter and form\nDramat...,Greece,~335 BC\n335 BC
1,Confucius,Analects\nThe Doctrine of the Mean,Confucianism,Moral welfare and human virtue spring from alt...,China,~475-206 BC\n~500 BC
2,"Descartes, Rene",Meditations on First Philosophy\nThe Principle...,Skepticism\nRationalism,Man is a thinking being capable of understandi...,France,1641\n1644
3,"Gibran, Khalil",The Prophet,Mysticism,The human condition is linked to union with th...,United States,1923
4,"Hobbes, Thomas",Leviathan,Political Philosophy,"Only strong unified government can save from ""...",England,1651


In [134]:
author_df.shape

(20, 6)

In [136]:
author_df.to_csv('./data_source_text/author_df.csv', index=False)

### 5.2.2 Factorize Text

In [124]:
preds = rnn.predict(unseen_vec_df)

In [137]:
pred_df = pd.DataFrame(preds, columns=author_df.author)
pred_df.head()

author,Aristotle,Confucius,"Descartes, Rene","Gibran, Khalil","Hobbes, Thomas","Hume, David","James, William","Kant, Immanuel","Khyyam, Omar","Locke, John","Machiavelli, Nicolo","Mill, John Stuart","Nietzsche, Friedrich","Paine, Thomas",Plato,"Rousseau, Jean-Jacques","Russell, Bertrand","Spinoza, Baruch",Sun Tzu,"Thoreau, Henry David"
0,0.036581,0.035552,0.040636,0.034015,0.039961,0.058626,0.037992,0.049685,0.031382,0.056295,0.047755,0.062128,0.037839,0.075129,0.062738,0.073575,0.056453,0.065341,0.056272,0.042048
1,0.028503,0.027911,0.033767,0.021311,0.028246,0.058792,0.034723,0.046848,0.017751,0.076751,0.049557,0.077209,0.026182,0.096408,0.056765,0.075493,0.070795,0.070092,0.060988,0.041907
2,0.012231,0.018957,0.018673,0.010963,0.018072,0.042943,0.016431,0.022834,0.005275,0.126913,0.049547,0.095843,0.013553,0.197972,0.054076,0.085611,0.046764,0.067497,0.051352,0.044493
3,0.005336,0.01281,0.010049,0.006452,0.013198,0.029149,0.008636,0.012743,0.001967,0.15555,0.037519,0.097138,0.007801,0.302998,0.049128,0.088315,0.028108,0.067123,0.032924,0.033057
4,0.00205,0.009934,0.005259,0.00472,0.01171,0.018551,0.004283,0.005404,0.000937,0.165122,0.026425,0.089822,0.004982,0.404112,0.046731,0.082074,0.012913,0.062586,0.016544,0.025842


In [144]:
pred_df.shape

(11, 20)

In [145]:
factors = pred_df.mean().sort_values(ascending=False)

In [157]:
factors.index[0]

'Nietzsche, Friedrich'

In [None]:
# def nlp_factorize(lda_df, model=model_full, ss=ss):
#     lda_sc = ss.transform(lda_df.values)    
#     preds = model.predict_proba(lda_sc)
#     return preds

In [None]:
# preds = nlp_factorize(bs_df)

### 5.2.3 Display Results

In [233]:
# Need Spacy `nlp`, stopwords list `sw`

def display_nlp_factors(block_str, n_factors=5, n_details=1):
    # convert block string to list of paragraph strings
    c_pars = par_list(block_str.strip().split('\n'))
    
    # convert list of paragraph strings to dataframe
    par_df = pd.DataFrame(columns=['paragraph'])
    for i, book in enumerate(tqdm(c_pars)):
        par_df.loc[i, 'paragraph'] = book
    
    # spaCy preprocessing to dataframe
    unseen_df = preprocess_to_df(par_df, nlp, sw)
    
    # bag of words corpus using Gensim
    bow_corpus_s = [g_dict.doc2bow(sent) for sent in unseen_df.sent_lemma]
    bow_corpus_p = [g_dict.doc2bow(par) for par in unseen_df.par_lemma]
    
    # TFIDF vectorization with Gensim
    corpus_s = tfidf[bow_corpus_s]
    corpus_p = tfidf[bow_corpus_p]
    
    # LDA vectors
    lda_df_s = pd.DataFrame(columns=[n for n in range(lda_multi_s.num_topics)])
    for i, doc in enumerate(lda_multi_s.get_document_topics(tqdm(corpus_s))):
        for topic, proba in doc:
            lda_df_s.loc[i, topic] = proba
    lda_df_s.fillna(0, inplace=True)
    
    lda_df_p = pd.DataFrame(columns=[n for n in range(lda_multi_p.num_topics)])
    for i, doc in enumerate(lda_multi_p.get_document_topics(tqdm(corpus_p))):
        for topic, proba in doc:
            lda_df_p.loc[i, topic] = proba
    lda_df_p.fillna(0, inplace=True)
    
    # LDA dataframe
    lda_df = pd.merge(lda_df_s, lda_df_p, left_index=True, right_index=True)
    lda_df.columns = lda_col
    
    # Doc2Vec
    unseen_s_vecs = []
    for _, sent_vec in unseen_df.sent_lemma.iteritems():
        unseen_s_vecs.append(d2v_s.infer_vector(sent_vec))
        
    unseen_p_vecs = []
    for _, par_vec in unseen_df.par_lemma.iteritems():
        unseen_p_vecs.append(d2v_p.infer_vector(par_vec))
    
    # column names for Doc2Vec
    s_vec_cols = ['s_vec_'+str(i) for i in range(len(unseen_s_vecs[0]))]
    p_vec_cols = ['p_vec_'+str(j) for j in range(len(unseen_p_vecs[0]))]
    
    # dataframe from d2v vectors
    vec_df = pd.DataFrame(unseen_s_vecs, columns=s_vec_cols)
    p_vec_df = pd.DataFrame(unseen_p_vecs, columns=p_vec_cols)
    for col_name in p_vec_cols:
        vec_df[col_name] = p_vec_df[col_name]
    
    # merge with LDA dataframe
    unseen_vec_df = pd.merge(lda_df, vec_df, left_index=True, right_index=True)
    unseen_vec_df['p_num'] = unseen_df.p_num
    unseen_vec_df['s_num'] = unseen_df.s_num
    
    # predict on input
    preds = rnn.predict(unseen_vec_df)
    pred_df = pd.DataFrame(preds, columns=author_df.author)
    
    # combine predicted results and sort
    factors = pred_df.mean().sort_values(ascending=False)
    
    # format output display
    print('-'*30)
    print('Text:\n', '"'+block_str+'"')
    print('-'*30)
    print('\nTop Authors:\n')
    
    for i in range(n_details):
        a_name = factors.index[i]
        titles = author_df[author_df.author==factors.index[i]].titles.item().split('\n')
        cats = author_df[author_df.author==factors.index[i]].categories.item().split('\n')
        bs = author_df[author_df.author==factors.index[i]].bumper_stickers.item().split('\n')
        for _, b in enumerate(bs):
            if len(b) >= 65:
                spl_ix = b[:65].rfind(' ')
                bs[_] = ''.join((b[:spl_ix],'\n\t\t\t',b[spl_ix:]))
        country = author_df[author_df.author==factors.index[i]].country.item()
        years = author_df[author_df.author==factors.index[i]].years.item().split('\n')
        
        print('\tAuthor:\t\t{}'.format(a_name))
        print('\tCountry:\t{}'.format(country))
        if len(titles) > 1:
            for t, title in enumerate(titles):
                if t == 0: print('\tWorks:\t\t{} ({})'.format(titles[t], years[t]))
                else: print('\t\t\t{} ({})'.format(titles[t], years[t]))
        else: print('\tWork:\t\t{} ({})'.format(titles[0], years[0]))
        if len(cats) > 1:
            for c, cat in enumerate(cats):
                if c == 0: print('\tCategory:\t{}'.format(cats[c]))
                else: print('\t\t\t{}'.format(cats[c]))
        else: print('\tCategory:\t{}'.format(cats[0]))
        if len(bs) > 1:
            for b, bstkr in enumerate(bs):
                if b == 0: print('\tThemes:\t\t{}'.format(bs[b]))
                else: print('\t\t\t{}'.format(bs[b]))
        else: print('\tTheme:\t\t{}'.format(bs[0]))
        print('')
    
    print('-'*30)
    print('\nTop Factors:')
    for j in range(n_factors):
        print('\t{}:{}{:.3f}'.format(factors.index[j], (' '*(30-len(factors.index[j]))), factors[j]))
    print('')
    print('-'*30)
    return factors

In [234]:
quote

"\nWhy give a robot an order to obey orders—why aren't the original orders enough? \nWhy command a robot not to do harm—wouldn't it be easier never to command it to \ndo harm in the first place? Does the universe contain a mysterious force pulling \nentities toward malevolence, so that a positronic brain must be programmed to \nwithstand it? Do intelligent beings inevitably develop an attitude problem?\n\nNow that computers really have become smarter and more powerful, the anxiety has \nwaned. Today's ubiquitous, networked computers have an unprecedented ability to \ndo mischief should they ever go to the bad. But the only mayhem comes from \nunpredictable chaos or from human malice in the form of viruses. We no longer \nworry about electronic serial killers or subversive silicon cabals because we \nare beginning to appreciate that malevolence—like vision, motor coordination, \nand common sense—does not come free with computation but has to be programmed in.\n\nAggression, like every o

In [235]:
results = display_nlp_factors(quote, 3, 3)

100%|██████████| 4/4 [00:00<00:00, 1000.61it/s]
100%|██████████| 4/4 [00:00<00:00, 70.72it/s]
100%|██████████| 4/4 [00:00<00:00, 3591.78it/s]
100%|██████████| 4/4 [00:00<00:00, 2861.05it/s]
100%|██████████| 4/4 [00:00<00:00, 2683.93it/s]
100%|██████████| 4/4 [00:00<00:00, 33.66it/s]
100%|██████████| 4/4 [00:00<00:00, 7496.52it/s]
100%|██████████| 13/13 [00:00<00:00, 15985.33it/s]
  0%|          | 0/4 [00:00<?, ?it/s]

1/8: nlp of paragraphs...
2/8: nlp lemmatizing, part-of-speech, stopwords...
3/8: additional stopwords...
4/8: saving sentence text...
5/8: nlp of sentences...
6/8: nlp lemmatizing, part-of-speech, stopwords...
7/8: additional stopwords...
8/8: constructing dataframe...


100%|██████████| 4/4 [00:00<00:00, 113.56it/s]
100%|██████████| 11/11 [00:00<00:00, 99.48it/s]
  0%|          | 0/11 [00:00<?, ?it/s]

complete


100%|██████████| 11/11 [00:00<00:00, 118.70it/s]

------------------------------
Text:
 "
Why give a robot an order to obey orders—why aren't the original orders enough? 
Why command a robot not to do harm—wouldn't it be easier never to command it to 
do harm in the first place? Does the universe contain a mysterious force pulling 
entities toward malevolence, so that a positronic brain must be programmed to 
withstand it? Do intelligent beings inevitably develop an attitude problem?

Now that computers really have become smarter and more powerful, the anxiety has 
waned. Today's ubiquitous, networked computers have an unprecedented ability to 
do mischief should they ever go to the bad. But the only mayhem comes from 
unpredictable chaos or from human malice in the form of viruses. We no longer 
worry about electronic serial killers or subversive silicon cabals because we 
are beginning to appreciate that malevolence—like vision, motor coordination, 
and common sense—does not come free with computation but has to be programmed in.

A




In [238]:
q = """
Doubt as sin. — Christianity has done its utmost to close the circle and 
declared even doubt to be sin. One is supposed to be cast into belief 
without reason, by a miracle, and from then on to swim in it as in the 
brightest and least ambiguous of elements: even a glance towards land, 
even the thought that one perhaps exists for something else as well as 
swimming, even the slightest impulse of our amphibious nature — is sin! 
And notice that all this means that the foundation of belief and all 
reflection on its origin is likewise excluded as sinful. What is wanted 
are blindness and intoxication and an eternal song over the waves in which 
reason has drowned.
"""

In [242]:
q = """
To teach how to live without certainty, and yet without being paralyzed 
by hesitation, is perhaps the chief thing that philosophy, in our age, 
can still do for those who study it.
"""

In [243]:
display_nlp_factors(q)

100%|██████████| 1/1 [00:00<00:00, 704.69it/s]
100%|██████████| 1/1 [00:00<00:00, 64.21it/s]
100%|██████████| 1/1 [00:00<00:00, 484.61it/s]
100%|██████████| 1/1 [00:00<00:00, 1939.11it/s]
100%|██████████| 1/1 [00:00<00:00, 3515.76it/s]
100%|██████████| 1/1 [00:00<00:00, 66.25it/s]
100%|██████████| 1/1 [00:00<00:00, 3701.95it/s]
100%|██████████| 1/1 [00:00<00:00, 4088.02it/s]
100%|██████████| 1/1 [00:00<00:00, 302.34it/s]
100%|██████████| 1/1 [00:00<00:00, 485.34it/s]
100%|██████████| 1/1 [00:00<00:00, 103.68it/s]

1/8: nlp of paragraphs...
2/8: nlp lemmatizing, part-of-speech, stopwords...
3/8: additional stopwords...
4/8: saving sentence text...
5/8: nlp of sentences...
6/8: nlp lemmatizing, part-of-speech, stopwords...
7/8: additional stopwords...
8/8: constructing dataframe...
complete
------------------------------
Text:
 "
To teach how to live without certainty, and yet without being paralyzed 
by hesitation, is perhaps the chief thing that philosophy, in our age, 
can still do for those who study it.
"
------------------------------

Top Authors:

	Author:		Paine, Thomas
	Country:	United States
	Works:		Common Sense (1776)
			The Rights of Man (1791)
	Category:	Political Philosophy
	Themes:		American colonists should revolt against inevitable British
			 oppression in pursuit of egalitarian government
			Political revolution is permissible when a government does not
			 safeguard the natural rights of its people

------------------------------

Top Factors:
	Paine, Thomas:                 




author
Paine, Thomas             0.075001
Rousseau, Jean-Jacques    0.074860
Spinoza, Baruch           0.066847
Plato                     0.063904
Mill, John Stuart         0.063293
Sun Tzu                   0.058335
Hume, David               0.058238
Russell, Bertrand         0.057609
Locke, John               0.057511
Kant, Immanuel            0.051169
Machiavelli, Nicolo       0.046680
Descartes, Rene           0.040062
Thoreau, Henry David      0.039974
Hobbes, Thomas            0.038967
Nietzsche, Friedrich      0.037736
James, William            0.037654
Aristotle                 0.036175
Confucius                 0.034795
Gibran, Khalil            0.031862
Khyyam, Omar              0.029330
dtype: float32