# Capstone: Text Factorizing with NLP
## Thomas W Ludlow Jr
### General Assembly DSI-NY-6, 2019

#### Problem Statement

**I will use Natural Language Processing (NLP) tools such as Python's NLTK to "factorize" a passage of text by identifying and quantifying similarities between the passage and historical texts and literature.  Starting with sentences then expanding to paragraphs and longer, I will identify grammatical patterns and word similarities and measure the amount of stylistic and content alignment with historical/philosophical text and classic literature, available online at the Gutenberg Project.**

#### Modeling Ideas

_Tools_
 - Sci-Kit Learn
 - Natural Language Toolkit (NLTK)
 - Keras
 - TensorFlow
 - Gensim
 - spaCy

_Preprocessing_
 - Tokenizing
 - Stemming/Lemmatizing
 - n-grams of 2+
 - Part of Speech Tagging
 - Dependency Modeling
 - Topic Modeling

_Models_
 - Literal similarity
  - Lemmatize and compare n-grams to corpora with string functionality
  - Token Vectors for word counts
  - Part of Speech Tagging to identify most-common word orders of different lengths
  - Dependency modeling to group connected words
 - Sentiment Analysis
 - Word2Vec
  - Cosine similarity to identify similarity between passage and corpora
 - Keras Neural Net
  - Multi-classification Model of corpus inputs +1
  - Sigmoid results showing levels of similarity with one to represent dissimilarity
 

# 01 - spaCy, sklearn for Keras RNN

This section contains exploration of spaCy libraries and methods.  Once functionality of various tools is established, an approach for optimizing against large bodies of text on hosted processors (AWS) can be developed.

#### Libraries

In [1]:
import pandas as pd
import numpy as np

In [2]:
import spacy

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer

#### Philosophy Texts

In [4]:
plato_file = open('./data/plato_republic.txt','r')
aristotle_file = open('./data/aristotle_categories.txt','r')
descartes_file = open('./data/descartes_principles.txt','r')
kant_file = open('./data/kant_critique.txt','r')

plato = plato_file.readlines()
aristotle = aristotle_file.readlines()
descartes = descartes_file.readlines()
kant = kant_file.readlines()

In [5]:
plato_lines = plato[8494:24328] # contents of work
aristotle_lines = aristotle[37:1492] # contents of work
descartes_lines = descartes[362:-6] # contents of work
kant_lines = kant[27:-373] # contents of work

**Paragraph Selection**

In [90]:
for n, line in enumerate(plato_lines):
    if "knowledge" in line.lower() and "virtue" in line.lower(): 
        print(n)

4552
14029


In [98]:
plato_lines[14077:14090]

['Those then who know not wisdom and virtue, and are always busy with\n',
 'gluttony and sensuality, go down and up again as far as the mean; and\n',
 'in this region they move at random throughout life, but they never pass\n',
 'into the true upper world; thither they neither look, nor do they ever\n',
 'find their way, neither are they truly filled with true being, nor do\n',
 'they taste of pure and abiding pleasure. Like cattle, with their eyes\n',
 'always looking down and their heads stooping to the earth, that is,\n',
 'to the dining-table, they fatten and feed and breed, and, in their\n',
 'excessive love of these delights, they kick and butt at one another with\n',
 'horns and hoofs which are made of iron; and they kill one another by\n',
 'reason of their insatiable lust. For they fill themselves with that\n',
 'which is not substantial, and the part of themselves which they fill is\n',
 'also unsubstantial and incontinent.\n']

In [6]:
plato_par = plato_lines[14077:14090]
aristotle_par = aristotle_lines[983:995]
descartes_par = descartes_lines[3052:3077]
kant_par = kant_lines[20381:20400]

## spaCy Exploration

In [7]:
nlp = spacy.load('en_core_web_lg')

In [8]:
ar_str = ' '.join(aristotle_par).replace('\n','') # convert list of strings sep by '\n' into single string
pl_str = ' '.join(plato_par).replace('\n','')
de_str = ' '.join(descartes_par).replace('\n','')
ka_str = ' '.join(kant_par).replace('\n','')


In [173]:
ar_str = ' '.join(aristotle_lines).replace('\n','') # convert list of strings sep by '\n' into single string
pl_str = ' '.join(plato_lines).replace('\n','')
de_str = ' '.join(descartes_lines).replace('\n','')
ka_str = ' '.join(kant_lines).replace('\n','')

In [175]:
nlp.max_length = 1_500_000

In [176]:
doc_a = nlp(ar_str)
doc_p = nlp(pl_str)
doc_d = nlp(de_str)
doc_k = nlp(ka_str)

In [177]:
docs = [doc_a, doc_p, doc_d, doc_k]
doc_names = ['Aristotle', 'Plato', 'Descartes', 'Kant']

for i in range(len(docs)):
    for j in range(len(docs)):
        print(doc_names[i], doc_names[j], ":", docs[i].similarity(docs[j]))
    print('')

Aristotle Aristotle : 1.0
Aristotle Plato : 0.986406569347
Aristotle Descartes : 0.99242321186
Aristotle Kant : 0.990772543316

Plato Aristotle : 0.986406569347
Plato Plato : 1.0
Plato Descartes : 0.993415082399
Plato Kant : 0.980461739495

Descartes Aristotle : 0.99242321186
Descartes Plato : 0.993415082399
Descartes Descartes : 1.0
Descartes Kant : 0.993687672486

Kant Aristotle : 0.990772543316
Kant Plato : 0.980461739495
Kant Descartes : 0.993687672486
Kant Kant : 1.0



**Spacy Tokens**

In [10]:
for token in doc_a:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
          token.shape_, token.is_alpha, token.is_stop)

We -PRON- PRON PRP nsubjpass Xx True False
must must VERB MD aux xxxx True False
not not ADV RB neg xxx True False
be be VERB VB auxpass xx True False
disturbed disturb VERB VBN ROOT xxxx True False
because because ADP IN mark xxxx True False
it -PRON- PRON PRP nsubjpass xx True False
may may VERB MD aux xxx True False
be be VERB VB auxpass xx True False
argued argue VERB VBN advcl xxxx True False
that that ADP IN mark xxxx True False
, , PUNCT , punct , False False
though though ADP IN mark xxxx True False
proposing propose VERB VBG advcl xxxx True False
to to PART TO aux xx True False
discuss discuss VERB VB xcomp xxxx True False
the the DET DT det xxx True False
category category NOUN NN dobj xxxx True False
of of ADP IN prep xx True False
quality quality NOUN NN pobj xxxx True False
, , PUNCT , punct , False False
we -PRON- PRON PRP nsubj xx True False
have have VERB VBP aux xxxx True False
included include VERB VBN ccomp xxxx True False
in in ADP IN prep xx True False
it -PRON- PR

**Spacy Noun Chunks**

In [11]:
for chunk in doc_d.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_,
          chunk.root.head.text)

I I nsubj wrong
the truth truth dobj wrong
it it nsubj certain
it it nsubj is
I I nsubj distinguish
two kinds kinds dobj distinguish
certitude certitude pobj of
a certainty certainty nsubj sufficient
the conduct conduct pobj for
life life pobj of
we we nsubj look
the absolute power power pobj to
God God pobj of
what what nsubj is
who who nsubj visited
Rome Rome dobj visited
it it nsubj is
a city city attr is
Italy Italy pobj of
it it nsubj be
whom whom pobj from
they they nsubj got
their information information nsubjpass deceived
any one one nsubj bethinks
a letter letter dobj decipher
Latin characters characters pobj in
regular order order pobj in
himself himself dobj bethinks
a B B dobj reading
an A A nsubjpass found
a B B attr is
place place pobj in
each letter letter pobj of
the one one dobj substituting
it it dobj follows
the order order pobj in
the alphabet alphabet pobj of
he he nsubj finds
certain Latin words words attr are
he he nsubj doubt
the true meaning meaning nsubjpass c

**Entities**

In [12]:
for ent in doc_d.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

two 114 117 CARDINAL
first 142 147 ORDINAL
Rome 340 344 GPE
Italy 379 384 GPE
Latin 526 531 LANGUAGE
Latin 825 830 LANGUAGE


**Sentences**

In [13]:
for sent in doc_k.sents:
    print(sent.text)

All rational cognition is, again, based either on conceptions, or on the construction of conceptions.
The former is termed philosophical, the latter mathematical.
I have already shown the essential difference of these two methods of cognition in the first chapter.
A cognition may be objectively philosophical and subjectively historical—as is the case with the majority of scholars and those who cannot look beyond the limits of their system, and who remain in a state of pupilage all their lives.
But it is remarkable that mathematical knowledge, when committed to memory, is valid, from the subjective point of view, as rational knowledge also, and that the same distinction cannot be drawn here as in the case of philosophical cognition.
The reason is that the only way of arriving at this knowledge is through the essential principles of reason, and thus it is always certain and indisputable; because reason is employed in concreto—but at the same time a priori—that is, in pure and, therefore,

# TF-IDF Vectorizing

1. Fit TFIDF on all docs combined
2. Create DF for each corpus transformed
3. Combine DF and add classifier value

In [178]:
docs = [doc_a, doc_p, doc_d, doc_k]
doc_names = ['Aristotle', 'Plato', 'Descartes', 'Kant', 'No Author']

**Filter and combine lemma tokens for TFIDF fitting**

In [179]:
lem_tokens_all = []
lem_tokens_ind = []
for doc in docs:
    lem_tokens_all.extend([token.lemma_ for token in doc 
                           if (token.lemma_ != '-PRON-') 
                           and (token.pos_ != 'PUNCT') 
                           and (len(token.lemma_)) > 1])
    lem_tokens_ind.append(' '.join([token.lemma_ for token in doc 
                           if (token.lemma_ != '-PRON-') 
                           and (token.pos_ != 'PUNCT') 
                           and (len(token.lemma_)) > 1]))


In [180]:
print(len(lem_tokens_all))
print(len(lem_tokens_ind))
for i in range(4):
    print(len(lem_tokens_ind[i]))

342217
4
74093
553000
156768
1152929


In [31]:
lem_tokens_ind[0]

'must not be disturb because may be argue that though propose to discuss the category of quality have include in many relative term do say that habit and disposition be relative in practically all such case the genus be relative the individual not thus knowledge as genus be explain by reference to something else for mean knowledge of something but particular branch of knowledge be not thus explain the knowledge of grammar be not relative to anything external nor be the knowledge of music but these if relative at all be relative only in virtue of genus thus grammar be say be the knowledge of something not the grammar of something similarly music be the knowledge of something not the music of something'

**Fit TFIDF**

In [181]:
tfidf = TfidfVectorizer()

In [182]:
corp_df = pd.DataFrame(tfidf.fit_transform(lem_tokens_ind).toarray(), columns=tfidf.get_feature_names())


In [183]:
corp_df

Unnamed: 0,000,10,100,11,12,13,14,15,1596,16,...,youngster,yours,yourself,youth,youthful,zeal,zealous,zeno,zero,zeus
0,0.0,0.000628,0.0,0.000775,0.000775,0.000775,0.000775,0.000775,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.000108,8.7e-05,0.001234,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000137,0.000137,0.000432,0.005249,0.000216,0.000175,0.0,0.0,0.0,0.002193
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000522,0.0,...,0.0,0.0,0.0,0.000999,0.0,0.000666,0.0,0.0,0.0,0.0
3,5.6e-05,0.000182,0.0,0.000112,0.000337,0.000169,0.000112,0.000169,0.0,0.000285,...,0.0,0.0,0.000225,0.000182,5.6e-05,4.6e-05,7.1e-05,7.1e-05,0.000285,0.0


In [184]:
corp_df.loc[corp_df.shape[0], :] = 0

In [185]:
corp_df['AUTHOR'] = doc_names

In [186]:
corp_df

Unnamed: 0,000,10,100,11,12,13,14,15,1596,16,...,yours,yourself,youth,youthful,zeal,zealous,zeno,zero,zeus,AUTHOR
0,0.0,0.000628,0.0,0.000775,0.000775,0.000775,0.000775,0.000775,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Aristotle
1,0.000108,8.7e-05,0.001234,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000137,0.000432,0.005249,0.000216,0.000175,0.0,0.0,0.0,0.002193,Plato
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000522,0.0,...,0.0,0.0,0.000999,0.0,0.000666,0.0,0.0,0.0,0.0,Descartes
3,5.6e-05,0.000182,0.0,0.000112,0.000337,0.000169,0.000112,0.000169,0.0,0.000285,...,0.0,0.000225,0.000182,5.6e-05,4.6e-05,7.1e-05,7.1e-05,0.000285,0.0,Kant
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,No Author


# X and y definitions

In [187]:
X = corp_df.iloc[:, :-1]

In [188]:
X

Unnamed: 0,000,10,100,11,12,13,14,15,1596,16,...,youngster,yours,yourself,youth,youthful,zeal,zealous,zeno,zero,zeus
0,0.0,0.000628,0.0,0.000775,0.000775,0.000775,0.000775,0.000775,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.000108,8.7e-05,0.001234,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000137,0.000137,0.000432,0.005249,0.000216,0.000175,0.0,0.0,0.0,0.002193
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000522,0.0,...,0.0,0.0,0.0,0.000999,0.0,0.000666,0.0,0.0,0.0,0.0
3,5.6e-05,0.000182,0.0,0.000112,0.000337,0.000169,0.000112,0.000169,0.0,0.000285,...,0.0,0.0,0.000225,0.000182,5.6e-05,4.6e-05,7.1e-05,7.1e-05,0.000285,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [189]:
X.shape

(5, 8198)

In [190]:
y = corp_df.iloc[:, -1:]

In [191]:
y_cat = pd.get_dummies(y).values

In [192]:
y_cat

array([[1, 0, 0, 0, 0],
       [0, 0, 0, 0, 1],
       [0, 1, 0, 0, 0],
       [0, 0, 1, 0, 0],
       [0, 0, 0, 1, 0]], dtype=uint8)

# Keras Models

In [193]:
import keras
import tensorflow as tf

In [194]:
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.utils.np_utils import to_categorical

In [205]:
model = Sequential()
model.add(Dense(X.shape[0], input_dim=X.shape[1], activation='relu'))
model.add(Dropout(.2))
model.add(Dense(20, activation='relu'))
model.add(Dense(5, activation=None))
model.add(Activation(tf.nn.softmax))

In [206]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [207]:
y_cat

array([[1, 0, 0, 0, 0],
       [0, 0, 0, 0, 1],
       [0, 1, 0, 0, 0],
       [0, 0, 1, 0, 0],
       [0, 0, 0, 1, 0]], dtype=uint8)

In [208]:
model.fit(X, y_cat, epochs=100, batch_size=10)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<keras.callbacks.History at 0x1a93aec588>

In [209]:
model.predict(X)

array([[ 0.23568825,  0.17034306,  0.22159944,  0.19413988,  0.17822932],
       [ 0.16509713,  0.17328736,  0.16697466,  0.20128256,  0.2933583 ],
       [ 0.1489836 ,  0.25180736,  0.24150878,  0.16304421,  0.19465606],
       [ 0.1617444 ,  0.17601646,  0.41114497,  0.13117988,  0.11991429],
       [ 0.22483061,  0.18815619,  0.1821624 ,  0.2051661 ,  0.19968472]], dtype=float32)

In [210]:
y_cat

array([[1, 0, 0, 0, 0],
       [0, 0, 0, 0, 1],
       [0, 1, 0, 0, 0],
       [0, 0, 1, 0, 0],
       [0, 0, 0, 1, 0]], dtype=uint8)