# Capstone Project: Part 2 - Problem Statement, EDA, Datasets

## Thomas Ludlow, DSI-NY-6 1/16/19

For Part 1 of our Capstone Project preparation, we shared 3 project ideas in a lightning talk format.  My initial ideas included:
 - Music feature extraction
 - Spotify music feature manipulation
 - NYC taxi/ride-sharing optimization
 
After discussion with DSI instructors and reflection on potential topics, I have decided to pursue a project not included among my intial ideas:

**Using Natural Language Processing (NLP) to identify similarities in text passages to historical texts and literature**

I find this area of machine learning fascinating, and I believe that this project will be both scaleable and adaptable to pursue within the ~5 week period remaining in the course.

### Problem Statement

**I will use Natural Language Processing (NLP) tools such as Python's NLTK to "factorize" a passage of text by identifying and quantifying similarities between the passage and historical texts and literature.  Starting with sentences then expanding to paragraphs and longer, I will identify grammatical patterns and word similarities and measure the amount of stylistic and content alignment with historical/philosophical text and classic literature, available online at the Gutenberg Project.**

### Methods and Models

Initial data processing approaches will include:
 - Tokenizing
 - Stemming
 - Lemmatizing
 - Part of Speech Tagging
 - Dependency Parsing
 - n-grams of 2-10
 
Models and tools will include:
 - Bayesian Prediction
 - Natural Language Tool Kit (NLTK) library
 - WordNet NLP library
 - Word2Vector, with/without Wikipedia Training
 - TensorFlow
 - Vector Space Models
 
The approach for this project will begin with a single training corpus from the Gutenberg Project Top 100 classics list, and will build a Neural Net model.  Once a method of identifying stylistic and word similarity is established and calibrated, additional works will be added and similarity results can be compared between differing training texts.

Success will be measured judged against an increasing size comparison between sets of identical content, highly similar content, somewhat similar content, and totally dissimilar content.  When a second model is added, these test sets will be run, as well as comparing degrees of difference between the two models.  These results will indicate the quality level of the factorizing model.


### Data and Sources

The Gutenberg Project
 - Philosophy texts: https://www.gutenberg.org/wiki/Philosophy_(Bookshelf)
 - Classic literature: https://www.gutenberg.org/browse/scores/top
 
I will start with "The Republic" by Plato, and "The Categories" by Aristotle.

Textfiles.com: http://www.textfiles.com/etext/FICTION/

### Risks and Assumptions

The main risk associated with this project is that the scale will be insufficient to achieve the goal of multiple models comparing similarity to multiple works.  Another risk is that results will be too small in general to connote meaningful relationships between texts.  

I am assuming that Neural Net models will be able to identify the type of connections between works, and that it will be able to perform at scale.  

### Training Corpus Samples, EDA

In [2]:
import nltk
import pandas as pd

In [11]:
plato_file = open('../data/plato_republic.txt','r')
aristotle_file = open('../data/aristotle_categories.txt','r')

In [12]:
plato = plato_file.readlines()
aristotle = aristotle_file.readlines()

In [13]:
len(plato)

24693

In [14]:
len(aristotle)

1861

In [17]:
plato_lines = plato[8494:24328] # contents of work

In [28]:
aristotle_lines = aristotle[37:1492] # contents of work

In [27]:
plato_par = plato_lines[116:141]

In [30]:
plato_par

['I will tell you, Socrates, he said, what my own feeling is. Men of my\n',
 'age flock together; we are birds of a feather, as the old proverb says;\n',
 'and at our meetings the tale of my acquaintance commonly is--I cannot\n',
 'eat, I cannot drink; the pleasures of youth and love are fled away:\n',
 'there was a good time once, but now that is gone, and life is no longer\n',
 'life. Some complain of the slights which are put upon them by relations,\n',
 'and they will tell you sadly of how many evils their old age is the\n',
 'cause. But to me, Socrates, these complainers seem to blame that which\n',
 'is not really in fault. For if old age were the cause, I too being old,\n',
 'and every other old man, would have felt as they do. But this is not\n',
 'my own experience, nor that of others whom I have known. How well I\n',
 'remember the aged poet Sophocles, when in answer to the question, How\n',
 'does love suit with age, Sophocles,--are you still the man you were?\n',
 'Peace, h

In [36]:
aristotle_par = aristotle_lines[208:222]

In [37]:
aristotle_par

['It is a common characteristic of all substance that it is never present\n',
 'in a subject. For primary substance is neither present in a subject nor\n',
 'predicated of a subject; while, with regard to secondary substances, it\n',
 'is clear from the following arguments (apart from others) that they are\n',
 "not present in a subject. For 'man' is predicated of the individual\n",
 'man, but is not present in any subject: for manhood is not present in\n',
 "the individual man. In the same way, 'animal' is also predicated of the\n",
 'individual man, but is not present in him. Again, when a thing is\n',
 'present in a subject, though the name may quite well be applied to that\n',
 'in which it is present, the definition cannot be applied. Yet of\n',
 'secondary substances, not only the name, but also the definition,\n',
 'applies to the subject: we should use both the definition of the\n',
 'species and that of the genus with reference to the individual man.\n',
 'Thus substance canno