<h1 align='center'>It Starts with a Humanistic Research Question...</h1>
<img src='Long, So 263, Fig 8.png' width="66%" height="66%">

# Literary Patterns (Probably)
<ul><li>Preview</li>
<li>Review</li>
<li>Import Corpus</li>
<li>Pre-Processing</li>
<ul><li>Tokenize</li>
<li>Stop Words</li>
<li>Lemmatize</li>
<li>Most Frequent Terms</li></ul>
<li>Classification</li>
<ul><li>Featurize</li>
<li>Training</li>
<li>Prediction</li>
<li>Extra: Cross-Validation</li></ul>
<li>Literary Distinction</li>
</ul>

# 0. Preview

In [None]:
import nltk
nltk.download(['punkt','wordnet', 'stopwords'])

In [None]:
# Get texts of interest that belong to identifiably different categories

unladen_swallow = 'high air-speed velocity'
swallow_grasping_coconut = 'low air-speed velocity'

# Transform them into the format NLTK expects

unladen_features_tagged = ({'high':True, 'air-speed': True, 'velocity': True}, 'unladen')
coconut_features_tagged = ({'low':True, 'air-speed': True, 'velocity': True}, 'coconut')

# Train a classifier to learn distinguishing features

classifier = nltk.NaiveBayesClassifier.train([unladen_features_tagged, coconut_features_tagged])

In [None]:
# It's a simple question of weight ratios!
# A five ounce bird could not carry a one pound coconut. 

unknown_swallow = "high velocity"
unknown_features = { 'high': True, 'velocity':True}

classifier.classify(unknown_features)

# 1. Review

In [None]:
# Read Moby Dick
moby_string = open('Melville - Moby Dick.txt','r').read()

In [None]:
# Inspect the text
moby_string

In [None]:
# Make the text lower case
moby_lower = moby_string.lower()

In [None]:
# Tokenize Moby Dick
from nltk import word_tokenize
moby_tokens = word_tokenize(moby_lower)

In [None]:
# Check out the tokens
moby_tokens

In [None]:
# Just how long is Moby Dick anyway?
len(moby_tokens)

In [None]:
# Create a dictionary that counts token frequencies
from collections import Counter
moby_dict = Counter(moby_tokens)

In [None]:
# Dictionaries pair keys with values
moby_dict

In [None]:
# Report the ten most common tokens in the novel
moby_dict.most_common(10)

In [None]:
# Get the frequency of a specific word
moby_dict['whale']

In [None]:
# Create a list comprehension, including an 'if' statement
just_whales = [token for token in moby_tokens if token=='whale']

In [None]:
# Hast seen the White Whale?
just_whales

# 2. Import Corpus

## Operating System Interface!

Up to this point, we have only worked with one text at a time. It is simple enough to read in a single plaintext file, but we often find ourselves with many files residing in a folder on our hard drive. In order to access these, we will have to instruct Python to navigate our computer and access the files sequentially.

Even though it sounds banal, this is the moment your computer ceases to be an appliance and transforms into a tool. The <i>os</i> package allows Python to speak with the rest of your computer's systems and file storage. You now have access to any file on your computer and can manipulate them using the code you have learned so far. With great power comes great responsibility!

For now, we will look at just one function from <i>os</i> that will return a list of the files in a given directory.

In [None]:
import os

In [None]:
# Report the files in the current folder
os.listdir()

In [None]:
# Follow one of the reported folders
os.listdir('movie_reviews')

In [None]:
# And follow deeper
os.listdir('movie_reviews/negative')

In [None]:
# Assign that list to a variable
negative_files = os.listdir('movie_reviews/negative')

In [None]:
# Inspect first element in the list
negative_files[0]

In [None]:
## EX. How many reviews are there in the 'positive' folder?
##     How many in the 'negative' one?

negative_path = "movie_reviews/negative/"
positive_path = "movie_reviews/positive/"

## CHALLENGE: Find a list of files and folders on your desktop.

## Corpus

Although Long and So's study of modernist haiku motivates this lesson, a substantial portion of their corpus remains under copyright so they have not made it available publicly. Instead we will apply their methods to a toy corpus distributed with NLTK: movie reviews. These have been divided into positive and negative categories, with one thousand of each.

In essence, our task will be to learn the vocabulary of positive and negative evaluation, rather than the poetic genre of haiku.

In [None]:
# Open the first file from 'negative_files'
open('movie_reviews/negative/cv000_29416.txt').read()

In [None]:
# When opening others, filenames change but the path doesn't!
negative_path = "movie_reviews/negative/"
open(negative_path+'cv000_29416.txt').read()

In [None]:
# Read all files and assign to a variable
negative_reviews = [open(negative_path+name,'r').read() for name in negative_files]

In [None]:
# NOTE: If you are using OSX, your operating system may sometimes
# include hidden files in your folders that confuse Python.

# If you get an error while running the above line, try including an 'if' condition
# in your list comprehension to prevent Python from tripping over these.

# For example:
negative_reviews = [open(negative_path+name,'r').read() for name in negative_files if name[-4:]=='.txt']

In [None]:
# Inspect

negative_reviews[0]

In [None]:
# Repeat process for positive reviews

positive_path = 'movie_reviews/positive/'
positive_files = os.listdir(positive_path)
positive_reviews = [open(positive_path+name,'r').read() for name in positive_files]

In [None]:
# Inspect first element in list
positive_reviews[0]

In [None]:
## EX. How long is the list of positive movie reviews? Negative reviews?
##     Do these match the number of files you had observed in the folders earlier?

# 3. Pre-Processing

The pre-processing phase of our workflow transforms the strings that Python has read from the plaintext files into useful sets of features. Not only will we tokenize the texts, as we have previously, but we will perform three further steps described by Long and So: <i>stop word</i> removal, <i>lemmatization</i> of nouns, and low-frequency word removal.

Although pre-processing feels like a nitty-gritty task, it is important to recognize that how we pre-process our texts depends on the questions we are trying to answer. Not every project lemmatizes or stems its vocabulary. Perhaps we can imagine research questions in which the grammmatical functions indicated by word endings might be useful.

## Tokenize

In [None]:
from nltk import word_tokenize

In [None]:
# Tokenize first negative review

word_tokenize(negative_reviews[0])

In [None]:
# Tokenize our sets of reviews; tokens remain grouped by review

negative_tokenized = [word_tokenize(review) for review in negative_reviews]
positive_tokenized = [word_tokenize(review) for review in positive_reviews]

In [None]:
## Q.  The texts in the movie review corpus are already in lower case, however many
##     texts found in the wild are not. How would you change the code in the cell above
##     to produce a corpus of tokenized and lower-case text?

## EX. How many tokens are there in the first negative movie review? positive?

## CHALLENGE: How many tokens are there on average in each negative movie review?

## Stop Words

Stop words, sometimes refered to as <i>function words</i>, include articles, prepositions, pronouns, and conjunctions among others. Although their frequencies encode information about textual features like authorship, they do not convey semantic meanings and are often removed before analysis.

In [None]:
# Import our list of stop words

from nltk.corpus import stopwords

In [None]:
# Pull up NLTK's list of English-language stop words

stopwords.words('english')

In [None]:
# How many stop words are in the list?

len(stopwords.words('english'))

In [None]:
# NLTK has stopwords for many Western languages

stopwords.words('spanish')

In [None]:
tokenized_sentence = ['what', 'is', 'the', 'air-speed', 'velocity', 'of', 'an', 'unladen', 'swallow']

In [None]:
# Remove stopwords from tokenized review

for word in tokenized_sentence:
    if word not in stopwords.words('english'):
        print(word)

In [None]:
# As a list comprehension

[word for word in tokenized_sentence if word not in stopwords.words('english')]

In [None]:
# But what if we have more than one review at a time?

two_sentences = [['what', 'is', 'the', 'air-speed', 'velocity', 'of', 'an', 'unladen', 'swallow'],\
               ['what', 'do', 'you', 'mean', 'african', 'or', 'european']]

In [None]:
# Use a nested for-loop!

for sentence in two_sentences:
    for word in sentence:
        if word not in stopwords.words('english'):
            print(word)

In [None]:
# As a nested list comprehension: produces a list of lists!

[[word for word in sentence if word not in stopwords.words('english')] for sentence in two_sentences]

In [None]:
# The example sentences were short so it's hard to tell, but looking up
# whether a word is 'in' a list takes a pretty long time

# We can improve the speed of our task by converting the list to a set!

stopword_set = set(stopwords.words('english'))

In [None]:
# Inspect

stopword_set

In [None]:
# And we can remove the stop words from our movie reviews

negative_no_stops = [[word for word in review if word not in stopword_set] for review in negative_tokenized]
positive_no_stops = [[word for word in review if word not in stopword_set] for review in positive_tokenized]

In [None]:
## Q.  Stop words are typically the most frequent words in a language, yet do not convey semantic meaning.
##     Does this make sense based on the words in NLTK's list of English stop words?
##     What about other languages with which you are familar?

stopword_languages = ['danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian',\
                      'norwegian', 'portuguese', 'russian', 'spanish', 'swedish', 'turkish']

## EX. How many tokens did we remove from the first negative review in total? What percent were removed?

## EX. Try rewriting the list comprehensions in the above cell as plain for-loops.

## CHALLENGE.  Stop words are often instrumental in language detection for unknown texts.
##             How might you write a program to do this?

## Lemmatize

Lemmatization transforms an inflected word into its root. For nouns, this often converts plural form to singular. For verbs, this often produces the infinitive. Long and So lemmatize only nouns for their project -- which is just as well because NLTK's lemmatization function assumes that all words are nouns by default and leaves most non-nouns untouched!

<h3>WordNet Abbreviations</h3>
<table align='left'>
<tr><td>Noun</td><td>'n'</td><td>wordnet.NOUN</td></tr>
<tr><td>Verb</td><td>'v'</td><td>wordnet.VERB</td></tr>
<tr><td>Adjective</td><td>'a'</td><td>wordnet.ADJ</td></tr>
<tr><td>Adverb</td><td>'r'</td><td>wordnet.ADV</td></tr>
</table>

In [None]:
# Import NLTK's lemmatizer

from nltk.stem import WordNetLemmatizer

In [None]:
# Initialize the lemmatizer and assign it to a variable

wnl = WordNetLemmatizer()

In [None]:
# The lemmatization is called as a method: .lemmatize()

wnl.lemmatize('dogs')

In [None]:
# Doesn't seem to work properly on 'running' because 'wnl' assumes it is seeing a noun

wnl.lemmatize('running')

In [None]:
# Fortunately we can pass in a label for a different part of speech
# Perhaps one might use this along with a POS tagger!

wnl.lemmatize('running', pos='v')

In [None]:
# A new list of tokens

famous_sketch = ['ministry','of', 'silly', 'walks']

In [None]:
# Use a for-loop to lemmatize each word sequentially

for word in famous_sketch:
    print(wnl.lemmatize(word))

In [None]:
# As a list comprehension!

[wnl.lemmatize(word) for word in famous_sketch]

In [None]:
# Now, two lists of tokens

two_sketches = [['ministry','of', 'silly', 'walks'],['musical','mice']]

In [None]:
# As a nested list comprehension!

[[wnl.lemmatize(word) for word in sketch] for sketch in two_sketches]

In [None]:
# Let's lemmatize the nouns in our movie reviews

negative_lemmatized = [[wnl.lemmatize(word) for word in review] for review in negative_no_stops]
positive_lemmatized = [[wnl.lemmatize(word) for word in review] for review in positive_no_stops]

In [None]:
## EX. Lemmatize the list of plural nouns below

plural_nouns = ['parrots', 'witches', 'volcanoes', 'soliloquies', 'cherries', 'addenda', \
                'baths', 'knives', 'oxen', 'lice', 'brethren', 'alumni', 'alumnae', 'matrices']

## CHALLENGE: Use the part-of-speech tagger we looked at yesterday and include
##            a POS argument while lemmatizing the following sentence.

brave_sir_robin = "When danger reared its ugly head, he bravely turned his tail and fled!"

## Minimum Document Frequency

Intuitively, not all words in the corpus will convey the same amount of information about whether a movie review is positive or negative or whether a poem is a haiku. At the extreme, if a word appears in just a single text out of thousands, it doesn't tell us much either way about whether that word is associated with a category. By removing infrequent terms from our model, we can also save computational time.

When we measure document frequency, we do not need to know how many times a token appears in a text. We simply need to know which tokens appear in each. Python has an easy, built-in data type that that tells us the unique elements appearing in a list: <i>set</i>. As a data-type, a set is like a list but it does not retain information about the order of elements. Also, it is very effient when we want to check 'if' a particular element is contained 'in' a group of words.

In order to count the document frequency for each word in the corpus, we will produce a set of unique words for each text. Then we will pull out each word from each set and put these into a single list. Words belonging to multiple sets will appear multiple times; words belonging to just a single set will appear only once. Finally, we will use the <i>Counter</i> to tally how many times each word appears in the term-document frequency list. 

### Set: Function/Data-Type

In [None]:
# Here's a text that reuses many of its words

gertrude_stein = ['rose','is', 'a', 'rose', 'is', 'a', 'rose', 'is', 'a', 'rose', '.']

In [None]:
# Return a 'set' of the unique tokens in the text
# A set is like a list but does not retain information about order

set(gertrude_stein)

In [None]:
# The size of the set is much smaller than the text itself

len(gertrude_stein), len(set(gertrude_stein))

In [None]:
# Produce a set of unique tokens contained in each review

negative_sets = [set(review) for review in negative_lemmatized]
positive_sets = [set(review) for review in positive_lemmatized]

### Document Frequency

In [None]:
# We want frequencies for the whole corpus, so we'll put our sets of words together now

all_sets = negative_sets + positive_sets

In [None]:
# Check that we got them all

len(negative_sets), len(positive_sets), len(all_sets)

In [None]:
# And we'll pull out the words from each review set
# This produces a list in which each token appears as many times as the number documents to which it belongs

term_document_frequency_list = [word for review in all_sets for word in review]

In [None]:
# We want to count the number of times each token appears, so we'll use 'Counter'

from collections import Counter

tdf_counts = Counter(term_document_frequency_list)

In [None]:
# Inspect

tdf_counts

In [None]:
# Let's refresh ourselves on Counter's methods

tdf_counts.keys()

In [None]:
# How many reviews refer to the larger film industry?

tdf_counts['hollywood']

In [None]:
# Produce a list of words whose tdf-count is greater than 1

more_than_once = [key for key in tdf_counts.keys() if tdf_counts[key]>1]

In [None]:
# Inspect

more_than_once

In [None]:
# Just how many words did we remove from our vocabulary?

len(more_than_once), len(tdf_counts.keys())

In [None]:
# Now we can go back through our movie reviews and remove
# any words that were not included in 'more_than_once'

# As mentioned above, it is much more efficient to perform that task using a 'set'

more_than_once_set = set(more_than_once)

In [None]:
negative_min_df = [[word for word in review if word in more_than_once_set] for review in negative_sets]
positive_min_df = [[word for word in review if word in more_than_once_set] for review in positive_sets]

In [None]:
# Inspect
negative_min_df[0]

In [None]:
## EX.  Get a list of the 500 words with the highest document frequencies.
##      Only words should be contained in the list.

## CHALLENGE: Long and So remove low-frequency words based on document frequency.
##            Another popular method is to retain only high-frequency
##            words based on the raw number of times they appear in the corpus
##            (i.e. the sum of their counts in all texts).

##            Use this method to find the 500 most common terms in the poetry corpus.
##            Does this match with the list in the previous exercise?

# 4. Classification

### Featurize

For humans, reading a string of text is a relatively easy task, but for the computer to learn about language, text has to be represented in very particular ways. We refer to this as <i>featurization</i>: the transformation of a text into a quantitative feature set.

In order for the NLTK classifier to work, we have to represent each text as a set of True/False values: Is a given word from our high-frequency vocabulary present in this review? More specifically, these values will be contained in a <i>dictionary</i>, where each key is a vocabulary word and its value is whether or not it is present. In fact, we need only to include terms that are present, so all of our values will be True.

Once we have processed each text according to this rubric, we will then attach a label for the text's category ('reviewed'/'random). The classifier will use this to identify which features are associated with each.

In [None]:
# Let's revisit some earlier tokens

unladen_tokens = ['high','air-speed','velocity']

In [None]:
# In order to represent our tokens to the classifier, we need to
# associate them with a 'True' value in a dictionary

# Sure looks like a list comprehension!

{token:True for token in unladen_tokens}

In [None]:
# In general, we are not limited to True/False values to our dictionary entries
# For example:

{token:len(token) for token in unladen_tokens}

In [None]:
# Turn our reviews into dictionaries that indicate whether a word is present

negative_featurized = [{word:True for word in review} for review in negative_min_df]
positive_featurized = [{word:True for word in review} for review in positive_min_df]

In [None]:
# Inspect first review

negative_featurized[0]

In [None]:
# Attach a label to each review
negative_tagged = [(review,'negative') for review in negative_featurized]
positive_tagged = [(review,'positive') for review in positive_featurized]

In [None]:
# Inspect
negative_tagged[0]

In [None]:
# Combine these lists of featurized, tagged reviews
all_tagged = negative_tagged + positive_tagged

In [None]:
## NOTE. We'll spend more time with dictionaries tomorrow, so let's hand wave
##       it as a formatting step for now and move on to classification!

### Classification

We have selected an algorithm that specifically relies on <a href="https://en.wikipedia.org/wiki/Bayes%27_theorem">Bayes' Theorem</a> to model relationships between textual features and categories in our corpus of movie reviews. (See link for more information about the method and its assumptions.)

Two ways that we learn about the model are its feature weights and predictions on new texts. The algorithm can explicity report to us which direction each word leans category-wise and how strongly. Based on those weights, it makes further predictions about the valences previously unseen movie reviews.

In [None]:
# Train the classifier and assign it to a variable

from nltk import NaiveBayesClassifier

classifier = NaiveBayesClassifier.train(all_tagged)

In [None]:
# Report feature information
classifier.show_most_informative_features(10)

In [None]:
# Roger Ebert's reviews of a couple family films

# "This movie made my heart glad. It is filled with innocence, hope, and good cheer."
#    -Roger Ebert, on ET

positive_ET = {'best': True, 'baffled': True, 'space': True, 'relationship': True, 'people': True, 'little': True,\
            'friendship': True, 'love': True, 'story': True, 'becomes': True,'hear': True, 'outer': True,\
            'boy': True, 'friend': True, 'tells': True, 'creature': True, 'described': True}

# "I hated this movie. Hated, hated, hated, hated, hated this movie. Hated it.
#  Hated every simpering stupid vacant audience-insulting moment of it."
#     -Roget Ebert, on "North"

negative_north = {'simpering': True, 'belief': True, 'every': True, 'thought': True, 'implied': True,\
                  'entertained': True, 'insulting': True, 'vacant': True, 'sensibility': True, 'stupid': True,\
                  'insult': True, 'audience': True, 'anyone': True, 'movie': True, 'hated': True, 'moment': True}

In [None]:
# What does the classifier think?

classifier.classify(positive_ET)

In [None]:
classifier.classify(negative_north)

In [None]:
# Predictions for a list of reviews

review_list = [positive_ET, negative_north]
classifier.classify_many(review_list)

In [None]:
# Although our classification is binary, Bayes theorem assigns
# a probability of membership in either category

# Just how confident is our classifier of its predictions?

classifier.prob_classify(positive_ET).prob('positive')

In [None]:
classifier.prob_classify(negative_north).prob('negative')

In [None]:
## EX. There are two movie reviews that the clasifier has not yet seen in the folder below.
##     Use the classifier to predict whether they are positive or negative.

## Q.  What kinds of patterns do you notice among the 'most informative features'
##     in the movie review corpus. Where are critics focusing their attention?
##     Try looking at the top several hundred most informative words.

## Extra: Cross-Validation

Just how good is our classifier? We can evaluate it by randomly selecting reviews from each category and setting them aside before training. We then see how well the classifier predicts their (known) categories.

Remember that if the classifier is trying to predict membership for just two categories, we would expect it to be correct about 50% of the time based on random chance. As a rule of thumb, if this kind of classifier has 65% accuracy or better under cross-validation, it has often identified a meaningful pattern.

In [None]:
# Randomize our list of movie reviews (in place)

import numpy
numpy.random.shuffle(all_tagged)

In [None]:
# We'll train our classifier on the first 90% of reviews
# and validate using the last 10%

training_set = all_tagged[:-72]
validation_set = all_tagged[-72:]

In [None]:
# Train, validate

classifier = nltk.NaiveBayesClassifier.train(training_set)
nltk.classify.accuracy(classifier, validation_set)

In [None]:
## EX. In fact, this is not the best implementation of cross-validation, since we used
##     the entire corpus to identify words that appear more than once. In effect, we have
##     passed information from our validation set into the classifier we wish to test.

##     Repeat the processing of the corpus based only on the training set and use this to
##     make predictions about the validation set. How much does the classifier's accuracy change? Why?