In [None]:
<h1 align='center'>It Starts with a Humanistic Research Question...</h1>
<img src='Long, So 263, Fig 8.png' width="66%" height="66%">

# 0. Preview

In [None]:
import nltk
nltk.download('punkt')

In [None]:
# Get texts of interest that belong to identifiably different categories

unladen_swallow = 'high air-speed velocity'
swallow_grasping_coconut = 'low air-speed velocity'

# Transform them into the format NLTK expects

unladen_features_tagged = ({'high':True, 'low': False, 'air-speed': True, 'velocity': True},\
                           'unladen')
coconut_features_tagged = ({'high': False, 'low':True, 'air-speed': True, 'velocity': True},\
                           'coconut')

# Train a classifier to learn distinguishing features

classifier = nltk.NaiveBayesClassifier.train([unladen_features_tagged, coconut_features_tagged])

In [None]:
# It's a simple question of weight ratios!
# A five ounce bird could not carry a one pound coconut. 

unknown_swallow = "high velocity"
unknown_features = {'high':True, 'low': False, 'air-speed': False, 'velocity':True}

classifier.classify(unknown_features)

# 1. Review

In [None]:
# Read Moby Dick
with open('Melville - Moby Dick.txt','r') as file_in:
    moby_string = file_in.read()

In [None]:
# Inspect the text
moby_string

In [None]:
# Make the text lower case
moby_lower = moby_string.lower()

In [None]:
# Tokenize Moby Dick
from nltk import word_tokenize
moby_tokens = word_tokenize(moby_lower)

In [None]:
# Check out the tokens
moby_tokens

In [None]:
# Just how long is Moby Dick anyway?
len(moby_tokens)

In [None]:
# Create a dictionary that counts token frequencies
from collections import Counter
moby_dict = Counter(moby_tokens)

In [None]:
# Dictionaries pair keys with values
moby_dict

In [None]:
# Report the ten most common tokens in the novel
moby_dict.most_common(10)

In [None]:
# Get the frequency of a specific word
moby_dict['whale']

In [None]:
# Create a list comprehension, including an 'if' statement
just_whales = [token for token in moby_tokens if token=='whale']

In [None]:
# Hast seen the White Whale?
just_whales

# 2. Import Corpus

### Operating System Interface!

Even though it sounds banal, this is the moment your computer ceases to be an appliance and transforms into a tool. The <i>os</i> package allows Python to speak with the rest of your computer's systems and file storage. You now have access to any file on your computer and can manipulate them using the code you have learned so far. With great power comes great responsibility!

For now, we will look at just one function from <i>os</i> that gives us access to our corpora when they are separated into individual, plaintext files.

In [None]:
import os

In [None]:
# Report the files in the current folder
os.listdir()

In [None]:
# Follow one of the reported folders
os.listdir('movie_reviews')

In [None]:
# And follow deeper
os.listdir('movie_reviews/negative')

In [None]:
# Assign that list to a variable
negative_files = os.listdir('movie_reviews/negative')

In [None]:
# Inspect first element in a list
negative_files[0]

In [None]:
## EX. How many reviews are there in the 'positive' folder?
##     How many in the 'negative' one?

## EX. Get a list of files in each of the following paths.
##     Assign these to separate variables.

review_path = 'poems/random/'
random_path = 'poems/reviewed/'

## CHALLENGE: Find a list of files and folders on your desktop.

### Corpora

The main corpus we will use for our exercises are a set of positive and negative movie reviews made available through NLTK. Each review is contained in its own <i>.txt</i> file, and these reside in their respective folders, "positive" and "negative".

Although positive and negative reviews capture rough ideas of taste and distinction, Ted Underwood and Jordan Sellers have done a literary historical study on  nineteenth- and early-twentieth volumes of poetry that were reviewed in prestigious magazines versus not at all. The idea being that even a negative review indicates valuable, critical engagement.

Underwood and Sellers have made their corpus publicly available, so we will apply the techniques we learn to these as we proceed. Note that due to issues of copyright, volumes' word order has not been retained, although their total word counts have been. Fortunately, our methods do not require word-order information.

Their literary corpus has been divided into three folders: "reviewed", "random", "canonic". (The last of these are canonic poets but who did not have the opportunity to be reviewed, such as Emily Dickinson.)

In [None]:
# Open the first file from negative_files
open('movie_reviews/negative/cv000_29416.txt').read()

In [None]:
# When opening others, filenames change but the path doesn't!
path = "movie_reviews/negative/"
open(path+'cv000_29416.txt').read()

In [None]:
# Read all files and assign to a variable
negative_reviews = [open(path+name,'r').read() for name in negative_files]

In [None]:
# NOTE: If you are using OSX, your operating system may sometimes
# include hidden files in your folders that confuse Python.

# If you get an error while running this line, try including an 'if' condition
# in your list comprehension to prevent Python from tripping over these.

# For example:
negative_reviews = [open(path+name,'r').read() for name in negative_files if name[-4:]=='.txt']

In [None]:
# Repeat process for positive reviews
path = 'movie_reviews/positive/'
positive_files = os.listdir(path)
positive_reviews = [open(path+name,'r').read() for name in positive_files]

In [None]:
# Inspect first element in list
positive_reviews[0]

In [None]:
## EX. How long is the list of positive movie reviews? Negative reviews?
##     Do these match the number of files you had observed in the folders?

## EX. Retrieve the files from the 'reviewed' and 'random' folders of poetry.
##     Create a separate list for each category, in which each element is a
##     string with a file's text.

# 3. Classification

### Feature Set

The feature set we will use is a simple one: Does the volume of poetry contain a given word? True or False. The idea is that one category of movie review (or poetry volume) may be distinguishable from the other based on their vocabularies. For example, words like "terrible" or "mediocre" are presumably less likely to appear in positive reviews.

It is often useful to look at high-frequency terms in a corpus. Intuitively, not all words in the corpus will convey the same amount of information about whether a review is positive or negative. At the extreme, if a word appears in just a single review out of thousands, it doesn't tell us much either way about whether that word is associated with a category. By removing infrequent terms from our model, we can also save computational time.

In this case, we will include just the 1000 most frequent words for our model.

In [None]:
# Tokenize our sets of reviews; tokens remain grouped by review

negative_tokenized = [word_tokenize(review.lower()) for review in negative_reviews]
positive_tokenized = [word_tokenize(review.lower()) for review in positive_reviews]

In [None]:
# Inspect
positive_tokenized[0]

In [None]:
# Ungroup words from their reviews; returns a flat list of words

negative_words = [token for review in negative_tokenized for token in review]
positive_words = [token for review in positive_tokenized for token in review]

# Combine lists
all_words = negative_words + positive_words

In [None]:
# Create a dictionary where keys are unique words
# and entries are the number of times they appear

from collections import Counter
all_words_counted = Counter(all_words)

In [None]:
# Inspect
all_words_counted

In [None]:
# Get a list of the most common words and their frequencies
all_words_counted.most_common(1000)

In [None]:
# Assign the list to a variable
common_words = [word for word,count in all_words_counted.most_common(1000)]

In [None]:
# Inspect
common_words

In [None]:
## EX. How many words are there total in the movie review corpus?

## CHALLENGE: How many unique words are there in the movie review corpus?
##            What is the average number of times each word appears?

## EX. Tokenize each poetry volume. Create a list for each category ('reviewed'/'random').

## EX. Get a list of the 500 most frequent terms in the poetry corpus.

### Featurize Texts

For humans, reading a string of text is a relatively easy task, but for the computer to learn about language, text has to be represented in very particular ways. We refer to this as <i>featurization</i>: the transformation of a text into a quantitative feature set.

In order for the NLTK classifier to work, we have to represent each text as a set of True/False values: Is a given word from our high-frequency vocabulary present in this review? More specifically, these values will be contained in a dictionary, where each key is a vocabulary word and its value is whether or not it is present.

Once we have processed each text according to this rubric, we will then attach a label for the text's category ('positive'/'negative). The classifier will use this to identify which features are associated with each.

In [None]:
# Dummy review tokens
test_review = ["i","loved","this","movie","!"]

In [None]:
# Is the word 'this' in the test_review?
'this' in test_review

In [None]:
# Is the word 'duck' in the test_review?
'duck' in test_review

In [None]:
# Iterate through high-frequency vocabulary to test words
for word in common_words:
    print(word, word in test_review)

In [None]:
# Record these values in a dictionary comprehension!
{word:word in test_review for word in common_words}

In [None]:
# Turn our reviews into dictionaries that indicate whether a word is present
negative_featurized = [{word:word in review for word in common_words} \
                       for review in negative_tokenized]

positive_featurized = [{word:word in review for word in common_words} \
                       for review in positive_tokenized]

In [None]:
# Inspect
negative_featurized[0]

In [None]:
# Attach a label to each review
negative_tagged = [(review,'negative') for review in negative_featurized]
positive_tagged = [(review,'positive') for review in positive_featurized]

In [None]:
# Inspect
negative_tagged[0]

In [None]:
# Combine these lists of featurized, tagged reviews
all_tagged = negative_tagged + positive_tagged

In [None]:
## EX. Featurize and tag the volumes of 'reviewed' and 'random' poetry.
##     Return a list containing all poems in the corpus.

### Classification

We have selected an algorithm that specifically relies on <a href="https://en.wikipedia.org/wiki/Bayes%27_theorem">Bayes' Theorem</a> to model relationships between textual features and categories in our corpus of movie reviews. (See link for more information about the method and its assumptions.)

Two ways that we learn about the model are its feature weights and predictions on new texts. The algorithm can explicity report to us which direction each word leans category-wise and how strongly. Based on those weights, it makes further predictions about the valences previously unseen movie reviews.

In [None]:
# Train the classifier and assign it to a variable

from nltk import NaiveBayesClassifier
classifier = NaiveBayesClassifier.train(all_tagged)

In [None]:
# Report feature information
classifier.show_most_informative_features()

In [None]:
# Import, featurize test set of reviews

positive_test_string = open('movie_reviews/test/New York Times - Zootopia (positive review).txt').read()
positive_test_tokens = word_tokenize(positive_test_string)
positive_test_features = {word:word in positive_test_tokens for word in common_words}

negative_test_string = open('movie_reviews/test/Cinemixtape - Zootopia (negative review).txt').read()
negative_test_tokens = word_tokenize(negative_test_string)
negative_test_features = {word:word in negative_test_tokens for word in common_words}

In [None]:
# Predict whether new reviews are positive or negative
classifier.classify_many([positive_test_features,negative_test_features])

In [None]:
# Just how confident is our classifier of its predictions?
classifier.prob_classify(positive_test_features).prob('positive')

In [None]:
classifier.prob_classify(negative_test_features).prob('negative')

In [None]:
## EX. Train a classifier on the featurized, tagged poetry corpus.

### Extra: Validation

Just how good is our classifier? We can evaluate it by randomly selecting reviews from each category and setting them aside before training. We then see how well the classifier predicts their (known) categories.

In [None]:
# Randomize our list of movie reviews (in place)

import numpy
numpy.random.shuffle(all_tagged)

In [None]:
# We'll train our classifier on the first 90% of reviews
# and validate using the last 10%

training_set = all_tagged[:-200]
validation_set = all_tagged[-200:]

In [None]:
# Train, validate

classifier = nltk.NaiveBayesClassifier.train(training_set)
nltk.classify.accuracy(classifier, validation_set)

In [None]:
## CHALLENGE: Validate the model you trained on the poetry corpus.

# 4. Literary Distinction

In their study of critical taste, Underwood and Sellers find not only that literary standards change very slowly, but that contemporary metrics of 'canonicity' resemble those of the nineteenth century.

In order to test this idea, the authors trained a classifier on nineteenth- and early twentieth-century volumes of poetry that received reviews in a prestigious magazine versus those that didn't. The authors then used the classifier to predict a category for volumes of poetry that went unreviewed, in several cases because they were unpublished, but are now included in Norton anthologies.

How closely does critical evaluation today match that of a century ago?

In [None]:
## EX. If you have not already, train a classifier on the featurized, tagged poetry corpus.

## EX. Import and process the 'canonic' (albeit unreviewed) volumes of poetry.
##     Use the poetry classifier to predict whether they might have been reviewed.

canonic_path = 'poems/canonic/'