<h1 align='center'>It Starts with the Very Idea of a Humanistic Research Question...</h1>
<img src='Moretti 7, Fig 7.png' width="66%" height="66%">
<br>
<img src='Moretti 11, excerpt.png' width="66%" height="66%">

# Operationalizing

<ul>
<li>New Methods</li>
<ul>
<li>String</li>
<li>Dictionary</li>
</ul>
<li>Import Corpus</li>
<li>Pre-Process Corpus</li>
<li>Pandas</li>
<li>Statistics</li>
<ul>
<li>Character Space</li>
<li>Most Distinctive Words</li>
</ul>
</ul>

# 0. Review/Preview

In [None]:
# Collect the texts of a set of novels, using yesterday's method

import os
novel_corpus_path = 'txtalb_Novel150_English/'
novel_file_names = os.listdir(novel_corpus_path)
novel_texts = [open(novel_corpus_path+file_name).read() for file_name in novel_file_names]

In [None]:
# Take a peek at the first novel

novel_texts[0]

In [None]:
# Tokenize, remove stop words, remove low frequency tokens, featurize as True-False

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english', min_df = 2, binary=True)

novel_dtm = cv.fit_transform(novel_texts)

In [None]:
# Use 'pandas' package to get the output into human-readable format

import pandas

pandas.DataFrame(novel_dtm.toarray(), columns = cv.get_feature_names(), index=novel_file_names)

# 1. Extending our Methods

### Strings

Strings and string methods are been our bread and butter throughout the workshop. We have already seen them assigned to variables, split over white spaces, added together, and sliced by index. Let's review those techniques and try out a couple variations.

In [None]:
# Let's assign a string to a new variable
# Using the triple quotation mark, we can simply paste a passage in between
# and Python will treat it as a continuous string

first_sonnet = """From fairest creatures we desire increase,
That thereby beauty's rose might never die,
But as the riper should by time decease,
His tender heir might bear his memory"""

In [None]:
# Note that when we print the 'first_sonnet', we see the character
# that represents a line break: '\n'

first_sonnet

In [None]:
# A familiar string method

first_sonnet.split()

In [None]:
# With a twist!

first_sonnet.split('\n')

In [None]:
# In fact, we can split over any character

first_sonnet.split('b')

In [None]:
# Of course, we can find the length of a string in characters

len(first_sonnet)

In [None]:
# And even slice it by character position!

first_sonnet[13:22]

In [None]:
# We can also reverse engineer the location of string patterns

first_sonnet.index('creatures')

In [None]:
# This method only returns the first instance of the string pattern

first_sonnet.index('b')

In [None]:
# A handy trick

first_sonnet.index('\n')

In [None]:
# Let's assign another string to a variable, shall we?

second_quatrain = """But thou, contracted to thine own bright eyes,
Feed'st thy light's flame with self-substantial fuel,
Making a famine where abundance lies, 
Thyself thy foe, to thy sweet self too cruel."""

In [None]:
# Remember that we can add strings together

first_sonnet + second_quatrain

In [None]:
# We can also assign them back to one of the variables!

first_sonnet = first_sonnet + "\n" + second_quatrain

In [None]:
# Et voila!

first_sonnet

In [None]:
## EX. How long is the first word of each line in 'first_sonnet'?
##     Hint: The first word ends at the first space!

### Dictionaries

Although we used dictionaries extensively yesterday in our applications, we didn't talk about them formally. Let's rectify that!

A <i>dictionary</i> is a data type in Python that is used to contain or organize other pieces of data. This is similar to the <i>list</i> data type that we have used extensively and that contains numbers, strings, and even other lists. Whereas lists organize data by keeping track of their order, a dictionary organizes data by means of a labeling system.

In a real-word dictionary, an entry can be found by its word-label and within the entry is a definition. In a Python dictionary, entries are labeled with <i>keys</i> and they contain <i>values</i>.

In [None]:
# Yesterday's first dictionary
# Keys were tokens from the sentence; Values were a True value representing its presence

{'high': True, 'air-speed': True, 'velocity': True}

In [None]:
# Assign to a variable

old_dictionary = {'high': True, 'air-speed': True, 'velocity': True}

In [None]:
# Call up the value for a given key

old_dictionary['high']

In [None]:
# What about a key that isn't in the dictionary

old_dictionary['low']

In [None]:
# Let's add it to the dictionary by giving it a value

old_dictionary['low'] = False

In [None]:
# Inspect

old_dictionary

In [None]:
# Get a list of all keys

old_dictionary.keys()

In [None]:
# Get a list of all values (same order as the key list!)

old_dictionary.values()

In [None]:
# Create a new, empty dictionary

new_dictionary = {}

In [None]:
# Add an entry
# In this case, both key and entry are strings!

new_dictionary['breakfast'] = 'egg and spam; egg bacon and spam; egg bacon sausage and spam'

In [None]:
new_dictionary

In [None]:
## EX. Create a dictionary in which each key is a unique word from this
##     famous line, along with the value 'True'

## EX. Create a new dictionary in which each key is a unique word from this
##     famous line and each value is the number of letters in the word

famous_line = 'To be or not to be that is the question'

# 2. Import Corpus

Moretti had performed his study of <i>Antigone</i> by collecting and dividing the speech belonging to each character. There are many ways to do this, but one elegant way is to create a dictionary, in which each entry belongs to a unique character. A key will be a name and a value will be a string with all of the words uttered by them.

In [None]:
# Read the text of Antigone from a file on your hard drive

antigone_text = open('antigone.txt', 'r').read()

In [None]:
# Inspect

antigone_text

In [None]:
# Create a list by splitting the string whereever a double line break occurs

antigone_list = antigone_text.split('\n\n')

In [None]:
# Inspect

antigone_list

In [None]:
# First line

antigone_list[0]

In [None]:
# Let's assign it to a variable and get a feel for an important property

first_line = antigone_list[0]

In [None]:
# Find the first space

first_line.index(' ')

In [None]:
# Slice the line before that space

first_line[:8]

In [None]:
# Slice the line after that space

first_line[8:]

In [None]:
# Create a new, empty dictionary

dialogue_dict = {}

In [None]:
# Remember the for-loop with conditional statements?

# Iterate through each of the play's lines
for line in antigone_list:
    
    # Find the first space in each line
    index_first_space = line.index(' ')
    
    # Slice the line, preceding the first space
    character_name = line[:index_first_space]
    
    # Check whether the character is in our dictionary yet
    if character_name not in dialogue_dict.keys():
        
        # If not, create a new entry whose value is a slice of the line *after* the first space
        dialogue_dict[character_name] = line[index_first_space:]
        
    else:
        
        # If so, add the slice of line to the existing value
        dialogue_dict[character_name] = dialogue_dict[character_name] + line[index_first_space:]

In [None]:
# Inspect

dialogue_dict

In [None]:
# Single character

dialogue_dict['ANTIGONE']

In [None]:
## Choose one of the following:

## EX. Create a new dictionary in which each key is the name of a character from Antigone
##     and each value is the total number of words spoken.

## EX. Create a dictionary in which each entry is the dialogue beloning to an individual character in Hamlet.
##     Note that the text of Hamlet is formatted slightly differently from that of Antigone.

In [None]:
hamlet_text = open('hamlet.txt', 'r').read()

# 3. Pre-Process Corpus

In yesterday's lesson, pre-processing was an arduous task that took most of our effort. As it turns out, there is a popular package, <i>scikit learn</i>, containing a function that makes pre-processing very easy. We can use it to tokenize, remove stop words, select common words, and count their frequencies in a single line of code!

In [None]:
# We used the dictionary to keep our strings organized by label above,
# but now we'll keep track of them by list order

dialogue_list = dialogue_dict.values()
character_list = dialogue_dict.keys()

In [None]:
# Inspect

character_list

In [None]:
# Inspect

dialogue_list

In [None]:
# We're going to remove stop words from our text, but we won't just use the basic list,
# since our translation of 'Antigone' affects an archaic diction

# Get NLTK's stopword list

from nltk.corpus import stopwords

english_stop_words = stopwords.words('english')

In [None]:
# Let's remind ourselves what the list contains

english_stop_words

In [None]:
# Create a custom list of stop words
ye_olde_stop_words = ['thou','thy','thee', 'ye', 'hath','hast', 'wilt',\
                      'art', 'dost','doth','shalt','tis','canst','thyself']

In [None]:
# Combine these lists

all_stop_words = english_stop_words + ye_olde_stop_words

In [None]:
# Import the pre-processing function 'CountVectorizer'

from sklearn.feature_extraction.text import CountVectorizer

In [None]:
# Initialize the function to remove stop words; assign to a variable

cv = CountVectorizer(stop_words=all_stop_words)

In [None]:
# Pre-process the text of the play
# Produces a document-term matrix

dtm = cv.fit_transform(dialogue_list)

In [None]:
# Inspect

dtm

In [None]:
# Inspect a *slightly* more readable format

dtm.toarray()

In [None]:
# Get a list of vocabulary words from the pre-processor
# They double as column labels!

vocabulary_list = cv.get_feature_names()

In [None]:
# Inspect

vocabulary_list

In [None]:
# And earlier, we created a list of characters that now match our rows!

character_list

In [None]:
# Import our dataframe package!

import pandas

In [None]:
# Create a human-readable document-term matrix!

pandas.DataFrame(dtm.toarray(), columns = vocabulary_list, index = character_list)

In [None]:
# Assign to a variable for later

dtm_df = pandas.DataFrame(dtm.toarray(), columns = vocabulary_list, index = character_list)

In [None]:
## Choose one of the following:

## EX. Try initializing the CountVectorizer function so that it removes words that appear
##     in just one document. (See example at beginning of lesson.)
##     How many columns remain in the document-term matrix?

## EX. Create a document-term matrix for Hamlet, in which each row is a character
##     and each column a unique word. Do not include stop words.

# 4. Pandas

Pandas is a popular and flexible package whose primary use is its datatype: the <i>DataFrame</i>. The dataframe is essentially a spreadsheet, like you would find in Excel, but it integrates seamlessly into a Natural Language Processing workflow and it has a few tricks up its sleeve.

In [None]:
# Create a list of three sub-lists, each with three entries

square_list = [[1,2,3],[4,5,6],[7,8,9]]

In [None]:
# Create a dataframe from that list

pandas.DataFrame(square_list)

In [None]:
# Let's create a couple of lists for our column and row labels

column_names = ['Eggs', 'Bacon', 'Sausage']
row_names = ['Served','With','Spam']

In [None]:
# A-ha!

pandas.DataFrame(square_list, columns = column_names, index=row_names)

In [None]:
# Assign this to a variable

spam_df = pandas.DataFrame(square_list, columns = column_names, index=row_names)

In [None]:
# Call up a column of the dataframe

spam_df['Eggs']

In [None]:
# Make that column into a list

list(spam_df['Eggs'])

In [None]:
# Get the indices for the entries in the column

spam_df['Eggs'].index

In [None]:
# Call up a row from the indices

spam_df.loc['Served']

In [None]:
# Call up a couple of rows, using a list of indices!

spam_df.loc[['Spam','Served']]

In [None]:
# Get a specific entry by calling both row and column

spam_df.loc['Spam']['Eggs']

In [None]:
# Create a new column

spam_df['Lobster Thermidor aux crevettes'] = [10,11,12]

In [None]:
# Inspect

spam_df

In [None]:
## EX. Call up the entries (5 and 6) from the middle of the dataframe 'spam_df' individually

## CHALLENGE: Call up both entries at the same time

### DataFrame Slicing

In [None]:
# Slice out a column

spam_df['Bacon']

In [None]:
# Evaluate whether each element in the column is greater than 5

spam_df['Bacon']==5

In [None]:
# Use that evaluation to subset the table

spam_df[spam_df['Bacon']==5]

In [None]:
## EX. Slice 'spam_df' to contain only rows in which 'Sausage' is greater than 5

# 5. Statistics!

A mentor of mine once joked that the only training one needs for digital literary analysis is how to construct a document-term matrix. After that you simply hand the trainee a statistics textbook.

The DTM is the basis for not only Moretti's study of dramatic character but the vast majority of studies in the field. Sometimes the rows are characters in a play; sometimes they are individual poems or novels. The columns are very often the unique words contained in a corpus. If we think creatively, the text processing from each of the three previous workshops in this series can be represented in a DTM.

The digital humanist's call to arms: If words are as powerful and multivalent as humanists like to believe, would we not expect to find patterns in the matrix of their frequencies? Would those patterns not have essential interpretive value? Statistics will be the method for identifying these patterns.

In [None]:
# Our document-term matrix

spam_df

In [None]:
# Pandas will produce a few descriptive statistics for each row

spam_df.describe()

In [None]:
# Multiply entries of the DTM by 10

spam_df*10

In [None]:
# Add 10 to each entry

spam_df+10

In [None]:
# Of course our dataframe hasn't changed

spam_df

In [None]:
# We can also perform operations among columns
# Pandas knows to match up individual entries in each column

spam_df['Bacon']/spam_df['Eggs']

### Character Space

In Moretti's study, he offers several measures of the concept of character-space. The simplest of these is to measure the relative dialogue belong to each character in a play. Presumably the main characters will speak more and peripheral characters will speak less.

The statistical moves we will make here are not only counting the raw number of words spoken by each character but also normalizing them. That is, converting them into a fraction of all words in the play.

In [None]:
# Get a sum of each column

dtm_df.sum()

In [None]:
# Get a sum of each row

dtm_df.sum(axis=1)

In [None]:
# Assign this to a variable

raw_counts = dtm_df.sum(axis=1)

In [None]:
# Let's visualize!

# Tells Jupyter to produce images in notebook
% pylab inline

# Makes images look good
style.use('ggplot')

In [None]:
# Visualize using the 'plot' method from Pandas

raw_counts.plot(kind='bar')

In [None]:
# Get the total number of words

sum(raw_counts)

In [None]:
# Assign to variable

total_counts = sum(raw_counts)

In [None]:
# Use that total to normalize the share of words belonging to each character

raw_counts/total_counts

In [None]:
# Assign to a variable

normed_counts = raw_counts/total_counts

In [None]:
# Get them in order of most prominent speaker

normed_counts.sort_values(ascending=False)

In [None]:
# Reassign variable, for convenience

normed_counts = normed_counts.sort_values(ascending=False)

In [None]:
# Visualize

normed_counts.plot(kind='bar')

# 6. Most Distinctive Words

This is a clever technique that can be used to determine which words are most prominently associated with a single character in a text (or with a single text in a corpus), and it is a kind of house method that appears regularly in Stanford's LitLab pamphlet series.

The MDW method relies two measurements: first it observes the number of times a given word is spoken by a character, and second it constructs an expected number of times the word would be spoken by the character if all things were equal. Finally, these measurements get compared as the ratio of Observed Frequency divided by Expected Frequency.

<table>
<tr><b><td>Word Spoken by Antigone</td><td>Observed</td><td>Expected</td><td>O/E Ratio</td></b></tr>
<tr><td>brother</td><td>10</td><td>2.1</td><td>4.7</td></tr>
</table>

Expected Frequency is determined by looking at the total share of words spoken by a character and multiplying that fraction by the total number of times a given word appears in the text. This produces a kind of weighted average.

For example, the word "brother" appears in the play 14 times. Antigone speaks about 15% of all words in the play, so we would expect that -- if she were totally average -- she would use the word about 2 times. (14 x 0.15 = 2.1) However, she says the word 10 times, or nearly five times more often than we expected. (10 / 2.1 = 4.7) Perhaps her relationship to her brother is a disctinctive characteristic! In general, a character's MDWs have an O/E ratio greater than 1.

Note that this method is sensitive to low frequency words. For the sake of validity, we typically throw out words that are spoken fewer than five times by a character.

In [None]:
## EX. Produce a list of the Most Distinctive Words Spoken by Antigone.

##     Using the empty dataframe below, add columns for the observed word frequencies
##     of relevance to this problem and perform operations on these as described above.

In [None]:
# Create new, empty dataframe

mdw_df = pandas.DataFrame()

In [None]:
# Hint: Depending on how you approach the problem, you may wish to add columns for
#       Antigone's observed word counts and the total counts of each word in the play

mdw_df['ANTIGONE'] = dtm_df.loc['ANTIGONE']
mdw_df['WORD_TOTAL'] = dtm_df.sum()