<h1 align='center'>It Starts with a Humanistic Research Question...</h1>
<img src='Thornbury 170, Fig 4.5.png' width="66%" height="66%">

# Introduction to Python for Humanists

### Motivating Project

### Programming in Python
<ul>
    <li>Basic Data Types & Operations</li>
    <ul>
        <li>Arithmetic</li>
        <li>Variable Assignment</li>
        <li>Strings</li>
        <li>Lists</li>
    </ul>
    <li>A few tricks up your sleeve</li>
    <ul>
        <li>String Methods</li>
        <li>List Comprehension</li>
    </ul>
</ul>

### Introduction to NLTK
<ul><li>Modules & Corpora</li>
<li>Concordance Building</li>
<li>Word Frequencies</li>

# 0. Building Intuition

Inside the folder that contains this script, there is a plaintext file with notes that I took during Professor Emily Thornbury's talk "Stop Having Ideas and Start Counting." The title contains two present progressive (<i>-ing</i>) verbs, which might suggest we look for others in the body of the text, if we were to do something like a literary analysis.

Part of the reason why people use Python to do work on human-language texts (<i>natural language processing</i>) is because it makes tasks like this relatively simple.

In [None]:
# (don't worry about understanding everything here)
for line in open('lecture notes 09-22-15.txt'):
    for word in line.split():
        if word.endswith('ing'):
            print(word)

# 1. Basic Data Types & Operations
## Arithmetic

Before doing any more NLP, let's start with the basics. Any time you work with computers, it is essential to remember that they are simply counting machines. Place-holders with zeros and ones represent numbers that get added and subtracted from one another. This is true even for language -- but let's not get ahead of ourselves.

In [None]:
# Addition

2+5

In [None]:
# Let's have Python report the results from three operations at the same time

print(2-5)
print(2*5)
print(2/5)

In [None]:
# If we have all of our operations in the last line of the cell, Jupyter will print them together

2-5, 2*5, 2/5

In [None]:
# And let's compare values

2>5

## Variable assignment

Assigning variables is something that we do all the time in programming. These aren't quite like the variables from high school algebra, where <i>x</i> represents an unknown to solve for. Instead these are like notes to ourselves that we want to save some value(s) for later use.

Note that the equals sign is directional, like an arrow, telling the computer to give a certain value to a certain label.

In [None]:
# 'a' is being given the value 2; 'b' is given 5

a = 2
b = 5

In [None]:
# Let's perform an operation on the variables

a+b

In [None]:
# Variables can have many different kinds of names

this_number = 2
b/this_number

## Strings

In Python, human language text gets represented as a <i>string</i>. These contain sequential sets of characters and they are offset by quotation marks, either double (") or single (').

We will explore different kinds of operations in Python that are specific to human language objects, but it is useful to start by trying to see them as the computer does, as numerical representations.

In [None]:
# The iconic string

print("Hello, World!")

In [None]:
# Assign these strings to variables

a = "Hello"
b = 'World'

In [None]:
# Try out arithmetic operations.
# When we add strings we call it 'concatenation'

print(a+b)
print(a*5)

In [None]:
# Unlike a number that consists of a single value, a string is an ordered
# sequence of characters. We can find out the length of that sequence.

len("Hello, World!")

In [None]:
## EX. How long is the string below?

this_string = "It was the best of times; it was the worst of times."

## Lists

The <i>numbers</i> and <i>strings</i> we have just looked at are the two basic data types that we will focus our attention on in this workshop. (In a few days, we will look at a third data type, <i>boolean</i>, which consists of just True/False values.) When we are working with just a few numbers or strings, it is easy to keep track of them, but as we collect more we will want a system to organize them.

One such organizational system is a <i>list</i>. This contains values (regardless of type) in order, and we can perform operations on it very similarly to the way we did with numbers.

In [None]:
# A list in which each element is a string

['Call', 'me', 'Ishmael']

In [None]:
# Let's assign a couple lists to variables

list1 = ['Call', 'me', 'Ishmael']
list2 = ['In', 'the', 'beginning']

In [None]:
## Q. Predict what will happen when we perform the following operations

print(list1+list2)
print(list1*5)

In [None]:
# Like a string, we can find out the length of a list

len(list1)

In [None]:
# Sometimes we just want a single value from the list at a time

print(list1[0])
print(list1[1])
print(list1[2])

In [None]:
# Or maybe we want the first few

print(list1[0:2])
print(list1[:2])

In [None]:
## EX. Concatenate 'list1' and 'list2' into a single list.
##     Retrieve the third element from the combined list.
##     Retrieve the fourth through sixth elements from the combined list.

## String Methods

The creators of Python recognize that human language has many important yet idiosyncratic features, so they have tried to make it easy for us to identify and manipulate them. For example, in the demonstration at the very beginning of the workshop, we referred to the idea of the suffix: the final letters of a word tell us something about its grammatical role and potentially semantics.

We can analyze or manipulate certain features of a string using its <i>methods</i>. These are basically internal functions that every string automatically gets assigned. Note that even though the method may transform the string at hand, they don't change it permanently!

In [None]:
# Let's assign a variable to perform methods upon

greeting = "Hello, World!"

In [None]:
# We saw the 'endswith' method at the very beginning
# Note the type of output that gets printed

greeting.startswith('H'), greeting.endswith('d')

In [None]:
# We can check whether the string is a letter or a number

# When there are multiple characters, it checks whether *all*
# of the characters belong to that category

greeting.isalpha(), greeting.isdigit()

In [None]:
# Similarly, we can check whether the string is lower or upper case

greeting.islower(), greeting.isupper(), greeting.istitle()

In [None]:
# Sometimes we want not just to check, but to change the string

greeting.lower(), greeting.upper()

In [None]:
# The case of the string hasn't changed!

greeting

In [None]:
# But if we want to permanently make it lower case we re-assign it

greeting = greeting.lower()

greeting

In [None]:
# Oh hey. And strings are kind of like lists, so we can slice them similarly

greeting[:3]

In [None]:
# Strings may be like lists of characters, but as humans we treat them as
# lists of words. We can perform that conversion for the computer.

greeting.split()

In [None]:
## EX. Return the second through eighth characters in 'greeting'

## EX. Split the string below into a list of words and assign this to a new variable
## Note: A slash at the end of a line allows a string to continue unbroken onto the next

## Challenge: Return the characters from the first half of 'greeting'

In [None]:
new_string = "It, is a truth universally acknowledged, that a single \
man in possession of a good fortune must be in want of a wife."

## List Comprehension

List comprehensions are a fairly advanced programming technique that we will spend more time talking about tomorrow. For now, you can think of them as list filters. Often, we don't need every value in a list, just a few that fulfill certain criteria.

In [None]:
# 'list1' had contained three words, two of which were in title case.
# We can automatically return those words using a list comprehension

[word for word in list1 if word.istitle()]

In [None]:
# Or we can include all the words in the list but just take their first letters

[word[0] for word in list1]

In [None]:
## EX. Using the list of words you produced by splitting 'new_string', create
##     a new list that contains only the words whose last letter is "e" 

## EX. Create a new list that contains the first letter of each word.

## EX. Create a new list that contains only words longer than two letters.

# Introduction to NLTK

The Natural Language Toolkit (NLTK) is a suite of Natural Language Processing tools that are designed to be easy to use while achieving near state-of-the-art performance. These will allow us to do things like tag parts of speech to the words in our strings, identify synonyms and other semantic relationships, and perform useful statistics on our collections of words.

We will look at NLTK's most essential functions in the next few days in order to start building our tools from the ground up. Today, however, we will focus our attention on a few popular, higher order functions that NLTK makes available, including producing concordances and counting word frequencies.

To be more precise, NLTK is a <i>package</i>, which expands the functionality of Python by making new functions, methods, and data available. In order to access a package, we must first download it to our computers (which you have already done if you installed Anaconda) and then activate it by <i>importing</i> the package into our script. 

### Essential NLTK functions

<table align='left'>
    <tr>
        <td>Processing raw text</td>
        <td>nltk.tokenize, nltk.stem</td>
    </tr>
    <tr>
        <td>Part of Speech and Grammar</td>
        <td>nltk.tag, nltk.chunk, nltk.parse</td>
    </tr>
    <tr>
        <td>Semantic Meaning</td>
        <td>nltk.stem.wordnet</td>
    </tr>
    <tr>
        <td>Statistical measures</td>
        <td>nltk.metrics, nltk.probability, nltk.collocations</td>
    </tr>
</table>

### Example NLTK corpora of humanistic interest

<table align='left'>
    <tr>
        <td>Project Gutenberg: Lit</td>
        <td>nltk.corpus.gutenberg
    </tr>
    <tr>
        <td>Webpages, chat</td>
        <td>nltk.corpus.webtext
    </tr>
    <tr>
        <td>Twitter</td>
        <td>nltk.corpus.twitter
    </tr>
    <tr>
        <td>Brown Corpus</td>
        <td>nltk.corpus.brown
    </tr>
    <tr>
        <td>Reuters: News</td>
        <td>nltk.corpus.reuters
    </tr>
    <tr>
        <td>Inaugural Addr. (US)</td>
        <td>nltk.corpus.inaugural</td>
    </tr>
</table>

In [None]:
# Read Mody Dick from a plaintext file in the current folder

moby_string = open('Melville - Moby Dick.txt').read()

In [None]:
# Take a look at the string!

moby_string

In [None]:
# Split the string into a list of words

moby_list = moby_string.split()

In [None]:
# Inspect the beginning of the list

moby_list[:10]

In [None]:
# Import NLTK!

import nltk

In [None]:
# Use NLTK's 'Text' function to give 'moby_list' some new methods

moby_text = nltk.Text(moby_list)

In [None]:
moby_text

In [None]:
# Has the familiar list slicing method

moby_text[:10]

In [None]:
# Has the same length as our old 'moby_list'

len(moby_list), len(moby_text)

In [None]:
# But it's now very easy to produce a concordance

moby_text.concordance('whale')

In [None]:
# We can also find words that appear in similar contexts

moby_text.similar('whale')

In [None]:
# And we can find out what those shared contexts are

moby_text.common_contexts(["whale", 'ship'])

# Word Frequencies

In [None]:
# Get a report of word frequencies in the text

nltk.FreqDist(moby_text)

In [None]:
# Assign these to a variable

fdist = nltk.FreqDist(moby_text)

In [None]:
# Ten most frequent words in Moby Dick

fdist.most_common(10)

In [None]:
# Frequency of 'whale' in Moby Dick

fdist['whale']

In [None]:
# Often we want to normalize the frequency as a proportion of all words in the text

fdist.freq('whale')

In [None]:
# Let's get a simple list of unique words in Moby Dick

fdist.keys()

In [None]:
## EX. A common measure of lexical diversity for a given text is its Type-Token Ratio:
##     the average number of times each word in a text gets used.
##     Calculate the Type-Token Ratio for Moby Dick.

# Visualization

In [None]:
# This is a special code for Jupyter, so that it produces graphs inside the notebook

% pylab inline

In [None]:
# Let's graph the top 50 words in Moby Dick by descending frequency

fdist.plot(50)

In [None]:
# Alternately, we can plot the words cumulatively so we can
# see how much of the text is accounted

fdist.plot(50, cumulative=True)

In [None]:
# EX. Transform the script below into a list of words, then plot their frequencies.

In [None]:
script = "Man: Well, what've you got? Waitress: Well, there's egg and bacon; egg sausage and bacon; \
egg and spam; egg bacon and spam; egg bacon sausage and spam; spam bacon sausage and spam; \
spam egg spam spam bacon and spam; spam sausage spam spam bacon spam tomato and spam; \
spam spam spam egg and spam; spam spam spam spam spam spam baked beans spam spam spam; \
...or Lobster Thermidor au Crevette with a Mornay sauce served in a Provencale manner with shallots \
and aubergines garnished with truffle pate, brandy and with a fried egg on top and spam."

# Vizualizing Alliterative Frequency

Thornbury's argument for dual authorship of "Christ and Satan" had relied upon the fact that there was a large, anomolous divide between fitts with alliterative distributions above and below the average frequency for the text. Using general Python techniques and specific tools from NLTK that we have seen, we can now approximate that average alliterative distribution curve for "Christ and Satan" by collecting the first letter of each word used in the poem.

The cell below will read in the text of "Christ and Satan" from a file that resides on your hard-drive, and assign the text as a string to the variable 'cs_text'.

In [None]:
## EX. Plot a frequency graph, not for its words, but for the first letter of each word in "Christ and Satan."
## Hint: You will need to use a list comprehension.

In [None]:
cs_text = open('christ-and-satan.txt').read()