<h1 align='center'>It Starts with a Humanistic Research Question...</h1>
<img src='Wilkens 807, Table 1.png' width="66%" height="66%">

## Geographic Imagination
<ul><li>Tokenize</li>
<ul><li>Functions</li>
<li>For-Loops</li></ul>
<li>Part of Speech Tags</li>
<ul><li>Conditional Statements</li></ul>
<li>Named Entity Recognition</li>
<li>Geographic Imagination</li></ul>

## 0. Preparation

Python has many basic, out-of-the-box functions that we use on all the time for programming. When we want to extend our reach beyond the basics, new functions are made available through <i>packages</i> like NLTK. 

Packages typically need to be downloaded individually. However if you are using a platform like Anaconda (https://www.continuum.io/downloads), then many common packages are already on your computer.

In order to access the new functions contained within a package, we have to <i>import</i> it into our programming environment.

In [None]:
# Load up the Natural Language Toolkit

import nltk

In [None]:
# Check that NLTK has access to appropriate models for our project

modules = ["averaged_perceptron_tagger", "maxent_ne_chunker", "punkt"]

for module in modules:
    nltk.download(module)

In [None]:
# We'll use the opening paragraph of Pride and Prejudice throughout for our exercises

paragraph = 'It is a truth universally acknowledged, that a single man \
in possession of a good fortune, must be in want of a wife.\
However little known the feelings or views of such a man may be \
on his first entering a neighbourhood, this truth is so well fixed \
in the minds of the surrounding families, that he is considered as \
the rightful property of some one or other of their daughters.\
"My dear Mr. Bennet," said his lady to him one day, "have you heard \
that Netherfield Park is let at last?"'

# 1. Tokenize

Natural Language Processing is the field and set of methods dedicated to converting human language into something that the computer can read. It's important to keep in mind that a computer does not even know what a <i>word</i> is without receiving direct instructions from a human.

Fortunately NLTK has an easy-to-implement set of instructions encoded in its function <i>word_tokenize()</i>. The idea with this function is that we can put a string of human-language text in between its parentheses and it will return a list of the individual words from that text. NLTK has a similar function, as well, called <i>sent_tokenize()</i> that does the same thing, but returns a list of individual sentences.

Very often we want to tokenize our texts by word, while retaining infomation about the boundaries between sentences. In order to do this, we will first use <i>sent_tokenize()</i> and then iterate through our list of sentences with <i>word_tokenize()</i>

### Functions

In [None]:
# Import the functions we will use directly

from nltk.tokenize import word_tokenize, sent_tokenize

In [None]:
word_tokenize("What... is the air-speed velocity of an unladen swallow?")

In [None]:
# Assign our sentence to a variable

velocity = "What... is the air-speed velocity of an unladen swallow?"

In [None]:
# Inspect our new variable

velocity

In [None]:
# Feed our new variable into the function

word_tokenize(velocity)

In [None]:
# We can also assign the output of a function to a variable

velocity_list = word_tokenize(velocity)

In [None]:
# Inspect the output variable

velocity_list

In [None]:
# And we can input that new variable into other functions and so on

len(velocity_list)

In [None]:
# Assign three sentences of dialogue to a new variable

three_sentences = "What... is the air-speed velocity of an unladen swallow? What do you mean? An African or European swallow?"

In [None]:
# Inspect the newest variable

three_sentences

In [None]:
# Tokenize text by word

word_tokenize(three_sentences)

In [None]:
# Tokenize text by sentence

sent_tokenize(three_sentences)

In [None]:
## EX. Use the function word_tokenize() in order to get a list of words
##     from the paragraph above. How many tokens does the paragraph contain?

## EX. Use the function sent_tokenize() in order to get a list of sentences
##     from the paragraph below.

## Bonus: What is the average number of words per sentence in the paragraph?

### For-Loops

We iterate through the elements in a list using the "for" and "in" syntax. You can tell those words do something special because they appear in green!

In [None]:
# Combine sentence- and word-level tokenization

# The line below gets indented, so that our script knows what to do
# to each element in the list when it comes up

sentence_list = sent_tokenize(three_sentences)

for sentence in sentence_list:
    print(word_tokenize(sentence))

In [None]:
# Alternate format: List Comprehension

# Collects the output from our for-loop into a new list!
tokenized_sentences = [word_tokenize(sent) for sent in sentence_list]

In [None]:
# Inspect the new list
tokenized_sentences

In [None]:
## EX. For the 'paragraph' from earlier, use a for-loop to get
##     a list of words from each sentence individually.

## EX. Rewrite the for-loop as a list comprehension

# Detour: Word Frequency

For those not-yet-familiar with Natural Language Processing, it often comes as a surprise how powerful word frequencies are. Simply creating a list of the unique words in a text and tallying the number of times it appears encodes information about authorship, genre, time period and author nationality among other features. Frankly, this is mind boggling!

It is exceptionally easy to create this kind of tally in Python. There is a simple out-of-the-box function that we can use to count the number of times a token appears in a list. Yesterday, we looked at a function from NLTK called <i>FreqDist</i> that is a special version of the one we will look at today, <i>Counter</i>.

In [None]:
# Import a handy counting function
# Reports number of time each unique element appears in a list

from collections import Counter

In [None]:
# Create a list of tokens
not_a_briton = "I didn't know we had a king; I thought we were an autonomous collective."
not_british_tokens = word_tokenize(not_a_briton)

In [None]:
# Inspect token list
not_british_tokens

In [None]:
# Tally the appearances of each unique token
Counter(not_british_tokens)

In [None]:
# Assign the tally to a new variable
tokens_counted = Counter(not_british_tokens)

In [None]:
# Return unique tokens, sorted by number of appearances in list
tokens_counted.most_common()

In [None]:
## EX. What is the most common word in the 'paragraph' from earlier?
##     How often does 'truth' appear?

# 2. Part of Speech

As trained readers, we know that language partly operates according to (or sometimes against!) abstract, underlying structures, such as grammar. Identifying a word's part of speech, or tagging it, is an extremely sophisticated task that remains an open problem in the Natural Language Processing world. At this point, state-of-the-art taggers have somewhere in the neighborhood of 98% accuracy.

NLTK's default tagger, <i>pos_tag()</i>, has an accuracy just shy of that with the trade-off that it is comparatively fast. Simply place a list of tokens between its parentheses and it returns a new list where each item is the original word alongside its predicted part of speech.

The tags themselves come from the Penn Treebank and a full list of them can be found here: <a href="http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html">www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html</a>

### Common POS taggers
<table align='left'>
    <tr>
        <td>from nltk.tag.perceptron</td>
        <td>import PerceptronTagger</td>
    </tr>
    <tr>
        <td>from nltk.tag.brill</td>
        <td>import BrillTagger</td>
    </tr>
    <tr>
        <td>from nltk.tag.stanford</td>
        <td>import StanfordTagger, StanfordPOSTagger, StanfordNERTagger</td>
    </tr>
</table>

Note: NLTK simply offers a wrapper for the Stanford taggers, which allows you to use them in Python, rather than their native Java. Stanford models must be downloaded from here: http://nlp.stanford.edu/software/

In [None]:
# NLTK's current default POS tagger is the 'averaged perceptron' as described here:
# https://spacy.io/blog/part-of-speech-POS-tagger-in-python

from nltk import pos_tag

# Create variable for new sentence
new_sentence = "Once the number three, being the third number, be reached, \
then lobbest thou thy Holy Hand Grenade of Antioch towards thy foe, \
who, being naughty in My sight, shall snuff it."

# Create list of word tokens
new_tokens = word_tokenize(new_sentence)

# Assign a POS tag to each token
pos_tag(new_tokens)

In [None]:
# Let's refresh ourselves on the functions and for-loops from earlier

# An old variable revisited!
three_sentences = "What... is the air-speed velocity of an unladen swallow? What do you mean? An African or European swallow?"

# Re-make the list of sentences from the text
sentence_list = sent_tokenize(three_sentences)

# Re-tokenize each sentence by word
tokenized_sentences = [word_tokenize(sent) for sent in sentence_list]

In [None]:
# Inspect the list of lists of tokens
tokenized_sentences

In [None]:
# Now iterate through the tokenized sentences and POS tag them
for sentence in tokenized_sentences:
    print(pos_tag(sentence))

In [None]:
# Collect the tagged sentences in a list
tagged_sentences = [pos_tag(sentence) for sentence in tokenized_sentences]

In [None]:
## EX. Get POS tags for the very new sentence below.

## EX. Get POS tags for the 'paragraph' from Pride and Prejudice.
##     Collect these tags using a list comprehension

In [None]:
very_new_sentence = "On second thought, let's not go to Camelot."

### Conditional Statements

In [None]:
# The entries in each tagged sentence consist of a token-tag pair.
# Sometimes we just want one of those values.

# When the entries in a list are paired like the (token,tag) format above,
# we can label the elements seperately while we iterate through

for sentence in tagged_sentences:
    for token, tag in sentence:
        print(token)

In [None]:
# Of course, we can access either value in the pair

for sentence in tagged_sentences:
    for token, tag in sentence:
        print(tag)

In [None]:
# We can also add a condition: IF the condition is TRUE,
# then the script continues with the next indented line.
# Otherwise, it gets skipped!

# Calling the noun tag for our IF statement

for sentence in tagged_sentences:
    for token, tag in sentence:
        if tag=='NN':
            print(token)

In [None]:
# Calling the adjective tag for our IF statement

for sentence in tagged_sentences:
    for token, tag in sentence:
        if tag=='JJ':
            print(token)

In [None]:
# The double equals sign is a test of equality NOT a variable assignment

5 == 3

In [None]:
## EX. Return the nouns from the opening paragraph of Pride and Prejudice.

# 3. Named Entity Recognition

Among parts of speech, names and proper nouns are of particular significance, since they are the more-or-less unique keywords that identify phenomena of social relevance (including people, places, and institutions). After all, there is just one <i>World War II</i>, and in a novel, a name like <i>Mr. Darcy</i> typically acts as a more-or-less stable referent over the course of the text. (Or perhaps we are interested in thinking about the degree of instability with which it is used!)

The identification of these kinds of names is referred to as Named Entity Recognition, or NER. The challenge is twofold. First, it has to be determined whether a name spans multiple tokens. (These multi-token grammatical units are referred to as <i>chunks</i>; the process, <i>chunking</i>.) Second, we would ideally distinguish among categories of entity. Is <i>Mr. Darcy</i> a geographic location? Just who is this <i>World War II</i> I hear so much about?

To this end, the function ne_chunk() receives a list of tokens including their parts of speech and returns a nested list where named entities' tokens are chunked together, along with their category as predicted by the computer.

In [None]:
# Let's start with a fresh sentence containing several proper names

ner_sentence = 'King Arthur is the sovereign over Britain and lord of the Round Table.'
ner_tokens = word_tokenize(ner_sentence)
ner_tags = pos_tag(ner_tokens)

In [None]:
# Inspect the POS tags
ner_tags

In [None]:
# Import the NER funtion

from nltk import ne_chunk

chunks = ne_chunk(ner_tags)

In [None]:
# NLTK is finicky here, so we need to use 'print' to inspect
print(chunks)

In [None]:
# We'll iterate through our list of chunks. Name Entities are grouped
# together into 'nltk.tree.Tree'. (This is an under-the-hood data type.)

for chunk in chunks:
    if type(chunk)==nltk.tree.Tree:
            print(chunk.leaves())

In [None]:
# Let's select just ones with the 'GPE' (Geo-Political Entity) designation

for chunk in chunks:
    if type(chunk)==nltk.tree.Tree:
        if chunk.label()=='GPE':
            print(chunk.leaves())

In [None]:
# When we have multiple conditions -- i.e. multiple 'if' statements --
# we can put them together on a line using 'and'.

for chunk in chunks:
    if type(chunk)==nltk.tree.Tree and chunk.label()=='GPE':
            print(chunk.leaves())

In [None]:
# Rewrite it as a list comprehension!
# Note that the 'if' statement goes *after* the 'for'-'in syntax

gpe_chunks = [chunk.leaves() for chunk in chunks if type(chunk)==nltk.tree.Tree and chunk.label()=='GPE']

In [None]:
# Inspect the new list

gpe_chunks

In [None]:
# We ultimately don't need the POS tag along with the place
# name, so we can iterate through and pull names out

for gpe in gpe_chunks:
    for name,tag in gpe:
        print(name)

In [None]:
# Now as a list comprehension!

names_only = [name for gpe in gpe_chunks for name,tag in gpe]

In [None]:
# Inspect the newest list

names_only

In [None]:
## EX. Retrieve the place names (excluding POS tags) from the sentence below.

## EX. Rewrite the previous exercise as a list comprehension.

In [None]:
swallow_skeptic = "Oh yeah, an African swallow, maybe, but not a European swallow."

# 4. Geographic Imagination

In order to study the changing attention of American literature during the nineteenth century, Matthew Wilkens counts the frequencies of place names and compares them before and after the Civil War. For now, we will limit our study to a single text, Kate Chopin's <i>The Awakening</i>. Using the techniques from this lesson, count the number of times each place name appears in this text. Return a list of the most common place names.


Q. Does the list of the most common names make sense? Are there things that don't? How might you change the program to handle names differently?

<i>Note: Wilkens includes notes on his own NER process on pp 833-835.</i>

In [None]:
# Read text of The Awakening from file
# Creates variable 'chopin_text' with whole text in single string

chopin_text = open('Chopin - The Awakening & Selected Short Stories.txt').read()