# 02. Textual geography 

This session will walk through some more advanced text processing and data wrangling to produce a map of the locations mentioned in our corpus. Topics covered include:

* Named entity recognition using Stanford's CRF-NER package.
* Geolocation using Google's mapping APIs.
* Cartographic visualization, both static (for print publication) and interactive (for online use).

## Named entity recognition

There are several approaches to identifying the places used in a piece of text. We could rely on a dictionary or gazetteer, which would tell us that Edinburgh is a city in Scotland, but would also tell us that Charlotte Brönte is a city in the United States.

We'll instead use statistical machine learning methods. While we *will* get an intro to machine learning this afternoon, for now we'll rely on the [implementation by the Stanford NLP group](http://nlp.stanford.edu/software/CRF-NER.html).

This is a Java package. It's possible -- with a lot of work -- to invoke Java code from Python. But there's no point in this case; it's much easier to invoke the NER package from the command line and to read in the plain text output.

Note that there do exist NER, POS, and other NLP packages for Python. NLTK -- the Natural Language Tool Kit, which we met in the last notebook when we used it for corpus processing -- is one of the most diverse and well conceived. But it's not optimized for speed and isn't notably accurate compared to more production-oriented offerings. 

There's a guide to using the NER package on Stanford's site. Here's the short version, for reference:

```
java -mx1g -cp "*:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier
-loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz 
-outputFormat tabbedEntities 
-textFile file.txt > file.tsv
```

This runs the English-language classifier over a single text file. Note that we need to have a classifier trained for the language of the text we're processing. Stanford has trained models for English, Spanish, German, Chinese, and other major languages, but they're lacking French and a great many other languages.

The classidier's output follows a tabbed format that looks like this:

```
                Why did the poor poet of
Tennessee       LOCATION        , upon suddenly receiving two handfuls of silver , deliberate whether to buy him a coat , which he sadly needed , or invest his money in a pedestrian trip to
Rockaway Beach  LOCATION        ?
```

The thing to notice is that every line begins with two tabs, but only the text in front of the first tab is a recognized entity. The text in front of the second tab, then, indicates the type of entity: PERSON, ORGANIZATION, or LOCATION. Any text following the second tab is body text, presumed not to contain any named entities.

So let's recreate our corpus, then read in the tagged files and get a list of the locations used in the corpus.

### Recreate the corpus

In [3]:
import pandas as pd
from nltk.corpus.reader.plaintext import PlaintextCorpusReader

text_dir = '../Data/Texts/'
corpus = PlaintextCorpusReader(text_dir, '.*\.txt')

# A function to turn fileids into a table of metadata
def parse_fileids(fileids):
    '''Takes a list of file names formatted like A-Cather-Antonia-1918-F.txt.
       Returns a pandas dataframe of derived metadata.'''
    import pandas as pd
    meta = {}
    for fileid in fileids:
        file = fileid.strip('.txt') # Get rid of file suffix
        fields = file.split('-') # Split on dashes
        fields[2] = fields[2].replace('_', ' ') # Remove underscore from titles
        fields[3] = int(fields[3])
        meta[fileid] = fields
    metadata = pd.DataFrame.from_dict(meta, orient='index') # Build dataframe
    metadata.columns = ['nation', 'author', 'title', 'pubdate', 'gender'] # Col names
    return metadata.sort_index() # Note we need to sort b/c datframe built from dictionary

def collect_stats(corpus):
    '''Takes an NLTK corpus as input. 
       Returns a pandas dataframe of stats indexed to fileid.'''
    import nltk
    import pandas as pd
    stats = {}
    for fileid in corpus.fileids():
        word_count = len(corpus.words(fileid))
        stats[fileid] = {'wordcount':word_count}
    statistics = pd.DataFrame.from_dict(stats, orient='index')
    return statistics.sort_index()

books = parse_fileids(corpus.fileids())
stats = collect_stats(corpus)
books = books.join(stats)
books.head()

Unnamed: 0,nation,author,title,pubdate,gender,wordcount
A-Cather-Antonia-1918-F.txt,A,Cather,Antonia,1918,F,97574
A-Chesnutt-Marrow-1901-M.txt,A,Chesnutt,Marrow,1901,M,110288
A-Crane-Maggie-1893-M.txt,A,Crane,Maggie,1893,M,28628
A-Davis-Life_Iron_mills-1861-F.txt,A,Davis,Life Iron mills,1861,F,18789
A-Dreiser-Sister_Carrie-1900-M.txt,A,Dreiser,Sister Carrie,1900,M,194062


### Read and parse NER-tagged files

The taged NER files are in the `..Data/NER/` directory. We need a function that will parse each one and return just the locations for further processing. 