# 02. Textual geography 

This session will walk through some more advanced text processing and data wrangling to produce a map of the locations mentioned in our corpus. Topics covered include:

* Named entity recognition using Stanford's CRF-NER package.
* Geolocation using Google's mapping APIs.
* Cartographic visualization, both static (for print publication) and interactive (for online use).

## Named entity recognition

There are several approaches to identifying the places used in a piece of text. We could rely on a dictionary or gazetteer, which would tell us that Edinburgh is a city in Scotland, but would also tell us that Charlotte Brönte is a city in the United States.

We'll instead use statistical machine learning methods. While we *will* get an intro to machine learning this afternoon, for now we'll rely on the [implementation by the Stanford NLP group](http://nlp.stanford.edu/software/CRF-NER.html).

This is a Java package. It's possible -- with a lot of work -- to invoke Java code from Python. But there's no point in this case; it's much easier to invoke the NER package from the command line and to read in the plain text output.

Note that there do exist NER, POS, and other NLP packages for Python. NLTK -- the Natural Language Tool Kit, which we met in the last notebook when we used it for corpus processing -- is one of the most diverse and well conceived. But it's not optimized for speed and isn't notably accurate compared to more production-oriented offerings. 

There's a guide to using the NER package on Stanford's site. Here's the short version, for reference:

```
java -mx1g -cp "*:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier
-loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz 
-outputFormat tabbedEntities 
-textFile file.txt > file.tsv
```

This runs the English-language classifier over a single text file. Note that we need to have a classifier trained for the language of the text we're processing. Stanford has trained models for English, Spanish, German, Chinese, and other major languages, but they're lacking French and a great many other languages.

The classidier's output follows a tabbed format that looks like this:

```
                Why did the poor poet of
Tennessee       LOCATION        , upon suddenly receiving two handfuls of silver , deliberate whether to buy him a coat , which he sadly needed , or invest his money in a pedestrian trip to
Rockaway Beach  LOCATION        ?
```

The thing to notice is that every line begins with two tabs, but only the text in front of the first tab is a recognized entity. The text in front of the second tab, then, indicates the type of entity: PERSON, ORGANIZATION, or LOCATION. Any text following the second tab is body text, presumed not to contain any named entities.

So let's recreate our corpus, then read in the tagged files and get a list of the locations used in the corpus.

### Recreate the corpus

In [2]:
import pandas as pd
from nltk.corpus.reader.plaintext import PlaintextCorpusReader

text_dir = '../Data/Texts/'
corpus = PlaintextCorpusReader(text_dir, '.*\.txt')

# A function to turn fileids into a table of metadata
def parse_fileids(fileids):
    '''Takes a list of file names formatted like A-Cather-Antonia-1918-F.txt.
       Returns a pandas dataframe of derived metadata.'''
    import pandas as pd
    meta = {}
    for fileid in fileids:
        file = fileid.strip('.txt') # Get rid of file suffix
        fields = file.split('-') # Split on dashes
        fields[2] = fields[2].replace('_', ' ') # Remove underscore from titles
        fields[3] = int(fields[3])
        meta[file] = fields
    metadata = pd.DataFrame.from_dict(meta, orient='index') # Build dataframe
    metadata.columns = ['nation', 'author', 'title', 'pubdate', 'gender'] # Col names
    return metadata.sort_index() # Note we need to sort b/c datframe built from dictionary

def collect_stats(corpus):
    '''Takes an NLTK corpus as input. 
       Returns a pandas dataframe of stats indexed to fileid.'''
    import nltk
    import pandas as pd
    stats = {}
    for fileid in corpus.fileids():
        word_count = len(corpus.words(fileid))
        stats[fileid.strip('.txt')] = {'wordcount':word_count}
    statistics = pd.DataFrame.from_dict(stats, orient='index')
    return statistics.sort_index()

books = parse_fileids(corpus.fileids())
stats = collect_stats(corpus)
books = books.join(stats)
books.index.set_names('file', inplace=True)
books.head()



Unnamed: 0_level_0,nation,author,title,pubdate,gender,wordcount
file,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A-Cather-Antonia-1918-F,A,Cather,Antonia,1918,F,97574
A-Chesnutt-Marrow-1901-M,A,Chesnutt,Marrow,1901,M,110288
A-Crane-Maggie-1893-M,A,Crane,Maggie,1893,M,28628
A-Davis-Life_Iron_mills-1861-F,A,Davis,Life Iron mills,1861,F,18789
A-Dreiser-Sister_Carrie-1900-M,A,Dreiser,Sister Carrie,1900,M,194062


### Read and parse NER-tagged files

The taged NER files are in the `..Data/NER/` directory. We need a function that will parse each one and return just the locations for further processing. 

In [3]:
import pandas as pd
import numpy as np

def get_loc(line):
    '''Takes a string of NER output. 
       Returns a location if found, else None.'''
    line = line.split('\t')
    try:
        if line[1] == 'LOCATION':
            return line[0]
        else:
            return None
    except:
        return None

def ingest_ner(f):
    '''Take a file handle for an NER output file.
       Returns a dict of locations and counts.'''
    from collections import defaultdict
    locations = defaultdict(lambda: 0)
    for line in f:
        loc = get_loc(line)
        if loc:
            locations[loc] += 1
    return locations

In [4]:
import os
ner_dir = '../Data/NER/'

files_list =[]
locs_list = []
occurs_list = []

for fileid in sorted(corpus.fileids()):
    file = os.path.join(ner_dir, fileid)
    with open(file) as f:
        locations = ingest_ner(f)
        for loc in sorted(locations, key=locations.get, reverse=True):
            files_list.append(fileid.strip('.txt'))
            locs_list.append(loc)
            occurs_list.append(locations[loc])

In [5]:
d = {'file': files_list, 
     'location': locs_list,
     'occurs': occurs_list}
geo = pd.DataFrame(d)
print(geo.describe())
geo.head()

            occurs
count  3439.000000
mean      4.178831
std      15.361014
min       1.000000
25%       1.000000
50%       1.000000
75%       2.000000
max     495.000000


Unnamed: 0,file,location,occurs
0,A-Cather-Antonia-1918-F,Black Hawk,21
1,A-Cather-Antonia-1918-F,Nebraska,16
2,A-Cather-Antonia-1918-F,Virginia,13
3,A-Cather-Antonia-1918-F,Omaha,10
4,A-Cather-Antonia-1918-F,Chicago,10


In [17]:
# Total number of named location occurrences in corpus
geo.occurs.sum()

14371

In [6]:
geo = geo.join(books, on='file')
geo.head()

Unnamed: 0,file,location,occurs,nation,author,title,pubdate,gender,wordcount
0,A-Cather-Antonia-1918-F,Black Hawk,21,A,Cather,Antonia,1918,F,97574
1,A-Cather-Antonia-1918-F,Nebraska,16,A,Cather,Antonia,1918,F,97574
2,A-Cather-Antonia-1918-F,Virginia,13,A,Cather,Antonia,1918,F,97574
3,A-Cather-Antonia-1918-F,Omaha,10,A,Cather,Antonia,1918,F,97574
4,A-Cather-Antonia-1918-F,Chicago,10,A,Cather,Antonia,1918,F,97574


In [7]:
places = geo.groupby('location')

In [26]:
places.occurs.aggregate(np.sum).head()

location
ASIA                  1
AUGUSTUS MELMOTTE     1
Abchurch Lane        27
Abingdon              1
Abingdon Street       1
Name: occurs, dtype: int64

In [27]:
places.occurs.size().head()

location
ASIA                 1
AUGUSTUS MELMOTTE    1
Abchurch Lane        1
Abingdon             1
Abingdon Street      1
dtype: int64

In [45]:
place_counts = pd.DataFrame(places.occurs.aggregate(np.sum))
place_counts['volumes'] = places.occurs.size()
print(place_counts.describe())
place_counts.head()

            occurs      volumes
count  2365.000000  2365.000000
mean      6.076533     1.454123
std      30.206909     1.580908
min       1.000000     1.000000
25%       1.000000     1.000000
50%       1.000000     1.000000
75%       3.000000     1.000000
max    1042.000000    23.000000


Unnamed: 0_level_0,occurs,volumes
location,Unnamed: 1_level_1,Unnamed: 2_level_1
ASIA,1,1
AUGUSTUS MELMOTTE,1,1
Abchurch Lane,27,1
Abingdon,1,1
Abingdon Street,1,1


In [47]:
lookups = place_counts[(place_counts['occurs']>5) | (place_counts['volumes']>2)]
lookups.describe()

Unnamed: 0,occurs,volumes
count,441.0,441.0
mean,25.795918,3.031746
std,66.475861,3.161399
min,3.0,1.0
25%,7.0,1.0
50%,11.0,2.0
75%,22.0,4.0
max,1042.0,23.0


In [54]:
lookups.head()

Unnamed: 0_level_0,occurs,volumes
location,Unnamed: 1_level_1,Unnamed: 2_level_1
Abchurch Lane,27,1
Adele,14,1
Africa,36,8
Alabama,6,1
Albany,7,4


In [57]:
# Parameters for geocoding clients
# NOTE: Per-second query rates will quickly use up daily API quota.
#   Need to be careful not to exceed daily quota

import googlemaps

gc_rate  =     50 # Geocoding queries per second
pl_rate  =      5 # Places queries per second
api_key_file = '/Users/mwilkens/Google Drive/Private/google-geo-api-key.txt'

# Get API key from file
try:
    api_key = open(api_key_file).read().strip()
except:
    sys.exit('Cannot get Google geocoding API key. Exiting.')

gc_client = googlemaps.Client(key=api_key, queries_per_second=gc_rate) # For Places API
pl_client = googlemaps.Client(key=api_key, queries_per_second=pl_rate) # For Geocoding API

In [58]:
# A function to get place_ids from strings
def get_placeid(string, api_client):
    '''Takes a string and an established googlemaps places API client.
       Returns first place_id associated with that string.
       If not place_id found, returns "ZERO_RESULTS" or None, depending on result status code.'''
    try:
        place = api_client.places(string)
        status = place['status']
        if status == 'OK':
            place_id = place['results'][0]['place_id']
        elif status == 'ZERO_RESULTS':
            place_id = None
        else:
            place_id = None
    except:
        place_id = None
    return place_id

def process_id(placeid, api_client):
    '''Takes a Google place_id and an established googlemaps geocoding API client.
        Looks up and parses geo data for placeid.
        Returns int code on error, else dictionary of geo data keyed to placeid.
    '''
    # Define all variables, initial to None
    formatted_address = None
    location_type = None
    country_long = None
    country_short = None
    admin_1_long = None
    admin_1_short = None
    admin_2 = None
    locality = None
    colloquial_area = None
    continent = None
    natural_feature = None
    point_of_interest = None
    lat = None
    lon = None
    partial = None
    
    # Perform reverse geocode. Note this needs googlemaps v 2.4.2-dev0 or higher
    try:
        data = gc_client.reverse_geocode(placeid)
    except:
        return 1 # Problem with geocoding API call
    
    # Use the first result. Should only be one when reverse geocoding with place_id.
    try:
        data = data[0]
        formatted_address = data['formatted_address']
        location_type = data['types'][0]
        lat = data['geometry']['location']['lat']
        lon = data['geometry']['location']['lng']
        try:
            partial = result['partial_match']
        except:
            partial = False
    except:
        print("   Bad geocode for place_id %s" % (placeid))
        return 2 # Problem with basic geocode result
    
    try:
        for addr_comp in data['address_components']:
            comp_type = addr_comp['types'][0]
            if comp_type == 'locality':
                locality = addr_comp['long_name']
            elif comp_type == 'country':
                country_long = addr_comp['long_name']
                country_short = addr_comp['short_name']
            elif comp_type == 'administrative_area_level_1':
                admin_1_long = addr_comp['long_name']
                admin_1_short = addr_comp['short_name']
            elif comp_type == 'administrative_area_level_2':
                admin_2 = addr_comp['long_name']
            elif comp_type == 'colloquial_area':
                colloquial_area = addr_comp['long_name']
            elif comp_type == 'natural_feature':
                natural_feature = addr_comp['long_name']
            elif comp_type == 'point_of_interest':
                point_of_interest = addr_comp['long_name']
            elif comp_type == 'continent':
                continent = addr_comp['long_name']
    except:
        return 3 # Problem with address components
     
    ### Code to build results dictionary here
    
    return 0