## Named Entity Recognition in U.S. State of the Union Addresses, 1945-2006

First, we'll use Anaconda on a single machine to process a small corpus of SOTU addresses.  This task is "embarrasingly parallel", meaning it can be split into tasks and executed in parallel threads, with no communication required among threads.  Next, we'll load a much larger corpus into a Hadoop cluster and analyze it using Apache Spark.  To drive the parallel analysis, we'll use Anaconda for Cluster Management to deploy the necessary Python packages across cluster nodes.

In [1]:
import nltk
import pandas as pd

import geograpy

### Setup and Mapping

Data directory contains past SOTU addresses, and can be downloaded with the NLTK corpus utility.

In [2]:
DIR = "/Users/zcarwile/nltk_data/corpora/state_union/"

In [3]:
words_df = pd.DataFrame(
    columns=['year','president','words'],
)
words_df[['year', 'words']] = words_df[['year', 'words']].astype(int)

countries_df = pd.DataFrame(
    columns=['year','country','region','mentions']
)
countries_df[['year', 'mentions']] = countries_df[['year', 'mentions']].astype(int)

Geograpy identifies country mentions, and we would like to group the mentions into regions.  This process is somewhat arbitrary.  Let's make a custom dictionary.

In [4]:
country_region_map = country_region_map = {
    'Canada': 'North America', 
    'Cambodia': 'Asia', 
    'Ethiopia': 'Africa', 
    'Argentina': 'South America', 
    'Bahrain': 'Middle East', 
    'Saudi Arabia': 'Middle East', 
    'Guatemala': 'North America', 
    'Bosnia and Herzegovina': 'Europe', 
    'Russian Federation': 'Europe', 
    'Germany': 'Europe', 
    'Spain': 'Europe', 
    'Netherlands': 'Europe', 
    'Christmas Island': 'Other', 
    'New Zealand': 'Other', 
    'Yemen': 'Middle Est', 
    'Pakistan': 'Middle East', 
    'Viet Nam': 'Asia', 
    'Saint Vincent and the Grenadines': 'Other', 
    'Kenya': 'Africa', 
    'Turkey': 'Europe', 
    'Afghanistan': 'Middle East', 
    'Czech Republic': 'Europe', 
    'Solomon Islands': 'Other', 
    'India': 'Asia', 
    'France': 'Europe', 
    'Somalia': 'Africa', 
    'Peru': 'South America', 
    'Norway': 'Europe', 
    'Singapore': 'Asia', 
    'Iran, Islamic Republic of': 'Middle East', 
    'China': 'Asia', 
    'Micronesia, Federated States of': 'Other', 
    'Ukraine': 'Europe', 
    'Finland': 'Europe', 
    'Indonesia': 'Asia', 
    'Central African Republic': 'Africa', 
    'United States': 'North America', 
    'Sweden': 'Europe', 
    'Belarus': 'Europe', 
    'Bulgaria': 'Europe', 
    'Romania': 'Europe', 
    'Angola': 'Africa', 
    'French Southern Territories': 'Other', 
    'Portugal': 'Europe', 
    'South Africa': 'Africa', 
    'Cyprus': 'Middle East', 
    'Venezuela, Bolivarian Republic of': 'South America', 
    'Austria': 'Europe', 
    'Japan': 'Asia', 
    'Brazil': 'South America', 
    'Kuwait': 'Middle East', 
    'Panama': 'North America', 
    'Korea, Republic of': 'Asia', 
    'Costa Rica': 'North America', 
    'Bahamas': 'North America', 
    'Ireland': 'Europe', 
    'Nigeria': 'Africa', 
    'Australia': 'Other', 
    'Chile': 'South America', 
    'Puerto Rico': 'North America', 
    'Belgium': 'Europe', 
    'Thailand': 'Asia', 
    'Haiti': 'North America', 
    'Iraq': 'Middle East', 
    'Georgia': 'Europe', 
    'Denmark': 'Europe', 
    'Poland': 'Europe', 
    'Morocco': 'Africa', 
    'Namibia': 'Africa', 
    'Switzerland': 'Europe', 
    'Grenada': 'Other', 
    'Tanzania, United Republic of': 'Africa', 
    'Uruguay': 'South America', 
    'Lebanon': 'Middle East', 
    'Uzbekistan': 'Asia', 
    'Colombia': 'South America', 
    'Nicaragua': 'North America', 
    'Italy': 'Europe', 
    'Israel': 'Middle East', 
    'Iceland': 'Europe', 
    'Zimbabwe': 'Africa', 
    'Jordan': 'Middle East', 
    'Philippines': 'Asia', 
    'British Indian Ocean Territory': 'Other', 
    "Korea, Democratic People's Republic of": 'Asia', 
    'Trinidad and Tobago': 'Other', 
    'Hungary': 'Europe', 
    'Mexico': 'North America', 
    'Egypt': 'Middle East', 
    'Cuba': 'North America', 
    'United Kingdom': 'Europe', 
    'Antarctica': 'Other', 
    'Congo': 'Africa', 
    'Greece': 'Europe'
}

### Entity Extraction

To identify country mentions in text, we use a library called **geograpy**.

In [None]:
def getWordCount(text):
    return len(nltk.word_tokenize(x[1]))

In [None]:
def getPlaces(text):
    places = geograpy.get_place_context(text=text) # WHAT'S WRONG WITH THIS ONE LINE??!??
    return places

In [5]:
import time
start_time = time.time()

# separate into 3 sections:
# 1. NER
# 2. Word count
# 3. Mapping

i = 0
j = 0
for id in nltk.corpus.state_union.fileids():
    
    # add word count to DataFrame
    year = id[0:4]
    year_int = int(year)
    president = id.split("-")[1].split(".")[0]
    words = len(nltk.corpus.state_union.words(id))
    words_df.loc[i] = [year_int,president,words]
    i = i + 1
    
    # extract places and add to DataFrame
    with open(DIR + id, 'r') as myfile:
        text=myfile.read().replace('\n', '') 
        places = geograpy.get_place_context(text=unicode(text,encoding='utf-8', errors='ignore'))

        for country in places.country_mentions:
            region = country_region_map.get(str(country[0]))
            countries_df.loc[j] = [year_int,country[0],region,country[1]]
            j = j + 1
    
    # Execution control -- show message every ~5 files
    if year[3] in ['0','5']:
        print(year_int)
    # for testing only
    # if i > 3:
    #    break
    
print("--- %s seconds ---" % (time.time() - start_time))

1945
1950
1955
1960
1965
1965
1970
1975
1980
1985
1990
1995
2000
2005
--- 338.850814819 seconds ---


  return all(self.correct_country_mispelling(place_name) not in l for l in places)


### Document statistics

In [4]:
words_df[words_df['year'] == 1991]

Unnamed: 0,year,president,words
47,1991,Bush,4661
48,1991,Bush,3357


In [13]:
countries_df[countries_df['year'] == 1991]

Unnamed: 0,year,country,region,mentions
469,1991,Iraq,Middle East,6
470,1991,Kuwait,Middle East,2
471,1991,Bahrain,Middle East,1
472,1991,Israel,Middle East,1
473,1991,French Southern Territories,Other,1
474,1991,United Kingdom,Europe,1
475,1991,United States,North America,1
476,1991,Spain,Europe,1
477,1991,India,Asia,1
478,1991,Costa Rica,North America,1


In [6]:
whole_df = words_df.merge(countries_df, on='year', how='left')
whole_df[whole_df['year'] == 1991]

Unnamed: 0,year,president,words,country,region,mentions
502,1991,Bush,4661,Iraq,Middle East,6
503,1991,Bush,4661,Kuwait,Middle East,2
504,1991,Bush,4661,Bahrain,Middle East,1
505,1991,Bush,4661,Israel,Middle East,1
506,1991,Bush,4661,French Southern Territories,Other,1
507,1991,Bush,4661,United Kingdom,Europe,1
508,1991,Bush,4661,United States,North America,1
509,1991,Bush,4661,Spain,Europe,1
510,1991,Bush,4661,India,Asia,1
511,1991,Bush,4661,Costa Rica,North America,1


### Visualization with Bokeh

In [None]:
from bokeh.charts import Bar
from bokeh.plotting import ColumnDataSource
from bokeh.io import output_notebook, show
from bokeh.models import HoverTool
output_notebook()

In [7]:


TOOLS = 'box_zoom,box_select,hover,resize,reset'

p = Bar(words_df, label='year',values='words', color='president', tools=TOOLS, width=1000, height=400, ylabel="Words")
hover = p.select(dict(type=HoverTool))
hover.tooltips = [
                    ("year", "@year"),
                    ("president", "@president")
                 ]

show(p)

In [8]:
p = Bar(countries_df, label='year', values='mentions', stack='region', agg='sum', legend='top_left', 
        tools=TOOLS, width=1000, height=400, ylabel="Country Mentions")

hover = p.select(dict(type=HoverTool))
hover.tooltips = [
                    ("year", "@year"),
                    ("region", "@region"),
                    ("president","$president")
                 ]

show(p)

### Attic

In [1]:
#countries_df.loc[whole_df['year'] == 1991]

In [2]:
#nltk.corpus.state_union.words()

In [3]:

#FILE='1991-GWBush.txt'
#with open(DIR + FILE, 'r') as myfile:
#    text=myfile.read().replace('\n', '')
#print(text)