### Steps

1. Load in the headline data and examine it for any data quality issues.
  * Use any library/data structure to read in the headlines.
  * Read through some of the headlines and identify potential problems.
1. Using regular expressions and the cities and countries within the `geonamescache` library, match any cities/countries within each headline.
  * Make sure to normalize headlines and city/country names by removing accent marks. This can be done with the `unidecode` library.
  * Watch out for multiple cities in a headline and matches on short words! We want the match to be on the entire city—for example San Marino—and not a partial match—San.
1. Put the extracted data into a pandas `DataFrame` with three columns: `headline, city, country`.
1. Make sure there were no issues with the extraction by sampling some of the headlines and examining the city and country names.
  * One method for finding problems is to look for the most common names and see if there are any issues.
1. Once you are confident you’ve found all the cities/countries in each headline, save the DataFrame for the next part.

In [None]:
### References

* https://pypi.org/project/geonamescache/

In [84]:
import os
import urllib.request
import pandas as pd
import re
import geonamescache
from unidecode import unidecode
from typing import List



In [25]:
HEADLINES = 'https://livevideo-resources.s3-us-west-2.amazonaws.com/course-data/93/headlines.txt'
HD_LOCAL  = 'data/headlines.txt'

In [26]:
os.makedirs('./data', exist_ok=True)

In [27]:
urllib.request.urlretrieve(HEADLINES, HD_LOCAL)

('data/headlines.txt', <http.client.HTTPMessage at 0x7fad647b9400>)

In [28]:
headlines_list = [lines.strip() for lines in open(HD_LOCAL)]

In [29]:
headlines_list[:10]

['Zika Outbreak Hits Miami',
 'Could Zika Reach New York City?',
 'First Case of Zika in Miami Beach',
 'Mystery Virus Spreads in Recife, Brazil',
 'Dallas man comes down with case of Zika',
 'Trinidad confirms first Zika case',
 'Zika Concerns are Spreading in Houston',
 'Geneve Scientists Battle to Find Cure',
 'The CDC in Atlanta is Growing Worried',
 'Zika Infested Monkeys in Sao Paulo']

In [30]:
headlines_list[-10:]

['Authorities are Worried about the Spread of Varicella in Clovis',
 'More Zika patients reported in Fort Worth',
 'Zika symptoms spotted in Boynton Beach',
 'Outbreak of Zika in Portoviejo',
 'Influenza Exposure in Muscat',
 'Rumors about Rabies spreading in Jerusalem have been refuted',
 'More Zika patients reported in Indang',
 'Suva authorities confirmed the spread of Rotavirus',
 'More Zika patients reported in Bella Vista',
 'Zika Outbreak in Wichita Falls']

In [31]:
zika_headlines = [hl for hl in headlines_list if re.search(r'zika', hl, re.I)]
flu_headlines = [hl for hl in headlines_list if re.search(r'flu', hl, re.I)]

In [32]:
len(zika_headlines)

227

In [33]:
len(flu_headlines)

28

In [34]:
len(headlines_list)

650

### Possible Issues

* way unbalanced with respect to disease. 
* Some headlines with no locations
   - 'The CDC in Atlanta is Growing Worried'

### Parse out Cities and Countries

In [85]:
#convert to ascii
headlines_list = [ unidecode(hl) for hl in headlines_list]

In [4]:
gc = geonamescache.GeonamesCache() 

In [86]:
city_names = [ unidecode(v['name']) for v in gc.get_cities().values() ]
city_names.sort() # alphabetical order (doesn't really matter)
city_names.sort(key=len, reverse=True)

In [87]:
country_names = [ unidecode(v['name']) for v in gc.get_countries().values() ]
country_names.sort() # alphabetical order (doesn't really matter)
country_names.sort(key=len, reverse=True)

In [50]:
country_names[:20]

['South Georgia and the South Sandwich Islands',
 'United States Minor Outlying Islands',
 'Bonaire, Saint Eustatius and Saba ',
 'Heard Island and McDonald Islands',
 'Democratic Republic of the Congo',
 'Saint Vincent and the Grenadines',
 'British Indian Ocean Territory',
 'French Southern Territories',
 'Saint Pierre and Miquelon',
 'Central African Republic',
 'Northern Mariana Islands',
 'Turks and Caicos Islands',
 'Bosnia and Herzegovina',
 'British Virgin Islands',
 'Svalbard and Jan Mayen',
 'Palestinian Territory',
 'Republic of the Congo',
 'Saint Kitts and Nevis',
 'Sao Tome and Principe',
 'Serbia and Montenegro']

In [54]:
city_names[:20]

['Chak Two Hundred Forty-nine Thal Development Authority',
 'Dolores Hidalgo Cuna de la Independencia Nacional',
 'Ampliación San Mateo (Colonia Solidaridad)',
 'Licenciado Benito Juárez (Campo Gobierno)',
 'Sant Pere, Santa Caterina i La Ribera',
 'Nanchital de Lázaro Cárdenas del Río',
 'Palikir - National Government Center',
 'San Fernando del Valle de Catamarca',
 'San Martin Texmelucan de Labastida',
 "el Camp d'en Grassot i Gràcia Nova",
 'Acilia-Castel Fusano-Ostia Antica',
 'Chak One Hundred Twenty Nine Left',
 'Brandýs nad Labem-Stará Boleslav',
 'Delegación Cuajimalpa de Morelos',
 'Montecchio Maggiore-Alte Ceccato',
 'Rosignano Solvay-Castiglioncello',
 'Socorro Mission Number 1 Colonia',
 'Sydney Central Business District',
 'Jardines de la Silla (Jardines)',
 'Marina di Ardea-Tor San Lorenzo']

In [65]:
#debugging
#city_names.insert(0, 'New York City')
print(entity_in_headline(headlines_list[1], city_names))


New York City


In [89]:
#use the same index as headlines_list
cities_found = list()
countries_found = list()

for h in headlines_list:
    cities_found.append(entity_in_headline(h, city_names))
    countries_found.append(entity_in_headline(h, country_names))


In [88]:
def entity_in_headline(h: str, names_list: list) -> str:
    for c in names_list:
        pat = r'\b{}\b'.format(c)
        if re.search(pat, h): #this takes a long time (~10 mins)
        #if h.find(c) != -1: #should use re here with word boundaries. 
            return c
    return ''

In [90]:
# TODO: Should also be matching only at word boundaries "Man" vs "Manila"
cities_found[:15]

['Miami',
 'New York City',
 'Miami Beach',
 'Recife',
 'Dallas',
 'Trinidad',
 'Houston',
 'Geneve',
 'Atlanta',
 'Sao Paulo',
 'Brownsville',
 'St. Louis',
 'San Juan',
 'Galveston',
 'Manila']

In [47]:
headlines_list[:15]

['Zika Outbreak Hits Miami',
 'Could Zika Reach New York City?',
 'First Case of Zika in Miami Beach',
 'Mystery Virus Spreads in Recife, Brazil',
 'Dallas man comes down with case of Zika',
 'Trinidad confirms first Zika case',
 'Zika Concerns are Spreading in Houston',
 'Geneve Scientists Battle to Find Cure',
 'The CDC in Atlanta is Growing Worried',
 'Zika Infested Monkeys in Sao Paulo',
 'Brownsville teen contracts Zika virus',
 'Mosquito control efforts in St. Louis take new tactics with Zika threat',
 'San Juan reports 1st U.S. Zika-related death amid outbreak',
 'Flu outbreak in Galveston, Texas',
 'Zika alert – Manila now threatened']

### Create the Data Frame

In [92]:
pd.DataFrame(list(zip(headlines_list, cities_found, countries_found)), columns=['Headline', 'City', 'Country'])

Unnamed: 0,Headline,City,Country
0,Zika Outbreak Hits Miami,Miami,
1,Could Zika Reach New York City?,New York City,
2,First Case of Zika in Miami Beach,Miami Beach,
3,"Mystery Virus Spreads in Recife, Brazil",Recife,Brazil
4,Dallas man comes down with case of Zika,Dallas,
...,...,...,...
645,Rumors about Rabies spreading in Jerusalem hav...,Jerusalem,
646,More Zika patients reported in Indang,Indang,
647,Suva authorities confirmed the spread of Rotav...,Suva,
648,More Zika patients reported in Bella Vista,Bella Vista,
