### Steps

1. Load in the headline data and examine it for any data quality issues.
  * Use any library/data structure to read in the headlines.
  * Read through some of the headlines and identify potential problems.
1. Using regular expressions and the cities and countries within the `geonamescache` library, match any cities/countries within each headline.
  * Make sure to normalize headlines and city/country names by removing accent marks. This can be done with the unidecode library.
  * Watch out for multiple cities in a headline and matches on short words! We want the match to be on the entire city—for example San Marino—and not a partial match—San.
1. Put the extracted data into a pandas `DataFrame` with three columns: `headline, city, country`.
1. Make sure there were no issues with the extraction by sampling some of the headlines and examining the city and country names.
  * One method for finding problems is to look for the most common names and see if there are any issues.
1. Once you are confident you’ve found all the cities/countries in each headline, save the DataFrame for the next part.

In [11]:
import os
import urllib.request
import pandas as pd
import re



In [3]:
HEADLINES = 'https://livevideo-resources.s3-us-west-2.amazonaws.com/course-data/93/headlines.txt'
HD_LOCAL  = 'data/headlines.txt'

In [2]:
os.makedirs('./data', exist_ok=True)

In [4]:
urllib.request.urlretrieve(HEADLINES, HD_LOCAL)

('data/headlines.txt', <http.client.HTTPMessage at 0x7f5347005588>)

In [5]:
headlines_list = [lines.strip() for lines in open(HD_LOCAL)]

In [8]:
headlines_list[:10]

['Zika Outbreak Hits Miami',
 'Could Zika Reach New York City?',
 'First Case of Zika in Miami Beach',
 'Mystery Virus Spreads in Recife, Brazil',
 'Dallas man comes down with case of Zika',
 'Trinidad confirms first Zika case',
 'Zika Concerns are Spreading in Houston',
 'Geneve Scientists Battle to Find Cure',
 'The CDC in Atlanta is Growing Worried',
 'Zika Infested Monkeys in Sao Paulo']

In [10]:
headlines_list[-10:]

['Authorities are Worried about the Spread of Varicella in Clovis',
 'More Zika patients reported in Fort Worth',
 'Zika symptoms spotted in Boynton Beach',
 'Outbreak of Zika in Portoviejo',
 'Influenza Exposure in Muscat',
 'Rumors about Rabies spreading in Jerusalem have been refuted',
 'More Zika patients reported in Indang',
 'Suva authorities confirmed the spread of Rotavirus',
 'More Zika patients reported in Bella Vista',
 'Zika Outbreak in Wichita Falls']

In [16]:
zika_headlines = [hl for hl in headlines_list if re.search(r'zika', hl, re.I)]
flu_headlines = [hl for hl in headlines_list if re.search(r'flu', hl, re.I)]

In [14]:
len(zika_headlines)

227

In [17]:
len(flu_headlines)

28

In [15]:
len(headlines_list)

650

### Possible Issues

* way unbalanced with respect to disease. 
* Some headlines with no locations
   - 'The CDC in Atlanta is Growing Worried'