# Parsing News Headlines

The objective is to find any city and/or country names mentioned in each of the news headlines.

In [1]:
import pandas as pd

with open('headlines.txt') as fin:
    lines = fin.readlines()

headlines = [v.strip() for v in lines]

df = pd.DataFrame({'headline': headlines})
df

Unnamed: 0,headline
0,Zika Outbreak Hits Miami
1,Could Zika Reach New York City?
2,First Case of Zika in Miami Beach
3,"Mystery Virus Spreads in Recife, Brazil"
4,Dallas man comes down with case of Zika
...,...
645,Rumors about Rabies spreading in Jerusalem hav...
646,More Zika patients reported in Indang
647,Suva authorities confirmed the spread of Rotav...
648,More Zika patients reported in Bella Vista


Looks good, let's load the countries and cities from GeonameCache:

In [2]:
import geonamescache

gnc = geonamescache.GeonamesCache()
countries = gnc.get_countries()

{k: v for k, v in list(countries.items())[:2]}
    

{'AD': {'geonameid': 3041565,
  'name': 'Andorra',
  'iso': 'AD',
  'iso3': 'AND',
  'isonumeric': 20,
  'fips': 'AN',
  'continentcode': 'EU',
  'capital': 'Andorra la Vella',
  'areakm2': 468,
  'population': 84000,
  'tld': '.ad',
  'currencycode': 'EUR',
  'currencyname': 'Euro',
  'phone': '376',
  'postalcoderegex': '^(?:AD)*(\\d{3})$',
  'languages': 'ca',
  'neighbours': 'ES,FR'},
 'AE': {'geonameid': 290557,
  'name': 'United Arab Emirates',
  'iso': 'AE',
  'iso3': 'ARE',
  'isonumeric': 784,
  'fips': 'AE',
  'continentcode': 'AS',
  'capital': 'Abu Dhabi',
  'areakm2': 82880,
  'population': 4975593,
  'tld': '.ae',
  'currencycode': 'AED',
  'currencyname': 'Dirham',
  'phone': '971',
  'postalcoderegex': '',
  'languages': 'ar-AE,fa,en,hi,ur',
  'neighbours': 'SA,OM'}}

The name of the country is buried in the values of the returned dictionary, so let's extract that:

In [3]:
countries = [v['name'] for v in gnc.get_countries().values()]
countries[:10]

['Andorra',
 'United Arab Emirates',
 'Afghanistan',
 'Antigua and Barbuda',
 'Anguilla',
 'Albania',
 'Armenia',
 'Angola',
 'Antarctica',
 'Argentina']

Now, do the same for cities:

In [4]:
{k: v for k, v in list(gnc.get_cities().items())[:2]}

{'3041563': {'geonameid': 3041563,
  'name': 'Andorra la Vella',
  'latitude': 42.50779,
  'longitude': 1.52109,
  'countrycode': 'AD',
  'population': 20430,
  'timezone': 'Europe/Andorra',
  'admin1code': '07'},
 '290594': {'geonameid': 290594,
  'name': 'Umm Al Quwain City',
  'latitude': 25.56473,
  'longitude': 55.55517,
  'countrycode': 'AE',
  'population': 62747,
  'timezone': 'Asia/Dubai',
  'admin1code': '07'}}

In [5]:
cities = [v['name'] for v in gnc.get_cities().values()]
cities[:10]

['Andorra la Vella',
 'Umm Al Quwain City',
 'Ras Al Khaimah City',
 'Zayed City',
 'Khawr Fakkān',
 'Dubai',
 'Dibba Al-Fujairah',
 'Dibba Al-Hisn',
 'Sharjah',
 'Ar Ruways']

One of the names in the first 10 cities is accented, i.e., `Khawr Fakkān`. This may be problematic if cities or countries use non-ascii characters but instances of headlines don't, and vice versa. To ensure such "mismatches" do not happen, let's replace non-ascii characters with the closest ascii equivalent for headlines, countries, and cities with the library `text_unidecode`:

In [6]:
from text_unidecode import unidecode

df.headline = df.headline.apply(unidecode)
df

Unnamed: 0,headline
0,Zika Outbreak Hits Miami
1,Could Zika Reach New York City?
2,First Case of Zika in Miami Beach
3,"Mystery Virus Spreads in Recife, Brazil"
4,Dallas man comes down with case of Zika
...,...
645,Rumors about Rabies spreading in Jerusalem hav...
646,More Zika patients reported in Indang
647,Suva authorities confirmed the spread of Rotav...
648,More Zika patients reported in Bella Vista


Do the same for country and cities:

In [7]:
countries = [unidecode(v) for v in countries]
countries[:10]

['Andorra',
 'United Arab Emirates',
 'Afghanistan',
 'Antigua and Barbuda',
 'Anguilla',
 'Albania',
 'Armenia',
 'Angola',
 'Antarctica',
 'Argentina']

In [8]:
cities = [unidecode(v) for v in cities]
cities[:10]

['Andorra la Vella',
 'Umm Al Quwain City',
 'Ras Al Khaimah City',
 'Zayed City',
 'Khawr Fakkan',
 'Dubai',
 'Dibba Al-Fujairah',
 'Dibba Al-Hisn',
 'Sharjah',
 'Ar Ruways']

Now, city name 'Khawr Fakkān' is no longer accented.

In [9]:
import re

country_regex = re.compile(f"\\b({'|'.join(countries)})\\b")
df.loc[:, 'country'] = df.headline.str.extract(country_regex, expand=False)
df

Unnamed: 0,headline,country
0,Zika Outbreak Hits Miami,
1,Could Zika Reach New York City?,
2,First Case of Zika in Miami Beach,
3,"Mystery Virus Spreads in Recife, Brazil",Brazil
4,Dallas man comes down with case of Zika,
...,...,...
645,Rumors about Rabies spreading in Jerusalem hav...,
646,More Zika patients reported in Indang,
647,Suva authorities confirmed the spread of Rotav...,
648,More Zika patients reported in Bella Vista,


The extraction of countries look kinda promising. Now, do the same for cities:

In [10]:
city_regex = re.compile(f"\\b({'|'.join(cities)})\\b")
df.loc[:, 'city'] = df.headline.str.extract(city_regex, expand=False)
df

Unnamed: 0,headline,country,city
0,Zika Outbreak Hits Miami,,
1,Could Zika Reach New York City?,,
2,First Case of Zika in Miami Beach,,
3,"Mystery Virus Spreads in Recife, Brazil",Brazil,
4,Dallas man comes down with case of Zika,,
...,...,...,...
645,Rumors about Rabies spreading in Jerusalem hav...,,
646,More Zika patients reported in Indang,,
647,Suva authorities confirmed the spread of Rotav...,,
648,More Zika patients reported in Bella Vista,,


Instead of "Miami Beach", the regex extracted only "Miami". Let's `cities` a little closer:

In [11]:
[v for v in cities if v.startswith('Miami')]

['Miami', 'Miami Beach', 'Miami Gardens', 'Miami Lakes', 'Miamisburg']

'Cos regex matches return the first match, let's sort cities by their length in descending order, create a new regex, and try extracting again:

In [12]:
cities2 = sorted(cities, key=len, reverse=True)
[v for v in cities2 if v.startswith('Miami')]

['Miami Gardens', 'Miami Beach', 'Miami Lakes', 'Miamisburg', 'Miami']

In [13]:
cities2_regex = re.compile(f"\\b({'|'.join(cities2)})\\b")
df.city = df.headline.str.extract(cities2_regex, expand=False)
df

Unnamed: 0,headline,country,city
0,Zika Outbreak Hits Miami,,Miami
1,Could Zika Reach New York City?,,New York City
2,First Case of Zika in Miami Beach,,Miami Beach
3,"Mystery Virus Spreads in Recife, Brazil",Brazil,Recife
4,Dallas man comes down with case of Zika,,Dallas
...,...,...,...
645,Rumors about Rabies spreading in Jerusalem hav...,,Jerusalem
646,More Zika patients reported in Indang,,Indang
647,Suva authorities confirmed the spread of Rotav...,,Suva
648,More Zika patients reported in Bella Vista,,Bella Vista


Looks good. Let's do it for countries, just in case:

In [14]:
countries2 = sorted(countries, key=len, reverse=True)
countries2_regex = re.compile(f"\\b({'|'.join(countries2)})\\b")
df.country = df.headline.str.extract(countries2_regex, expand=False)
df

Unnamed: 0,headline,country,city
0,Zika Outbreak Hits Miami,,Miami
1,Could Zika Reach New York City?,,New York City
2,First Case of Zika in Miami Beach,,Miami Beach
3,"Mystery Virus Spreads in Recife, Brazil",Brazil,Recife
4,Dallas man comes down with case of Zika,,Dallas
...,...,...,...
645,Rumors about Rabies spreading in Jerusalem hav...,,Jerusalem
646,More Zika patients reported in Indang,,Indang
647,Suva authorities confirmed the spread of Rotav...,,Suva
648,More Zika patients reported in Bella Vista,,Bella Vista


Looks good. Save the data frame for the next exercise.

In [15]:
with open('exercise1-output.csv', 'w') as fout:
    fout.write(df.to_csv(index=False))