### Steps

1. Load in the pandas DataFrame with headline, countries, and cities.
  + If a headline contains multiple cities/countries, decide which single one to keep.
2. For each city/country, match the name to the latitude and longitude in `geonamescache`.
  + You can use the function `gc.get_cities_by_names_` `_(“city_name”)`.
  + Some cities will return multiple matches with the previous function in different countries. You’ll have to decide which city to keep based on a heuristic (rule of thumb).
  + If you have trouble, work with a single problematic city until you figure it out, then write a function to apply on all headlines.
3. Add longitude and latitude coordinates to your DataFrame for each headline.
  + It will be helpful to get the `countrycode` of each headline at this point.
  + If you were not able to find many countries, think about **dropping the column**. You also need to decide what to do with headlines that have no coordinates.
  + You should end up with over 600 headlines that have geographic coordinates.

### Notes

* Make sure you are inspecting your data statistically and visually at each step. Compute summary statistics and make basic distribution plots (histograms) to search for anomalies.
* When you find issues (we all make mistakes), go back and correct them. You may find yourself repeatedly carrying out data cleaning operations; this is normal for a data science project!
* Keep in mind that the more accurate the geographic locations of the headlines, the better will be the clustering and the overall conclusions. Checking your data one more time is worth it when lives are at stake.

### Questions

* What does "dropping the column" mean? 
* What type of histograms would I make this this data?
    + Has lat / long, missing lat long
    + Headlines by country
    + Bucketize headlines by population of city



In [15]:
import pandas as pd
import re
import geonamescache

In [16]:
gc = geonamescache.GeonamesCache() 

In [17]:
#DF_PICK = './data/headlines_cities_countries.pkl'
DF_PICK = './data/headlines_cities_countries_unicode.pkl'

In [18]:
df = pd.read_pickle(DF_PICK)

In [19]:
countries = gc.get_countries_by_names()
cities = gc.get_cities()

In [20]:
def unpack_city_results(cities):
    out = []
    for c in cities:
        for k,v in c.items():
            out.append(v)
    return out

def pick_city(city_results):
    city_results = unpack_city_results(city_results)
    cr = sorted(city_results, key=lambda x: x['population'], reverse=True)
    return cr[0]
    
def find_country_coordinates(country):
    global cities

    cities_in_country = filter(lambda x: x.countrycode == cc, cities)
    cities_in_country.sort(key="population", reverse=True)
    return { k:cities_in_country[0][k] for k in ['latitude', 'longitude'] }

def find_coords(row):
    global countries
    
    if row.City:
        city = pick_city(gc.get_cities_by_name(row.City))
        return { k:city[k] for k in ['latitude', 'longitude']}
    elif row.Country:
        country = countries[row.Country]
        return find_country_coordinates(country)
    return ''


In [21]:
df['geo'] = df.apply(find_coords, axis=1)

In [22]:
df.head(40)

Unnamed: 0,Headline,City,Country,geo
0,Zika Outbreak Hits Miami,Miami,,"{'latitude': 25.77427, 'longitude': -80.19366}"
1,Could Zika Reach New York City?,New York City,,"{'latitude': 40.71427, 'longitude': -74.00597}"
2,First Case of Zika in Miami Beach,Miami Beach,,"{'latitude': 25.79065, 'longitude': -80.13005}"
3,"Mystery Virus Spreads in Recife, Brazil",Recife,Brazil,"{'latitude': -8.05389, 'longitude': -34.88111}"
4,Dallas man comes down with case of Zika,Dallas,,"{'latitude': 32.78306, 'longitude': -96.80667}"
5,Trinidad confirms first Zika case,Trinidad,,"{'latitude': -14.83333, 'longitude': -64.9}"
6,Zika Concerns are Spreading in Houston,Houston,,"{'latitude': 29.76328, 'longitude': -95.36327}"
7,Geneve Scientists Battle to Find Cure,Genève,,"{'latitude': 46.20222, 'longitude': 6.14569}"
8,The CDC in Atlanta is Growing Worried,Atlanta,,"{'latitude': 33.749, 'longitude': -84.38798}"
9,Zika Infested Monkeys in Sao Paulo,São Paulo,,"{'latitude': -23.5475, 'longitude': -46.63611}"


In [23]:
#How many lat / long values?
(df['geo'].values != '').sum()

608

In [None]:
### LEFT OFF HERE.

# * reformat the `geo` column
# * remove rows without the required data
# * Add a country code column. 
# * bonus - make histograms
# * double bonus - try to create an R-based notebook and make histograms there. 

### Testing

In [25]:
gc.get_countries()


{'AD': {'geonameid': 3041565,
  'name': 'Andorra',
  'iso': 'AD',
  'iso3': 'AND',
  'isonumeric': 20,
  'fips': 'AN',
  'continentcode': 'EU',
  'capital': 'Andorra la Vella',
  'areakm2': 468,
  'population': 84000,
  'tld': '.ad',
  'currencycode': 'EUR',
  'currencyname': 'Euro',
  'phone': '376',
  'postalcoderegex': '^(?:AD)*(\\d{3})$',
  'languages': 'ca',
  'neighbours': 'ES,FR'},
 'AE': {'geonameid': 290557,
  'name': 'United Arab Emirates',
  'iso': 'AE',
  'iso3': 'ARE',
  'isonumeric': 784,
  'fips': 'AE',
  'continentcode': 'AS',
  'capital': 'Abu Dhabi',
  'areakm2': 82880,
  'population': 4975593,
  'tld': '.ae',
  'currencycode': 'AED',
  'currencyname': 'Dirham',
  'phone': '971',
  'postalcoderegex': '',
  'languages': 'ar-AE,fa,en,hi,ur',
  'neighbours': 'SA,OM'},
 'AF': {'geonameid': 1149361,
  'name': 'Afghanistan',
  'iso': 'AF',
  'iso3': 'AFG',
  'isonumeric': 4,
  'fips': 'AF',
  'continentcode': 'AS',
  'capital': 'Kabul',
  'areakm2': 647500,
  'population': 

In [7]:
gc.get_cities_by_name('Boston')

[{'2655138': {'geonameid': 2655138,
   'name': 'Boston',
   'latitude': 52.97633,
   'longitude': -0.02664,
   'countrycode': 'GB',
   'population': 41340,
   'timezone': 'Europe/London',
   'admin1code': 'ENG'}},
 {'4930956': {'geonameid': 4930956,
   'name': 'Boston',
   'latitude': 42.35843,
   'longitude': -71.05977,
   'countrycode': 'US',
   'population': 667137,
   'timezone': 'America/New_York',
   'admin1code': 'MA'}}]

In [13]:
countries['Brazil']

{'geonameid': 3469034,
 'name': 'Brazil',
 'iso': 'BR',
 'iso3': 'BRA',
 'isonumeric': 76,
 'fips': 'BR',
 'continentcode': 'SA',
 'capital': 'Brasilia',
 'areakm2': 8511965,
 'population': 201103330,
 'tld': '.br',
 'currencycode': 'BRL',
 'currencyname': 'Real',
 'phone': '55',
 'postalcoderegex': '^\\d{5}-\\d{3}$',
 'languages': 'pt-BR,es,en,fr',
 'neighbours': 'SR,PE,BO,UY,GY,PY,GF,VE,CO,AR'}

In [14]:
countries['United States']

{'geonameid': 6252001,
 'name': 'United States',
 'iso': 'US',
 'iso3': 'USA',
 'isonumeric': 840,
 'fips': 'US',
 'continentcode': 'NA',
 'capital': 'Washington',
 'areakm2': 9629091,
 'population': 310232863,
 'tld': '.us',
 'currencycode': 'USD',
 'currencyname': 'Dollar',
 'phone': '1',
 'postalcoderegex': '^\\d{5}(-\\d{4})?$',
 'languages': 'en-US,es-US,haw,fr',
 'neighbours': 'CA,MX,CU'}

In [15]:
countries['United States']['iso'] #countrycode

'US'

AttributeError: 'GeonamesCache' object has no attribute 'get_cities_by_names'