# Outreach Data

This notebook is just looking at the raw data, downloading it and getting it in the right shape.

#### Requirements

*Note-to-self:* Here's a list of things to keep in mind when going through the data/analysis.

- Gaza, Tel Aviv, Israel, and Palestine need segmentation. (To be done by Natasa)
- Some projects/classifications will need links added.

In [1]:
# Project variables
data_dir = '../data/'

outreach_f = data_dir + '/original/IWP_map.csv'

In [2]:
# Libraries
import pandas as pd
import json

%matplotlib inline

#### Extract Outreach data from CSV

In [69]:
df = pd.read_csv(outreach_f)

print ("{} rows".format(len(outreach_f)))
print (df.columns.tolist())
df.head(20)

29 rows
['Category', 'Country', 'Year', 'link']


Unnamed: 0,Category,Country,Year,link
0,International Conference,Greece,2006,https://iwp.uiowa.edu/programs/international-c...
1,Reading tour,Syria - Aleppo,2007,https://iwp.uiowa.edu/programs/reading-abroad/...
2,Reading tour,Syria - Damascus,2007,https://iwp.uiowa.edu/programs/reading-abroad/...
3,Reading tour,Jordan,2007,https://iwp.uiowa.edu/programs/reading-abroad/...
4,Reading tour,Jerusalem,2007,https://iwp.uiowa.edu/programs/reading-abroad/...
5,Reading tour,Palestinian Territories,2007,https://iwp.uiowa.edu/programs/reading-abroad/...
6,Reading tour,Turkey,2007,https://iwp.uiowa.edu/programs/reading-abroad/...
7,International Conference,Greece,2007,https://iwp.uiowa.edu/programs/international-c...
8,Reading tour,Cyprus,2008,http://iwp.uiowa.edu/programs/reading-abroad/2...
9,Reading tour,Oman,2008,http://iwp.uiowa.edu/programs/reading-abroad/2...


In [70]:
df.Country.unique().tolist()

['Greece',
 'Syria - Aleppo',
 'Syria - Damascus',
 'Jordan ',
 'Jerusalem',
 'Palestinian Territories',
 'Turkey',
 'Cyprus',
 'Oman',
 'Saudi Arabia - Riyadh',
 'Saudi Arabia - Dhahran',
 'Saudi Arabia - Jeddah',
 'Yemen',
 'Kenya',
 'Morocco - Fes',
 'China',
 'Morocco - Casablanca',
 'Tunisia',
 'Morocco - Essaouira',
 'Nepal',
 'Pakistan',
 'United Arab Emirates',
 'Afghanistan',
 'Uruguay',
 'Bolivia',
 'Congo - Kinshasa',
 'Congo - Brazzaville',
 'Brazil - Fortaleza',
 'Brazil - Sao Paolo',
 'Brazil - Brasilia',
 'Cambodia - Phnom Penh',
 'Cambodia - Kampong Speu',
 'Cambodia - Krong Kampong Chhnang',
 'Cambodia - Krong Siem Reap',
 'Vietnam - Hanoi',
 'Mozambique',
 'Zimbabwe - Harare',
 'Zimbabwe - Masvingo',
 'Iraq - Basrah',
 'Iraq - Baghdad',
 'Iraq - Erbil',
 'Sudan',
 'South Sudan',
 'Uzbekistan',
 'Turkmenistan',
 'United Arab Emirates - Abu Dhabi',
 'United Arab Emirates - Dubai',
 'United Arab Emirates - Sharjah',
 'Burma',
 'Maldives',
 'Haiti',
 'Cuba',
 'Armenia',
 

#### Add Geocoded lat/longs to Dataframe

- City Column
    * Split from existing columns
    * Add capital cities to those countries without duplicates
- Geocode Lat/Longs

In [71]:
# add city column to dataframe
df['city'] = df['Country'].apply(lambda x: x.split(' - ')[1] if len(x.split(' - '))>1 else '')

In [72]:
# Clean the country column
df['Country'] = df['Country'].apply(lambda x: x.split(' - ')[0] if len(x.split(' - '))>1 else x.strip())

Download info from Github to get capitals

In [73]:
capitals_url = "https://raw.githubusercontent.com/mledoze/countries/master/dist/countries.csv"
country_info = pd.read_csv(capitals_url, sep=';')

print(country_info.columns.tolist())
country_info.head()

['name', 'tld', 'cca2', 'ccn3', 'cca3', 'cioc', 'currency', 'callingCode', 'capital', 'altSpellings', 'region', 'subregion', 'languages', 'translations', 'latlng', 'demonym', 'landlocked', 'borders', 'area']


Unnamed: 0,name,tld,cca2,ccn3,cca3,cioc,currency,callingCode,capital,altSpellings,region,subregion,languages,translations,latlng,demonym,landlocked,borders,area
0,"Aruba,Aruba,Aruba,Aruba,Aruba,Aruba",.aw,AW,533.0,ABW,ARU,AWG,297,Oranjestad,AW,Americas,Caribbean,"Dutch,Papiamento","Aruba,Aruba,Aruba,Aruba,Aruba,Aruba,Aruba,Arub...","12.5,-69.96666666",Aruban,,,180.0
1,"Afghanistan,Islamic Republic of Afghanistan,جم...",.af,AF,4.0,AFG,AFG,AFN,93,Kabul,"AF,Afġānistān",Asia,Southern Asia,"Dari,Pashto,Turkmen","Islamic Republic of Afghanistan,Affganistan,Is...",3365,Afghan,1.0,"IRN,PAK,TKM,UZB,TJK,CHN",652230.0
2,"Angola,Republic of Angola,República de Angola,...",.ao,AO,24.0,AGO,ANG,AOA,244,Luanda,"AO,República de Angola,ʁɛpublika de an'ɡɔla",Africa,Middle Africa,Portuguese,"Republic of Angola,Angola,Republik Angola,Ango...","-12.5,18.5",Angolan,,"COG,COD,ZMB,NAM",1246700.0
3,"Anguilla,Anguilla,Anguilla,Anguilla",.ai,AI,660.0,AIA,,XCD,1264,The Valley,AI,Americas,Caribbean,English,"Anguilla,Anguilla,Anguilla,Anguilla,Anguilla,A...","18.25,-63.16666666",Anguillian,,,91.0
4,"Åland Islands,Åland Islands,Landskapet Åland,Å...",.ax,AX,248.0,ALA,,EUR,358,Mariehamn,"AX,Aaland,Aland,Ahvenanmaa",Europe,Northern Europe,Swedish,"Åland-Inseln,Åland,Ahvenanmaa,Ahvenanmaa,Aland...","60.116667,19.9",Ålandish,,,1580.0


In [74]:
country_info['name'] = country_info.name.apply(lambda x: x.split(',')[0])

In [75]:
country_names = country_info['name'].tolist()

for x in df.Country.unique().tolist():
    if (x in country_names) == False:
        print (x)

Jerusalem
Palestinian Territories
Congo
Burma


In [76]:
# Rename offending countries to merge with capitals

##NOTE: there is only Palestinian tyerritories, no palestine
## the capital of Israel is Jerusalem, so the renaming still holds.

rename_capitals = {
    'Palestinian Territories': 'Palestine',
    'Congo': 'DR Congo',
    'Burma': 'Myanmar',
    'Jerusalem': 'Israel'
}
df['Country'] = df['Country'].apply(lambda x: rename_capitals[x] if x in rename_capitals else x)

In [77]:
for x in df.Country.unique().tolist():
    if (x in country_names) == False:
        print (x)
#### All clear

In [78]:
captial_lookup = country_info[['name', 'capital']].set_index('name')

def get_country_captial(country, country_df):
    return country_df.loc[country]['capital']

In [79]:
# repopulate the empty city cells
df['city'] = df.apply(lambda x: get_country_captial(x['Country'], captial_lookup) if (x['city'] == '') else x['city'], axis=1)

#### Geocode the Outreach programs

In [91]:
from geopy.geocoders import GoogleV3
from geopy.exc import GeocoderTimedOut
from time import sleep

google_geocoder = GoogleV3()

def find_map_coordinates(full_address, geolocator):
    """
    Function to find latitude and longtitude for the address.
    Input
    :full_address: input string with full address (like "55 rue du Faubourg Saint-Honore, 75008, Paris")
    :geolocator: geocoder object from geopy module
    
    Output
    :latitude, longitude: values of latitude and longtitude for the input address
    """
    timeout_secs = 15
    try:
        location = geolocator.geocode(full_address, timeout=timeout_secs)
        sleep(5)
        return location.latitude, location.longitude
    except GeocoderTimedOut:
        sleep(timeout_secs)
        return 0, 0

In [83]:
df['fulladdress'] = df['city'] + ', ' + df['Country']

In [92]:
# Find coordinates by appling helper function. We pass also geolocator object
locs = df['fulladdress'].apply(find_map_coordinates, args=(google_geocoder,)).tolist()

# Add new columns to the dataframe with latitude and longtitude values
df['lat'], df['lng'] = zip(*locs)

In [94]:
df

Unnamed: 0,Category,Country,Year,link,city,fulladdress,lat,lng
0,International Conference,Greece,2006,https://iwp.uiowa.edu/programs/international-c...,Athens,"Athens, Greece",37.983810,23.727539
1,Reading tour,Syria,2007,https://iwp.uiowa.edu/programs/reading-abroad/...,Aleppo,"Aleppo, Syria",36.202105,37.134260
2,Reading tour,Syria,2007,https://iwp.uiowa.edu/programs/reading-abroad/...,Damascus,"Damascus, Syria",33.513807,36.276528
3,Reading tour,Jordan,2007,https://iwp.uiowa.edu/programs/reading-abroad/...,Amman,"Amman, Jordan",31.945367,35.928372
4,Reading tour,Israel,2007,https://iwp.uiowa.edu/programs/reading-abroad/...,Jerusalem,"Jerusalem, Israel",31.768319,35.213710
5,Reading tour,Palestine,2007,https://iwp.uiowa.edu/programs/reading-abroad/...,Ramallah,"Ramallah, Palestine",31.903764,35.203418
6,Reading tour,Turkey,2007,https://iwp.uiowa.edu/programs/reading-abroad/...,Ankara,"Ankara, Turkey",39.933363,32.859742
7,International Conference,Greece,2007,https://iwp.uiowa.edu/programs/international-c...,Athens,"Athens, Greece",37.983810,23.727539
8,Reading tour,Cyprus,2008,http://iwp.uiowa.edu/programs/reading-abroad/2...,Nicosia,"Nicosia, Cyprus",35.185566,33.382276
9,Reading tour,Oman,2008,http://iwp.uiowa.edu/programs/reading-abroad/2...,Muscat,"Muscat, Oman",23.585890,58.405923


In [95]:
df.columns

Index(['Category', 'Country', 'Year', 'link', 'city', 'fulladdress', 'lat',
       'lng'],
      dtype='object')

#### Rename Columns

Something human readable, prefferably.

In [96]:
new_cols = ['category', 'country', 'year', 'link', 'city', 'full', 'lat', 'lng']
df.columns = new_cols

df.head(2)

Unnamed: 0,category,country,year,link,city,full,lat,lng
0,International Conference,Greece,2006,https://iwp.uiowa.edu/programs/international-c...,Athens,"Athens, Greece",37.98381,23.727539
1,Reading tour,Syria,2007,https://iwp.uiowa.edu/programs/reading-abroad/...,Aleppo,"Aleppo, Syria",36.202105,37.13426


In [97]:
# drop the full address column
df = df[['category', 'country', 'year', 'link', 'city', 'full', 'lat', 'lng']]

---

In [98]:
outreach_canon_f = '../data/canonical/outreach-programs.csv'
df.to_csv(outreach_canon_f, index=False)

#### **OLD**: *Rename Countries to match participant naming.*

In [12]:
# name_sort aliases
name_sort = {
    'Russia': 'Russian Federation',
    #
    'USA': 'United States of America',
    'U.S.': 'United States of America',
    'US': 'United States of America',
    #
    'Burma': 'Myanmar',
    #
    'Kyrgyzstan': 'Kyrgyz Republic',
    #
    'Kuwait (but she is Lebanese)': 'Kuwait',
    #
    'Syria': 'Syrian Arab Republic',
    'Syria (Palestinian Syrian)': 'Syrian Arab Republic',
    #
    'Egypt': 'Egypt, Arab Rep.',
    #
    'Palestine': 'Palestine (West Bank and Gaza)',
    'West Bank': 'Palestine (West Bank and Gaza)',
    'Gaza': 'Palestine (West Bank and Gaza)',
    'Palestinian Territories': 'Palestine (West Bank and Gaza)',
    'Palestinian Territories (Gaza)': 'Palestine (West Bank and Gaza)',
    'Palestinian Territories (West Bank)': 'Palestine (West Bank and Gaza)',
    #
    'Venezuela': 'Venezuela, RB',
    #
    'Yemen': 'Yemen, Rep.'
    
}

# strip trailing spaces before dict lookup
df['Country'] = df['Country'].str.strip()

# dict lookup
df['Country'] = df['Country'].map(lambda x: name_sort[x] if x in name_sort else x)