In [188]:
%matplotlib inline

import numpy as np
import requests
import re
import pickle
import pandas as pd
import matplotlib.pyplot as pl

The API key below is "hidden", as we don't need to run the notebook anymore, since we cached the data we fetched from the API in a pickled file.

In [189]:
GAPI_KEY = '#################################'

Let's first load the row data, keeping only the columns we need, specifying what data should be treat as N/A, etc.

In [190]:
cols = ['Project Number', 'Institution', 'University', 'Approved Amount']
na_values = ['data not included in P3', 'Nicht zuteilbar - NA']

dtypes = {
    'Approved Amount': np.float64
}

raw = pd.read_csv(
    'P3_GrantExport.csv',
    sep = ';',
    na_values=na_values,
    index_col='Project Number',
    dtype=dtypes,
    usecols=cols
)

df = raw.dropna()

Let's peek at the data:

In [191]:
df.sample(10)

Unnamed: 0_level_0,Institution,University,Approved Amount
Project Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
100417,Institut für Physikalische Chemie Universität ...,Universität Zürich - ZH,63100.0
589,Commissione scientifica del VDSI,"NPO (Biblioth., Museen, Verwalt.) - NPO",256503.0
31299,Département de Physique Théorique Université d...,Université de Genève - GE,546585.0
154077,Stiftung Landschaftsschutz Schweiz,"NPO (Biblioth., Museen, Verwalt.) - NPO",4100.0
102880,Centre Integratif de Genomique Faculté de Biol...,Université de Lausanne - LA,1319453.0
161284,Departement Physik Universität Basel,Universität Basel - BS,426119.0
40769,Laboratoire de biophysique cellulaire EPFL - S...,EPF Lausanne - EPFL,71689.0
40335,Jacob Burckhardt-Stiftung,"NPO (Biblioth., Museen, Verwalt.) - NPO",74929.0
5694,Dépt des Neurosciences Fondamentales Faculté d...,Université de Genève - GE,100000.0
127212,Section d'histoire et esthétique du cinéma Fac...,Université de Lausanne - LA,558644.0


Let's load a map of Switzerland through `folium`, scaling it to see the entire country.

Let's see if the index is unique:

In [153]:
df.index.is_unique

True

Good news, every project number is unique!

Let's now try to guess the canton straight from the 'University' column:

In [158]:
with_canton = df.copy()

word_to_canton = {
    'bern': 'BE',
    'lausanne': 'VD',
    'genève': 'GE',
    'geneva': 'GE',
    'luzern': 'LU',
    'zürich': 'ZH',
    'lugano': 'TI',
    'basel': 'BS',
    'vaud': 'VD',
    'fribourg': 'FR',
    'davos': 'GR',
    'sagw': 'BE'
}

cantons = [
    'ZH','BE','LU','UR','SZ','OW','NW','GL','ZG','FR','SO','BS','BL',
    'SH','AR','AI','SG','GR','AG','TG','TI','VD','VS','NE','GE','JU'
]

# Tries to guess the canton by seeing if the given text
# contains a word defined in the dict above.
def guess_canton(text):
    lower = text.lower()
    for word in word_to_canton:
        if word in lower:
            return word_to_canton[word]
        
    return ''

# Extract an abbreviated canton name from the given string
def ex_canton_str(s):
    m = re.search(r'\b([A-Z]+)\b$', s.strip())
    if m != None and m.group(1) in cantons:
        return m.group(1)
    else:
        return ''

# Extrarct the canton from the given string, using the above function
def ex_canton(text, axis):
    guess = guess_canton(text)
    if guess:
        return guess
    
    res = text.split('-')
    
    if len(res) < 2: 
        return text.strip()
    else:
        return ex_canton_str(res[1])
        
# Extract the university name (dropping the canton suffix if any)
def ex_uni(text, axis):
    res = text.split('-')
    
    if len(res) < 2 or ex_canton_str(res[1]) == '':
        return text.strip()
    else:
        return res[0].strip()

# Let's add the guessed cantons and the refined universitity name to the dataframe
with_canton['Canton']     = with_canton['University'].apply(ex_canton, axis=1)
with_canton['University'] = with_canton['University'].apply(ex_uni, axis=1)

# Peek at it
with_canton.sample(10)

Unnamed: 0_level_0,Institution,University,Approved Amount,Canton
Project Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
34146,Département de Pharmacologie & Toxicologie Fac...,Université de Lausanne - LA,10000.0,VD
122269,Institut de Recherche en Ophtalmologie,Forschungsinstitut für Opthalmologie - IRO,331000.0,
108072,Zoologisches Institut Universität Zürich-Irchel,Universität Zürich,919000.0,ZH
38527,UNI: Agricultural Center Department of Anima l...,Universität Bern,20000.0,BE
30766,Institut für Geologie Universität Bern,Universität Bern,6000.0,BE
6429,Institut für Experimentalphysik Fachbereich Ph...,Universität Basel,174765.0,BS
54124,Institut de microtechnique EPFL - STI - IMT,Université de Neuchâtel,72044.0,NE
6933,Abteilung Mikrobiologie Biozentrum Universität...,Universität Basel,532909.0,BS
67858,Département des langues et littératures médite...,Université de Genève,231218.0,GE
32311,Service de Radiodiagnostic et de Radiologie In...,Université de Genève,49116.0,GE


Let's now group those rows by canton and university, while summing the approved amounts.

In [160]:
grouped = with_canton.groupby(['Canton', 'University']).sum().reset_index()
grouped

Unnamed: 0,Canton,University,Approved Amount
0,,AO Research Institute - AORI,3.435621e+06
1,,Allergie- und Asthmaforschung - SIAF,1.916996e+07
2,,Biotechnologie Institut Thurgau - BITG,2.492535e+06
3,,Centre de rech. sur l'environnement alpin - CR...,1.567678e+06
4,,Eidg. Anstalt für Wasserversorgung - EAWAG,7.397585e+07
5,,"Eidg. Forschungsanstalt für Wald,Schnee,Land -...",4.836039e+07
6,,Eidg. Hochschulinstitut für Berufsbildung - EHB,2.086572e+06
7,,Eidg. Material und Prüfungsanstalt - EMPA,5.793069e+07
8,,Ente Ospedaliero Cantonale - EOC,5.067172e+06
9,,Fachhochschule Kalaidos - FHKD,1.090280e+06


To ease subsequent manipulations of the data, we now add a column that tells whether a row has an associated+know canton or not.

In [161]:
def is_known_canton(x, axis):
    return x.strip() in cantons

wc = grouped.copy()
wc['IsCanton'] = wc['Canton'].apply(is_known_canton, axis=1)

In [163]:
len(wc[wc['IsCanton'] == False])

53

In [164]:
wc[wc['IsCanton'] == False]

Unnamed: 0,Canton,University,Approved Amount,IsCanton
0,,AO Research Institute - AORI,3435621.0,False
1,,Allergie- und Asthmaforschung - SIAF,19169960.0,False
2,,Biotechnologie Institut Thurgau - BITG,2492535.0,False
3,,Centre de rech. sur l'environnement alpin - CR...,1567678.0,False
4,,Eidg. Anstalt für Wasserversorgung - EAWAG,73975850.0,False
5,,"Eidg. Forschungsanstalt für Wald,Schnee,Land -...",48360390.0,False
6,,Eidg. Hochschulinstitut für Berufsbildung - EHB,2086572.0,False
7,,Eidg. Material und Prüfungsanstalt - EMPA,57930690.0,False
8,,Ente Ospedaliero Cantonale - EOC,5067172.0,False
9,,Fachhochschule Kalaidos - FHKD,1090280.0,False


In [166]:
wc[wc['IsCanton']]

Unnamed: 0,Canton,University,Approved Amount,IsCanton
53,BE,Berner Fachhochschule - BFH,31028700.0,True
54,BE,Forschungskommission SAGW,100000.0,True
55,BE,Pädagogische Hochschule Bern - PHBern,1836136.0,True
56,BE,Robert Walser-Stiftung Bern - RWS,569579.0,True
57,BE,Universität Bern,1490646000.0,True
58,BS,Staatsunabh. Theologische Hochschule Basel - STHB,17300.0,True
59,BS,Universität Basel,1326427000.0,True
60,FR,Haute école pédagogique fribourgeoise - HEPFR,1547498.0,True
61,FR,Université de Fribourg,448092400.0,True
62,GE,Université de Genève,1810170000.0,True


We initially used the Geonames API but eventually switched to Google Places. We keep the code below for posterity.

In [167]:
def load_geo():
    '''
    params = {
        'username': 'ada_drs3',
        'country': 'CH',
        'type': 'json'
    }

    def geoname_query(q):
        params['q'] = q
        # print('Searching for %s...' % q)
        return requests.get('http://api.geonames.org/search', params)

    def search_by(col):
        for i in wc[wc['IsCanton'] == False].index:
            row = wc.iloc[i]
            res = geoname_query(row[col].strip())
            json = res.json()

            if json['totalResultsCount'] > 0:
                canton = json['geonames'][0]['adminCode1']
                print('=> Found ' + canton)
                wc.set_value(i,'Canton', canton)

    #search_by('University')
    #search_by('Canton')
    '''

Okay, let's get to business, and ask the almighty Google what they think of our data:

In [114]:
# Get the Google Place ID from the given university name
def get_place_id(uni):
    url = 'https://maps.googleapis.com/maps/api/place/textsearch/json?'
    params = {
        'query': uni,
         'key': GAPI_KEY
    }
    res = requests.get(url, params=params).json()
    if res['status'] == 'OK':
        return res['results'][0]['place_id']
    else:
        print(res)
        return None

In [185]:
# Query the Google Geocode API for information about the given Place ID.
def get_geocode_info(place_id):
    url = 'https://maps.googleapis.com/maps/api/geocode/json?'
    params = {
        'place_id': place_id,
        'key': GAPI_KEY
    }
    res = requests.get(url,params=params).json()
    if res['status'] == 'OK': 
        return res['results']
    else:
        print(res)
        return ''

We can now query the Google API, or just load the data we gathered last time from disk to avoid having to renew our API keys everytime we refresh the whole notebook.

In [184]:
query_api = False

place_ids = {}
geocodes = {}

if query_api:

    for i in wc.index:
        query = wc.iloc[i]['University']
        print('GMap request for %s' % query)
        place_ids[query] = get_place_id(query)
        if place_ids[query] != None:
            geocodes[query] = get_geocode_info(place_ids[query])
        else:
            geocodes[query]=None

    pickle.dump(place_ids, open('place_ids.p','wb'))
    pickle.dump(geocodes, open('geocodes.p','wb'))

else:
    place_ids = pickle.load(open('place_ids.p','rb'))
    geocodes  = pickle.load(open('geocodes.p','rb'))

Extract the relevant geographical information from the retrieved data:

In [183]:
# Get the canton's abbreviation from the given geocode object
def get_short_name(geocode):
    short_names = [
        comp['short_name']
        for comp in geocode['address_components']
        if 'administrative_area_level_1' in comp['types']
    ]
    
    if len(short_names) > 0:
        return short_names[0]
    else:
        return None

# Get the locality's name from the given geocode object
def get_locality(geocode):
    localities = [
        comp['long_name']
        for comp in geocode['address_components']
        if 'locality' in comp['types']
    ]
    
    if len(localities) > 0:
        return localities[0]
    else:
        return None

# Get the position (lat, long) from the given geocode object
def get_location(geocode):
    return geocode['geometry']['location']
    
# Aggregate the data fetched with the functions above into a dict.
def get_geo_info(geocode):
    if geocode == None or geocode[0] == None:
        return None

    return {
        'canton':   get_short_name(geocode[0]),
        'locality': get_locality(geocode[0]),
        'location': get_location(geocode[0])
    }

# To each university, associate the Geo info computed above
uni_geo_infos = {}

for uni in geocodes:
    uni_geo_infos[uni] = get_geo_info(geocodes[uni])

Merge the geographic information we gathered manually with the results retrieved from the Google API.

In [182]:
from uni_geo_infos_manual import uni_geo_infos_manual

for uni in uni_geo_infos_manual:
    uni_geo_infos[uni] = uni_geo_infos_manual[uni]

And merge this information with our main dataframe:

In [179]:
def load_from_uni_geo_info(uni, axis=None):
    if uni in uni_geo_infos and uni_geo_infos[uni] != None:
        return uni_geo_infos[uni]['canton']
    return ''

wc['Canton']   = wc['University'].apply(load_from_uni_geo_info, axis=1)
wc['IsCanton'] = wc['Canton'].apply(is_known_canton, axis=1)
wc[wc['IsCanton'] == False]

Unnamed: 0,Canton,University,Approved Amount,IsCanton
13,,Firmen/Privatwirtschaft - FP,109180100.0,False
16,HE,Forschungsinstitut für biologischen Landbau - ...,7442410.0,False
29,Lazio,Istituto Svizzero di Roma - ISR,141000.0,False
31,,"NPO (Biblioth., Museen, Verwalt.) - NPO",322996000.0,False
45,,Schweizer Kompetenzzentrum Sozialwissensch. - ...,34732820.0,False
50,,Weitere Institute - FINST,9256736.0,False
51,,Weitere Spitäler - ASPIT,10749810.0,False


Let's now see how we're doing at the canton-guessing game:

In [180]:
from __future__ import division
print ('Number of entries: ' +  repr(len(wc)))
print ('Number of entries with known canton: ' + repr(len(wc)-len(wc[wc['IsCanton'] == False])))
print ('Ratio of missing cantons to total number of entries: ' + repr((len(wc)-len(wc[wc['IsCanton'] == False]))/len(wc)))

Number of entries: 76
Number of entries with known canton: 69
Ratio of missing cantons to total number of entries: 0.9078947368421053


Let's drop the rows with no matching cantons (displayed above) they are either no corresponding to a university or are located outside of Switzerland.

In [174]:
final_wc = wc[wc['IsCanton'] == True].drop(['IsCanton'], axis=1)
final_wc

Unnamed: 0,Canton,University,Approved Amount
53,BE,Berner Fachhochschule - BFH,31028700.0
54,BE,Forschungskommission SAGW,100000.0
55,BE,Pädagogische Hochschule Bern - PHBern,1836136.0
56,BE,Robert Walser-Stiftung Bern - RWS,569579.0
57,BE,Universität Bern,1490646000.0
58,BS,Staatsunabh. Theologische Hochschule Basel - STHB,17300.0
59,BS,Universität Basel,1326427000.0
60,FR,Haute école pédagogique fribourgeoise - HEPFR,1547498.0
61,FR,Université de Fribourg,448092400.0
62,GE,Université de Genève,1810170000.0


Now that we've linked a canton to every university above, we can group them by canton and sum the amount of money allocated.

In [171]:
grouped_wc = final_wc.groupby('Canton').sum().reset_index()
grouped_wc

Unnamed: 0,Canton,Approved Amount
0,AG,115269000.0
1,BE,1526267000.0
2,BL,3476142.0
3,BS,1366673000.0
4,FR,449639900.0
5,GE,1857647000.0
6,GR,36538320.0
7,JU,34790350.0
8,LU,48820480.0
9,NE,398615800.0


Some cantons do not appear in the above dataframe, either because they don't host a university, or because we failed to process the relevant information (though given which cantons are missing, the former is much more likely).

We're nonetheless going to add them to the dataframe, as Folium requires a corresponding row for each canton.

In [168]:
missing_cantons = [canton for canton in cantons if canton not in grouped_wc['Canton'].values]

with_all_cantons = grouped_wc.copy()

for canton in missing_cantons:
    data = {
        'Canton': [canton],
        'Approved Amount': [0]
    }
    df = pd.DataFrame.from_dict(data, orient='columns')
    
    with_all_cantons = with_all_cantons.append(df, ignore_index=True)

with_all_cantons

Unnamed: 0,Approved Amount,Canton
0,115269000.0,AG
1,1526267000.0,BE
2,3476142.0,BL
3,1366673000.0,BS
4,449639900.0,FR
5,1857647000.0,GE
6,36538320.0,GR
7,34790350.0,JU
8,48820480.0,LU
9,398615800.0,NE


We now scale the approved amount per canton by million of swiss francs:

In [170]:
from math import pow
scaled_cantons = with_all_cantons.copy()
scaled_cantons['Approved Amount'] = with_all_cantons['Approved Amount'].div(pow(10, 6))
scaled_cantons

Unnamed: 0,Approved Amount,Canton
0,115.268969,AG
1,1526.266616,BE
2,3.476142,BL
3,1366.673453,BS
4,449.639858,FR
5,1857.646558,GE
6,36.538316,GR
7,34.790345,JU
8,48.820483,LU
9,398.61578,NE


And write the data down to a file, just in case.

In [147]:
pickle.dump(scaled_cantons, open('all_cantons.p','wb'))