# 03 - Interactive Viz

## Deadline

Wednesday November 8th, 2017 at 11:59PM

## Important Notes

- Make sure you push on GitHub your Notebook with all the cells already evaluated
- Note that maps do not render in a standard Github environment : you should export them to HTML and link them in your notebook.
- Remember that `.csv` is not the only data format. Though they might require additional processing, some formats provide better encoding support.
- Don't forget to add a textual description of your thought process, the assumptions you made, and the solution you plan to implement!
- Please write all your comments in English, and use meaningful variable names in your code

## Background

In this homework we will be exploring interactive visualization, which is a key ingredient of many successful data visualizations (especially when it comes to infographics).

Unemployment rates are major economic metrics and a matter of concern for governments around the world. Though its definition may seem straightforward at first glance (usually defined as the number of unemployed people divided by the active population), it can be tricky to define consistently. For example, one must define what exactly unemployed means : looking for a job ? Having declared their unemployment ? Currently without a job ? Should students or recent graduates be included ? We could also wonder what the active population is : everyone in an age category (e.g. `16-64`) ? Anyone interested by finding a job ? Though these questions may seem subtle, they can have a large impact on the interpretation of the results : `3%` unemployment doesn't mean much if we don't know who is included in this percentage. 

In this homework you will be dealing with two different datasets from the statistics offices of the European commission ([eurostat](http://ec.europa.eu/eurostat/data/database)) and the Swiss Confederation ([amstat](https://www.amstat.ch)). They provide a variety of datasets with plenty of information on many different statistics and demographics at their respective scales. Unfortunately, as is often the case is data analysis, these websites are not always straightforward to navigate. They may include a lot of obscure categories, not always be translated into your native language, have strange link structures, â€¦ Navigating this complexity is part of a data scientists' job : you will have to use a few tricks to get the right data for this homework.

For the visualization part, install [Folium](https://github.com/python-visualization/folium) (*HINT*: it is not available in your standard Anaconda environment, therefore search on the Web how to install it easily!). Folium's `README` comes with very clear examples, and links to their own iPython Notebooks -- make good use of this information. For your own convenience, in this same directory you can already find two `.topojson` files, containing the geo-coordinates of 

- European countries (*liberal definition of EU*) (`topojson/europe.topojson.json`, [source](https://github.com/leakyMirror/map-of-europe))
- Swiss cantons (`topojson/ch-cantons.topojson.json`) 

These will be used as an overlay on the Folium maps.

## Assignment

1. Go to the [eurostat](http://ec.europa.eu/eurostat/data/database) website and try to find a dataset that includes the european unemployment rates at a recent date.

   Use this data to build a [Choropleth map](https://en.wikipedia.org/wiki/Choropleth_map) which shows the unemployment rate in Europe at a country level. Think about [the colors you use](https://carto.com/academy/courses/intermediate-design/choose-colors-1/), how you decided to [split the intervals into data classes](http://gisgeography.com/choropleth-maps-data-classification/) or which interactions you could add in order to make the visualization intuitive and expressive. Compare Switzerland's unemployment rate to that of the rest of Europe.

2. Go to the [amstat](https://www.amstat.ch) website to find a dataset that includes the unemployment rates in Switzerland at a recent date.

   > *HINT* Go to the `details` tab to find the raw data you need. If you do not speak French, German or Italian, think of using free translation services to navigate your way through. 

   Use this data to build another Choropleth map, this time showing the unemployment rate at the level of swiss cantons. Again, try to make the map as expressive as possible, and comment on the trends you observe.

   The Swiss Confederation defines the rates you have just plotted as the number of people looking for a job divided by the size of the active population (scaled by 100). This is surely a valid choice, but as we discussed one could argue for a different categorization.

   Copy the map you have just created, but this time don't count in your statistics people who already have a job and are looking for a new one. How do your observations change ? You can repeat this with different choices of categories to see how selecting different metrics can lead to different interpretations of the same data.

3. Use the [amstat](https://www.amstat.ch) website again to find a dataset that includes the unemployment rates in Switzerland at recent date, this time making a distinction between *Swiss* and *foreign* workers.

   The Economic Secretary (SECO) releases [a monthly report](https://www.seco.admin.ch/seco/fr/home/Arbeit/Arbeitslosenversicherung/arbeitslosenzahlen.html) on the state of the employment market. In the latest report (September 2017), it is noted that there is a discrepancy between the unemployment rates for *foreign* (`5.1%`) and *Swiss* (`2.2%`) workers. 

   Show the difference in unemployment rates between the two categories in each canton on a Choropleth map (*hint* The easy way is to show two separate maps, but can you think of something better ?). Where are the differences most visible ? Why do you think that is ?

   Now let's refine the analysis by adding the differences between age groups. As you may have guessed it is nearly impossible to plot so many variables on a map. Make a bar plot, which is a better suited visualization tool for this type of multivariate data.

4. *BONUS*: using the map you have just built, and the geographical information contained in it, could you give a *rough estimate* of the difference in unemployment rates between the areas divided by the [RÃ¶stigraben](https://en.wikipedia.org/wiki/R%C3%B6stigraben)?

In [None]:
import os
import pandas as pd
import json
import folium
import branca
from branca.colormap import ColorMap
import numpy as np
from jenks import jenks
print(folium.__version__ == '0.5.0')

In [None]:
DATA_FOLDER = './Data'
EU_DATA_FOLDER=DATA_FOLDER + '/LFS-quaterly-TotalUnemployment'

# LFS: Labour force survey
#URGAN: unemployement rates by gender, age and nationality
#separated sheet, use sheet= Data with the total (male + female)
urgan_data = pd.read_excel(EU_DATA_FOLDER + '/lfsq_urgan-2.xls', sheetname='Data', header=11);

#keep most recent unemployement rates i.e. from second quarter of 2017
urgan = urgan_data[['GEO/TIME','2017Q2']]

#drop  Europe stats + ":"+"nan" + Sepcial values
urgan= urgan.drop([0,1,2,3,4,5,41,39,40],axis=0)
#change columns name
urgan= urgan.rename(columns={'GEO/TIME': 'Country', '2017Q2': 'Unemployement (%)'})

#change countries name to merge with the topojson data later
urgan = urgan.set_value(10,'Country','Germany')
urgan = urgan.set_value(37,'Country','The former Yugoslav Republic of Macedonia')

#create data frame with countries id  and names from the topo
topo= pd.read_json('./topojson/europe.topojson.json',typ= 'series')
eu_topo= pd.DataFrame(topo['objects']['europe']['geometries'])
#get ids and names
countries_id=eu_topo['id']
countries_name=pd.DataFrame(list(eu_topo['properties'].values))
#create df with id and name
eu_code=pd.DataFrame()
eu_code['id']=countries_id
eu_code['Country']=countries_name['NAME']
eu_code =eu_code.sort_values('Country')
eu_code.head(100)

#merge ids and unemployement rates using the country's name
urgan = pd.merge(eu_code,urgan, 'inner')
urgan = urgan.set_index('id')
urgan

In [None]:
m_eu = folium.Map([46, 6], tiles='Mapbox Bright', zoom_start=4)

unemployed_series = urgan['Unemployement (%)']
#sequential colormap
colorscale = branca.colormap.linear.YlGnBu.scale(min(unemployed_series), max(unemployed_series))
colorscale.caption= 'Unemployment Rate (%)'


def style_function(feature):
    unemployed = unemployed_series.get(feature['id'][-5:], None)
    return {
        'fillOpacity': 0.5,
        'weight': 0,
        'fillColor': '#black' if unemployed is None else colorscale(unemployed),
        
    }
    
folium.TopoJson(open('./topojson/europe.topojson.json'),
                'objects.europe',style_function=style_function).add_to(m_eu)


m_eu.add_child(colorscale)


m_eu

In [None]:
m_eu2 = folium.Map([46, 6], tiles='Mapbox Bright', zoom_start=4)

#Jens Natural breaks
bins = jenks(list(urgan['Unemployement (%)'].values), 5)


def highlight_fuction(feature):
     return {
        'weight': 2,
        'fillOpacity': 1
    }

eu_geo = json.load(open(r'./topojson/europe.topojson.json'))

m_eu2.choropleth(geo_data=eu_geo,topojson='objects.europe', data = unemployed_series,
             columns=['Unemployement (%)'],
             key_on='feature.id',
             fill_color= 'YlGnBu',
             fill_opacity=0.5, 
             line_opacity=0.2,
             legend_name='Unemployment Rate (%)',
             threshold_scale=bins,
                highlight=True
             )

m_eu2

In [None]:
m_eu = folium.Map([46, 6], tiles='Mapbox Bright', zoom_start=4)

unemployed_series = urgan['Unemployement (%)']
swiss_ur= urgan.get_value('CH','Unemployement (%)')
swiss_ur= urgan.get_value('CH','Unemployement (%)')

def color(value,feature):
    if value.get(feature['id'][-5:], None) is None:
        return {       
        'color' : 'black',
        'weight' : 1,
        'dashArray' : '5, 5'
 }
    elif value.get(feature['id'][-5:], None) > swiss_ur :
        return  {
        'fillOpacity': 0.5,
        'weight': 0.5,
        'fillColor': 'red',
        'color':'red'
 }
    elif (value.get(feature['id'][-5:], None) == swiss_ur) & (feature['id'][-5:] != 'CH') :
        return {
        'fillOpacity': 0.5,
        'weight': 0.5,
        'lineOpacity':1,
        'fillColor': None
 } 
    elif (value.get(feature['id'][-5:], None) == swiss_ur) & (feature['id'][-5:] == 'CH') :
        return {
        'lineOpacity':1,
        'fillOpacity':0,
        'fillColor' : None,
        'weight' : 3
 } 
    else:
        return  {
        'fillOpacity': 0.5,
        'weight': 0.5,
        'fillColor': 'green',
        'color':'green'
 }   
    
colors_colormap = ['red' ,'green', None,'black' ]

#new_colormap = ColorMap(colors_colormap,
                            #caption='Compare with Switzerland')
#new_colormap.render(tick_labels = ['-','+','=','missing data'])
        
def style_function(feature):
    unemployed = unemployed_series.get(feature['id'][-5:], None)
    return color(unemployed_series, feature)

folium.TopoJson(open('./topojson/europe.topojson.json'),'objects.europe',style_function=style_function).add_to(m_eu)
#m_eu.add_child(new_colormap)

m_eu


We firtly load a dataset from the [amstat](https://www.amstat.ch) recovering the datas for September 2017. We took in account the unemployement rate, the number os registered unemployed people, the number of long term unemployed people (who do not have a job for more than two years), the number of job seekers and the employed job seekers. The data were available for each canton.

In [None]:
#load data
swiss_unemployement = pd.read_csv('Data/swiss_unemployement.csv',header=1,quotechar='"',encoding='utf-16le')

#clean the data
swiss_unemployement.drop(['Mois','Septembre 2017','Septembre 2017.1','Septembre 2017.2','Septembre 2017.3','Septembre 2017.4'],1,inplace=True)
swiss_unemployement.columns = [swiss_unemployement.columns[0],
                               swiss_unemployement['Total'][0],
                               swiss_unemployement['Total.1'][0],
                               swiss_unemployement['Total.2'][0],
                               swiss_unemployement['Total.3'][0],
                               swiss_unemployement['Total.4'][0]]

swiss_unemployement.drop(0,0,inplace=True)
swiss_unemployement.reset_index(drop=True,inplace=True)

swiss_unemployement['Taux de chômage']=swiss_unemployement['Taux de chômage'].astype(float)
swiss_unemployement['Chômeurs inscrits'] = swiss_unemployement['Chômeurs inscrits'].str.replace('\'','').astype('int')
swiss_unemployement['Chômeurs de longue durée'] = swiss_unemployement['Chômeurs de longue durée'].str.replace('\'','').astype('int')
swiss_unemployement['Demandeurs d\'emploi'] = swiss_unemployement['Demandeurs d\'emploi'].str.replace('\'','').astype('int')
swiss_unemployement['Demandeurs d\'emploi non chômeurs'] = swiss_unemployement['Demandeurs d\'emploi non chômeurs'].str.replace('\'','').astype('int')

In order to compute meaningful informations we needed the active population of each contons. Since it is not given on the website, we decided to compute an estimation by dividing the number of job seekers by the unemployement rate (as explain in the description). We decided to also compute the percentage of long-term unemployed people on the total of unemployed people, the registered enemployed people rate (lower than the unemployement rate) and the percentage of unemployed people among all the job seekers.

In [None]:
#adding new columns

swiss_unemployement['Population active (estimation)'] = (100 * swiss_unemployement['Demandeurs d\'emploi'].astype(float)
                                                         /swiss_unemployement['Taux de chômage']).astype(int)
swiss_unemployement['Taux de chômeurs longue durée'] = 100*(swiss_unemployement['Chômeurs de longue durée'].astype(float)
                                                         /(swiss_unemployement['Chômeurs de longue durée'] + swiss_unemployement['Chômeurs inscrits']))
swiss_unemployement['Taux de chômeurs inscrits'] = 100*(swiss_unemployement['Chômeurs inscrits'].astype(float)
                                                    /swiss_unemployement['Population active (estimation)'])
swiss_unemployement['Taux de chômeurs dans les demandeurs d\'emploi'] = 100*(swiss_unemployement['Chômeurs inscrits'].astype(float)
                                                                         /swiss_unemployement['Demandeurs d\'emploi'] )


swiss_unemployement

We can then load the map. In order to synchronize the map with the dataset, we need to have the same label on both. The name are in different language in the map and the dataset, but happily they are in the same order (the traditionnal swiss cantons order). We can simply add a column in the dataset for the cantons ID given in the json map.

In [None]:
#load map

canton_topo_path = os.path.join('topojson','ch-cantons.topojson.json')
topo_json_data = json.load(open(canton_topo_path))

In [None]:
#add id to synchronize the dataset and the map

N_canton = len(topo_json_data['objects']['cantons']['geometries'])
swiss_unemployement['id']=swiss_unemployement['Canton']

for i in range(N_canton):
    swiss_unemployement['id'].loc[i] = topo_json_data['objects']['cantons']['geometries'][i]['id']

In [None]:
#define the map

center_coord = [46.8011111,8.2266667]

swiss_map = folium.Map(location=center_coord,
            tiles='cartodbpositron',           
            zoom_start=7.5)


def add_layer_to_map(dataset,input_map,path,column,name,scale_min,scale_max):
    scale = jenks(list(dataset[column].values), 5)
    serie = dataset.set_index('id')[column]
    input_map.choropleth(geo_data=open(path), topojson='objects.cantons', data=serie,
        columns=['id', column],
        key_on='id',
        threshold_scale=scale,
        fill_color='YlOrRd', fill_opacity=0.7, line_opacity=0.2,
        name=name,
        legend_name=name,
        highlight = True)

    
add_layer_to_map(swiss_unemployement,swiss_map,canton_topo_path,
                 'Taux de chômage','Unemployment rate',0.,6.)
add_layer_to_map(swiss_unemployement,swiss_map,canton_topo_path,
                 'Taux de chômeurs longue durée','Percentage of long term unemployed people',5.,22.)
add_layer_to_map(swiss_unemployement,swiss_map,canton_topo_path,
                 'Taux de chômeurs inscrits','Percentage of registered unemployed people',0.,5.)
add_layer_to_map(swiss_unemployement,swiss_map,canton_topo_path,
                 'Taux de chômeurs dans les demandeurs d\'emploi','Percentage of unemployed people in job seekers',40.,80.)

folium.LayerControl().add_to(swiss_map)

In [None]:
swiss_map

Firstly, we can see on this map the distribution of the unemployement rate in Switzerland's different cantons. With this map we can see two major trends, firtly the latin cantons (french or italian) seems to have a higher overall unemployement rate. We can see that the worst cantons (Geneva, Neuchâtel) are french ones and that latin cantons have in average a higher unemployement rate than the surrounding german cantons. Another major trend we can extract is that the urban cantons have in general a higher unemployement rate (like Zürich or Basel-Town) when the cantons where the purcentage of urban population is lower appears to be lower (like Uri or Schwytz).

In [None]:
data_path = os.path.join(os.path.dirname(os.getcwd()),'Homework3','Data','data_3.csv') 
map_path = os.path.join(os.path.dirname(os.getcwd()),'Homework3','topojson','ch-cantons.topojson.json') 

Load and clean data

In [None]:
df = pd.read_csv(data_path, skiprows=[0], encoding='utf-16le')
df = df.drop(['Mois'], axis=1)

new_column_names = df.values[0]
old_column_names = df.columns.values

for i in range(2,5):
    df = df.rename(columns={old_column_names[i]: new_column_names[i]})

df = df.drop([0], axis=0)
df = df.drop(df.columns[[3,5]], axis=1)
df = df.loc[:,~df.columns.duplicated()]

In [None]:
topo_json_data = json.load(open(map_path))
N_canton = len(topo_json_data['objects']['cantons']['geometries'])

df['id'] = df['Canton']

for i in range(N_canton):
    for j in range(2):
        df.set_value(2*i+j+1,'id',topo_json_data['objects']['cantons']['geometries'][i]['id'])


In [None]:
suisses = df[df['Nationalité'] == 'Suisses']
etrangers = df[df['Nationalité'] == 'Etrangers']

In [None]:

m = folium.Map([46.8011111,8.2266667], tiles='cartodbpositron', zoom_start=8)
serie_suisses = suisses.set_index('id')['Taux de chômage'].astype(float)
serie_etrangers = etrangers.set_index('id')['Taux de chômage'].astype(float)

scale_suisses = list(np.linspace(0.,4.8,6))
scale_etrangers = list(np.linspace(0.,9.,6))

m.choropleth(geo_data=open(map_path), topojson='objects.cantons', data=serie_suisses,
    columns=['id', 'Taux de chômage'],
    key_on='id',
    threshold_scale=scale_suisses,
    fill_color='YlGn', fill_opacity=0.7, line_opacity=0.2,
    name='Swiss',
    legend_name='Unemployment Swiss',
    highlight = True)

m.choropleth(geo_data=open(map_path), topojson='objects.cantons', data=serie_etrangers,
    columns=['id', 'Taux de chômage'],
    key_on='id',
    threshold_scale=scale_etrangers,
    fill_color='YlGn', fill_opacity=0.7, line_opacity=0.2,
    name='Foreign',
    legend_name='Unemployement Foreigners',
    highlight = True)

folium.LayerControl().add_to(m)

m

Part with age differences

In [None]:
data_path = os.path.join(os.path.dirname(os.getcwd()),'Homework3','Data','data_4.csv') 
df = pd.read_csv(data_path, skiprows=[0], encoding='utf-16le')
df = df.drop(['Mois'], axis=1)
new_column_names = df.values[0]
old_column_names = df.columns.values

for i in range(2,5):
    df = df.rename(columns={old_column_names[i]: new_column_names[i]})

df = df.drop([0], axis=0)
df = df.drop(df.columns[[1,4,5,6]], axis=1)

df