In [None]:
import numpy as np
import pandas as pd
import geopandas as gpd
import plotly.express as px
import os

import matplotlib.pyplot as plt
%matplotlib inline

# Exploration

The purpose of this notebook is to get familiar with the relevant data sets that are used during the course. You will see examples of how to work with numpy, pandas and plotting libraries. 

## Detailed population data

INSEE provides detailed population information per municipality as open data:

- Go to https://www.insee.fr/fr/statistiques/6544333
- Download "Individus localisés au canton-ou-ville - Zone A" in CSV format
- Note that "Zone A" includes only the Île-de-France region. There is data for all France or other regions and departments in other zones.
- Put the downloaded zip file into the `data` folder next to this notebook
- Unpack the zip file so that the CSV file is located inside the `data` folder
- Alternatively call the following cell if you are using Linux

In [None]:
if not os.path.exists("data/FD_INDCVIZA_2019.csv"):
    !cd data && wget https://www.insee.fr/fr/statistiques/fichier/6544333/RP2019_INDCVIZA_csv.zip
    !cd data && unzip RP2019_INDCVIZA_csv.zip
    !cd data && rm RP2019_INDCVIZA_csv.zip

Next, load a chunk of the data to see what is contained in the file:

In [None]:
df_census = pd.read_csv("data/FD_INDCVIZA_2019.csv", sep = ";", nrows = 10)
df_census.head()

You will find information on the variables in the dataset by clicking on "Dictionnaire des variables" in the link above and downloading the corresponding PDF.

**Task**:  For our first analysis, load the following columns. For performance reasons, it makes sense, to define a data type for each column:
- Detailed age (by year) as `int`
- Socioprofessional category (Catégorie socioprofessionnelle en 8 postes) as `int`

Additionally, load the following columns:
- `IRIS` is an identifier for the location of the observation in France, load it as `str`
- `IPONDI` is a weight of each observation, load it as `float`

In [None]:
columns = {
    "IRIS": str,
    "IPONDI": float,

    "AGED": int,
    "CS1": int

}

df_census = pd.read_csv("data/FD_INDCVIZA_2019.csv", sep = ";", dtype = columns, usecols = columns.keys())
df_census.head()

**Task**:  It is always better to work with a cleaned data set, let's clean up the column names:
- The age column to `age`
- The socioprofessional category column to `csp`

In [None]:
# Insert your code here
# df_census = 

df_census.head()

The data set contains the official open census data from the French statistical office INSEE. Let's aggregate the data to obtain a data frame that gives us the number of observations at a certain age:

In [None]:
df_age = df_census.groupby("age").size().reset_index(name = "count")
df_age.head()

And plot this information using plotly:

In [None]:
px.bar(df_age, x = "age", y = "count")

Is this information correct? Write the code to calculate total number of observations in the data set:

In [None]:
# Insert your code here


Compare this value with information from other sources like Wikipedia? Do we see a difference? Why?

**Task**: Write the code to calculate the correct number of Île-de-France inhabitants:

In [None]:
# Insert your code here


**Task:** Show a bar plot of both the count of *observations* at a specific age and the number of *persons*.

Hints: 
- You will need another aggregator function than `size` (used before) in your `groupby` statement
- You will need to `merge` the existing `df_age` data frame and a new one that you create
- For the y-axis, you may pass a list of columns to plotly
- Try to use the `barmode = "group"` argument for plotly

In [None]:
# Insert your code here


**Task:** On average, how many persons are represented by one observation in the census data?

In [None]:
# Insert your code here


Let's explore the data a bit further. 

**Task:** Show the number of persons for each socioprofessional category in a plot.

Bonus: Instead of showing only CSP identifiers, can you show the name of the CSPs?

Remember, the socioprofessional category is a classification of persons in France according to their job status:
https://www.insee.fr/fr/metadonnees/pcs2003/categorieSocioprofessionnelleAgregee/1?champRecherche=true

In [None]:
# Insert your code here


**Task:** Show a line plot with one age distribution per CSP in different colors and use it to compare the age distribution of at least three CSP. Aggregate the years by 10.

Hint:
- You will need to aggregate over two columns this time.

In [None]:
# Insert your code here


The previous analysis were performed in absolute terms. Let's pass on to a relative analysis. We want to know which share of people belongs to a certain CSP for each age. The ages are represented by bars with the CSPs stacked on top of each other. Each bar has a height of `1.0` or 100%.

**Task:** Set up a stacked bar plot where all CSPs are shown per age.

Hints:
- Proceed as in the previous task, but perform a second aggregation by age.
- Via `merge`, append another column to the two-variable data set that describes this total
- Then, divide the absolute value by the group total

In [None]:
# Insert your code here


## Spatial data

So far, we have only performed analysis over the whole IDF population. The data set contains a column called `IRIS`. This is a statistical zoning system that covers France. Each zone in that system has a unique identifier. It is constructed as follows:

- `[2]` digits are the department identifier
- `[3]` following digits describe the municipality
- `[4]` following digits describe the IRIS (sub-municipality zoning)

For instance, the 14e arrondissement in Paris has the `75` as the department identifier `75`, followed by `114` indicating the arrondissement. After, there are four digits that describe smaller zones within the arrondissement, for instance:

`[75][114][0001]`

**Task:** For convience, let's create additional columns that indicate the department and the municipality of an observation:
- `department_id`: The first two digits of `iris_id`
- `municipality_id`: The five first digits of `iris_id`

Hint: IRIS are strings although they may appear as numbers. The reason is that the department codes for Corsica are 2A and 2B.

In [None]:
# Insert your code here


**Task:** Find out which is the department with the highest number of inhabitants. Which is the least inhabitated one? Show an ordered list.

In [None]:
# Insert your code here


**Task:** Let's repeat the exercise by identifying the top 10 and bottom 10 municipalities:

In [None]:
# Insert your code here


Which are the names of those municipalities? You can look them up on Wikipedia, for instance, by searching for their INSEE codes.

## Mapping

Looking at spatial data works best when using maps. The IRIS system is not only a system of identifiers, but there is also geographic shape data attached to it. The data is provided by IGN (Institut Géographique National).

- Download the data from https://geoservices.ign.fr/contoursiris
- Make sure to download the 2021 edition which is compatible with our 2019 census data
- Unpack the 7z file. The relevant files for us are located in `CONTOURS*/1_DONNES/*LAMB93*/` (make sure about the last `LAMB93` part).
- Copy the files prefixed with `CONTOURS-IRIS.*` to the `data` folder next to this notebook

Linux users may execute the next cell:


In [None]:
if not os.path.exists("data/CONTOURS-IRIS.shp"):
    !cd data && wget https://data.geopf.fr/telechargement/download/CONTOURS-IRIS/CONTOURS-IRIS_2-1__SHP__FRA_2021-01-01/CONTOURS-IRIS_2-1__SHP__FRA_2021-01-01.7z
    !cd data && 7z x CONTOURS-IRIS_2-1__SHP__FRA_2021-01-01.7z -y
    !cd data && cp CONTOURS-IRIS_2-1__SHP__FRA_2021-01-01/CONTOURS-IRIS/1_DONNEES_LIVRAISON_2021-06-00217/CONTOURS-IRIS_2-1_SHP_LAMB93_FXX-2021/CONTOURS-IRIS.* .
    !cd data && rm CONTOURS-IRIS_2-1__SHP__FRA_2021-01-01.7z

Let's load the data using `geopandas`:

In [None]:
df_iris = gpd.read_file("data/CONTOURS-IRIS.shp")
df_iris.head()

As before, let's clean up the data set. We will need the following columns with the following readable names:
- `INSEE_COM`: `municipality_id`
- `CODE_IRIS`: `iris_id`
- `geometry`

**Task:** Set up the data set accordingly.

In [None]:
# Insert your code here


**Task**: Calculate how many IRIS exist in France and how many municipalities are there:

In [None]:
# Insert your code here


You can try plotting all IRIS or all municipalities, but this will usually take a while with the standard Python tools. Let's plot only Paris:

In [None]:
df_iris[
    df_iris["municipality_id"].str.startswith("75")
].plot()

**Task**: Only the spatial shapes are not really useful. We should attach some data to it. To simplify our life, let's create a data frame based on `df_iris` that only contains the municipality shapes `df_municipalities`.

Hint: Check the `dissolve` method in `geopandas`.

In [None]:
# Insert your code here
# df_municipalities = ...


**Task**: Plot all municipalities in the Essonne departmennt (91).

In [None]:
# Insert your code here


**Task**: Show all municipalities in the "petite couronne" including Paris.

In [None]:
# Insert your code here
# filter_departments = [ ... ]


**Task**: Now we are ready to cross some information with the spatial data set:
- Equip your `df_census` data frame with a `municipality_id` column
- Prepare a data set that contains the number of inhabitants per municipality (`municipality_id`, `inhabitants`)
- Perform a merge between your municipality data frame and the inhabitant data frame
- Provide the inhabitants column in the `plot` method

In [None]:
# Insert your code here


Do you observe anything specific?

**Task**: Plot a population map of "petite couronne"" with a legend (using `legend = True`)

In [None]:
# Insert your code here


## Aggregated population data

To solve the issue, INSEE provides aggregated census data sets with less attributes but higher spatial availability. We will make use of a data set that indicates the total population and population per CSP over 15 years for every municipality in France:

- The data is available at https://www.insee.fr/fr/statistiques/6543200
- Download "Population en 2019 - IRIS - France hors Mayotte" in CSV format
- Information on the variables is avaialble in "Dictionnaire des variables"

Linux users may execute the following cell:

In [None]:
if not os.path.exists("data/base-ic-evol-struct-pop-2019.CSV"):
    !cd data && wget https://www.insee.fr/fr/statistiques/fichier/6543200/base-ic-evol-struct-pop-2019_csv.zip
    !cd data && unzip base-ic-evol-struct-pop-2019_csv.zip
    !cd data && rm base-ic-evol-struct-pop-2019_csv.zip

**Task**: Load the data set have a first look
- Only load a couple of lines (`nrows=20`) to be sure that you don't exceed your memory
- Look at the first few lines and check the explanation of the variables online
- How can you obtain the population total per municipality from this data set?
- How can you obtain the number of persons per CSP from this data set?

In [None]:
# Insert your code here


**Task**: Transform the data set such that you have each municipality together with the population total and the total of each CSP:

In [None]:
pd.DataFrame({ "municipality_id": [], "population": [], "csp_1": [], "csp_2": [], "csp_3": [], "csp_...": [] })

Hint: The data set is given per IRIS.

In [None]:
# Insert your code here
# df_population = ... 


**Task**: Repeat the task from above, create a map of the population in Île-de-France, but with the new data set. Note: The two data sets (geography and population totals) are now given for all France, so you can also plot the whole country.

In [None]:
# Insert your code here


Save the cleaned population data, because we will need it again in a later exercise:

In [None]:
df_population.to_parquet("data/population.parquet")

Let's do the same with the municipalities spatial data set:

In [None]:
df_municipalities.to_parquet("data/municipalities.parquet")

For mapping, Python can be useful to make a first draft, but there are more elaborate tools available. 

**Task**: Create a data frame in which the municipality data has been merged with the the population data set, i.e., we want all columns from the population data set and additionally the `geometry` column. Filter for all municipalities in Île-de-France. Save this data frame in GeoPackage format:

In [None]:
# Insert your code here
# df_export =


In [None]:
df_export.to_file("export.gpkg")

**Exercise**: Explore the exported data using **QGIS**

![](material/qgis.png)

## Employment

In a later exercise, we will also need information on employment. Employment data per municipality is avaialble as open data from Urssaf.

- Download the data from https://open.urssaf.fr/explore/dataset/etablissements-et-effectifs-salaries-au-niveau-commune-x-ape-last/information/
- Go to "Export" and export the data as CSV

Linux users may execute the following cell:

In [None]:
if not os.path.exists("data/etablissements-et-effectifs-salaries-au-niveau-commune-x-ape-last.csv"):
    !cd data && wget "https://open.urssaf.fr/api/explore/v2.1/catalog/datasets/etablissements-et-effectifs-salaries-au-niveau-commune-x-ape-last/exports/csv?lang=fr&timezone=Europe%2FBerlin&use_labels=true&delimiter=%3B" -O etablissements-et-effectifs-salaries-au-niveau-commune-x-ape-last.csv

**Task**: As before, explore the data by first loading a few columns and understanding the content.

Hint: To get a better overview of the available columns, try `df.columns`

In [None]:
# Insert your code here


**Task**: Clean the data set such that you have a column indicating the municipality identifier and the number of emloyees in 2019 in that zone.

Hint: The data frame is disaggregated over various economic sectors (NAF code), but we want the total! Also, make sure to read the municipality codes as a string.

In [None]:
pd.DataFrame({ "municipality_id": [], "emloyment": [] })

In [None]:
# Insert your code here
# df_employment =


**Task**: Plot a map of the number of employees in a department of your choice

In [None]:
# Insert your code here


**Task**: Plot a map of the employment *density* of employees in a department of your choice.

In [None]:
# Insert your code here


Let's save this data for later:

In [None]:
df_employment.to_parquet("employment.parquet")

## Commuting data

Finally, we will have a look at a more complex data set: commuting data. This data set is also available from INSEE and describes how many people living in a specific municipality in France to to any other municipality for work. This data set is known as *MOBPRO*.

- Download the data from https://www.insee.fr/fr/statistiques/6456056
- Download the data in CSV format

Linux users may execute the following cell:

In [None]:
if not os.path.exists("data/FD_MOBPRO_2019.csv"):
    !cd data && wget https://www.insee.fr/fr/statistiques/fichier/6456056/RP2019_mobpro_csv.zip
    !cd data && unzip RP2019_mobpro_csv.zip
    !cd data && rm RP2019_mobpro_csv.zip

**Task**: Load the data set with the following columns:
- `COMMUNE` : `str`
- `ARM` : `str`
- `DCLT`: `str`
- `IPONDI`: `float`
- `TRANS`: `int`

In [None]:
# Insert your code here


The MOBPRO data set is a bit particular with respect to the spatial identifiers. In fact, `DCLT` describes the destination of a commuters as a municipality identifier. In principle, this is also the case for `COMMUNE` which describes the origin. However, Paris, for instance, is encoded as `75056`, but the actual "municipality" (or arrondissement) is contained in `ARM`. If one knows this, there is an easy fix:

In [None]:
f = df_commutes["ARM"] != "ZZZZZ"
df_commutes.loc[f, "COMMUNE"] = df_commutes.loc[f, "ARM"]

**Task**: Reformat the data frame so that we have the following format.

In [None]:
pd.DataFrame({ "origin_id": [], "destination_id": [], "weight": [], "transport_mode": [] })

In [None]:
# Insert your code here


**Task**: Plot a map showing how many people commute from Melun (77288) in the south of Paris to any other municipality in Île-de-France **by car**.

In [None]:
# Insert your code here


**Task**: Plot the same map but for commutes by **public transport**. What do you notice?

In [None]:
# Insert your code here


**Task:** Aggregate the commuting data set further by removing the `transport_mode` column such that we only have the bare commuting flows as a weight between two municipalities. Then, save the data set as `commutes.parquet`, we will need it later on!

In [None]:
# Insert your code here


In [None]:
df_commutes.to_parquet("commutes.parquet")

**Congratulations!** You can now solve Exercise 1 of the course project.