# Project 3 - Population estimates from the census

The aim of this project is to perform a rapid statistical analysis of a
dataset whose format is not directly optimized for a
analysis in python. We will use exclusively the library
`pandas` for data analysis. To best reproduce a
situation you may be facing, we invite you
strongly recommend consulting the library documentation
([docs](https://pandas.pydata.org/docs/user_guide/index.html#user-guide)).

We will focus on the population estimate on the 1st
January of each year, this estimate being made from the
censuses and population change models. The data
are available on the INSEE website at the following address:
<https://www.insee.fr/fr/statistiques/1893198>. The file that we
will used can be downloaded directly via this url:
<https://www.insee.fr/fr/statistiques/fichier/1893198/estim-pop-dep-sexe-aq-1975-2023.xls>.

In [None]:
import copy
import pandas as pd
import matplotlib.pyplot as plt
import geopandas as gpd
import seaborn as sns

import solutions

Before performing the data download with python it is
necessary to know the format of our data. In our case, it
This is the Excel format (`.xlsx`). Additionally, it may be useful to
look at what the data you want to import looks like,
especially when their format is not standard. Thus, before
To begin, take the time to look at the data.

### Question 0

Download the data by clicking on this
[lien](https://www.insee.fr/fr/statistiques/fichier/1893198/estim-pop-dep-sexe-aq-1975-2023.xls)
and open it with your favorite spreadsheet software. Analyze the structure
data.

### Question 1

Define the function `load_data()` which has no parameters and returns a
`Dict` where the **keys** correspond to the names of the tabs in our
file and the **values** correspond to the data of the different
spreadsheets. To do this, use a function from the library
`pandas` by specifying the correct parameters.

#### Expected result

In [None]:
data = solutions.load_data()
data["2022"]

#### Up to you !

In [None]:
def load_data():
    # Your code here
    return data

### Question 2

Now that the data is imported we will put it under the
form of a single `DataFrame` whose columns are:

- `gender` ;

- `age` ;

- `population` ;

- `dep_code` ;

- `dep` ;

- `year`.

2.1 - To do this create a function `reshape_data_by_year(df, year)`
which takes as argument a DataFrame from your Dict `data` and a
given year.

#### Expected result

In [None]:
annee = 2022
df = solutions.reshape_table_by_year(data[f"{annee}"], annee)
df

#### Up to you !

In [None]:
def reshape_table_by_year(df, year):
    # Your code here
    return df

2.2 - Create a function `reshape_data(data)` that produces a `DataFrame`
with data for all years between 1975 and 2022.

#### Expected result

In [None]:
df = solutions.reshape_data(data)
df

#### Up to you !

In [None]:
def reshape_data(data):
    # Your code here
    return df

## Part 2: Data Visualization

We now have a dataset ready to be analyzed! Let's get started
first of all by visualizing the evolution of the population for different
departments.

### Question 3

Write a function
`plot_population_by_gender_per_department(df, department_code)` which
returns a graph representing the evolution of the population in a
given department. Use the `matplotlib` library. You can
look at the data from Haute Garonne (31), Loir-et-Cher (41) and
from Reunion (974) to note disparities in developments.

#### Expected result

In [None]:
solutions.plot_population_by_gender_per_department(df, "31")

#### Up to you !

In [None]:
def plot_population_by_gender_per_department(data, department_code):
    # Your code here

### Question 4

In order to compare 2 graphs it is sometimes useful to display them side by side.
side by side. Thanks to the `subplots()` method of `matplotlib` this is very
easy to achieve in python. To see this, we will represent
the age pyramid of France in 1975 and in 2022.

4.1- Define the function `get_age_pyramid_data(df, year)` which, at
from the DataFrame generated by the `reshape_data()` function, returns a
DataFrame with columns `age`, `Female`, `Male`. The `age` column
must contain all age groups present in the game.
data, the `Women/Men` columns correspond to the population
feminine/masculine for a given age group. In order to
aesthetics, the `Men` column will first be multiplied by -1.

#### Expected result

In [None]:
pyramide_data = solutions.get_age_pyramid_data(df, 2022)
pyramide_data

#### Up to you !

In [None]:
def get_age_pyramid_data(df, year):
    # Your code here
    return pyramide_data

4.2- Define the function `plot_age_pyramid(df, year, ax=None)` which
represents the age pyramid of France for a given year. You
can you take inspiration from what has been done in this
[blog](https://maciejtarsa.medium.com/plotting-a-population-pyramid-in-python-52be034968b0).

#### Expected result

In [None]:
fig,(ax1,ax2) = plt.subplots(1,2,figsize=(15,6))

solutions.plot_age_pyramid(df, 1975, ax=ax1)
solutions.plot_age_pyramid(df, 2022, ax=ax2)

#### Up to you !

In [None]:
def plot_age_pyramid(df, year, ax=None):
    if ax is None:
        ax = plt.gca()
    # Your code here
    return df

## Part 3: An Introduction to Geographic Data

Geographic data is very useful because it allows us to
visualize and analyze location-related information
specific on the earth. Geographic data can be
used to create maps, 3D visualizations and
spatial analyses to understand trends, patterns and
relationships in data. Using Python libraries such as
than `Geopandas` or `Folium`, you can easily manipulate and
visualize geographic data to meet your needs
analytical.

In order to graphically represent geographic data it is
necessary to obtain the contour data (*shapefile*) of the areas
that we want to represent. The goal of this part is to create
a choropleth map of regions based on their population
respective.

The data we currently have contains the information by
department and not by region. First of all it is necessary
to assign each department its corresponding region. To do this,
you can use the `.json` file available at the following address:
<https://static.data.gouv.fr/resources/departements-et-leurs-regions/20190815-175403/departements-region.json>.

### Question 5

Create a DataFrame from the departments `.json` file and
regions of France previously mentioned. Make sure that the columns
are in the correct format.

#### Expected result

In [None]:
df_matching = solutions.load_departements_regions("https://static.data.gouv.fr/resources/departements-et-leurs-regions/20190815-175403/departements-region.json")
df_matching

#### Up to you !

In [None]:
def load_departements_regions(url):
    # Your code here
    return df_matching

### Question 6

Match the DataFrame containing the population data by
department with the DataFrame of the regions of France.

#### Expected result

In [None]:
df_regions = solutions.match_department_regions(df, df_matching)
df_regions

#### Up to you !

In [None]:
def match_department_regions(df, df_matching):
    # Your code here
    return df_regions

### Question 7

Download the geographic contour data of the regions in
using the `cartiflette` package and the `geopandas` library. The
data is accessible at [this
address](https://minio.lab.sspcloud.fr/projet-cartiflette/diffusion/shapefiles-test1/year=2022/administrative_level=REGION/crs=4326/FRANCE_ENTIERE=metropole/vectorfile_format='geojson'/provider='IGN'/source='EXPRESS-COG-CARTO-TERRITOIRE'/raw.geojson).

#### Expected result

In [None]:
geo = solutions.load_geo_data("https://minio.lab.sspcloud.fr/projet-cartiflette/diffusion/shapefiles-test1/year=2022/administrative_level=REGION/crs=4326/FRANCE_ENTIERE=metropole/vectorfile_format='geojson'/provider='IGN'/source='EXPRESS-COG-CARTO-TERRITOIRE'/raw.geojson")
geo

#### Up to you !

In [None]:
def load_geo_data(url):
    # Your code here
    return geo

### Question 8

Produce a choropleth map of the population in 2022 of the regions of
France. You can consult the documentation of `geopandas`
[ici](https://geopandas.org/en/stable/docs/user_guide/mapping.html).

#### Expected result

In [None]:
solutions.plot_population_by_regions(df_regions, geo, 2022)

#### Up to you !

In [None]:
def plot_population_by_regions(df, geo, year):
    # Your code here

### Question 9

The total population of a region is not sufficient to analyze the
demographics of a region. It may be interesting to look at the
population growth.

9.1- Write a function `compute_population_growth_per_region(df)` which
calculates the population growth in percentage per year for each
region.

#### Expected result

In [None]:
df_croissance = solutions.compute_population_growth_per_region(df_regions)
df_croissance

#### Up to you !

In [None]:
def compute_population_growth_per_region(df_regions):
    # Your code here
    return df_croissance

9.2- Write a function
`compute_mean_population_growth_per_region(df, min_year, max_year)` which
calculates the average population growth between two years
data.

#### Expected result

In [None]:
df_croissance = solutions.compute_mean_population_growth_per_region(df_regions, 2015, 2022)
df_croissance

#### Up to you !

In [None]:
def compute_mean_population_growth_per_region(df, geo, year):
    # Your code here
    return df_croissance

9.3- Write a function
`plot_growth_population_by_regions(df, geo, min_year, max_year)` which
represents the average population growth between two years
data for all regions of France on a choropleth map.

#### Expected result

In [None]:
solutions.plot_growth_population_by_regions(df_regions, geo, 2015, 2022)

#### Up to you !

In [None]:
def plot_growth_population_by_regions(df, geo, min_year, max_year):
    # Your code here