In [None]:
import numpy as np
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import shapely.geometry as sgeo
%matplotlib inline

# Disaggregation

The goal of this exercise is to learn how to switch from a zone-based and flow-based simulation to a trip-based setting. The idea is to take the flow matrix generated in the previous exercise and to generate individual trips.

## Disaggregating the flows

Instead of generating the true number of trips, we will only generate a small share to keep the calculations feasible. 

First, we read the generated flow data and the municipality shapes from the previous exercises:

In [None]:
df_flow = pd.read_parquet("data/flow.parquet")
df_municipalities = gpd.read_parquet("data/municipalities.parquet")

**Task**: Filter both the municipalities data frame and the flow data frame for the area of Paris (department 75). For the flow data set, make sure that only flows are kept that start within Paris and also end within Paris.

In [None]:
### Insert your code here


**Task**: Now convert the `weight` column (reference flows) or the `model` column (your choice) into a probability by dividing by the total:

In [None]:
### Insert your code here
# df_flow["probability"] = 


**Task**: Next, sample a trip table from the flow data frame, using *pandas*' `sample` method of the data frame, according to the `probability` of each relation. Make sure to pass `replace = True` so you can sample individual entries multiple times. Sample *1,000* trips.

In [None]:
### Insert your code here
# df_trips = 


Next, we clean up the trips data frame:

In [None]:
df_trips = df_trips[["origin_id", "destination_id"]]
df_trips.head()

## Sampling a departure time

Our next goal is to generate a departure time for each trip. We will only generate trips for the morning peak.

**Task**: Add a departure time to each trip by sampling from the following normal distribution:

$$
t \sim \mathcal{N}(\mu = 8.5, \sigma = 1)
$$

Save the departure time in seconds (seconds after midnight). Set any negative values to zero to avoid computational issues:

In [None]:
### Insert your code here
# df_trips["departure_time"] = 


**Task**: To make sure, plot a histogram of your generated departure times.

In [None]:
### Insert your code here


## Sampling origins and destinations

Next, we want to generate origin and destination points for the trips and show them on a map.

**Task**: We will follow a process that is not the most efficient, but straight-forward to follow. First, merge the municipality data frame onto the trips data frame such that a new column `origin_geometry` is created:

In [None]:
pd.DataFrame({ "origin_id": [], "destination_id": [], "departure_time": [], "origin_geometry": [] })

In [None]:
### Insert your code here
# df_trips = pd.merge(df_trips, df_municipalities.rename( # complete # ), on = "origin_id")


**Task**: Now repeat the same to generate a new column `destination_geometry` for each trip:

In [None]:
### Insert your code here


Let's clean the data set, we should have the following columns:

In [None]:
df_trips = df_trips[["origin_id", "destination_id", "departure_time", "origin_geometry", "destination_geometry"]]
assert len(df_trips) == 1000

*Geopandas* provides a useful method called `sample_points`, but it only acts on the active *geometry* column of a `GeoDataFrame`. First, we need to convert `df_trips` into a `GeoDataFrame` with `origin_geometry` as the active geometry column:

In [None]:
df_trips = gpd.GeoDataFrame(df_trips, geometry = "origin_geometry", crs = df_municipalities.crs)

Try the following code, it will take the polygon geometry of every trip's origin zone and sample a point from within that zone:

In [None]:
df_trips.sample_points(1)

**Task**: Override the (polygon) `origin_geometry` column in your trip table now with a sampled point from the respective zone:

In [None]:
### Insert your code here
# df_trips["origin_geometry"] = # ...


**Task**: Now, do the same with the destination. First, set the active geometry column of the data frame to `destination_geometry` (see `GeoDataFrame.set_geometry` and then override this column with a randomly sampled point:

In [None]:
### Insert your code here
# df_trips["origin_geometry"] = ...


Have a look at your data frame. You should see that both `origin_geometry` and `destination_geometry` are of type `POINT`. If they are *exactly* the same, you did something wrong in the previous cells:

In [None]:
df_trips.head()

We will now create a new geometry column which, instead of a `POINT` contains a `LINESTRING`, i.e. a connected line between `N` (in our case 2) points. For that, we make use of the `shapely.geometry` package, which we have imported as `sgeo` (see first cell).

In [None]:
df_trips["geometry"] = [
    sgeo.LineString(od) 
    for od in zip(df_trips["origin_geometry"], df_trips["destination_geometry"])
]

Try to understand the previous cell. What does it do?

**Task**: Set the *active geometry column* of the data frame to `geometry` and plot the data frame. What do you see?

In [None]:
### Insert your code here


**Task**: In this notebook or using QGIS, plot the arrondissements of Paris together with your generated flows.

In [None]:
### Insert your code here or use QGIS

Let's save the generated trips for the next exercise:

In [None]:
df_trips.to_parquet("trips.parquet")

**Congratulations!** You should now be able to disaggregate a flow matrix for your course project (Exercise 3.1).