# Flow Network

In this notebook we build a network of flows between places, i.e., the number of _trips_ between two areas.

## Preamble

In [22]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(context='notebook', font='Lucida Sans Unicode', style='white', palette='plasma')

In [23]:
import pandas as pd
import geopandas as gpd

In [24]:
checkins = pd.read_csv('output/santiago_relevant_check_ins.csv.gz')
checkins.head()

Unnamed: 0.1,Unnamed: 0,user_id,venue_id,datetime,utc_offset
0,56,44093,4dc0367c6a23e5a549e68d9b,Tue Apr 03 18:00:22 +0000 2012,-180
1,82,980044,4bb8ddd31261d13ab608e998,Tue Apr 03 18:00:27 +0000 2012,-180
2,83,640995,4f7200c9e4b0e30fde677e23,Tue Apr 03 18:00:28 +0000 2012,-180
3,179,969261,4d386aa93ffba1433b405c56,Tue Apr 03 18:00:55 +0000 2012,-180
4,180,1424666,4b6f2cf8f964a52013e22ce3,Tue Apr 03 18:00:55 +0000 2012,-180


In [25]:
venues = gpd.read_file('output/santiago_relevant_pois.json')
venues.head()

Unnamed: 0,venue_id,category,index_right,geometry
0,4af8ffeaf964a520cd1022e3,Other Great Outdoors,1135,POINT (-70.66046 -33.44616)
1,4b4463faf964a52078f425e3,Sushi Restaurant,3752,POINT (-70.51793 -33.3738)
2,4b4463faf964a52078f425e3,Sushi Restaurant,3753,POINT (-70.51793 -33.3738)
3,4b44abe7f964a5201ef925e3,Sushi Restaurant,2411,POINT (-70.57765 -33.41222)
4,4b44ad89f964a52040f925e3,Office,2939,POINT (-70.57276 -33.40448)


## Flow Estimation

Check-ins are stored as one row per check-in in the dataframe. However, we need to consider _pairs_ of check-ins by the same user to be able to see a trip.

First, we sort the check-ins dataframe by `user_id`, as that ensures that two consecutive check-ins were done by the same user (with the exception of the last check-in of every user, but we deal with that later).

In [26]:
checkins_grid = checkins.merge(venues[['venue_id', 'index_right']]).sort_values('user_id').drop('utc_offset', axis=1)
checkins_grid.head()

Unnamed: 0.1,Unnamed: 0,user_id,venue_id,datetime,index_right
1431897,60701213,230,4b59c564f964a5202c9728e3,Mon Apr 08 19:57:47 +0000 2013,1605
1268593,50917219,230,50e9b11ee4b06654a0a2dc80,Sat Feb 16 06:30:23 +0000 2013,1761
1268592,50917219,230,50e9b11ee4b06654a0a2dc80,Sat Feb 16 06:30:23 +0000 2013,1760
1338203,55382144,230,4d2f5351a6df6dcbd709e27a,Fri Mar 15 04:57:18 +0000 2013,4487
1338204,55382144,230,4d2f5351a6df6dcbd709e27a,Fri Mar 15 04:57:18 +0000 2013,4488


Then, we define an utilitary function `shift` that takes a dataframe as input, and then builds a new dataframe where two consecutive rows of the original dataframe are put together, as if they were from a trip.

Since we have a dataframe sorted by `user_id`, most of the shifted rows belong to only one user. Those that do not are not real trips, and we filter them out of the dataframe.

In [27]:
def shift(df):
    origin = df.rename({'venue_id': 'origin'}, axis=1)[['origin', 'user_id']]
    destination = df.rename({'venue_id': 'destination', 'user_id': 'user_id_d'}, axis=1)[['destination', 'user_id_d']].shift()
    trips = (origin.join(destination)
             .dropna()
             .pipe(lambda x: x[x.user_id == x.user_id_d])
             .groupby(['user_id', 'origin', 'destination'])
             .size()
            )
    trips.name = 'n_trips'
    return trips

#shift(checkins_grid.head(100))

In [28]:
user_trip_counts = shift(checkins_grid).reset_index()
user_trip_counts.head()

Unnamed: 0,user_id,origin,destination,n_trips
0,230,4b52419af964a520177327e3,4b6322e2f964a520a6652ae3,1
1,230,4b549b4ef964a52060c227e3,4e7f48f70aaf89ce53278fb3,1
2,230,4b562eedf964a520180428e3,4b59c564f964a5202c9728e3,1
3,230,4b562eedf964a520180428e3,4b65adfef964a52047f92ae3,1
4,230,4b562eedf964a520180428e3,4d00e897ffcea1434f243091,1


In [29]:
user_trip_counts.shape

(1345629, 4)

Then, we proceed to assign an origin / destination _cell id_ to each trip.

In [30]:
user_trips_grid = (user_trip_counts
.join(venues[['venue_id', 'index_right']].set_index('venue_id'), on='origin')
.rename({'index_right': 'origin_cell_id'}, axis=1)
.join(venues[['venue_id', 'index_right']].set_index('venue_id'), on='destination')
.rename({'index_right': 'destination_cell_id'}, axis=1))
user_trips_grid.head()

Unnamed: 0,user_id,origin,destination,n_trips,origin_cell_id,destination_cell_id
0,230,4b52419af964a520177327e3,4b6322e2f964a520a6652ae3,1,3643,1605
1,230,4b549b4ef964a52060c227e3,4e7f48f70aaf89ce53278fb3,1,2577,3635
2,230,4b562eedf964a520180428e3,4b59c564f964a5202c9728e3,1,3578,1605
3,230,4b562eedf964a520180428e3,4b65adfef964a52047f92ae3,1,3578,1795
3,230,4b562eedf964a520180428e3,4b65adfef964a52047f92ae3,1,3578,1796


Finally, we can compute the flow of people between two cells by aggregating the trips dataframe:

In [31]:
flows = (user_trips_grid
         [user_trips_grid.origin_cell_id != user_trips_grid.destination_cell_id]
         .groupby(['origin_cell_id', 'destination_cell_id'])
         ['n_trips']
         .sum()
         .reset_index()
         .pipe(lambda x: x[x.n_trips > 5])
         .rename({'origin_cell_id': 'origin', 'destination_cell_id': 'dest', 'n_trips': 'count'}, axis=1)
        )
flows.head()

Unnamed: 0,origin,dest,count
4,0,365,7
32,0,1395,9
38,0,1529,6
39,0,1530,6
40,0,1531,6


In [32]:
flows['count'].describe()

count    97085.000000
mean        43.466509
std        424.968046
min          6.000000
25%          8.000000
50%         13.000000
75%         26.000000
max      42767.000000
Name: count, dtype: float64

## Data Export for Visualization

We export the data in the format required in the [flowmap.blue](http://flowmap.blue) platform. It requires a flow magnitude file, and a flow location file.

The flow magnitude was estimated above.

In [33]:
flows.to_csv('output/santiago_foursquare_flow_magnitudes.csv', index=False)

The flow locations are the centroids of each cell. 

In [34]:
grid = gpd.read_file('output/santiago_relevant_grid.geo.json')
grid.head()

Unnamed: 0,h3_cellid,index_right,geometry
0,88b2c5cdebfffff,5,"POLYGON ((-70.70157 -33.32457, -70.70599 -33.3..."
1,88b2c51281fffff,7,"POLYGON ((-70.2935 -33.26894, -70.29789 -33.27..."
2,88b2c55481fffff,26,"POLYGON ((-70.6372 -33.4642, -70.64161 -33.467..."
3,88b2c510a3fffff,7,"POLYGON ((-70.33679 -33.33516, -70.34119 -33.3..."
4,88b2c552a3fffff,5,"POLYGON ((-70.77352 -33.36271, -70.77794 -33.3..."


Note that we don't need all cells from the grid, as some of them do not generate or attract trips.

In [35]:
flow_grid = grid[grid.index.isin(flows.origin) | grid.index.isin(flows.dest)]
flow_grid.shape, grid.shape

((1484, 3), (4620, 3))

In [36]:
centroids = pd.DataFrame({'x': flow_grid.geometry.centroid.x, 'y': flow_grid.geometry.centroid.y})


  centroids = pd.DataFrame({'x': flow_grid.geometry.centroid.x, 'y': flow_grid.geometry.centroid.y})


In [37]:
centroids.index.name = 'id'

In [38]:
centroids['name'] = centroids.index.values

In [39]:
centroids[['name', 'y', 'x']].rename({'x': 'lon', 'y': 'lat'}, axis=1).to_csv('output/santiago_foursquare_flow_locations.csv')

With these files, we can use flowmap.blue. [See here the resulting visualization!](https://flowmap.blue/1VdL8bOI42S_M597lO-uVZG2Hx4ujwudiSrgPFps6oeo)

## Next Steps

This exploration was only to apply the work of Eduardo Graells in the case of Santiago with foursquare checkins data. The next steps are:

- Inspect EOD data
- Inspect Adatrap data
- Propose a unified format, considering the formats of these datasets
- Make the code that converts the datasets to the unified format.
- Make the code that converts the unified format so that it can be input to flowmap