# Flow Network

In this notebook we build a network of flows between places, i.e., the number of _trips_ between two areas.

## Preamble

In [1]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(context='notebook', font='Fira Sans', style='white', palette='plasma')

In [2]:
import pandas as pd
import geopandas as gpd

In [3]:
checkins = pd.read_csv('output/relevant_check_ins.csv.gz')
checkins.head()

Unnamed: 0.1,Unnamed: 0,user_id,venue_id,datetime,utc_offset
0,130,1597907,4d9b7366b2aaa093259f7082,Tue Apr 03 18:00:43 +0000 2012,120
1,278,106998,4adcda60f964a520934421e3,Tue Apr 03 18:01:23 +0000 2012,120
2,377,1101077,4bac7462f964a520d4f53ae3,Tue Apr 03 18:01:47 +0000 2012,120
3,382,1218854,4adcda60f964a520934421e3,Tue Apr 03 18:01:49 +0000 2012,120
4,433,1648478,4dfc7683227185f38b94d41d,Tue Apr 03 18:02:02 +0000 2012,120


In [4]:
venues = gpd.read_file('output/relevant_pois.json')
venues.head()

Unnamed: 0,venue_id,category,index_right,geometry
0,4ad0e340f964a520c7da20e3,Southern / Soul Food Restaurant,2192,POINT (2.15161 41.39223)
1,4adcda4bf964a520873f21e3,Hotel,2192,POINT (2.15006 41.39436)
2,4adcda4cf964a520903f21e3,Hotel,2192,POINT (2.15351 41.39404)
3,4adcda4df964a520084021e3,Cocktail Bar,2192,POINT (2.15427 41.39277)
4,4adcda4df964a520164021e3,Beer Garden,2192,POINT (2.15316 41.39503)


## Flow Estimation

Check-ins are stored as one row per check-in in the dataframe. However, we need to consider _pairs_ of check-ins by the same user to be able to see a trip.

First, we sort the check-ins dataframe by `user_id`, as that ensures that two consecutive check-ins were done by the same user (with the exception of the last check-in of every user, but we deal with that later).

In [5]:
checkins_grid = checkins.merge(venues[['venue_id', 'index_right']]).sort_values('user_id').drop('utc_offset', axis=1)
checkins_grid.head()

Unnamed: 0.1,Unnamed: 0,user_id,venue_id,datetime,index_right
169950,863379,11,4bf24abb99d02d7feb0eca48,Wed Feb 27 19:04:51 +0000 2013,2177
169975,879777,19,512bb246e4b0447bba28bce6,Wed Feb 27 20:49:55 +0000 2013,2257
80184,68775,19,4adcda60f964a520994421e3,Thu Feb 28 09:49:55 +0000 2013,2176
14208,87234,19,4b7980a2f964a5202bfd2ee3,Thu Feb 28 10:45:55 +0000 2013,2910
124812,416624,19,4bcc52eb3740b713efc46365,Mon Feb 25 08:10:09 +0000 2013,2699


Then, we define an utilitary function `shift` that takes a dataframe as input, and then builds a new dataframe where two consecutive rows of the original dataframe are put together, as if they were from a trip.

Since we have a dataframe sorted by `user_id`, most of the shifted rows belong to only one user. Those that do not are not real trips, and we filter them out of the dataframe.

In [6]:
def shift(df):
    origin = df.rename({'venue_id': 'origin'}, axis=1)[['origin', 'user_id']]
    destination = df.rename({'venue_id': 'destination', 'user_id': 'user_id_d'}, axis=1)[['destination', 'user_id_d']].shift()
    trips = (origin.join(destination)
             .dropna()
             .pipe(lambda x: x[x.user_id == x.user_id_d])
             .groupby(['user_id', 'origin', 'destination'])
             .size()
            )
    trips.name = 'n_trips'
    return trips

#shift(checkins_grid.head(100))

In [7]:
user_trip_counts = shift(checkins_grid).reset_index()
user_trip_counts.head()

Unnamed: 0,user_id,origin,destination,n_trips
0,19,4adcda60f964a520994421e3,512bb246e4b0447bba28bce6,1
1,19,4b241c53f964a5204e6124e3,4b76f383f964a520ce6d2ee3,1
2,19,4b76f383f964a520ce6d2ee3,4bcc52eb3740b713efc46365,1
3,19,4b7980a2f964a5202bfd2ee3,4adcda60f964a520994421e3,1
4,19,4bcc52eb3740b713efc46365,4b7980a2f964a5202bfd2ee3,1


In [8]:
user_trip_counts.shape

(129855, 4)

Then, we proceed to assign an origin / destination _cell id_ to each trip.

In [9]:
user_trips_grid = (user_trip_counts
.join(venues[['venue_id', 'index_right']].set_index('venue_id'), on='origin')
.rename({'index_right': 'origin_cell_id'}, axis=1)
.join(venues[['venue_id', 'index_right']].set_index('venue_id'), on='destination')
.rename({'index_right': 'destination_cell_id'}, axis=1))
user_trips_grid.head()

Unnamed: 0,user_id,origin,destination,n_trips,origin_cell_id,destination_cell_id
0,19,4adcda60f964a520994421e3,512bb246e4b0447bba28bce6,1,2176,2257
1,19,4b241c53f964a5204e6124e3,4b76f383f964a520ce6d2ee3,1,2236,2236
2,19,4b76f383f964a520ce6d2ee3,4bcc52eb3740b713efc46365,1,2236,2699
3,19,4b7980a2f964a5202bfd2ee3,4adcda60f964a520994421e3,1,2910,2176
4,19,4bcc52eb3740b713efc46365,4b7980a2f964a5202bfd2ee3,1,2699,2910


Finally, we can compute the flow of people between two cells by aggregating the trips dataframe:

In [10]:
flows = (user_trips_grid
         [user_trips_grid.origin_cell_id != user_trips_grid.destination_cell_id]
         .groupby(['origin_cell_id', 'destination_cell_id'])
         ['n_trips']
         .sum()
         .reset_index()
         .pipe(lambda x: x[x.n_trips > 5])
         .rename({'origin_cell_id': 'origin', 'destination_cell_id': 'dest', 'n_trips': 'count'}, axis=1)
        )
flows.head()

Unnamed: 0,origin,dest,count
379,42,509,6
528,45,2236,6
798,68,476,10
819,70,487,22
832,70,2177,6


In [11]:
flows['count'].describe()

count    4753.000000
mean       17.483905
std        44.596368
min         6.000000
25%         7.000000
50%         9.000000
75%        15.000000
max      1687.000000
Name: count, dtype: float64

## Data Export for Visualization

We export the data in the format required in the [flowmap.blue](http://flowmap.blue) platform. It requires a flow magnitude file, and a flow location file.

The flow magnitude was estimated above.

In [12]:
flows.to_csv('output/flow_magnitudes.csv', index=False)

The flow locations are the centroids of each cell. 

In [13]:
grid = gpd.read_file('output/relevant_grid.geo.json')
grid.head()

Unnamed: 0,s2_cellid,index_right,geometry
0,1343054926502166528,9,"POLYGON ((1.97837 41.25890, 1.98326 41.25882, ..."
1,1343054935092101120,9,"POLYGON ((1.97347 41.25899, 1.97837 41.25890, ..."
2,1343054960861904896,9,"POLYGON ((1.96858 41.25907, 1.97347 41.25899, ..."
3,1343054969451839488,9,"POLYGON ((1.96369 41.25915, 1.96858 41.25907, ..."
4,1343054978041774080,9,"POLYGON ((1.95880 41.25923, 1.96369 41.25915, ..."


Note that we don't need all cells from the grid, as some of them do not generate or attract trips.

In [14]:
flow_grid = grid[grid.index.isin(flows.origin) | grid.index.isin(flows.dest)]
flow_grid.shape, grid.shape

((595, 3), (3846, 3))

In [15]:
centroids = pd.DataFrame({'x': flow_grid.geometry.centroid.x, 'y': flow_grid.geometry.centroid.y})

In [16]:
centroids.index.name = 'id'

In [17]:
centroids['name'] = centroids.index.values

In [18]:
centroids[['name', 'y', 'x']].rename({'x': 'lon', 'y': 'lat'}, axis=1).to_csv('output/flow_locations.csv')

With these files, we can use flowmap.blue. [See here the resulting visualization!](https://flowmap.blue/1VdL8bOI42S_M597lO-uVZG2Hx4ujwudiSrgPFps6oeo)

## Next Steps

This is just a preliminary exploration of the data. We still need to tackle other challenges, such as:

  * Filtering trips according to temporal distance between check-ins.
  * Building a mobility model from the data.
  * Estimate the model for several cities.
  * Visualize grid popularity, and include same-cell trips in the visualization.
  * Dissagregate tourist and local trips.
  
Do you have other interesting question to ask the data? Let's collaborate!