# Create a TrajDataFrame

In scikit-mobility, a set of trajectories is described by a `TrajDataFrame`, an extension of the pandas `DataFrame` that has specific columns names and data types. A row in the `TrajDataFrame` represents a point of the trajectory, described by three mandatory fields (aka columns): 
- `latitude` (type: float);
- `longitude` (type: float);
- `datetime` (type: date-time). 

Additionally, two optional columns can be specified: 
- `uid` (type: string) identifies the object associated with the point of the trajectory. If `uid` is not present, scikit-mobility assumes that the `TrajDataFrame` contains trajectories associated with a single moving object; 
- `tid` specifies the identifier of the trajectory to whichthe point belongs to. Similar to `uid`, if `tid` is not present, scikit-mobility assumes that the `TrajDataFrame` contains a single trajectory;

Note that, besides the mandatory columns, the user can add to a `TrajDataFrame` as many columns as they want since the data structures in scikit-mobility inherit all the pandas `DataFrame` functionalities.

Create a `TrajDataFrame` from a list:

In [1]:
import skmob
skmob.core.trajectorydataframe.np.random.seed(0)

In [2]:
# create a TrajDataFrame from a list
data_list = [[1, 39.984094, 116.319236, '2008-10-23 13:53:05'], [1, 39.984198, 116.319322, '2008-10-23 13:53:06'], [1, 39.984224, 116.319402, '2008-10-23 13:53:11'], [1, 39.984211, 116.319389, '2008-10-23 13:53:16']]
tdf = skmob.TrajDataFrame(data_list, latitude=1, longitude=2, datetime=3)
print(tdf.head())

   0        lat         lng            datetime
0  1  39.984094  116.319236 2008-10-23 13:53:05
1  1  39.984198  116.319322 2008-10-23 13:53:06
2  1  39.984224  116.319402 2008-10-23 13:53:11
3  1  39.984211  116.319389 2008-10-23 13:53:16


In [3]:
print(type(tdf))

<class 'skmob.core.trajectorydataframe.TrajDataFrame'>


Create a `TrajDataFrame` from a [pandas](https://pandas.pydata.org/) `DataFrame`:

In [4]:
# create a TrajDataFrame from a pandas DataFrame
import pandas as pd
# create a DataFrame from the previous list
data_df = pd.DataFrame(data_list, columns=['user', 'latitude', 'lng', 'hour'])
print(type(data_df))

<class 'pandas.core.frame.DataFrame'>


In [5]:
# now create a TrajDataFrame from the pandas DataFrame
tdf = skmob.TrajDataFrame(data_df, latitude='latitude', datetime='hour', user_id='user')
print(type(tdf))

<class 'skmob.core.trajectorydataframe.TrajDataFrame'>


In [6]:
print(tdf.head())

   uid        lat         lng            datetime
0    1  39.984094  116.319236 2008-10-23 13:53:05
1    1  39.984198  116.319322 2008-10-23 13:53:06
2    1  39.984224  116.319402 2008-10-23 13:53:11
3    1  39.984211  116.319389 2008-10-23 13:53:16


Create a `TrajDataFrame` from a file:

In [7]:
# read the trajectory data (GeoLife, Beijing, China)
tdf = skmob.TrajDataFrame.from_file('geolife_sample.txt.gz', latitude='lat', longitude='lon', user_id='user', datetime='datetime')
print(tdf.head())

         lat         lng            datetime  uid
0  39.984094  116.319236 2008-10-23 05:53:05    1
1  39.984198  116.319322 2008-10-23 05:53:06    1
2  39.984224  116.319402 2008-10-23 05:53:11    1
3  39.984211  116.319389 2008-10-23 05:53:16    1
4  39.984217  116.319422 2008-10-23 05:53:21    1


A `TrajDataFrame` can be plotted on an [folium](https://python-visualization.github.io/folium/) interactive map using the `plot_trajectory` function.

In [8]:
tdf.plot_trajectory()

# Create a FlowDataFrame

In scikit-mobility, an origin-destination matrix is described by the `FlowDataFrame` structure, an extension of the pandas `DataFrame` that has specific column names and data types. A row in a `FlowDataFrame` represents a flow of objects between two locations, described by three mandatory columns:
- `origin` (type: string); 
- `destination` (type: string);
- `flow` (type: integer). 

Again, the user can add to a `FlowDataFrame` as many columnsas they want. Each `FlowDataFrame` is associated with a spatial tessellation, a [geopandas](http://geopandas.org/) `GeoDataFrame` that contains two mandatory columns:
- `tile_ID` (type: integer) indicates the identifier of a location;
- `geometry` indicates the polygon (or point) that describes the geometric shape of the location on a territory (e.g., a square, a voronoi shape, the shape of a neighborhood). 

Note that each location identifier in the `origin` and `destination` columns of a `FlowDataFrame` must be present in the associated spatial tessellation.

Create a spatial tessellation from a file:

In [9]:
import skmob
import geopandas as gpd
# load a spatial tessellation
url_tess = 'https://raw.githubusercontent.com/scikit-mobility/scikit-mobility/master/examples/NY_counties_2011.geojson'
tessellation = gpd.read_file(url_tess).rename(columns={'tile_id': 'tile_ID'})
print(tessellation.head())

  tile_ID  population                                           geometry
0   36019       81716  POLYGON ((-74.006668 44.886017, -74.027389 44....
1   36101       99145  POLYGON ((-77.099754 42.274215, -77.0996569999...
2   36107       50872  POLYGON ((-76.25014899999999 42.296676, -76.24...
3   36059     1346176  POLYGON ((-73.707662 40.727831, -73.700272 40....
4   36011       79693  POLYGON ((-76.279067 42.785866, -76.2753479999...


Create a `FlowDataFrame` from a spatial tessellation and a file of flows:


In [10]:
# load real flows into a FlowDataFrame
# download the file with the real fluxes from: https://raw.githubusercontent.com/scikit-mobility/scikit-mobility/master/tutorial/data/NY_commuting_flows_2011.csv
fdf = skmob.FlowDataFrame.from_file("NY_commuting_flows_2011.csv",
                                        tessellation=tessellation,
                                        tile_id='tile_ID',
                                        sep=",")
print(fdf.head())

     flow origin destination
0  121606  36001       36001
1       5  36001       36005
2      29  36001       36007
3      11  36001       36017
4      30  36001       36019


A `FlowDataFrame` can be visualized on a [folium](https://python-visualization.github.io/folium/) interactive map using the `plot_flows` function, which plots the flows on a geographic map as lines between the centroids of the tiles in the `FlowDataFrame`'s spatial tessellation:

In [11]:
fdf.plot_flows(flow_color='red')

Similarly, the spatial tessellation of a `FlowDataFrame` can be visualized using the `plot_tessellation` function. The argument `popup_features` (type:list, default:[`constants.TILE_ID`]) allows to enhance the plot's interactivity displaying popup windows that appear when the user clicks on a tile and includes information contained in the columns of the tessellation's `GeoDataFrame` specified in the argument’s list:

In [12]:
fdf.plot_tessellation(popup_features=['tile_ID', 'population'])

The spatial tessellation and the flows can be visualized together using the `map_f` argument, which specified the folium object on which to plot: 

In [13]:
m = fdf.plot_tessellation()
fdf.plot_flows(flow_color='red', map_f=m)

# Trajectory preprocessing

As any analytical process, mobility data analysis requires data cleaning and preprocessing steps. The `preprocessing` module allows the user to perform four main preprocessing steps: 
- noise filtering; 
- stop detection; 
- stop clustering;
- trajectory compression;

Note that, if `TrajDataFrame` contains multiple trajectories from multiple users, the preprocessing methods automatically apply to the single trajectory and, when necessary, to the single object.

## Noise filtering

In scikit-mobility, the standard method `filter` filters out a point if the speed from the previous point is higher than the parameter `max_speed`, whichis by default set to 500km/h.

In [14]:
from skmob.preprocessing import filtering
# filter out all points with a speed (in km/h) from the previous point higher than 500 km/h
ftdf = filtering.filter(tdf, max_speed_kmh=500.)
print(ftdf.parameters)

{'from_file': 'geolife_sample.txt.gz', 'filter': {'function': 'filter', 'max_speed_kmh': 500.0, 'include_loops': False, 'speed_kmh': 5.0, 'max_loop': 6, 'ratio_max': 0.25}}


In [15]:
n_deleted_points = len(tdf) - len(ftdf) # number of deleted points
print(n_deleted_points)

54


Note that the `TrajDataFrame` structure as the `parameters` attribute, which indicates the list of operations that have been applied to the `TrajDataFrame`. This attribute is a dictionary the key of which is the signature of the function applied.

## Stop detection

Some points in a trajectory can represent Point-Of-Interests (POIs) such as schools, restaurants, and bars, or they can represent user-specific places such as home and work locations. These points are usually called Stay Points or Stops, and they can be detected in different ways. A common approach is to apply spatial clustering algorithms to cluster trajectory points by looking at their spatial proximity. In scikit-mobility, the `stops` function, contained in the `detection` module, finds the stay points visited by an object. For instance, to identify the stops where the object spent at least `minutes_for_a_stop` minutes within a distance `spatial_radius_km \time stop_radius_factor`, from a given point, we can use the following code:

In [16]:
from skmob.preprocessing import detection
stdf = detection.stops(tdf, stop_radius_factor=0.5, minutes_for_a_stop=20.0, spatial_radius_km=0.2, leaving_time=True)
print(stdf.head())

         lat         lng            datetime  uid    leaving_datetime
0  39.978253  116.327275 2008-10-23 06:01:05    1 2008-10-23 10:32:53
1  40.013819  116.306532 2008-10-23 11:10:09    1 2008-10-23 23:46:02
2  39.978950  116.326439 2008-10-24 00:12:30    1 2008-10-24 01:48:57
3  39.981316  116.310181 2008-10-24 01:56:47    1 2008-10-24 02:28:19
4  39.981451  116.309505 2008-10-24 02:28:19    1 2008-10-24 03:18:23


In [17]:
print('Points of the original trajectory:\t%s'%len(tdf))
print('Points of stops:\t\t\t%s'%len(stdf))

Points of the original trajectory:	217653
Points of stops:			413


A new column `leaving_datetime` is added to the `TrajDataFrame` in order to indicate the time when the user left the stop location. We can then visualize the detected stops using the `plot_stops` function:

In [18]:
m = stdf.plot_trajectory(max_users=1, start_end_markers=False)
stdf.plot_stops(max_users=1, map_f=m)

## Trajectory compression

The goal of trajectory compression is to reduce the number of trajectory points while preserving the structure of the trajectory. This step results in a significant reduction of the number of trajectory points. In scikit-mobility, we can use one of the methods in the `compression` module under the `preprocessing` module. For instance, to merge all the points that are closer than 0.2km from each other, we can use the following code:

In [19]:
from skmob.preprocessing import compression
# compress the trajectory using a spatial radius of 0.2 km
ctdf = compression.compress(tdf, spatial_radius_km=0.2)
print('Points of the original trajectory:\t%s'%len(tdf))
print('Points of the compressed trajectory:\t%s'%len(ctdf))

Points of the original trajectory:	217653
Points of the compressed trajectory:	6281


# Measures


Several measures have been proposed in the literature to capture the patterns of human mobility, both at the individual and collective levels. Individual measures summarize the mobility patterns of a single moving object, while collective measures summarize mobility patterns of a population as a whole. scikit-mobility provides a wide set of mobility measures, each implemented as a function that takes in input a `TrajDataFrame` and outputs a pandas `DataFrame`. Individual and collective measures are implemented the in `skmob.measure.individual` and the `skmob.measures.collective` modules, respectively.

For example, the following code compute the *radius of gyration*, the *jump lengths* and the *home locations* of a `TrajDataFrame`:

In [20]:
from skmob.measures.individual import jump_lengths, radius_of_gyration, home_location

In [21]:
url = "https://snap.stanford.edu/data/loc-brightkite_totalCheckins.txt.gz"
df = pd.read_csv(url, sep='\t', header=0, nrows=100000,
             names=['user', 'check-in_time', 'latitude', 'longitude', 'location id'])
tdf = skmob.TrajDataFrame(df, latitude='latitude', longitude='longitude', datetime='check-in_time', user_id='user')
rg_df = radius_of_gyration(tdf)
print(rg_df.head())

100%|██████████| 162/162 [00:01<00:00, 138.66it/s]

   uid  radius_of_gyration
0    0         1564.436792
1    1         2467.773523
2    2         1439.649774
3    3         1752.604191
4    4         5380.503250





In [22]:
jl_df = jump_lengths(tdf.sort_values(by='datetime'))
print(jl_df.head())

100%|██████████| 162/162 [00:01<00:00, 83.52it/s] 

   uid                                       jump_lengths
0    0  [19.640467328877936, 0.0, 0.0, 1.7434311010381...
1    1  [6.505330424378251, 46.75436600375988, 53.9284...
2    2  [0.0, 0.0, 0.0, 0.0, 3.6410097195943507, 0.0, ...
3    3  [3861.2706300798827, 4.061631313492122, 5.9163...
4    4  [15511.92758595804, 0.0, 15511.92758595804, 1....





Note that for some measures, such as `jump_length`, the `TrajDataFrame` must be order in increasing order by the column `datetime` (see the documentation for the measures that requires this condition https://scikit-mobility.github.io/scikit-mobility/reference/measures.html).

In [23]:
# compute the home location for each individual
hl_df = home_location(tdf)
print(hl_df.head())

100%|██████████| 162/162 [00:01<00:00, 100.06it/s]

   uid        lat         lng
0    0  39.891077 -105.068532
1    1  37.630490 -122.411084
2    2  39.739154 -104.984703
3    3  37.748170 -122.459192
4    4  60.180171   24.949728





In [24]:
import folium
from folium.plugins import HeatMap
m = folium.Map(tiles = 'openstreetmap', zoom_start=12, control_scale=True)
HeatMap(hl_df[['lat', 'lng']].values).add_to(m)
m

# Collective models

Collective generative algorithms estimate spatial flows between a set of discrete locations. Examples of spatial flows estimated with collective generative algorithms include commut-ing trips between neighborhoods, migration flows between municipalities, freight shipmentsbetween states, and phone calls between regions. 

In scikit-mobility, a collective generative algorithm takes in input a spatial tessellation, i.e., a geopandas `GeoDataFrame`. To be a valid input for a collective algorithm, the spatial tessellation should contain two columns, `geometry` and `relevance`, which are necessary to compute the two variables used by collective algorithms: the distance between tiles and the importance (aka "attractiveness") of each tile. A collective algorithm produces a `FlowDataFrame` that contains the generated flows and the spatial tessellation. scikit-mobility implements the most common collective generative algorithms: 
- the `Gravity` model; 
- the `Radiation` model. 

## Gravity model

The class `Gravity`, implementing the Gravity model, has two main methods: 
- `fit`, which calibrates the model's parameters using a `FlowDataFrame`; 
- `generate`, which generates the flows on a given spatial tessellation. 

Load the spatial tessellation and a data set of real flows in a `FlowDataFrame`:

In [25]:
from skmob.utils import utils, constants
import geopandas as gpd
from skmob.models import Gravity
import numpy as np

In [26]:
# load a spatial tessellation
url_tess = 'https://raw.githubusercontent.com/scikit-mobility/scikit-mobility/master/examples/NY_counties_2011.geojson'
tessellation = gpd.read_file(url_tess).rename(columns={'tile_id': 'tile_ID'})
# download the file with the real fluxes from: https://raw.githubusercontent.com/scikit-mobility/scikit-mobility/master/tutorial/data/NY_commuting_flows_2011.csv
fdf = skmob.FlowDataFrame.from_file("NY_commuting_flows_2011.csv",
                                        tessellation=tessellation,
                                        tile_id='tile_ID',
                                        sep=",")
# compute the total outflows from each location of the tessellation (excluding self loops)
tot_outflows = fdf[fdf['origin'] != fdf['destination']].groupby(by='origin', axis=0)[['flow']].sum().fillna(0)
tessellation = tessellation.merge(tot_outflows, left_on='tile_ID', right_on='origin').rename(columns={'flow': constants.TOT_OUTFLOW})

Instantiate a Gravity model object and generate synthetic flows:

In [27]:
# instantiate a singly constrained Gravity model
gravity_singly = Gravity(gravity_type='singly constrained')
print(gravity_singly)

Gravity(name="Gravity model", deterrence_func_type="power_law", deterrence_func_args=[-2.0], origin_exp=1.0, destination_exp=1.0, gravity_type="singly constrained")


In [28]:
# start the simulation
np.random.seed(0)
synth_fdf = gravity_singly.generate(tessellation,
                                   tile_id_column='tile_ID',
                                   tot_outflows_column='tot_outflow',
                                   relevance_column= 'population',
                                   out_format='flows')
# print a portion of the synthetic flows
print(synth_fdf.head())

100%|██████████| 62/62 [00:00<00:00, 2018.43it/s]

  origin destination  flow
0  36019       36101   109
1  36019       36107    52
2  36019       36059  1105
3  36019       36011   152
4  36019       36123    34



  return np.power(x, exponent)


Fit the parameters of the Gravity model from the `FlowDataFrame` and generate the synthetic flows:

In [29]:
# instantiate a Gravity object (with default parameters)
gravity_singly_fitted = Gravity(gravity_type='singly constrained')
print(gravity_singly_fitted)

Gravity(name="Gravity model", deterrence_func_type="power_law", deterrence_func_args=[-2.0], origin_exp=1.0, destination_exp=1.0, gravity_type="singly constrained")


In [30]:
# fit the parameters of the Gravity model from real fluxes
gravity_singly_fitted.fit(fdf, relevance_column='population')
print(gravity_singly_fitted)

Gravity(name="Gravity model", deterrence_func_type="power_law", deterrence_func_args=[-1.9947152031914142], origin_exp=1.0, destination_exp=0.6471759552223146, gravity_type="singly constrained")


In [31]:
np.random.seed(0)
synth_fdf_fitted = gravity_singly_fitted.generate(tessellation,
                                                        tile_id_column='tile_ID',
                                                        tot_outflows_column='tot_outflow',
                                                        relevance_column= 'population',
                                                        out_format='flows')
# print a portion of the synthetic flows
print(synth_fdf_fitted.head())

100%|██████████| 62/62 [00:00<00:00, 1999.68it/s]

  origin destination  flow
0  36019       36101   142
1  36019       36107   101
2  36019       36059   578
3  36019       36011   213
4  36019       36123    97





Plot the real flows and the synthetic flows:

In [32]:
m = fdf.plot_flows(min_flow=100, flow_exp=0.01, flow_color='blue')
synth_fdf_fitted.plot_flows(min_flow=1000, flow_exp=0.01, map_f=m)

## Radiation model

The Radiation model is parameter-free and has only one method: `generate`. Given a spatial tessellation, the synthetic flows can be generated using the `Radiation` class as follows:

In [33]:
from skmob.models import Radiation
# instantiate a Radiation object
radiation = Radiation()
# start the simulation
np.random.seed(0)
rad_flows = radiation.generate(tessellation, tile_id_column='tile_ID',  tot_outflows_column='tot_outflow', relevance_column='population', out_format='flows_sample')
# print a portion of the synthetic flows
print(rad_flows.head())

100%|██████████| 62/62 [00:00<00:00, 444.23it/s]

  origin destination   flow
0  36019       36033  11648
1  36019       36031   4232
2  36019       36089   5598
3  36019       36113   1596
4  36019       36041    117





# Individual generative models

The goal of individual generative algorithms of human mobility is to create a population of agents whose mobility patterns are statistically indistinguishable from those of real individuals. A generative algorithm typically generates a synthetic trajectory corresponding to a single moving object, assuming that an object is independent of the others. 

scikit-mobility implements the most common individual generative algorithms, such as the [Exploration and Preferential Return](https://www.nature.com/articles/nphys1760) model and its variants, and [DITRAS](https://link.springer.com/article/10.1007/s10618-017-0548-4). Each generative model is a python class with a public method `generate`, which starts the generation of synthetic trajectories.

The following code generate synthetic trajectories using the `DensityEPR` model:

In [35]:
from skmob.models.epr import DensityEPR
# load a spatial tesellation on which to perform the simulation
url = 'https://raw.githubusercontent.com/scikit-mobility/scikit-mobility/master/examples/NY_counties_2011.geojson'
tessellation = gpd.read_file(url)
# starting and end times of the simulation
start_time = pd.to_datetime('2019/01/01 08:00:00')
end_time = pd.to_datetime('2019/01/14 08:00:00')
# instantiate a DensityEPR object
depr = DensityEPR()
# start the simulation
synth_tdf = depr.generate(start_time, end_time, tessellation, relevance_column='population', n_agents=100, show_progress=True)
print(synth_tdf.head())

100%|██████████| 100/100 [00:11<00:00,  5.16it/s]


   uid                   datetime        lat        lng
0    1 2019-01-01 08:00:00.000000  42.452018 -76.473618
1    1 2019-01-01 08:32:30.108708  42.170344 -76.306260
2    1 2019-01-01 09:09:11.760703  42.452018 -76.473618
3    1 2019-01-01 10:00:22.832309  43.999013 -76.051987
4    1 2019-01-01 14:00:25.923314  42.452018 -76.473618


In [36]:
print(synth_tdf.parameters)

{'model': {'class': <function DensityEPR.__init__ at 0x118218c80>, 'generate': {'start_date': Timestamp('2019-01-01 08:00:00'), 'end_date': Timestamp('2019-01-14 08:00:00'), 'gravity_singly': {}, 'n_agents': 100, 'relevance_column': 'population', 'random_state': None, 'show_progress': True}}}
