# 0 - Introduction to scikit-mobility

### We are presenting the alpha version, any feedback is welcome :)

#### Contacts

<ul>
  <li>Luca Pappalardo - lucapappalardo1984@gmail.com</li>
  <li>Gianni Barlacchi - gianni.barlacchi@gmail.com</li>
  <li>Filippo Simini - flppsmn@gmail.com</li>
</ul>

### What's scikit-mobility?

<b>scikit-mobility</b> is a Python package that provides easy methods to work with mobility data. The library extends the datatypes used by Pandas to allow working with trajectories and fluxes of mobility. 

scikit-mobility is well suited for working with:

<ul>
    <li><b>trajectories</b> composed by latitude/longitude points (e.g. GPS data)</li>
    <li><b>fluxes</b> of movements between places (e.g. OD matrix)</li>
</ul>

We extended Pandas to provide a user-friendly, powerful and stable basic data structures.

The two primary data structures of scikit-mobility are:

<ul>
    <li><b>TrajDataFrame</b> - a dataframe designed to deal with mobility trajectories.</li>
    <li><b>FlowDataFrame</b> - a dataframe designed to deal with fluxes of movements mapped into a tessellation</li>
</ul>


### What to do with scikit-mobility?

Here are just a few of the things that scikit-mobility does well:

<ul>
    <li><b>Preprocessing</b> of mobility data such as clustering, compressing, detecting and filtering trajectories. </li>
    <li><b>Measuring</b> individual and collective mobility behaviours.</li>
  <li>[ FILIPPO ADD SHORT SENTENCE FOR THE MODELING PART ]</li>
</ul>
    

# TrajDataFrame


A `TrajDataFrame` is a Pandas DataFrame where row contains information regarding a point of the trajectory. There is no limit in the number of columns it can contains, but there three of them must always be present:

- `lat` - latitude of the point
- `lng` - longitude of the point
- `datetime` - date and time at which the point was visited

Additionaly, for multi-user and multi-trajectory dataset, there are two optional columns:

- `uid` - identifier for the user to which the trajectory belongs to
- `tid` - identifier for the trajectory

In [3]:
# Import the library

import skmob

## Construction of a `TrajDataFrame`

The `TrajDataFrame` can be created from:

- a python list or a numpy array
- a pandas `DataFrame`
- a python dictionary

### From `list`

In [15]:
# From a list

data_list = [[1, 39.984094, 116.319236, '2008-10-23 13:53:05'],
             [1, 39.984198, 116.319322, '2008-10-23 13:53:06'],
             [1, 39.984224, 116.319402, '2008-10-23 13:53:11'],
             [1, 39.984211, 116.319389, '2008-10-23 13:53:16']]
data_list

[[1, 39.984094, 116.319236, '2008-10-23 13:53:05'],
 [1, 39.984198, 116.319322, '2008-10-23 13:53:06'],
 [1, 39.984224, 116.319402, '2008-10-23 13:53:11'],
 [1, 39.984211, 116.319389, '2008-10-23 13:53:16']]

The constructor requires to know the indexes of the mandatory columns. We can set them by using the constructor arguments `latitude`, `longitude` and `datetime`.

In [16]:
tdf = skmob.TrajDataFrame(data_list, latitude=1, longitude=2, datetime=3)
tdf

Unnamed: 0,0,lat,lng,datetime
0,1,39.984094,116.319236,2008-10-23 13:53:05
1,1,39.984198,116.319322,2008-10-23 13:53:06
2,1,39.984224,116.319402,2008-10-23 13:53:11
3,1,39.984211,116.319389,2008-10-23 13:53:16


If present, we can also specify the index of the columns `uid` and `tid` using the contructor arguments `user_id` and `trajectory_id`.

In [18]:
tdf = skmob.TrajDataFrame(data_list, latitude=1, longitude=2, datetime=3, user_id=0)
tdf.head(5)

Unnamed: 0,uid,lat,lng,datetime
0,1,39.984094,116.319236,2008-10-23 13:53:05
1,1,39.984198,116.319322,2008-10-23 13:53:06
2,1,39.984224,116.319402,2008-10-23 13:53:11
3,1,39.984211,116.319389,2008-10-23 13:53:16


### From `DataFrame`

In [31]:
import pandas as pd

# Let's build a dataframe from the 2D list
data_df = pd.DataFrame(data_list, columns=['user', 'latitude', 'lng', 'hour'])

print(type(data_df))
data_df

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,user,latitude,lng,hour
0,1,39.984094,116.319236,2008-10-23 13:53:05
1,1,39.984198,116.319322,2008-10-23 13:53:06
2,1,39.984224,116.319402,2008-10-23 13:53:11
3,1,39.984211,116.319389,2008-10-23 13:53:16


The columns name in the dataframe don't match the names required from the TrajDataFrame.
We must specify the names of the mandatory columns in the original `DataFrame` using the constructor arguments `latitude`, `longitude` and `datetime`. If present, the columns for the user_id and the trajectory_id must also be specified.

In [32]:
tdf = skmob.TrajDataFrame(data_df, latitude='latitude', datetime='hour', user_id='user')

print(type(tdf))
tdf

<class 'skmob.core.trajectorydataframe.TrajDataFrame'>


Unnamed: 0,uid,lat,lng,datetime
0,1,39.984094,116.319236,2008-10-23 13:53:05
1,1,39.984198,116.319322,2008-10-23 13:53:06
2,1,39.984224,116.319402,2008-10-23 13:53:11
3,1,39.984211,116.319389,2008-10-23 13:53:16


### From `dictionary`

In [33]:
# Let's build a dataframe from the 2D list

data_dict = data_df.to_dict(orient='list')
data_dict

{'user': [1, 1, 1, 1],
 'latitude': [39.984094, 39.984198, 39.984224, 39.984211],
 'lng': [116.319236, 116.319322, 116.319402, 116.319389],
 'hour': ['2008-10-23 13:53:05',
  '2008-10-23 13:53:06',
  '2008-10-23 13:53:11',
  '2008-10-23 13:53:16']}

In [34]:
tdf = skmob.TrajDataFrame(data_dict, latitude='latitude', datetime='hour', user_id='user')
tdf

Unnamed: 0,uid,lat,lng,datetime
0,1,39.984094,116.319236,2008-10-23 13:53:05
1,1,39.984198,116.319322,2008-10-23 13:53:06
2,1,39.984224,116.319402,2008-10-23 13:53:11
3,1,39.984211,116.319389,2008-10-23 13:53:16


### From `file`

Most of the time, mobility data are stored in csv files. TrajDataFrame has its own method `from_file` to construct the object from an input file.

Let's try with a subsample of the GeoLife trajectories. The whole dataset can be found [here](https://www.microsoft.com/en-us/download/details.aspx?id=52367)

In [35]:
# read a dataset containing trajectories (GeoLife subsample, the data)

tdf = skmob.TrajDataFrame.from_file('./data/geolife_sample.txt.gz', sep=',')

In [36]:
# Let's explore the dataframe as we would do with Pandas

tdf.head()

Unnamed: 0,lat,lng,datetime,uid
0,39.984094,116.319236,2008-10-23 05:53:05,1
1,39.984198,116.319322,2008-10-23 05:53:06,1
2,39.984224,116.319402,2008-10-23 05:53:11,1
3,39.984211,116.319389,2008-10-23 05:53:16,1
4,39.984217,116.319422,2008-10-23 05:53:21,1


In [37]:
filename = './data/geolife_sample.txt.gz'
!gzcat ./data/geolife_sample.txt.gz | head

lat,lng,datetime,uid
39.984094,116.319236,2008-10-23 05:53:05,001
39.984198,116.319322,2008-10-23 05:53:06,001
39.984224,116.319402,2008-10-23 05:53:11,001
39.984211,116.319389,2008-10-23 05:53:16,001
39.984217,116.319422,2008-10-23 05:53:21,001
39.98471,116.319865,2008-10-23 05:53:23,001
39.984674,116.31981,2008-10-23 05:53:28,001
39.984623,116.319773,2008-10-23 05:53:33,001
39.984606,116.319732,2008-10-23 05:53:38,001
gzcat: error writing to output: Broken pipe
gzcat: ./data/geolife_sample.txt.gz: uncompress failed


In [38]:
tdf = skmob.TrajDataFrame.from_file(filename, sep=',')
tdf[:5]

Unnamed: 0,lat,lng,datetime,uid
0,39.984094,116.319236,2008-10-23 05:53:05,1
1,39.984198,116.319322,2008-10-23 05:53:06,1
2,39.984224,116.319402,2008-10-23 05:53:11,1
3,39.984211,116.319389,2008-10-23 05:53:16,1
4,39.984217,116.319422,2008-10-23 05:53:21,1


## Properties

TrajDataFrame supports the possibility to add properties.

- `crs`: it's the coordinate reference system associated to the trajectories. By default epsg:4326 (lat/long)
- `parameters`: dictionary to add as many as necessary additional properties.

In [39]:
tdf.crs

{'init': 'epsg:4326'}

In [40]:
tdf.parameters

{'from_file': './data/geolife_sample.txt.gz'}

In [41]:
tdf.parameters['something'] = 5
tdf.parameters

{'from_file': './data/geolife_sample.txt.gz', 'something': 5}

## Column types

The mandatory (and optional) columns must have specific types.

In [42]:
# In the DataFrame
type(data_df)

pandas.core.frame.DataFrame

In [43]:
data_df.dtypes

user          int64
latitude    float64
lng         float64
hour         object
dtype: object

In [44]:
# In the TrajDataFrame
type(tdf)

skmob.core.trajectorydataframe.TrajDataFrame

In [133]:
tdf.dtypes

lat                float64
lng                float64
datetime    datetime64[ns]
uid                  int64
dtype: object

In [47]:
# We can access the columns as we would do with pandas

tdf.lat.head()

0    39.984094
1    39.984198
2    39.984224
3    39.984211
4    39.984217
Name: lat, dtype: float64

## Write and read a `TrajDataFrame`

To include all the metadata attached to TrajDataFrame and write/read them into a file, scikit-mobility provides ad-hoc methods.

### Write a `TrajDataFrame` to file

- including the `tdf.parameters` and `tdf.crs`
- automatically preserve the dtype of columns with time stamps (time zone info is lost though).

**Caveat**: dtypes other than `int`, `float` and `datetime` may not be identical to the original dtype after loading from a json file. 

Check with `tdf.dtypes` and manually convert each column to the proper dtype, if needed. 

In [50]:
skmob.write(tdf, './tdf.json')

In [51]:
tdf.dtypes

lat                float64
lng                float64
datetime    datetime64[ns]
uid                  int64
dtype: object

In [52]:
tdf.parameters

{'from_file': './data/geolife_sample.txt.gz', 'something': 5}

### Load a `TrajDataFrame` from file

- trajectories data
- automatically add `tdf.parameters` and `tdf.crs` stored with the write function.


In [56]:
tdf2 = skmob.read('./tdf.json')
tdf2[:4]

Unnamed: 0,lat,lng,datetime,uid
0,39.984094,116.319236,2008-10-23 05:53:05,1
1,39.984198,116.319322,2008-10-23 05:53:06,1
2,39.984224,116.319402,2008-10-23 05:53:11,1
3,39.984211,116.319389,2008-10-23 05:53:16,1


In [57]:
tdf2.dtypes

lat                float64
lng                float64
datetime    datetime64[ns]
uid                  int64
dtype: object

In [58]:
tdf2.parameters

{'from_file': './data/geolife_sample.txt.gz', 'something': 5}

## Plotting

A very important aspect when working with mobil the `TrajDataFrame`

In [150]:
tdf.plot_trajectory(zoom=12, weight=3, opacity=0.9, tiles='Stamen Toner')

# FlowDataFrame

The `FlowDataFrame` is a structure for trajectory data that extends Pandas' `DataFrame`.

A `FlowDataFrame` must have the following **columns**:

- `origin`: ID of tile of origin
- `destination`: ID of tile of destination
- `flow`: number of people travelling between from the origin to the destination

Other columns may be present but are not mandatory. 

The **rows** of the `FlowDataFrame` corresponds to origin-destination pairs.


### Tessellation

A tessellation is a geopandas `GeoDataFrame` that defines the tiles or locations in the `FlowDataFrame`. 

The `FlowDataFrame` constructor requires to specify a tessellation containing the geometry of all the tiles, origina and destinations, in the rows of the `FlowDataFrame`. 

## Construction of a `FlowDataFrame`

The `FlowDataFrame` can be created from:

- a python list or a numpy array
- a pandas `DataFrame`
- a python dictionary