<center><img src="logo_skmob.png" width=450 align="left" /></center>

# Preprocessing mobility data

- Repo: [http://bit.ly/skmob_repo](http://bit.ly/skmob_repo)
- Docs: [http://bit.ly/skmob_doc](http://bit.ly/skmob_doc)
- Paper: [http://bit.ly/skmob_paper](http://bit.ly/skmob_paper)



### GPS: the [GeoLife dataset](https://www.microsoft.com/en-us/download/details.aspx?id=52367)

collected in (Microsoft Research Asia) **GeoLife** project by 182 users in the period Apr 2007 - Aug 2012.

- 17,621 trajectories
- total distance of about 1.2 million kilometers 
- total duration of 48,000+ hours.

In [None]:
# import the skmob and pandas libraries
import skmob
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [None]:
tdf = skmob.TrajDataFrame.from_file('data/geolife_sample.txt.gz').sort_values(by='datetime')
print(type(tdf))
print(tdf.crs, tdf.parameters)
tdf.head()

In [None]:
tdf.plot_trajectory(zoom=12, weight=3, opacity=0.9, tiles='Stamen Toner',
                    start_end_markers=False)

- How many users in the data set?
- How many points?
- What's the time window?

In [None]:
print('# users: %s' %len(tdf.uid.unique()))
print('# points: %s' %len(tdf))
print('time window: %s' 
      %(tdf.iloc[-1].datetime - tdf.iloc[0].datetime))

## Let's focus on a single user
using the *select* operation as we do in **pandas**

In [None]:
user1_tdf = tdf[tdf.uid == 1]
user1_tdf.head()

In [None]:
user1_map = user1_tdf.plot_trajectory(zoom=11, weight=3, hex_color='black',
                                      tiles='Open Street Map')
user1_map

## Mobility data preprocessing

There are 3 common steps we can apply to clean our data:

1. Filtering (`filtering.filter`)
- Compression (`compression.compress`)
- Stop detection (`detection.stops`)
- Stops clustering (`clustering.cluster`)


## Filtering trajectories

Filter out points with speed higher than `max_speed` km/h from the previous point.

In [None]:
from skmob.preprocessing import filtering

In [None]:
max_speed_kmh = 500.
user1_f_tdf = filtering.filter(user1_tdf, max_speed_kmh=max_speed_kmh)

In [None]:
user1_f_tdf.parameters

Very few points have been filtered.

In [None]:
print('Filtered points:\t%s'%(len(user1_tdf) - len(user1_f_tdf)))

In [None]:
# indicator adds column _merge
merged = user1_tdf.merge(user1_f_tdf, indicator=True, how='outer')
diff_df = merged[merged['_merge'] == 'left_only']
diff_df

Extract the filtered points between indexes `25372` and `23377`.

In [None]:
min_index, max_index = 25373, 25376
dt_start = user1_tdf.iloc[min_index - 1]['datetime']
dt_end = user1_tdf.iloc[max_index + 1]['datetime']

filtered_tdf = user1_f_tdf[(user1_f_tdf['datetime'] >= dt_start) \
                 & (user1_f_tdf['datetime'] <= dt_end)]

unfiltered_tdf = user1_tdf[(user1_tdf['datetime'] >= dt_start) \
                  & (user1_tdf['datetime'] <= dt_end)]
filtered_tdf

Compute the speeds between consecutive points on the unfiltered trajectory

In [None]:
lat_lng_dt = unfiltered_tdf[['lat', 'lng', 'datetime']].values

In [None]:
# avg speed (km/h) between last not filtered point and following points
from  skmob.utils.gislib import getDistance
lat0, lng0, dt0 = lat_lng_dt[0]
pd.DataFrame(
    [[dt0, dt , getDistance((lat, lng), (lat0, lng0)) / ((dt - dt0).seconds / 3600),
     getDistance((lat, lng), (lat0, lng0)) / ((dt - dt0).seconds / 3600) > max_speed_kmh] \
     for i, (lat ,lng, dt) in enumerate(lat_lng_dt[1:])], \
             columns=['time 0', 'time 1', 'speed (km/h)', 'to_filter'])

In [None]:
# Cut a buffer of 10 points around the filtered part
dt_start = user1_tdf.iloc[min_index - 10]['datetime']
dt_end = user1_tdf.iloc[max_index + 10]['datetime']

filtered_tdf = user1_f_tdf[(user1_f_tdf['datetime'] >= dt_start) \
                 & (user1_f_tdf['datetime'] <= dt_end)]

unfiltered_tdf = user1_tdf[(user1_tdf['datetime'] >= dt_start) \
                  & (user1_tdf['datetime'] <= dt_end)]
filtered_tdf.head()

In [None]:
map_f = unfiltered_tdf.plot_trajectory(zoom=14, weight=10, opacity=0.5, hex_color='black') #, tiles='Stamen Toner')
filtered_tdf.plot_trajectory(map_f=map_f, hex_color='red')

## Compressing trajectories

Reduce the number of points of the trajectory, preserving the structure.

Merge together all points that are closer than `spatial_radius_km=0.2` kilometers from each other.

In [None]:
from skmob.preprocessing import compression

In [None]:
user1_cf_tdf = compression.compress(user1_f_tdf, spatial_radius_km=0.5)
user1_cf_tdf.head()

In [None]:
user1_cf_tdf.parameters

The compressed trajectory has only a small fraction of the points of the filtered trajectory.

In [None]:
print('Points of the filtered trajectory:\t%s'%len(user1_f_tdf))
print('Points of the compressed trajectory:\t%s'%len(user1_cf_tdf))
print('Compressed points:\t\t\t%s'%(len(user1_f_tdf)-len(user1_cf_tdf)))

In [None]:
end_time = user1_f_tdf.iloc[10000]['datetime']
map_f = user1_f_tdf[user1_f_tdf['datetime'] < end_time].plot_trajectory(weight=5, hex_color='black',
                                                                      opacity=0.5, start_end_markers=False)
user1_cf_tdf[user1_cf_tdf['datetime'] < end_time].plot_trajectory(map_f=map_f, \
                                                  start_end_markers=False, hex_color='red')

## Stop detection

Identify locations where the user spent at least `minutes_for_a_stop` minutes within a distance `spatial_radius_km` $\times$ `stop_radius_factor`, from a given point. 

A new column `leaving_datetime` is added, indicating the time when the user departs from the stop.

In [None]:
from skmob.preprocessing import detection

In [None]:
user1_scf_tdf = detection.stops(user1_cf_tdf, stop_radius_factor=0.5, \
            minutes_for_a_stop=20.0, spatial_radius_km=0.2, 
                       leaving_time=True)
user1_scf_tdf.head()

In [None]:
user1_scf_tdf.parameters

### Visualise the compressed trajectory and the stops

Click on the stop markers to see a pop up with: 
- User ID
- Coordinates of the stop (click to see the location on Google maps)
- Arrival time
- Departure time

In [None]:
map_f = user1_scf_tdf.plot_trajectory(max_points=1000, hex_color=-1, start_end_markers=False)
user1_scf_tdf.plot_stops(map_f=map_f, hex_color=-1)

## Stops define <font color="blue">trips</font>
Let's take the first trip of the individual using the stops

In [None]:
user1_scf_tdf.head()

In [None]:
dt1 = user1_scf_tdf.iloc[0].leaving_datetime
dt2 = user1_scf_tdf.iloc[1].leaving_datetime
dt1, dt2

In [None]:
# select all points between the first two stops
user1_tid1_tdf = user1_tdf[(user1_tdf.datetime >= dt1) 
                           & (user1_tdf.datetime <= dt2)]
user1_tid1_tdf.head()

In [None]:
# plot the trip
user1_tid1_map = user1_tid1_tdf.plot_trajectory(zoom=12, weight=5, opacity=0.9, tiles='Stamen Toner', )
user1_tid1_map

Compute the length of the trip and the distance between origin and destination

In [None]:
from skmob.utils.gislib import getDistanceByHaversine
from skmob.measures.individual import distance_straight_line
# take origin and destination of the trip
start_loc = user1_tid1_tdf.iloc[0][['lat', 'lng']]
end_loc = user1_tid1_tdf.iloc[-1][['lat', 'lng']]
# compute distance between origin and destination
print("distance:", getDistanceByHaversine(end_loc, start_loc))

In [None]:
distance_straight_line(user1_tid1_tdf)

## Compute some features based on trips

In [None]:
def number_of_trips(tdf, stop_radius_factor=0.5, minutes_for_a_stop=20.0, spatial_radius_km=0.2):
    """
    Compute the number of trips for each object.
    """
    # detect the stops for each individual
    stdf = detection.stops(tdf, stop_radius_factor=stop_radius_factor, 
                             minutes_for_a_stop=minutes_for_a_stop, 
                           spatial_radius_km=spatial_radius_km, leaving_time=True)
    return stdf.groupby('uid').apply(lambda user_stdf: len(user_stdf)).reset_index().rename(columns={0: 'n_trips'})

In [None]:
number_of_trips(tdf)

In [None]:
def number_of_evening_trips(tdf, stop_radius_factor=0.5, minutes_for_a_stop=20.0, 
                                   spatial_radius_km=0.2):
    """
    Number of subtrajectories that end in the evening.
    """
    def get_evening_trips(user_stdf, evening_time=['16:00', '20:00']):
        start_evening, end_evening = evening_time
        return len(user_stdf.set_index('leaving_datetime').between_time(start_evening, 
                                                                end_evening))
    stdf = detection.stops(tdf, stop_radius_factor=stop_radius_factor, 
                             minutes_for_a_stop=minutes_for_a_stop, 
                           spatial_radius_km=spatial_radius_km, 
                             leaving_time=True)
    return stdf.groupby('uid').apply(lambda user_stdf: get_evening_trips(user_stdf)).reset_index().rename(columns={0: 'evening_trips'})

In [None]:
number_of_evening_trips(tdf)

In [None]:
def average_stops_per_day(tdf, stop_radius_factor=0.5, minutes_for_a_stop=20.0, 
                                   spatial_radius_km=0.2):
    """
    Average number of stops per day
    """
    def get_stops_per_day(user_stdf):
        return user_stdf.groupby(user_stdf.leaving_datetime.dt.floor('d')).size().reset_index(name='count').mean()

    stdf = detection.stops(tdf, stop_radius_factor=stop_radius_factor, 
                             minutes_for_a_stop=minutes_for_a_stop, 
                           spatial_radius_km=spatial_radius_km, 
                             leaving_time=True)
    return stdf.groupby('uid').apply(lambda user_stdf: get_stops_per_day(user_stdf)).reset_index().rename(columns={'count': 'avg_stops_per_day'})             

In [None]:
average_stops_per_day(tdf)

## Find clusters of stops
- stops are clustered by spatial proximity using DBSCAN

- a new column `cluster` is added with cluster ID (`int`)

- 0 is the most visited, 1 the second most visited,  etc.

In [None]:
from skmob.preprocessing import clustering
user1_clscf_tdf = clustering.cluster(user1_scf_tdf)
user1_clscf_tdf.head()

In [None]:
user1_clscf_tdf.parameters

## Visualise clustered stops: 
- stops in the same clusters have the same color.

In [None]:
map_f = user1_clscf_tdf.plot_trajectory(start_end_markers=False, hex_color='black')
user1_clscf_tdf.plot_stops(map_f=map_f)

## Social Media: the <font color="blue">Brightkite</font> data set
[Brightkite](https://snap.stanford.edu/data/loc-brightkite.html) was a location-based social networking service provider where users shared their locations by checking-in in the period Apr 2008 - Oct 2010: 
- 58,228 users
- 4,491,143 checkins

In [None]:
# load the pandas DataFrame
url = "https://snap.stanford.edu/data/loc-brightkite_totalCheckins.txt.gz"
df = pd.read_csv(url, sep='\t', header=0, nrows=10000, names=['user', 'check-in_time', 'latitude', 'longitude', 'location id'])

# convert it to a TrajDataFrame
btdf = skmob.TrajDataFrame(df, latitude='latitude', longitude='longitude', datetime='check-in_time', user_id='user')

print(btdf.shape, len(btdf['uid'].unique()))
btdf.head()

In [None]:
btdf['leaving_datetime'] = btdf.datetime
# take the points of a single user
user0_btdf = btdf[btdf.uid == btdf.uid.unique()[0]]
# take a sample of 200 random points
user0_btdf_sample = user0_btdf.sample(200)
# plot the stops of the user
user0_map = user0_btdf_sample.plot_stops(zoom=3)
# plot the trajectory of the user
user0_btdf_sample.plot_trajectory(map_f=user0_map)

## Filtering
Filter out points with speed higher than max_speed km/h from the previous point.

In [None]:
f_btdf = filtering.filter(btdf.drop('leaving_datetime', axis=1), max_speed_kmh=500.)
f_btdf.head(3)

In [None]:
print('Points of the raw trajectory: %s.'%len(btdf))
print('Points of the filtered trajectory: %s.'%len(f_btdf))

## Compression
Reduce trajectory's number of points, preserving the structure.

In [None]:
cf_btdf = compression.compress(f_btdf, spatial_radius_km=0.5)
cf_btdf.head()

In [None]:
print('Points of the filtered trajectory: %s.'%len(f_btdf))
print('Points of the compressed trajectory: %s.'%len(cf_btdf))

### Visualise the filtered and compressed trajectories
Show the first 10000 points of the filtered trajectory.

In [None]:
user, imin, imax = 1, 0, 100
dt_start = f_btdf[f_btdf['uid'] == user]['datetime'].min()
dt_end = f_btdf[f_btdf['uid'] == user]['datetime'].max()

filtered_tdf = f_btdf[(f_btdf['datetime'] >= dt_start) \
                 & (f_btdf['datetime'] <= dt_end) \
                 & (f_btdf['uid'] == user)]

compressed_tdf = cf_btdf[(cf_btdf['datetime'] >= dt_start) \
                  & (cf_btdf['datetime'] <= dt_end) \
                  & (cf_btdf['uid'] == user)]

In [None]:
print(len(filtered_tdf), len(compressed_tdf))
filtered_tdf.head()

In [None]:
map_f = filtered_tdf.plot_trajectory(zoom=9, max_points=None, weight=5, hex_color='black', opacity=0.5, start_end_markers=False)
compressed_tdf.plot_trajectory(map_f=map_f, max_points=None, hex_color='red', start_end_markers=False)

In [None]:
from skmob.tessellation import tilers
from skmob.utils import plot
sm_tess = tilers.tiler.get('squared', base_shape='San Mateo, USA', meters=5000)

In [None]:
map_filtered_tdf = filtered_tdf.mapping(sm_tess, remove_na=True)
map_compressed_tdf = compressed_tdf.mapping(sm_tess, remove_na=True)
map_compressed_tdf.head()

In [None]:
map_f = plot.plot_gdf(sm_tess, zoom=9, style_func_args={'color':'gray', 'fillColor':'gray', 'opacity':0.2})
map_f = map_filtered_tdf.plot_trajectory(map_f=map_f, max_points=None, weight=5, hex_color='black', opacity=0.5)
map_compressed_tdf.plot_trajectory(map_f=map_f, max_points=None, hex_color='red')

## Split trajectory by day

In [None]:
from skmob.utils import utils
groups = utils.group_df_by_time(map_compressed_tdf, 
                        offset_value=3, offset_unit='hours', add_starting_location=True)

In [None]:
map_f = groups[0].plot_trajectory(start_end_markers=False, hex_color='red', weight=3)
map_f = groups[1].plot_trajectory(map_f=map_f, start_end_markers=False, hex_color='blue', weight=3)
map_f = groups[5].plot_trajectory(map_f=map_f, start_end_markers=False, hex_color='green', weight=3)
map_f