<center><img src="logo_skmob.png" width=450 align="left" /></center>

# Real-World Mobility data

- Repo: [http://bit.ly/skmob_repo](http://bit.ly/skmob_repo)
- Docs: [http://bit.ly/skmob_doc](http://bit.ly/skmob_doc)
- Paper: [http://bit.ly/skmob_paper](http://bit.ly/skmob_paper)



## Social Media: the Brightkite data set
[Brightkite](https://snap.stanford.edu/data/loc-brightkite.html) was a location-based social networking service provider where users shared their locations by checking-in in the period Apr 2008 - Oct 2010: 
- 58,228 users
- 4,491,143 checkins

In [12]:
import pandas as pd
url = "https://snap.stanford.edu/data/loc-brightkite_totalCheckins.txt.gz"
df = pd.read_csv(url, sep='\t', header=0, nrows=100000, names=['user', 'check-in_time', 'latitude', 'longitude', 'location id'])
bdf = skmob.TrajDataFrame(df, latitude='latitude', longitude='longitude', datetime='check-in_time', user_id='user')
bdf.head()

Unnamed: 0,uid,datetime,lat,lng,location id
0,0,2010-10-16 06:02:04+00:00,39.891383,-105.070814,7a0f88982aa015062b95e3b4843f9ca2
1,0,2010-10-16 03:48:54+00:00,39.891077,-105.068532,dd7cd3d264c2d063832db506fba8bf79
2,0,2010-10-14 18:25:51+00:00,39.750469,-104.999073,9848afcc62e500a01cf6fbf24b797732f8963683
3,0,2010-10-14 00:21:47+00:00,39.752713,-104.996337,2ef143e12038c870038df53e0478cefc
4,0,2010-10-13 23:31:51+00:00,39.752508,-104.996637,424eb3dd143292f9e013efa00486c907


In [13]:
bdf['leaving_datetime'] = bdf.datetime
user0_bdf = bdf[bdf.uid == bdf.uid.unique()[0]]
user0_bdf_sample = user0_bdf.sample(200)
user0_map = user0_bdf_sample.plot_stops(zoom=3)
user0_bdf_sample.plot_trajectory(map_f=user0_map)

### GPS: the GeoLife dataset

collected in (Microsoft Research Asia) **[GeoLife](https://www.microsoft.com/en-us/download/details.aspx?id=52367)** project by 182 users in the period Apr 2007 - Aug 2012.

- 17,621 trajectories
- total distance of about 1.2 million kilometers 
- total duration of 48,000+ hours.

In [1]:
# Import the library
import skmob

In [2]:
tdf = skmob.TrajDataFrame.from_file('data/geolife_sample.txt.gz').sort_values(by='datetime')
print(type(tdf))
print(tdf.crs)
print(tdf.parameters)

<class 'skmob.core.trajectorydataframe.TrajDataFrame'>
{'init': 'epsg:4326'}
{'from_file': 'data/geolife_sample.txt.gz'}


In [3]:
# Let's explore the TrajDataFrame as we would do with pandas
tdf.head()

Unnamed: 0,lat,lng,datetime,uid
0,39.984094,116.319236,2008-10-23 05:53:05,1
1,39.984198,116.319322,2008-10-23 05:53:06,1
2,39.984224,116.319402,2008-10-23 05:53:11,1
3,39.984211,116.319389,2008-10-23 05:53:16,1
4,39.984217,116.319422,2008-10-23 05:53:21,1


In [4]:
tdf.plot_trajectory(zoom=12, weight=3, opacity=0.9, tiles='Stamen Toner')

- How many users in the data set?
- How many points?
- What's the time window?

In [5]:
print('# users: %s' %len(tdf.uid.unique()))
print('# points: %s' %len(tdf))
print('time window: %s' 
      %(tdf.iloc[-1].datetime - tdf.iloc[0].datetime))

# users: 2
# points: 217653
time window: 146 days 23:53:32


## Let's focus on a single user
using the *select* operation as we do in **pandas**

In [99]:
user1_tdf = tdf[tdf.uid == 1]
user1_tdf.head()

Unnamed: 0,lat,lng,datetime,uid
0,39.984094,116.319236,2008-10-23 05:53:05,1
1,39.984198,116.319322,2008-10-23 05:53:06,1
2,39.984224,116.319402,2008-10-23 05:53:11,1
3,39.984211,116.319389,2008-10-23 05:53:16,1
4,39.984217,116.319422,2008-10-23 05:53:21,1


In [100]:
user1_map = user1_tdf.plot_trajectory(zoom=11, weight=3, tiles='Open Street Map')
user1_map

## Mobility data preprocessing

There 3 common steps we can apply to clean our data:

- Filtering
- Compression
- Stop detection


## Filtering

Filter out points with speed higher than `max_speed` km/h from the previous point.

In [101]:
from skmob.preprocessing import filtering

In [102]:
# Let's filter points with speed higher than 500km/h
user1_ftdf = filtering.filter(user1_tdf, max_speed_kmh=500.)

In [103]:
user1_ftdf.parameters

{'from_file': 'data/geolife_sample.txt.gz',
 'filter': {'function': 'filter',
  'max_speed_kmh': 500.0,
  'include_loops': False,
  'speed_kmh': 5.0,
  'max_loop': 6,
  'ratio_max': 0.25}}

Very few points have been filtered.

In [104]:
print('Points of the raw trajectory:\t\t%s'%len(user1_tdf))
print('Points of the filtered trajectory:\t%s'%len(user1_ftdf))

Points of the raw trajectory:		108607
Points of the filtered trajectory:	108589


## Compression

Reduce the number of points of the trajectory, preserving the structure.

Merge together all points that are closer than `spatial_radius_km=0.2` kilometers from each other.

In [105]:
from skmob.preprocessing import compression

In [106]:
user1_ctdf = compression.compress(user1_ftdf, spatial_radius_km=0.1)
user1_ctdf.head()

Unnamed: 0,lat,lng,datetime,uid
0,39.984578,116.319749,2008-10-23 05:53:05,1
1,39.984533,116.320287,2008-10-23 05:54:03,1
2,39.984235,116.320923,2008-10-23 05:54:38,1
3,39.982974,116.321144,2008-10-23 05:55:54,1
4,39.982069,116.321219,2008-10-23 05:56:22,1


In [107]:
user1_ctdf.parameters

{'from_file': 'data/geolife_sample.txt.gz',
 'filter': {'function': 'filter',
  'max_speed_kmh': 500.0,
  'include_loops': False,
  'speed_kmh': 5.0,
  'max_loop': 6,
  'ratio_max': 0.25},
 'compress': {'function': 'compress', 'spatial_radius_km': 0.1}}

The compressed trajectory has only a small fraction of the points of the filtered trajectory.

In [108]:
print('Points of the filtered trajectory: %s'%len(user1_ftdf))
print('Points of the compressed trajectory: %s'%len(user1_ctdf))

Points of the filtered trajectory: 108589
Points of the compressed trajectory: 7098


## Stop detection

Identify locations where the user spent at least `minutes_for_a_stop` minutes within a distance `spatial_radius_km` $\times$ `stop_radius_factor`, from a given point. 

A new column `leaving_datetime` is added, indicating the time when the user departs from the stop.

In [109]:
from skmob.preprocessing import detection

In [110]:
user1_stdf = detection.stops(user1_ctdf, stop_radius_factor=0.5, \
            minutes_for_a_stop=20.0, spatial_radius_km=0.2, 
                       leaving_time=True)
stdf[:4]

Unnamed: 0,lat,lng,datetime,uid,leaving_datetime
0,39.97803,116.327481,2008-10-23 06:01:37,1,2008-10-23 10:32:53
1,40.013987,116.306378,2008-10-23 11:09:46,1,2008-10-23 23:45:39
2,39.978419,116.32687,2008-10-24 00:21:52,1,2008-10-24 01:47:30
3,39.981112,116.308757,2008-10-24 02:00:22,1,2008-10-24 02:30:29


In [111]:
user1_stdf.parameters

{'from_file': 'data/geolife_sample.txt.gz',
 'filter': {'function': 'filter',
  'max_speed_kmh': 500.0,
  'include_loops': False,
  'speed_kmh': 5.0,
  'max_loop': 6,
  'ratio_max': 0.25},
 'compress': {'function': 'compress', 'spatial_radius_km': 0.1},
 'detect': {'function': 'stops',
  'stop_radius_factor': 0.5,
  'minutes_for_a_stop': 20.0,
  'spatial_radius_km': 0.2,
  'leaving_time': True,
  'no_data_for_minutes': 1000000000000.0,
  'min_speed_kmh': None}}

#### Visualise the compressed trajectory and the stops

Click on the stop markers to see a pop up with: 
- User ID
- Coordinates of the stop (click to see the location on Google maps)
- Arrival time
- Departure time

In [112]:
map_f = user1_stdf.plot_trajectory(max_points=1000, hex_color=-1, start_end_markers=False)
user1_stdf.plot_stops(map_f=map_f, hex_color=-1)

In [10]:
from skmob.preprocessing import detection
user1_stdf = detection.stops(user1_tdf, stop_radius_factor=0.5, 
                             minutes_for_a_stop=20.0, spatial_radius_km=0.2, 
                             leaving_time=True)
user1_stdf.head(4)

Unnamed: 0,lat,lng,datetime,uid,leaving_datetime
0,39.97803,116.327481,2008-10-23 06:01:37,1,2008-10-23 10:32:53
1,40.01382,116.306532,2008-10-23 11:10:19,1,2008-10-23 23:45:27
2,39.978419,116.32687,2008-10-24 00:21:52,1,2008-10-24 01:47:30
3,39.981166,116.308475,2008-10-24 02:02:31,1,2008-10-24 02:30:29


In [11]:
user1_stdf.plot_stops(map_f=user1_map, hex_color=-1)

## Stops define sub-trajectories
Let's take the first sub-trajectories using the stops

In [113]:
user1_stdf.head(4)

Unnamed: 0,lat,lng,datetime,uid,leaving_datetime
0,39.97803,116.327481,2008-10-23 06:01:37,1,2008-10-23 10:32:53
1,40.01382,116.306532,2008-10-23 11:10:19,1,2008-10-23 23:45:27
2,39.978419,116.32687,2008-10-24 00:21:52,1,2008-10-24 01:47:30
3,39.981133,116.308758,2008-10-24 02:00:22,1,2008-10-24 02:30:29


In [114]:
dt1 = user1_stdf.iloc[0].leaving_datetime
dt2 = user1_stdf.iloc[1].leaving_datetime
dt1, dt2

(Timestamp('2008-10-23 10:32:53'), Timestamp('2008-10-23 23:45:27'))

In [117]:
user1_tid1_tdf = user1_tdf[(user1_tdf.datetime >= dt1) 
                           & (user1_tdf.datetime <= dt2)]
user1_tid1_tdf.head()

Unnamed: 0,lat,lng,datetime,uid
148,39.970511,116.341455,2008-10-23 10:32:53,1
149,39.977648,116.326925,2008-10-23 10:33:00,1
150,39.977586,116.326918,2008-10-23 10:33:05,1
151,39.977596,116.326894,2008-10-23 10:33:10,1
152,39.977661,116.326947,2008-10-23 10:33:14,1


In [118]:
user1_tid1_map = user1_tid1_tdf.plot_trajectory(zoom=13, weight=5, opacity=0.9, tiles='Stamen Toner', )
user1_tid1_map

In [15]:
from skmob.utils.gislib import getDistanceByHaversine
start_loc = user1_tid1_tdf.iloc[0][['lat', 'lng']]
end_loc = user1_tid1_tdf.iloc[-1][['lat', 'lng']]
distance = getDistanceByHaversine(end_loc, start_loc)
distance

4.369295922582342

## Compute some features based on sub-trajectories

In [89]:
def number_of_trips(tdf, stop_radius_factor=0.5, minutes_for_a_stop=20.0, spatial_radius_km=0.2):
    """
    Compute the number of trips for each object.
    """
    # detect the stops for each individual
    stdf = detection.stops(tdf, stop_radius_factor=stop_radius_factor, 
                             minutes_for_a_stop=minutes_for_a_stop, 
                           spatial_radius_km=spatial_radius_km, 
                             leaving_time=True)
    return stdf.groupby('uid').apply(lambda user_stdf: len(user_stdf)).reset_index().rename(columns={0: 'n_trips'})

In [90]:
number_of_trips(tdf)

Unnamed: 0,uid,n_trips
0,1,145
1,5,246


In [91]:
def number_of_evening_trips(tdf, stop_radius_factor=0.5, minutes_for_a_stop=20.0, 
                                   spatial_radius_km=0.2):
    """
    Number of subtrajectories that end in the evening.
    """
    def get_evening_trips(user_stdf, evening_time=['16:00', '20:00']):
        start_evening, end_evening = evening_time
        return len(user_stdf.set_index('leaving_datetime').between_time(start_evening, 
                                                                end_evening))
    
    stdf = detection.stops(tdf, stop_radius_factor=stop_radius_factor, 
                             minutes_for_a_stop=minutes_for_a_stop, 
                           spatial_radius_km=spatial_radius_km, 
                             leaving_time=True)
    return stdf.groupby('uid').apply(lambda user_stdf: get_evening_trips(user_stdf)).reset_index().rename(columns={0: 'evening_trips'})

In [92]:
number_of_evening_trajectories(tdf)

In [93]:
def average_stops_per_day(tdf, stop_radius_factor=0.5, minutes_for_a_stop=20.0, 
                                   spatial_radius_km=0.2):
    """
    Average number of stops per day
    """
    def get_stops_per_day(user_stdf):
        return user_stdf.groupby(user_stdf.leaving_datetime.dt.floor('d')).size().reset_index(name='count').mean()

    stdf = detection.stops(tdf, stop_radius_factor=stop_radius_factor, 
                             minutes_for_a_stop=minutes_for_a_stop, 
                           spatial_radius_km=spatial_radius_km, 
                             leaving_time=True)
    return stdf.groupby('uid').apply(lambda user_stdf: get_stops_per_day(user_stdf)).reset_index().rename(columns={'count': 'avg_stops_per_day'})             

In [94]:
average_stops_per_day(tdf)

Unnamed: 0,uid,avg_stops_per_day
0,1,3.372093
1,5,4.032787


In [None]:
### MEASURES

Individual: radius of gyration, jump lengths, maximum distance, individual mobility network
Collective: visits per time unit, OD matrix

In [None]:
Social Media: load the dataset, explore with some plots
GPS data: load the dataset, explore with some plots, show 3 steps for cleaning