<center><img src="logo_skmob.png" width=450 align="left" /></center>

# Real-World Mobility data

- Repo: [http://bit.ly/skmob_repo](http://bit.ly/skmob_repo)
- Docs: [http://bit.ly/skmob_doc](http://bit.ly/skmob_doc)
- Paper: [http://bit.ly/skmob_paper](http://bit.ly/skmob_paper)



## Social Media: the <font color="blue">Brightkite</font> data set
[Brightkite](https://snap.stanford.edu/data/loc-brightkite.html) was a location-based social networking service provider where users shared their locations by checking-in in the period Apr 2008 - Oct 2010: 
- 58,228 users
- 4,491,143 checkins

In [2]:
# import the skmob and pandas libraries
import skmob
import pandas as pd

# load the pandas DataFrame
url = "https://snap.stanford.edu/data/loc-brightkite_totalCheckins.txt.gz"
df = pd.read_csv(url, sep='\t', header=0, nrows=100000, names=['user', 'check-in_time', 'latitude', 'longitude', 'location id'])

# convert it to a TrajDataFrame
bdf = skmob.TrajDataFrame(df, latitude='latitude', longitude='longitude', datetime='check-in_time', user_id='user')
bdf.head()

KeyboardInterrupt: 

In [None]:
bdf['leaving_datetime'] = bdf.datetime
# take the points of a single user
user0_bdf = bdf[bdf.uid == bdf.uid.unique()[0]]
# take a sample of 200 random points
user0_bdf_sample = user0_bdf.sample(200)
# plot the stops of the user
user0_map = user0_bdf_sample.plot_stops(zoom=3)
# plot the trajectory of the user
user0_bdf_sample.plot_trajectory(map_f=user0_map)

### GPS: the <font color="blue">GeoLife</font> dataset

collected in (Microsoft Research Asia) **[GeoLife](https://www.microsoft.com/en-us/download/details.aspx?id=52367)** project by 182 users in the period Apr 2007 - Aug 2012.

- 17,621 trajectories
- total distance of about 1.2 million kilometers 
- total duration of 48,000+ hours.

In [None]:
tdf = skmob.TrajDataFrame.from_file('data/geolife_sample.txt.gz').sort_values(by='datetime')
print(type(tdf))
print(tdf.crs)
print(tdf.parameters)
tdf.head()

In [None]:
tdf.plot_trajectory(zoom=12, weight=3, opacity=0.9, tiles='Stamen Toner')

- How many users in the data set?
- How many points?
- What's the time window?

In [None]:
print('# users: %s' %len(tdf.uid.unique()))
print('# points: %s' %len(tdf))
print('time window: %s' 
      %(tdf.iloc[-1].datetime - tdf.iloc[0].datetime))

## Let's focus on a single user
using the *select* operation as we do in **pandas**

In [None]:
user1_tdf = tdf[tdf.uid == 1]
user1_tdf.head()

In [None]:
user1_map = user1_tdf.plot_trajectory(zoom=11, weight=3, tiles='Open Street Map')
user1_map

## Mobility data preprocessing

There are 3 common steps we can apply to clean our data:

- Filtering
- Compression
- Stop detection


## Filtering trajectories

Filter out points with speed higher than `max_speed` km/h from the previous point.

In [None]:
from skmob.preprocessing import filtering

In [None]:
# filter points with speed higher than 500km/h
user1_ftdf = filtering.filter(user1_tdf, max_speed_kmh=500.)

In [None]:
user1_ftdf.parameters

Very few points have been filtered.

In [None]:
print('Points of the raw trajectory:\t\t%s'%len(user1_tdf))
print('Points of the filtered trajectory:\t%s'%len(user1_ftdf))
print('Filtered points:\t\t\t%s'%(len(user1_tdf)-len(user1_ftdf)))

## Compressing trajectories

Reduce the number of points of the trajectory, preserving the structure.

Merge together all points that are closer than `spatial_radius_km=0.2` kilometers from each other.

In [None]:
from skmob.preprocessing import compression

In [None]:
user1_ctdf = compression.compress(user1_ftdf, spatial_radius_km=0.1)
user1_ctdf.head()

In [None]:
user1_ctdf.parameters

The compressed trajectory has only a small fraction of the points of the filtered trajectory.

In [None]:
print('Points of the filtered trajectory:\t%s'%len(user1_ftdf))
print('Points of the compressed trajectory:\t%s'%len(user1_ctdf))
print('Filtered points:\t\t\t%s'%(len(user1_tdf)-len(user1_ftdf)))

## Stop detection

Identify locations where the user spent at least `minutes_for_a_stop` minutes within a distance `spatial_radius_km` $\times$ `stop_radius_factor`, from a given point. 

A new column `leaving_datetime` is added, indicating the time when the user departs from the stop.

In [None]:
from skmob.preprocessing import detection

In [None]:
user1_stdf = detection.stops(user1_ctdf, stop_radius_factor=0.5, \
            minutes_for_a_stop=20.0, spatial_radius_km=0.2, 
                       leaving_time=True)
user1_stdf.head()

In [None]:
user1_stdf.parameters

#### Visualise the compressed trajectory and the stops

Click on the stop markers to see a pop up with: 
- User ID
- Coordinates of the stop (click to see the location on Google maps)
- Arrival time
- Departure time

In [None]:
map_f = user1_stdf.plot_trajectory(max_points=1000, hex_color=-1, start_end_markers=False)
user1_stdf.plot_stops(map_f=map_f, hex_color=-1)

In [None]:
from skmob.preprocessing import detection
user1_stdf = detection.stops(user1_tdf, stop_radius_factor=0.5, 
                             minutes_for_a_stop=20.0, spatial_radius_km=0.2, 
                             leaving_time=True)
user1_stdf.head(4)

In [None]:
user1_stdf.plot_stops(map_f=user1_map, hex_color=-1)

## Stops define <font color="blue">trips</font>
Let's take the first trip of the individual using the stops

In [None]:
user1_stdf.head(4)

In [None]:
dt1 = user1_stdf.iloc[0].leaving_datetime
dt2 = user1_stdf.iloc[1].leaving_datetime
dt1, dt2

In [None]:
# select all points between the first two stops
user1_tid1_tdf = user1_tdf[(user1_tdf.datetime >= dt1) 
                           & (user1_tdf.datetime <= dt2)]
user1_tid1_tdf.head()

In [None]:
# plot the trip
user1_tid1_map = user1_tid1_tdf.plot_trajectory(zoom=13, weight=5, opacity=0.9, tiles='Stamen Toner', )
user1_tid1_map

Compute the length of the trip and the distance between origin and destination

In [None]:
from skmob.utils.gislib import getDistanceByHaversine
from skmob.measures.individual import distance_straight_line
# take origin and destination of the trip
start_loc = user1_tid1_tdf.iloc[0][['lat', 'lng']]
end_loc = user1_tid1_tdf.iloc[-1][['lat', 'lng']]
# compute distance between origin and destination
print("distance:", getDistanceByHaversine(end_loc, start_loc))

In [None]:
distance_straight_line(user1_tid1_tdf)

## Compute some features based on trips

In [None]:
def number_of_trips(tdf, stop_radius_factor=0.5, minutes_for_a_stop=20.0, spatial_radius_km=0.2):
    """
    Compute the number of trips for each object.
    """
    # detect the stops for each individual
    stdf = detection.stops(tdf, stop_radius_factor=stop_radius_factor, 
                             minutes_for_a_stop=minutes_for_a_stop, 
                           spatial_radius_km=spatial_radius_km, leaving_time=True)
    return stdf.groupby('uid').apply(lambda user_stdf: len(user_stdf)).reset_index().rename(columns={0: 'n_trips'})

In [None]:
number_of_trips(tdf)

In [None]:
def number_of_evening_trips(tdf, stop_radius_factor=0.5, minutes_for_a_stop=20.0, 
                                   spatial_radius_km=0.2):
    """
    Number of subtrajectories that end in the evening.
    """
    def get_evening_trips(user_stdf, evening_time=['16:00', '20:00']):
        start_evening, end_evening = evening_time
        return len(user_stdf.set_index('leaving_datetime').between_time(start_evening, 
                                                                end_evening))
    stdf = detection.stops(tdf, stop_radius_factor=stop_radius_factor, 
                             minutes_for_a_stop=minutes_for_a_stop, 
                           spatial_radius_km=spatial_radius_km, 
                             leaving_time=True)
    return stdf.groupby('uid').apply(lambda user_stdf: get_evening_trips(user_stdf)).reset_index().rename(columns={0: 'evening_trips'})

In [None]:
number_of_evening_trips(tdf)

In [None]:
def average_stops_per_day(tdf, stop_radius_factor=0.5, minutes_for_a_stop=20.0, 
                                   spatial_radius_km=0.2):
    """
    Average number of stops per day
    """
    def get_stops_per_day(user_stdf):
        return user_stdf.groupby(user_stdf.leaving_datetime.dt.floor('d')).size().reset_index(name='count').mean()

    stdf = detection.stops(tdf, stop_radius_factor=stop_radius_factor, 
                             minutes_for_a_stop=minutes_for_a_stop, 
                           spatial_radius_km=spatial_radius_km, 
                             leaving_time=True)
    return stdf.groupby('uid').apply(lambda user_stdf: get_stops_per_day(user_stdf)).reset_index().rename(columns={'count': 'avg_stops_per_day'})             

In [None]:
average_stops_per_day(tdf)