# Trajectory Data Cleaning <a id="top"></a>

Remove datapoints that don't make sense, i.e. travelling speed too high

Let `speed_thres` be the maximum speed (metres/second) physically possible e.g. walking speed: 1.5m/s, driving speed: 80km/h = 22.2m/s.

- [Testcase 1](#cleaning-testcase-1): Stationary all the way
- [Testcase 2](#cleaning-testcase-2): Only the start point is outlying
- [Testcase 3](#cleaning-testcase-3): Only the end point is outlying
- [Testcase 4](#cleaning-testcase-4): Only the start and end point is outlying
- [Testcase 5](#cleaning-testcase-5)
- [Testcase 6](#cleaning-testcase-6)

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from math import radians, sin, cos, asin, sqrt
import folium
from folium import plugins

In [11]:
pd.options.display.max_rows = 10

# arbitrary start time
STARTDATE = datetime(2020, 1, 1)

# general speed 
CAR_SPEED = 22.2

In [3]:
def plot(traj, m=None, zoom_start=10):
    '''
    :param traj: a dataframe with variables `time`, `lat`, `lon`
    :param m: an instance of the Folium map
    :param zoom_start: integer between 0 and 18, determines how much you want to zoom into the map, 18 is the maximum you can zoom in.
    :return m: an animation of the travel history and the stay points.
    '''
    if m is None:
        m = folium.Map(location=[traj.lat[0], traj.lon[0]], zoom_start=zoom_start)
    
    folium.CircleMarker(
        location=[traj.lat[0], traj.lon[0]],
        radius=5,
        color='green'
    ).add_to(m)
    
    for i in range(1, len(traj)-1):
        folium.CircleMarker(
            location=[traj.lat[i], traj.lon[i]],
            radius=5,
            color='black'
        ).add_to(m)
    
    folium.CircleMarker(
        location=[traj.lat[len(traj)-1], traj.lon[len(traj)-1]],
        radius=5,
        color='red'
    ).add_to(m)
    
    # adds a fullscreen button at the 'topright' corner of the plot
    plugins.Fullscreen(position='topright').add_to(m)
    
    return m

In [4]:
def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance (in metres) between two points on the earth (specified in decimal degrees)
    
    :param lon1: longitude of point 1
    :param lat1: longitude of point 1
    :param lon2: longitude of point 2
    :param lat2: longitude of point 2
    :return: the distance between (lon1, lat1) and (lon2, lat2), in metres
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    r = 6378.1 # Radius of earth in kilometers. Use 3956 for miles
    
    return c * r * 1000

def data_cleaning(df, speed_thres):
    '''
    Removes datapoints in the dataframe `df` where
    
    :param df: a dataframe with variables `lat`, `lon`, `time`, sorted by increasing `time`
    :param speed_thres: the maximum speed between two geoloc datapoints
    :return: a cleaned version of the dataframe `df`
    '''
    
    def speed(point1, point2):
        '''
        Calculate speed between 2 points, in metres per second
        
        :param point1: a geoloc datapoint with variables `lat`, `lon`, `time`
        :param point2: a geoloc datapoint with variables `lat`, `lon`, `time`
        :return: speed between `point1` and `point2`, in metres per second
        '''
        
        if point1.time == point2.time:
            return 0.0
        else:
            return haversine(point1.lon, point1.lat, point2.lon, point2.lat) / np.abs((point1.time - point2.time).total_seconds())
    
    if len(df) <= 2:
        return df
    
    # initialise variables
    row_num_to_keep = []
    i = 0
    size = len(df)
    is_start = True
    
    while i <= size-1:
        if is_start:
            # if the current point is an outlier
            if speed(df.iloc[i,:], df.iloc[i+1,:]) > speed_thres and speed(df.iloc[i+1,:], df.iloc[i+2,:]) <= speed_thres:
                pass
            else:
                is_start = False
                row_num_to_keep += [i]
                
        else:
            # not the first nor last point
            if i < size-1:
                # outlier
                if speed(df.iloc[i-1,:], df.iloc[i,:]) > speed_thres and speed(df.iloc[i,:], df.iloc[i+1,:]) > speed_thres:
                    continue
                else:
                    row_num_to_keep += [i]
                
            # last point
            if i == size-1:
                # outlier
                if speed(df.iloc[i-1,:], df.iloc[i,:]) > speed_thres:
                    pass
                else:
                    row_num_to_keep += [i]
                    
        i += 1            
    
    cleaned_df = df.iloc[row_num_to_keep, :]
    
    return cleaned_df

sample = [ ]
1. Determine if the first point is an outlier
    - If the first point is an outlier, move to the next point
    - Repeat the process until we obtain a point that is not an outlier
    - Add non-outlying point to the sample
2. For each additional point, check if it is an outlier
    - If the point is an outlier, move to the next point
    - If not, add it to the sample
    - Repeat the process until there are no more points
    
    
**Definition of outlier:**

Let the no. of points be `n`.
- If `i = 0`, then `point[i]` is an outlier if `speed(point[i], point[i+1]) > speed_thres` AND `speed(point[i+1], point[i+2]) <= speed_thres`
- Else if `i > 0` AND `i < n-1`, then `point[i]` is an outlier if `speed(point[i-1], point[i]) > speed_thres` AND `speed(point[i], point[i+1]) > speed_thres`
- Else if `i == n-1`, then `point[i]` is an outlier if `speed(point[i-1], point[i]) > speed_thres`

In [7]:
traj.head()

Unnamed: 0,time,lat,lon
0,2020-01-01 00:00:00,1.3521,103.82
1,2020-01-01 00:01:00,1.4927,103.741
2,2020-01-01 00:02:00,1.4927,103.741
3,2020-01-01 00:03:00,1.4927,103.741
4,2020-01-01 00:04:00,1.4927,103.741


## Testcase 1 <a id="cleaning-testcase-1"></a>

All stationary

[Back to top](#top)

In [13]:
traj = pd.DataFrame(columns=["time", "lat", "lon"])

traj.loc[0, :] = [STARTDATE, 1.3521, 103.8198]

for i in range(1, 60):
    randSmallNums = np.random.uniform(low=0, high=0, size=2)
    traj.loc[i, :] = [traj.time[i-1] + timedelta(seconds=60), 1.3521, 103.8198]
    
traj

Unnamed: 0,time,lat,lon
0,2020-01-01 00:00:00,1.3521,103.82
1,2020-01-01 00:01:00,1.3521,103.82
2,2020-01-01 00:02:00,1.3521,103.82
3,2020-01-01 00:03:00,1.3521,103.82
4,2020-01-01 00:04:00,1.3521,103.82
...,...,...,...
55,2020-01-01 00:55:00,1.3521,103.82
56,2020-01-01 00:56:00,1.3521,103.82
57,2020-01-01 00:57:00,1.3521,103.82
58,2020-01-01 00:58:00,1.3521,103.82


In [14]:
plot(traj)

In [15]:
cleaned_df1 = data_cleaning2(traj, speed_thres=22)
cleaned_df1

Unnamed: 0,time,lat,lon
0,2020-01-01 00:00:00,1.3521,103.82
1,2020-01-01 00:01:00,1.3521,103.82
2,2020-01-01 00:02:00,1.3521,103.82
3,2020-01-01 00:03:00,1.3521,103.82
4,2020-01-01 00:04:00,1.3521,103.82
...,...,...,...
55,2020-01-01 00:55:00,1.3521,103.82
56,2020-01-01 00:56:00,1.3521,103.82
57,2020-01-01 00:57:00,1.3521,103.82
58,2020-01-01 00:58:00,1.3521,103.82


## Testcase 4 <a id="cleaning-testcase-4"></a>

Only the start and end point is outlying

[Back to top](#top)

In [1]:
traj = pd.DataFrame(columns=["time", "lat", "lon"])

traj.loc[0, :] = [STARTDATE, 1.3521, 103.8198]

for i in range(1, 60):
    randSmallNums = np.random.uniform(low=0, high=0, size=2)
    traj.loc[i, :] = [traj.time[i-1] + timedelta(seconds=60), 1.4927+randSmallNums[0], 103.7414+randSmallNums[1]]
    
traj.loc[60, :]  = [traj.time[60-1] + timedelta(seconds=60), 1.5021, 104]

traj

NameError: name 'pd' is not defined

In [None]:
plot(traj)

In [None]:
df = data_cleaning(df, speed_thres)