### This notebook process the initial dataset, in order to extract usefull information from it. 

In this dataset there are trajectories from taxis in San Francisco. All the trajectories provide <b>Latitude</b> and <b>Longitude</b> infromation, as well as <b>Timestamp</b>. All the trajectories have been traced in May 2008.

In [1]:
# measure execution time
%load_ext autotime

time: 0 ns (started: 2023-05-16 11:39:09 +03:00)


In [2]:
# import libraries
import os
import json
import requests
import pandas as pd
import seaborn as sns
import plotly_express as px
import matplotlib.pyplot as plt
import tqdm
from tqdm.notebook import tqdm_notebook
from datetime import datetime, timedelta

%matplotlib inline

time: 11.9 s (started: 2023-05-16 11:39:09 +03:00)


### Phase 1: Preprocess the dataset
In this step, the following commands are executed:
- Add Taxi ID on the data
- Gather all the data in one txt file
- Convert time information to timestamp
- Split trajectories, based in the time field
- Delete unnecessary columns

In [None]:
# define the path in which data are stored
path = 'C:/Users/SK/Desktop/Πτυχιακή/Σύνολα Δεδομένων/Sam-Francisco-Yellow-Cabs/Data'

counter = 0 # Taxi ID starts from 0

# create an empty dataframe, in which all the data will be saved
all_data = pd.DataFrame(columns=['Taxi ID','Latitude','Longitude','Occupied','Date Time'])

for filename in os.listdir(path):
    
    # read each file in the Data folder
    temp = pd.read_csv(path+'/'+filename,names=['Latitude','Longitude','Occupied','Date Time'],sep=' ')
    
    # assign Taxi ID number to each file
    temp.insert(1,'Taxi ID',counter)
    
    # add the data in this file in the 'all_data' dataframe
    all_data = pd.concat([all_data, temp],ignore_index = True)
    
    counter += 1 # Increase Taxi ID number by 1    

In [None]:
# save data to txt file
all_data.to_csv('C:/Users/SK/Desktop/Πτυχιακή/Σύνολα Δεδομένων/Sam-Francisco-Yellow-Cabs/Files/all_data.txt',index=False)

In [None]:
visited_segments = pd.read_csv('C:/Users/SK/Desktop/Πτυχιακή/Σύνολα Δεδομένων/Sam-Francisco-Yellow-Cabs/Files/visited_segments.txt',sep=',')

#### Change datetime field to timestamp

In [None]:
all_data['Date Time'] = pd.to_datetime(all_data['Date Time'],origin='unix',unit='s')

In [None]:
# choose the data of one week
all_data = all_data[(all_data['Date Time'] >= "2008-05-18 00:00:00") & (all_data['Date Time'] < "2008-05-25 00:00:00")]

#### Sort the data based in Taxi ID and timestamp information

In [None]:
all_data = all_data.sort_values(['Taxi ID','Date Time'])
all_data = all_data.reset_index(drop=True)

#### Delete the 'Occupied' column
This column denotes whether or not the taxi was occupied by a passenger, at the time of GPS recording. So, this information is not usefull for our research.

In [None]:
all_data.drop('Occupied',axis=1,inplace=True)

#### Split the trajectories based in time field and FIle ID

Split each trajectory in the same Taxi ID based in the timestamp field.

Here, <b>n_sec</b> variable denotes the maximum number of seconds that consecutive GPS traces in the same trajectory should have.

In [None]:
all_data.insert(1,'Traj ID',-1)

In [None]:
''' 
Each Taxi ID contains GPS data of one trajectory

If the time gap between two GPS points is lower than n_sec seconds, (condition 1)
and these GPS points belong to the same Taxi ID  (condition 2)
then asign the same Traj ID number. (result)

If the time gap between two GPS points is higher than n_sec seconds, (condition 1)
and these GPS points belong to the same Taxi ID  (condition 2)
then asign different Traj ID number to each of these GPS points. (result)

If the GPS points belong to the same Taxi ID  (condition 2)
then asign different Traj ID number to each of these GPS points. (result)

'''

# max number of seconds between GPS records of each traectory
n_sec = 90
traj_id = 0

for i in range(all_data.shape[0] -1):
    
    if (all_data['Taxi ID'][i+1] == all_data['Taxi ID'][i]): # belong to the same File ID
        
        if (((all_data['Date Time'][i+1])-(all_data['Date Time'][i])).total_seconds() <= n_sec): # time interval less-equal than n_sec
            all_data.at[i,'Traj ID'] = traj_id
            all_data.at[i+1,'Traj ID'] = traj_id
            
        else: # time interval higher than n_sec
            all_data.at[i,'Traj ID'] = traj_id
            traj_id +=1
            all_data.at[i+1,'Traj ID'] = traj_id
    
    else: # not belong to the same File ID
        all_data.at[i,'Traj ID'] = traj_id
        traj_id  = 0
        all_data.at[i+1,'Traj ID'] = traj_id

#### Delete trajectories, which contain only one OSM Way ID

In [None]:
uniques = all_data.loc[:, ['Taxi ID', 'Traj ID']].drop_duplicates(keep=False).index
all_data.drop(uniques,axis=0,inplace=True)
all_data.reset_index(drop=True,inplace=True)

#### Find min and max date in this dataset

In [None]:
print("Min date is: "+str(all_data['Date Time'].min()))
print("Max date is: "+str(all_data['Date Time'].max()))

In [None]:
# save data to txt file (with information of splitted trajectories)
all_data.to_csv('C:/Users/SK/Desktop/Πτυχιακή/Σύνολα Δεδομένων/Sam-Francisco-Yellow-Cabs/Files/splitted_trajectories90.txt',index=False)

#### Begin Map Matching

Map matching done using Valhalla Meili API. Given each trajectory to the API as input, the response contains information of the exact path that each trajectory followed. The paths are in the form of OSM Way IDs. 

Sources:

-  <b>Installation using Docker: </b>https://ikespand.github.io/posts/meili/
-  <b>Paper about Valhalla: </b>https://link.springer.com/article/10.1007/s42979-022-01340-5#Tab5
-  <b>APIs documentation: </b>https://valhalla.github.io/valhalla/api/map-matching/api-reference/#matched-point-items  

In [None]:
# pass lat and lot pairs to Valhalla API
df_for_meili = all_data[['Latitude','Longitude']]
df_for_meili = df_for_meili.rename(columns={"Latitude": "lat", "Longitude": "lon"})

Create a new dataframe under the name "visited_segments", in which information about each trajectory will be contained. The columns of this new dataframe are:
-  <b>File ID: </b>The folder that contains information of this trajecotry.
-  <b>Traj ID: </b>The ID of the trajectory in this folder.
-  <b>OSM Way ID: </b>The way ID number of the edge that trajectory visited.
-  <b>Start Time: </b>Expected time that trajecotry enter the specific edge.
-  <b>End Time: </b>Expected time that trajecotry left the specific edge.

In [None]:
visited_segments = pd.DataFrame(columns=['Taxi ID','Traj ID','OSM Way ID','Start Time','End Time'])

for taxi_id in all_data['Taxi ID'].unique():
    for traj_id in all_data[all_data['Taxi ID'] == taxi_id]['Traj ID'].unique():

            # get the batch of data that we send to the request
            indexes = all_data[(all_data['Taxi ID']==taxi_id) & (all_data['Traj ID'] == traj_id)].index
            
            # input to API
            passed_data = df_for_meili.iloc[indexes]

            # Preparing the request to Valhalla's Meili
            meili_coordinates = passed_data.to_json(orient='records')
            meili_head = '{"shape":'
            meili_tail = ""","search_radius": 250, "sigma_z": 10, "beta": 10,"shape_match":"map_snap", "costing":"auto",
                            "filters":{"attributes":["edge.way_id"],"action":"include"},
                            "format":"osrm"}"""

            # this is the request
            meili_request_body = meili_head + meili_coordinates + meili_tail

            # the URL of the local valhalla server
            url = "http://localhost:8002/trace_attributes"

            # providing headers to the request
            headers = {'Content-type': 'application/json'}

            # we need to send the JSON as a string
            data = str(meili_request_body)

            # sending a request
            r = requests.post(url, data=data, headers=headers)

            if r.status_code == 200: # response from Valhalla API was successful

                # Parsing the JSON response
                response_text = json.loads(r.text)

                # find the time interval (in sec) that the trajectory needs to be completed [last timestamp - first timestamp]
                interval = (all_data.iloc[indexes].iloc[-1]['Date Time'] - all_data.iloc[indexes].iloc[0]['Date Time']).total_seconds()

                # compute the expected duration that the moving object is in each edge (duration is equal for each edge that the trajectory visits)
                duration  = interval/len(response_text['edges'])

                # make a temporary dataframe
                temp = pd.DataFrame(columns=['Taxi ID','Traj ID','OSM Way ID','Start Time','End Time'])

                # make the final dataframe with the help of a temporary dataframe
                for i in range(len(response_text['edges'])):

                    # complete the fields of temp dataframe
                    temp.at[i,'Taxi ID'] = taxi_id
                    temp.at[i,'Traj ID'] = traj_id
                    temp.at[i,'OSM Way ID'] = response_text['edges'][i]['way_id']

                    if i == 0:
                        temp.at[i,'Start Time'] = all_data.iloc[indexes].iloc[0]['Date Time']
                    else:
                        temp.at[i,'Start Time'] = temp.at[i-1,'End Time']

                    temp.at[i,'End Time'] = temp.at[i,'Start Time'] + timedelta(seconds=duration)

                # concatenate the two dataframes
                visited_segments = pd.concat([visited_segments,temp],ignore_index=True)

#### Delete trajectories, which contain only one OSM Way ID

In [None]:
uniques = visited_segments.loc[:, ['Taxi ID', 'Traj ID']].drop_duplicates(keep=False).index
visited_segments.drop(uniques,axis=0,inplace=True)
visited_segments.reset_index(drop=True,inplace=True)

In [None]:
# save the new dataframe to separate txt file
visited_segments.to_csv('C:/Users/SK/Desktop/Πτυχιακή/Σύνολα Δεδομένων/Sam-Francisco-Yellow-Cabs/Files/visited_segments.txt',index=False)

### Phase 2: Make the time series dataset

In [3]:
# read and sort the data. Also, convert timestamps to datetime data type
visited_segments = pd.read_csv('C:/Users/SK/Desktop/Πτυχιακή/Σύνολα Δεδομένων/Sam-Francisco-Yellow-Cabs/Files/visited_segments.txt')
visited_segments['Start Time'] = pd.to_datetime(visited_segments['Start Time'],format='%Y-%m-%d %H:%M:%S.%f')
visited_segments['End Time'] = pd.to_datetime(visited_segments['End Time'],format='%Y-%m-%d %H:%M:%S.%f')
visited_segments = visited_segments.sort_values(['Taxi ID','Traj ID','Start Time']).reset_index(drop=True)

time: 3min 19s (started: 2023-05-16 11:39:29 +03:00)


#### Step 1: Create the SPQ function

This is the main function that will be used for the construction of the time series dataset.
The SPQ (Strict Path Query) function, returns all the trajectories [the trajectories are unique (Taxi_ID,Traj_ID) pairs] that passes through given path of edges at a given time interval [time_enter,time_leave].

Parameters:
- <b>path: </b> The path that the tajectories should EXACTLY follow (edge by edge). This path can be of any length greater or equal to 2 edges.

- <b>time_enter: </b>The time, in which the trajectory should enter the first edge of the path given as input.
- <b>time_leave: </b>The time, in which the trajectory should leave the last edge of the path given as input.

In [4]:
def SPQ(path,time_enter,time_leave):
    
    # length of the path given to the function
    path_length = len(path)
    
    # this list will save temporarily the trajectories that match the SPQ condition
    trajectories = []
    
    # this block of code has been added to another place (but it is still part of this function)
    # extract only the data that match the time interval given as input
    # examined_data = visited_segments[(visited_segments['Start Time'] >= time_enter) &
    #                                  (visited_segments['End Time'] <= time_leave)].reset_index(drop=True)

    # find all the indexes, in which the first edge in the path is located
    needed_indexes = examined_data[examined_data['OSM Way ID'] == path[0]].index

    # iterate through all indexes (note the Taxi_ID and Traj_ID numbers)
    for index in needed_indexes:

        traj_id = examined_data.at[index,'Traj ID']
        taxi_id = examined_data.at[index,'Taxi ID']
        inter = 1
        
        # decide if the row in the next index matches the criteria (same Taxi_ID, same Traj_ID, the path required)
        for i in range(1,path_length):
            try:
                if (not ((examined_data['OSM Way ID'].iloc[index+i] == path[i]) 
                         & (examined_data['Traj ID'].iloc[index+i] == traj_id) 
                        & (examined_data['Taxi ID'].iloc[index+i] == taxi_id))):

                    break

                inter += 1 # if the criteria matches, then increase inter counter by one
            
            # index out of bounds exception
            except:
                print('-- index out of bounds --')
            
        # if the criteria matches as many times as the length of the path, then we found one trajectory
        # add this trajectory to the trajectories list
        if (path_length == inter):
            trajectories.append((taxi_id,traj_id))

    # return the number of trajectories that matches the criteria
    return (len([t for t in (set(tuple(i) for i in trajectories))]))

time: 31 ms (started: 2023-05-16 11:42:48 +03:00)


#### Step 2: Create the time information of the dataset

In [5]:
# Find max and min timestamp in the dataset
min_timestamp = visited_segments['Start Time'].min()
print("Min timestamp value in the dataframe is: ",min_timestamp)

max_timestamp = visited_segments['End Time'].max()
print("Max timestamp value in the dataframe is: ",max_timestamp)

# Calculate total seconds between those max and min values
total_sec = (max_timestamp-min_timestamp).total_seconds()
print("\nTotal duration in sec in this dataframe is: ",total_sec)

Min timestamp value in the dataframe is:  2008-05-18 00:00:00
Max timestamp value in the dataframe is:  2008-05-24 23:59:59.000130

Total duration in sec in this dataframe is:  604799.00013
time: 282 ms (started: 2023-05-16 11:42:48 +03:00)


###### Since we have data of one week, we will create time intervals  of one hour

In [6]:
# This list contains the time information of our time-series data
time_info = []

i =0
while(True):
    if i == 0:
        time_info.append(min_timestamp)
    else:
        time_info.append(time_info[i-1] + timedelta(seconds=3600))
    
    if (time_info[i]>=max_timestamp):
        break
    
    i+=1 

# create pairs of consequtive values of the list time_info
time_info = list(zip(*[time_info[i:] for i in range(2)])) 

time: 735 ms (started: 2023-05-16 11:42:49 +03:00)


#### Step 3: Generate random paths of length 2 and 3
These paths can be of any length. The number of consecutive edges contained in the path define it's length.

In [None]:
# select the 100 most common appeared edges (OSM Way IDs) in the visited_segments dataframe
most_common_edges = pd.DataFrame(visited_segments['OSM Way ID'].value_counts()[0:100].index,columns=['OSM Way ID'])

double_paths = [] # list of paths with length 2
triple_paths = [] # list of paths with length 3

# fill these lists with consequtive paths of length 2 and length 3
for i in range (most_common_edges.shape[0]):

    index = visited_segments[visited_segments['OSM Way ID']==most_common_edges.at[i,'OSM Way ID']].index
    counter = 0
    
    for item in index:
        
        counter += 1
        
        sublist2 = []
        sublist3 = []
        
        item1 = visited_segments.at[item,'OSM Way ID']
        item2 = visited_segments.at[item+1,'OSM Way ID']
        item3 = visited_segments.at[item+2,'OSM Way ID']
        
        sublist2.append(item1)
        sublist2.append(item2)
        
        sublist3.append(item1)
        sublist3.append(item2)
        sublist3.append(item3)
        
        double_paths.append(sublist2)
        triple_paths.append(sublist3)
        
        if counter == 100: # stop the execution
            break

# create a new dataframe, which contains all the paths
d = pd.DataFrame(columns=['Path'])
for i in range (len(double_paths)):
    d.at[i,'Path'] = double_paths[i]

double_paths = d.drop_duplicates(keep='first').reset_index(drop=True)

d = pd.DataFrame(columns=['Path'])
for i in range (len(double_paths)):
    d.at[i,'Path'] = triple_paths[i]

triple_paths = d.drop_duplicates(keep='first').reset_index(drop=True)

# add all paths in one dataframe
paths = pd.concat([double_paths,triple_paths],ignore_index=True)

#### Step 4: Fill the time series dataframe

In [None]:
# create an empty dataframe
time_series = pd.DataFrame(columns=time_info)
time_series.insert(0,'Path',0)

In [None]:
# fill the dataframe column by column
for time in time_info:
    i = 0
    # extract only the data that match the time interval given as input
    examined_data = visited_segments[(visited_segments['Start Time'] >= time[0]) & (visited_segments['End Time'] <= time[1])].reset_index(drop=True)
    
    for path in paths['Path'].to_list():
        time_series.at[i,'Path'] = str(path)
        time_series.at[i,time] = SPQ(path,time[0],time[1])
        i += 1

In [None]:
# print dataframe
time_series