<a href="https://colab.research.google.com/github/stratoskar/Path-Based-Traffic-Flow-Prediction/blob/main/Python_Code/DataCollection_and_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Data Collection

The data used for this research comes from a project called the cabspotting project. Specifically, this dataset contains about 11.000.000 GPS data of various taxis (Yellow Cab Vehicles) in the San Francisco, California area. The whole data sampling occurred in May 2008.

<b>You can read more about the dataset in the following link: </b>https://stamen.com/work/cabspotting/

In [1]:
!pip install ipython-autotime

# Measure execution time of each cell
%load_ext autotime

Collecting ipython-autotime
  Downloading ipython_autotime-0.3.2-py2.py3-none-any.whl (7.0 kB)
Collecting jedi>=0.16 (from ipython->ipython-autotime)
  Downloading jedi-0.19.1-py2.py3-none-any.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: jedi, ipython-autotime
Successfully installed ipython-autotime-0.3.2 jedi-0.19.1
time: 443 µs (started: 2023-11-19 20:11:41 +00:00)


In [2]:
# Import basic libraries
import os

import numpy as np
import pandas as pd

from datetime import datetime, timedelta

time: 421 ms (started: 2023-11-19 20:11:41 +00:00)


##### Combine all the files of the dataset in one dataframe

In [3]:
from google.colab import drive
drive.mount('/content/drive')

# Define the path with the data
PATH = '/content/drive/MyDrive/Paper/Data/'

Mounted at /content/drive
time: 19.6 s (started: 2023-11-19 20:11:42 +00:00)


In [4]:
counter = 0 # Taxi ID starts from 0

# Create an empty dataframe, in which all the data will be saved
all_data = pd.DataFrame(columns=['Taxi ID','Latitude','Longitude','Occupied','Date Time'])

for filename in os.listdir(PATH):
    try:
      # Read each file in the Data folder
      temp = pd.read_csv(PATH+'/'+filename,names=['Latitude','Longitude','Occupied','Date Time'],sep=' ')

      # Assign Taxi ID number to each file
      temp.insert(1,'Taxi ID',counter)

      # Add the data in this file in the 'all_data' dataframe
      all_data = pd.concat([all_data, temp],ignore_index = True)

      counter += 1 # Increase Taxi ID number by 1
    except:
      continue

time: 2min 46s (started: 2023-11-19 20:12:01 +00:00)


##### Change datetime field to timestamp

In [5]:
all_data['Date Time'] = pd.to_datetime(all_data['Date Time'],origin='unix',unit='s')

time: 4.92 s (started: 2023-11-19 20:14:48 +00:00)


##### Sort the data based in Taxi ID and timestamp information

In [6]:
all_data = all_data.sort_values(['Taxi ID','Date Time'])
all_data = all_data.reset_index(drop=True)

time: 7.39 s (started: 2023-11-19 20:14:53 +00:00)


##### Delete the 'Occupied' column
This column denotes whether or not the taxi was occupied by a passenger during GPS recording. So, this information is not useful for our research.

In [7]:
all_data.drop('Occupied',axis=1,inplace=True)

time: 430 ms (started: 2023-11-19 20:15:00 +00:00)


##### Present information about the dataset

In [8]:
# Show the shape of the dataframe
all_data.shape

(11224281, 4)

time: 6.69 ms (started: 2023-11-19 20:15:00 +00:00)


In [9]:
# Print data types of every column present in the dataframe
all_data.dtypes

Taxi ID              object
Latitude            float64
Longitude           float64
Date Time    datetime64[ns]
dtype: object

time: 8.81 ms (started: 2023-11-19 20:15:00 +00:00)


In [10]:
# Show schema information of the dataframe
all_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11224281 entries, 0 to 11224280
Data columns (total 4 columns):
 #   Column     Dtype         
---  ------     -----         
 0   Taxi ID    object        
 1   Latitude   float64       
 2   Longitude  float64       
 3   Date Time  datetime64[ns]
dtypes: datetime64[ns](1), float64(2), object(1)
memory usage: 342.5+ MB
time: 14.1 ms (started: 2023-11-19 20:15:00 +00:00)


In [11]:
# Present statistical insights about this dataframe
all_data.describe()

Unnamed: 0,Latitude,Longitude
count,11224280.0,11224280.0
mean,37.7636,-122.4124
std,0.05386684,0.03578742
min,32.8697,-127.0814
25%,37.75513,-122.4253
50%,37.78106,-122.4111
75%,37.79045,-122.4003
max,50.30546,-115.5622


time: 1.16 s (started: 2023-11-19 20:15:00 +00:00)


#### Split the trajectories based on time field and FIle ID

Split each trajectory in the same Taxi ID based on the timestamp field.

Here, <b>n_sec</b> variable denotes the maximum number of seconds that consecutive GPS traces in the same sub trajectory should have.

In [12]:
# Insert a new column
all_data.insert(1,'Traj ID',-1)

time: 30 ms (started: 2023-11-19 20:15:02 +00:00)


In [13]:
'''
Each Taxi ID contains GPS data of one trajectory

If the time gap between two GPS points is lower than n_sec seconds, (condition 1)
and these GPS points belong to the same Taxi ID  (condition 2)
then asign the same Traj ID number. (result)

If the time gap between two GPS points is higher than n_sec seconds, (condition 1)
and these GPS points belong to the same Taxi ID  (condition 2)
then asign different Traj ID number to each of these GPS points. (result)

If the GPS points belong to the same Taxi ID  (condition)
then asign different Traj ID number to each of these GPS points. (result)

'''

# Max number of seconds between GPS records of each traectory
n_sec = 90
traj_id = 0

for i in range(all_data.shape[0] -1):

    if (all_data['Taxi ID'][i+1] == all_data['Taxi ID'][i]): # Belong to the same File ID

        if (((all_data['Date Time'][i+1])-(all_data['Date Time'][i])).total_seconds() <= n_sec): # Tme interval less-equal than n_sec
            all_data.at[i,'Traj ID'] = traj_id
            all_data.at[i+1,'Traj ID'] = traj_id

        else: # Time interval higher than n_sec
            all_data.at[i,'Traj ID'] = traj_id
            traj_id +=1
            all_data.at[i+1,'Traj ID'] = traj_id

    else: # Not belong to the same File ID
        all_data.at[i,'Traj ID'] = traj_id
        traj_id  = 0
        all_data.at[i+1,'Traj ID'] = traj_id

time: 30min 51s (started: 2023-11-19 20:15:02 +00:00)


In [14]:
# Print dataframe
all_data

Unnamed: 0,Taxi ID,Traj ID,Latitude,Longitude,Date Time
0,0,0,37.73515,-122.40484,2008-05-17 10:00:29
1,0,0,37.72245,-122.40081,2008-05-17 10:01:14
2,0,0,37.70973,-122.39541,2008-05-17 10:01:56
3,0,0,37.69660,-122.39249,2008-05-17 10:02:33
4,0,0,37.68318,-122.38942,2008-05-17 10:03:11
...,...,...,...,...,...
11224276,536,533,37.74332,-122.39649,2008-06-10 05:27:46
11224277,536,533,37.74996,-122.39307,2008-06-10 05:28:46
11224278,536,533,37.75201,-122.39383,2008-06-10 05:29:47
11224279,536,533,37.75161,-122.39392,2008-06-10 05:30:47


time: 23.1 ms (started: 2023-11-19 20:45:53 +00:00)


In [15]:
# Save results
SAVE_PATH = '/content/drive/MyDrive/Paper/Files/splitted_data.csv'
all_data.to_csv(SAVE_PATH,index=False)

time: 1min 21s (started: 2023-11-19 20:45:53 +00:00)
