### This notebook process the initial dataset, in order to extract usefull information from it. 

In this dataset there are trajectories from 181 users. Only 69 of them have labeled their trajectories as Taxi, Car, Bus , Train, Subway and Walk. We are interested only in vehicle trajectories labeled as <b>Taxi</b>, <b>Car</b> and <b>Bus</b>.

In [2]:
# measure execution time
%load_ext autotime

time: 0 ns (started: 2023-05-03 20:57:46 +03:00)


### Delete all folders, which contain unlabeled trajectories. We keep only the trajectories that are labeled.

Prefered labels are: Car, Bus and Taxi.

In [2]:
import os
import shutil

# define the path in which data are stored
path = 'C:/Users/SK/Desktop/Πτυχιακή/Σύνολα Δεδομένων/Geolife Trajectories 1.3/Data'

# measure number of folders that remained after deletion
counter = 0 

for filename in os.listdir(path):
    if (os.path.isfile(path+'/'+filename+'/labels.txt') == False):
        shutil.rmtree(path+'/'+filename) # remove
    else:
        counter += 1 # keep remained folders in path

print("Number of remained folders is: ",counter)
        

Number of remained folders is:  69
time: 3.45 s (started: 2023-05-02 02:12:30 +03:00)


In [3]:
import pandas as pd

for filename in os.listdir(path):
    # read the file labels.txt of each folder
    label = pd.read_csv(path+'/'+filename+'/labels.txt',sep='\t',names=['Start Time','End Time','Transportation Mode'])
    
    # keep only necessary labels
    label = label[(label['Transportation Mode'] == 'car') | (label['Transportation Mode'] == 'bus') | (label['Transportation Mode'] == 'taxi')]
    
    # save the new file with the same name in the same folder (file replacion with the necessary information only)
    path_to_be_saved = path+'/'+filename+'/labels.txt'
    label.to_csv(path_to_be_saved,index=False,sep='\t')

time: 2.8 s (started: 2023-05-02 02:12:34 +03:00)


### Make the trajectory dataset
Add in a new dataframe only the information of the trajecotries that we are interested in. This information refers to:

-  <b>File ID:</b> The ID number of the folder, in which the information was teken.
-  <b>Latitude:</b> The latitude of the GPS point.
-  <b>Longitude:</b> The longitude of the GPS point.
-  <b>Date Time:</b> Timestamp, in which the GPS was recorded.
-  <b>Label:</b> This field contains one of the following values - Taxi, Bus or Car.

In [5]:
# make empty dataframe, in which all the data used for our research will be saved
all_data = pd.DataFrame(columns = ['File ID','Latitude','Longitude','Date Time','Label']) 

id = 0 # trajectory ID

for directory in os.listdir(path):

    # read label.txt file of folder
    labels = pd.read_csv(path+'/'+directory+'/'+'labels.txt',sep='\t',names=['Start Time','End Time','Transportation Mode'],skiprows=[0])
    
    if (labels.shape[0] != 0):
        
        # temporary dataframe
        directory_data = pd.DataFrame(columns = ['File ID','Latitude','Longitude','Date Time','Label'])

        # convert datetime in labels.txt file to necessary format
        labels['Start Time'] = pd.to_datetime(labels['Start Time'],format='%Y-%m-%d %H:%M:%S.%f')
        labels['End Time'] = pd.to_datetime(labels['End Time'],format='%Y-%m-%d %H:%M:%S.%f')

        for filename in os.listdir(path+'/'+directory+'/'+'Trajectory'): # for each folder

            # read data file
            data = pd.read_csv(path+'/'+directory+'/'+'Trajectory'+'/'+filename,skiprows=[0,1,2,3,4,5],sep=',',names=['Latitude','Longitude','Field 0','Altitude','Days Passed','Date','Time'])

            # drop unecessary columns from data file
            data.drop(['Days Passed','Altitude','Field 0'],axis=1,inplace=True)

            # join time information to one column
            data['Date Time'] = data['Date']+' '+data['Time']
            data['Date Time'] = pd.to_datetime(data['Date Time'],format='%Y-%m-%d %H:%M:%S.%f')
            data.drop(['Date','Time'],axis=1,inplace=True)

            # add the ID of the trajectory
            data.insert(0,'File ID',id)

            directory_data = pd.concat([directory_data, data],ignore_index = True)
            
            id += 1
        
        # assign the label to each trajectory of the specific folder
        for y in range(labels.shape[0]):
            directory_data.loc[(directory_data['Date Time'] >= labels['Start Time'][y]) & (directory_data['Date Time'] <= labels['End Time'][y]),'Label'] = labels['Transportation Mode'][y] 
        
        directory_data.dropna(axis=0,inplace=True)
        
        all_data = pd.concat([all_data, directory_data],ignore_index = True)
        
        print("Files in folder "+str(directory)+" have been proccessed!")

Files in folder 010 have been proccessed!
Files in folder 020 have been proccessed!
Files in folder 021 have been proccessed!
Files in folder 052 have been proccessed!
Files in folder 053 have been proccessed!
Files in folder 056 have been proccessed!
Files in folder 058 have been proccessed!
Files in folder 062 have been proccessed!
Files in folder 064 have been proccessed!
Files in folder 065 have been proccessed!
Files in folder 067 have been proccessed!
Files in folder 068 have been proccessed!
Files in folder 069 have been proccessed!
Files in folder 073 have been proccessed!
Files in folder 075 have been proccessed!
Files in folder 076 have been proccessed!
Files in folder 078 have been proccessed!
Files in folder 080 have been proccessed!
Files in folder 081 have been proccessed!
Files in folder 082 have been proccessed!
Files in folder 084 have been proccessed!
Files in folder 085 have been proccessed!
Files in folder 086 have been proccessed!
Files in folder 088 have been proc

### Process the dataset

In this dataset, there are GPS information from many cities of China. The majority of those GPS records are located in the city of Beijing. We focus only in this information.

In [8]:
# keep only GPS points from the city of Beijing
all_data = all_data[(all_data['Latitude'] >= 39.8) & (all_data['Longitude'] >= 116.1)]

# add a new column, which indicates the ID of each trajectory
all_data.insert(1,'Traj ID',-1)

time: 172 ms (started: 2023-05-02 10:27:12 +03:00)


In [None]:
# save the data to file
all_data.to_csv('C:/Users/SK/Desktop/Πτυχιακή/Σύνολα Δεδομένων/Geolife Trajectories 1.3/all_original.txt',index=False)

### Split the trajectories based in time field and FIle ID

Split each trajectory in the same File ID based in the timestamp field.

In [3]:
# convert timestamp field to datetime
all_data['Date Time'] = pd.to_datetime(all_data['Date Time'],format='%Y-%m-%d %H:%M:%S.%f')

In [1]:
''' 
Each File ID contains GPS data of one trajectory

If the time gap between two GPS points is lower than 15 seconds, (condition 1)
and these GPS points belong to the same File ID  (condition 2)
then asign the same Traj ID number. (result)

If the time gap between two GPS points is higher than 15 seconds, (condition 1)
and these GPS points belong to the same File ID  (condition 2)
then asign different Traj ID number to each of these GPS points. (result)

If the GPS points belong to the same File ID  (condition 2)
then asign different Traj ID number to each of these GPS points. (result)

'''

traj_id = 0

for i in range(all_data.shape[0] -1):
    
    if (all_data['File ID'][i+1] == all_data['File ID'][i]): # belong to the same File ID
        
        if (((all_data['Date Time'][i+1])-(all_data['Date Time'][i])).total_seconds() <= 15): # time interval less-equal than 15sec
            all_data.at[i,'Traj ID'] = traj_id
            all_data.at[i+1,'Traj ID'] = traj_id
            
        else: # time interval higher than 15sec
            all_data.at[i,'Traj ID'] = traj_id
            traj_id +=1
            all_data.at[i+1,'Traj ID'] = traj_id
    
    else: # not belong to the same File ID
        all_data.at[i,'Traj ID'] = traj_id
        traj_id  = 0
        all_data.at[i+1,'Traj ID'] = traj_id

NameError: name 'all_data' is not defined

In [5]:
# save data (with information of splitted trajectories)
all_data.to_csv('C:/Users/SK/Desktop/Πτυχιακή/Σύνολα Δεδομένων/Geolife Trajectories 1.3/all_broken_trajectories.txt',index=False)