# Maryland dataset

The Maryland trip file contains three main spreadsheets, namely TripRecordsReport.csv, TripRecordsReportWaypoints.csv and TripsRecordsReportProviderDetails.csv. 


## TripRecordsReport File - Column Definitions 

1. 'TripID - A trip's unique identifier
2. DeviceID - A device's unique identifier
3. ProviderID - A provider's unique identifier
4. Mode - Mode of transport (0 = walk; 1 = vehicle; 2 = rail)
5. StartDate - The trip's start date and time in UTC, ISO-8601 format
6. StartWDay - The weekday of the trip's start (1 = Sunday, 2 = Monday, etc.)
7. EndDate - The trip's end date and time in UTC, ISO-8601 format
8. EndWDay - The weekday of the trip's end (1 = Sunday, 2 = Monday, etc.)
9. StartLocLat - The decimal degree latitude coordinates of the trip's start point
10. StartLocLon - The decimal degree longitude coordinates of the trip's start point
11. EndLocLat - The decimal degree latitude coordinates of the trip's end point
12. EndLocLon - The decimal degree longitude coordinates of the trip's end point
13. IsStartHome - Is the start location the home of the device?
14. IsEndHome - Is the end location the home of the device?
15. GeospatialType - Type of trip (II - Internal to Internal; IE - Internal to External; EI - External to Internal; EE - External to External)
16. ProviderType - Consumer, Fleet, Mobile
17. ProviderDrivingProfile - The driving class represented by the provider
18. VehicleWeightClass - Vehicle weight class

## TripRecordsReportWaypoints - Column Definitions

1. TripID - A trip's unique identifier
2. WaypointSequence - The order of the waypoint within the trip starting with "1" and incrementing by one
3. CaptureDate - The capture date and time of the waypoint in UTC, ISO-8601 format
4. Latitude - The decimal degree latitude coordinates of the waypoint
5. Longitude - The decimal degree longitude coordinates of the waypoint

The detailed route about each trip is recorded in the TripRecordsReportWaypoints.csv.
The TripRecordsReportWaypoints.csv is composed of many waypoints which are located by the latitude and longtitude with the timestamps. The waypoints in the same trip will have the same trip id, so we can recover the route of a particular trip. Besides, with the timestamp, we can further infer the travel time in different road segments as well. 


## TripsRecordsReportProviderDetails - Column Definitions

1. ProviderId - A provider's unique identifier
2. ProviderType - Consumer, Fleet, Mobile
3. ProviderDrivingProfile - The driving class represented by the provider
4. VehicleWeightClass - Vehicle weight class
5. ProbeSourceType - Source of the probe data (EmbeddedGPS, MobileDevice, Unknown)

The last spreadsheet, TripsRecordsReportProviderDetails.csv, keeps the detailed information about the data provider.

### Here, Here, we collect some basic statistics on the data 

In [None]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
f = open('D:\\Maryland\\TripRecordsReportTrips.csv', 'r')
reader = pd.read_csv(f, sep=',', iterator=True,header=None)
chunksize = 500
chunk = reader.get_chunk(chunksize)


In [6]:
loop=True
device_list = []
provider_list = []
num = 0
geospatial = [0,0,0,0,0]
pro_type = [0,0,0,0,0,0]
pro_dri_file = [0,0,0,0,0,0]
vehicle_weight = [0,0,0,0,0,0]


In [8]:
while loop:
    try:
        chunk = reader.get_chunk(chunksize)
        for i in range(0,chunk.shape[0]):
#             if chunk.iloc[i,1] not in device_list:
#                 device_list.append(chunk.iloc[i,1])
#             if chunk.iloc[i,2] not in provider_list:
#                 provider_list.append(chunk.iloc[i,1])
            if chunk.iloc[i,14]=='II':
                geospatial[0] = geospatial[0]+1
            if chunk.iloc[i,14]=='IE':
                geospatial[1] = geospatial[1]+1
            if chunk.iloc[i,14]=='EI':
                geospatial[2] = geospatial[2]+1
            if chunk.iloc[i,14]=='EE':
                geospatial[3] = geospatial[3]+1
            pro_type[chunk.iloc[i,15]] = pro_type[chunk.iloc[i,15]] + 1
            pro_dri_file[chunk.iloc[i,16]] = pro_dri_file[chunk.iloc[i,16]]+1
            vehicle_weight[chunk.iloc[i,17]] = vehicle_weight[chunk.iloc[i,17]]+1
        num = num+1 
    except StopIteration:
        loop = False
        print("Iteration is stopped.")
        print('total trip number', num*chunksize+chunk.shape[0])
        print(chunk.shape[0])


Iteration is stopped.
total trip number 4868083
83


The total number of the trips record is 4868083.  The number in four particular attribute are also counted in each iteration

In [9]:
geospatial

[3762062, 485664, 484036, 136258, 0]

## GeospatialType attribute: 
The number of II is 3762062; 
 


The number of IE is 485664;

The number of EI is 484036;

The number of EE is 136258;

The rest of those are empty values.

In [14]:
pro_type

[0, 1428952, 3439068, 0, 0, 0]

## ProviderType attribute: 

 


The nuber of consumers is 

The number of fleet is 3439068

The number of mobile is 

In [17]:
pro_dri_file

[0, 966578, 41353, 2455800, 1404289, 0]

## ProviderDrivingProfile attribute: 
The number of 'Heavy Duty Trucks: > 26000 lb.' is 1404289; 
 

The number of 'Medium Duty Trucks / Vans: ranges from 14001-26000 lb.' is 2455800


The number of 'unknown' is 41353

The number of 'Light Duty Truck/Passenger Vehicle: Ranges from 0 to 14000 lb.' is 966578

In [18]:
vehicle_weight

[0, 1007932, 2454584, 1405504, 0, 0]

## VehicleWeightClass attribute: 

The number of 1 is 1007932

The number of 3 is 1405504

# Research ideas for this dataset

## Traffic prediction based on the travel time information
The whole traffic network can be abstracted as a graph model, the vertex of which represents a road segment. We can develop an algorithm to predict the travel time on every road segment based on the travel time in the last few timestamps. This can be a follow-up research of the study of Professor Duffield in the ECE, who has develop an algorithm based on the clustering algorithm and Graph Signal Processing approach.

## The routing behavior based on trip information
The vehicle driver will make different route decision based on different trip information. We can make a logistic regression model to simulate the human decision. There has already been research on this topic by Xiaoqiang Kong. However, the model of the algorithm is not perfect, we can add more variables and apply the decision tree algorithm on it, which is supposed to have a better performance.


## The detouring behavior based on the trip information
Since the travel time can denote whether there is a congestion, we can study the detour behavior of the driver as well. Like the routing behavior, we can build a decision tree to infer the human decision.


## The speeding behavior based on the trip information
The travel time information can also be used to determine whether a driver is speeding on this road segment, so we can do a study about impact of several factor on the speeding behavior.


## The relationship between the familiar route and the driving behavior
Since the dataset keep tracking the route of the driver for several month, we can distinguish the high frequency trip from the others. We can try to find out whether there will be a huge difference on the driving behavior like the speed.