Capstone project for NYC Department of Transportation.
-
Documentation of data processing and Spark
-
Python Data Processing
-
Demonstration: Ipython Notebooks that demonstrate all the processes
-
Core Modules
- Siri Tools: Modules for Bus Time data retrieval and cleaning
- Time Tools: Homemade Timedelta Converter
- GTFS: Extract the Schedule Data from GTFS Schedules(originally in ZIP)
- Arrival Time: Estimate the arrival time for each stop using Scipy KD-Tree and Interpolate
- Performance Metrics: Calculate common performance measurements on each route, stop and date.
-
-
- For details, check the Spark folder.
-
Darker means poorer on time performance for the buses
Open Interactive Map in Carto Map
-
Bus Time data (use siri_tools)
-
Scrape: Query the Bus Time API every 60 seconds and write each JSON response to a local file. It is recommended to run two independent scrape processes (separated by 30 seconds) to get maximum data density. This minimizes the interruptions from some responses taking longer than 30 seconds.
Requirements:
*MYKEY
file located in the OS working directory containing a single text string. See Bus Time documentation for instructions on getting a key. *jsons/
directory exists in the OS working directory -
Parse: Extract useful data elements from each vehicle record in each JSON response file. Takes roughly one second to parse one JSON, so an entire day's worth data may take up to 15 minutes. Speed is significiantly faster using the Spark code.
-
Clean: Using schedule data as the "truth" source, filter extracted and parsed Bus Time data to exclude any records where the reported "next stop" is invalid for the reported
trip_id
. -
Schedule data
-
Download: Static feeds of the current schedule data for each borough (plus the MTA Bus Company) are available directly from the MTA. Historical feeds are available through a third-party open-source project. Shell script to download all previous feeds in one batch can be found in the [Bus Viz github] (https://github.com/efranco63/NYU_USI_BusViz/blob/master/TransitFeeds/fetch.sh).
-
Generate metadata (list of date ranges): Use method gtfs.build_metadata(dpath) to generate a small text file within each subdirectory of
dpath
that lists the valid date ranges of each included feed. This is necessary since schedule data changes periodically, so any schedule-comparison analysis must use only data extracted from the corresponding concurrent feed.
Requirements: * All downloaded transit feed files must be in their original standard format (zip) * Each feed gets its own subdirectory, containing current and prior feeds
Example directory structure for GTFS data
gtfs/
80_brooklyn/
metadata.txt
gtfs_brooklyn_1383136207.zip
gtfs_brooklyn_1419914436.zip
gtfs_brooklyn_1386879331.zip
gtfs_brooklyn_20150402.zip
82_manhattan/
84_staten_island/
81_bronx/
83_queens/
85_bus_company/
-
Stop time estimation
-
Recommended: Linear interpolation (see demonstration notebook)
-
Alternative: Spatial search (see demonstration script)
-
Performance metrics
-
Recommended: Generate a single measurement for each route at the stop with the most data (see demonstration notebook)
-
Alternative: Batch process metrics for all stops, routes and dates before filtering and analyzing (use metrics.py)
Example:python metrics.py dec2015_interpolated.csv gtfs/ dec2015_metrics.csv