# Discovering and Analysing GTFS

## Lab Goals
In this lab you will explore one of the most common data formats in public transit: General Transit Feed Specification (GTFS). Specifcally, we will be looking at a static GTFS feed from Calgary, Canada to calculate some basic attributes about the system and learn how this data source is structured. By the end of this lab you should be able to:

* Read in multiple tables from a provided GTFS source and join them together where necessary.
* Convert the "time" portions of a GTFS feed into something useful for temporal analysis.
* Answer some questions about basic system characteristics based on your calculations.

You can find a helpful reference for GTFS-static at [GTFS.org](https://gtfs.org/reference/static).

## Reading and Exploring Files

Let's start by reading in the folders and learning a little bit about how our data is structured.


In [2]:
import os
import pandas as pd
import datetime as dt

data_dir = "../data/calgary_gtfs_2022-08-17"

# Reading in a bunch of tables so we can explore them.
agency = pd.read_csv(os.path.join(data_dir, "agency.txt"))
calendar = pd.read_csv(os.path.join(data_dir, "calendar.txt"))
calendar_dates = pd.read_csv(os.path.join(data_dir, "calendar_dates.txt"))
stops = pd.read_csv(os.path.join(data_dir, "stops.txt"))
routes = pd.read_csv(os.path.join(data_dir, "routes.txt"))
trips = pd.read_csv(os.path.join(data_dir, "trips.txt"))
stop_times = pd.read_csv(os.path.join(data_dir, "stop_times.txt"))

In [25]:
# Explore the data using this cell

### Exercise: How many trips were run on August 15, 2022?

In [7]:
# Let's find the service_ids which fall within the start/end dates and run on a monday.
service_ids = calendar[(calendar.start_date >= 20220815) & (calendar.end_date >= 20220815) & (calendar.monday == 1)]
service_ids
# Now we simply filter the trips that fall into that particular service ID
trips[trips.service_id.isin(service_ids.service_id)].shape[0]

19640

## Working with times in GTFS

There are two particularities with the GTFS when it comes to the way times are managed:
* Many operators include trips that run after midnight as part of the previous day's service, and so the GTFS allows for times greater that 24 hours to account for this.
* GTFS schedules are designed to be valid for long periods of time without specifying each day. It's important to first filter trips to appropriate service days and to account for any service adjustments using both `calendar.txt` and `calendar_dates.txt`.

As an example, let's answer the question: **What was the total span of service on August 24, 2022?**

In [25]:
base_date = dt.datetime(2022, 8, 24)

# Let's convert the datetimes into a specific date
stop_times['arrival_dt'] = base_date + pd.to_timedelta(stop_times['arrival_time'])

# Now let's get the trips we need for that particular date and filter down.
service_ids = calendar[(calendar.start_date >= 20220824) & (calendar.end_date >= 20220824) & (calendar.wednesday == 1)]
trip_ids = trips[trips.service_id.isin(service_ids.service_id)]

first_trip = stop_times[stop_times.trip_id.isin(trip_ids.trip_id)].arrival_dt.min()
last_trip = stop_times[stop_times.trip_id.isin(trip_ids.trip_id)].arrival_dt.max()

print("First trip:", first_trip.strftime("%b %d at %H:%M:%S"))
print("Last trip:", last_trip.strftime("%b %d at %H:%M:%S"))
print("Total Service Span:", last_trip - first_trip)

First trip: Aug 24 at 03:42:00
Last trip: Aug 25 at 02:21:00
Total Service Span: 0 days 22:39:00


## Exercises
* How many trips are scheduled to run on November 11, 2022? 
* What is the longest length route that Calgary Transit runs?
* What is the scheduled average weekday morning peak period headway on routes 1-10 (inclusive)?
* What is the service span for routes numbered in the 300s?