# Melbourne Transit System Data PTV

15/7/2022

This project focuses on analyzing the State of Victoria's Public Transport System. (PTV).

## Data source

- Source: https://discover.data.vic.gov.au/dataset/timetable-and-geographic-information-gtfs
- Download link: http://data.ptv.vic.gov.au/downloads/gtfs.zip


In [4]:
import urllib.request 
import os
import datetime
import zipfile

In [2]:
# download the gtfs.zip file

# The data can be downloaded from this source: https://discover.data.vic.gov.au/dataset/timetable-and-geographic-information-gtfs

# download link
url = 'http://data.ptv.vic.gov.au/downloads/gtfs.zip'

# current time
# Since this gtfs file is often updated and is not static, there are different versions of the gtfs.zip file
# Therefore, we need to specify the current time as a folder name to distinct the 'versions' of the downloaded files.
current_time = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")

# the name of the downloaded zip file
download_to_filename = 'gtfs.zip'

# the folder to store the downloaded file
download_to_folder = os.path.join(os.getcwd(), 'downloads', current_time)

if(not os.path.exists(download_to_folder)):
    os.makedirs(download_to_folder)

download_to_filepath = os.path.join(download_to_folder, download_to_filename)

# download the file
urllib.request.urlretrieve(url, download_to_filepath)


('d:\\Workspace\\Programming\\Melbourne\\PTV\\downloads\\20230805_030129\\gtfs.zip',
 <http.client.HTTPMessage at 0x2530de4cf90>)

In [None]:
# # extract the gtfs.zip file and all zip files inside the gtfs.zip file

# # First, extract the gtfs.zip file

# # set the gtfs.zip file that you want to extract
# # you can omit this if `download_to_filepath` has been defined
# download_to_filepath = 'D:/Workspace/Melbourne/PTV/downloads/20220403_025040/gtfs.zip'

# # set the folder that you want to extract the files to
# download_to_folder = 'downloads/20220403_025040' # you can omit this if `download_to_folder` has already been defined


In [5]:

extract_to_folder = download_to_folder.replace('downloads', 'data', 1)
extract_to_folderpath = os.path.join(os.getcwd(), extract_to_folder)


# extract the file
with zipfile.ZipFile(download_to_filepath, 'r') as zip_ref:
    zip_ref.extractall(path = extract_to_folderpath)

# Next, extract all zip files inside the gtfs.zip file

# this variable records if there is any zip files left to be extracted
zip_files_still_exist = True

# this list records all zip files
# this helps records all zip files that have been extracted
zip_files = []

while(zip_files_still_exist):

    zip_files_still_exist = False

    for dirpath, dirnames, filenames in os.walk(extract_to_folderpath):

        for f in filenames:

            fpath = os.path.join(dirpath, f)
                
            if(f.endswith(".zip") and fpath not in zip_files):

                zip_files.append(fpath)

                zip_files_still_exist = True

                # the folder path to extract the zip file to
                folder_to_extract_to = fpath.rstrip(".zip")

                with zipfile.ZipFile(fpath, 'r') as zip_ref:
                    zip_ref.extractall(path=folder_to_extract_to)
                
                # we can delete the zip file once the zip file is extracted, in order to reduce file redundancy 
                os.remove(fpath)

## Data explanation

You can learn about the meaning of the data here: https://developers.google.com/transit/gtfs/reference

### Services

| Number | Service Type |
| - | - |
| `1` | Regional Train |
| `2` | Train |
| `3` | Tram |
| `4` | Bus |
| `5` | Regional Coach |
| `6` | Regional Bus |
| `7` | |
| `8` | |
| `10` | Airport SkyTrain |
| `11` | Airport SkyBus |

### Tables

#### Primary keys

The primary keys listed below are not official; they are actually assumed from my analysis of the tables. To determine if a column can be a primary key, I checked whether or not the values of each key are unique in each table or not.

| Table | Keys |
| - | - |
| `agency` | `agency_id` |
| `calendar`| `service_id` |
| `calendar_dates`| `service_id`, `date` |
| `routes`| `route_id` |
| `shapes`| `shape_id`, `shape_pt_sequence` |
| `stops`| `stop_id` |
| `stop_times`| `trip_id` , `stop_sequence` |
| `trips` | `trip_id` |