# "Wiener Linien" Minute

Anyone who has used the Vienna public transport system has probably used the Wiener Linen.

And if they payed close attention to the departure monitor, they might have noticed some inconsistencies with the displayed arrival times and the actual arriving times. Sometimes the monitor says the next train will "arrive in 4 minutes" for over two minutes and other times the "4 minute" arrival time will display for less then half a minute. This is because the departure monitors don't actually display arrival times according to the schedule, but instead take guesses based on the position of the train/tram/bus/etc. .

So the question any reasonable passenger is thinking is: How accurate are these predictions, and are there any patterns one can observe?

## Data acquisition

Instead of having to manually sit at the subway station and take timings, the Wiener Linien have decided to provide the departure monitors' data publicly for free [via their open data api](https://www.data.gv.at/katalog/dataset/wiener-linien-echtzeitdaten-via-datendrehscheibe-wien/resource/ce15bc47-696f-4ff8-81f3-4ee0737d95de)

For this project, I implemented a data-scraper script that periodically pulls the data for a certain station from the api. It can be run with the 'data_scraper_deploy.py' script by simply using the command line - for example:

`python3 data_scraper_deploy.py >> scrape.log` to get logging output to scrape.log

(here's the contents of the data_scraper_deploy.py file for context):

```{python}
import wienerLinien.datascraper as ds

scraper = ds.Scraper(outFile="scrape.csv", station="Längenfeldgasse", mode="a")
scraper.run(None) # run indefinitely
```

For this project the data scraper was run for multiple weeks with some interruptions, but more on the dataset later. Requests were sent every 10 to 15 seconds (started with 10 s and switched to 15 s after encountering rate limitations).

## Data preperation

Data preperation was the most challenging part of the task. The initial dataframe looks something like this:

In [1]:
import pandas as pd

dataset = pd.read_csv('./data/scrape_combined_2021-12-29.csv.gz')
dataset

Unnamed: 0,station,line,towards,countdown,time
0,Längenfeldgasse U,12A,Eichenstraße,"[6, 13, 24, 34, 44, 54, 64]",1.637752e+09
1,Längenfeldgasse U,12A,"Schmelz, Gablenzgasse","[1, 12, 21, 30, 40, 50, 60]",1.637752e+09
2,"Flurschützstraße, Längenfeldgasse",62,"Lainz, Wolkersbergenstraße","[3, 11, 17, 25, 32, 40, 47, 55, 62]",1.637752e+09
3,"Flurschützstraße, Längenfeldgasse",62,"Oper, Karlsplatz U","[0, 8, 14, 25, 30, 38, 45, 53, 60, 68]",1.637752e+09
4,Flurschützstraße / Längenfeldgasse,63A,Am Rosenhügel,"[0, 12, 22, 32, 42, 52, 59, 67]",1.637752e+09
...,...,...,...,...,...
2250410,Längenfeldgasse,U4,HÜTTELDORF,"[0, 5, 11, 19, 26, 34, 41, 49, 56, 64]",1.640814e+09
2250411,Längenfeldgasse,U6,FLORIDSDORF,"[7, 16, 21, 29, 36, 44, 51, 59, 66]",1.640814e+09
2250412,Längenfeldgasse,U6,SIEBENHIRTEN,"[4, 12, 19, 26, 34, 41, 49, 56, 64]",1.640814e+09
2250413,"Flurschützstraße, Längenfeldgasse",WLB,Wien Oper,"[9, 24, 39, 54, 69]",1.640814e+09


As seen above, the dataset consists of a table containing the station, line, direction, countdowns for the next few arrivals and the time when the data was pulled.

We can also see that the dataset consists of over 2 million entries beginning at November, 24th 2021 with the last entry on December 29th 2021.

So now to get any meaningful data out of this, the data had to be separated into station/line/towards combinations. Then for each distinct combination of insterest, the vehicles (trains, trams, busses, etc.) had to be tracked using the countdown data. This is especially tricky as the position within the countdown array changes with time and sometimes trains will just be left out (this predominantly happens at the end of the countdown list while the first few positions are almost never affected). Also, sometimes the api rate limited the request and there were a few occurences where the dataset was interrupted by minutes, hours or days (for example because of an unhandled exception or a non-scheduled server-restart). Measures were taken to filter out these bad samples as good as possible.

The resulting logic was implemented in the "wienerLinien" package, more specifically in the "apiData" class which is part of the "analyse" package.

So let's jump straight into the dataset...

### Loading the dataset

The dataset is loaded by simply instantiating a new apiData object passing the dataset as argument

In [2]:
from wienerLinien.analyse import apiData

api_file = './data/scrape_combined_2021-12-29.csv.gz'
#api_file = './data/scrape_subset_last5k.csv.gz' # small subset for debugging, also change in next code cell if using this dataset
if __name__ == '__main__':
    data_class = apiData(api_file)

### View and select station/line/direction pairs

After loading the dataset, we need to first view the available station/line/direction combinations and then select the ones we are interested in.

This can be done by the `getAvailable()` method which neatly lists all present combinations including the number of occurences per combination. The function also performs some rudimentary filtering to exclude most special announcements made via the departure monitors.

In [3]:
data_class.getAvailable().head(10)

Unnamed: 0,station,line,towards,count
295,Längenfeldgasse,U6,SIEBENHIRTEN,177221
225,Längenfeldgasse,U6,FLORIDSDORF,175275
71,Längenfeldgasse,U4,HÜTTELDORF,175096
1,Flurschützstraße / Längenfeldgasse,63A,Gesundheitszentrum Süd,175067
0,Flurschützstraße / Längenfeldgasse,63A,Am Rosenhügel,174809
30,Längenfeldgasse,U4,HEILIGENSTADT,172561
14,"Flurschützstraße, Längenfeldgasse",62,"Oper, Karlsplatz U",171747
12,"Flurschützstraße, Längenfeldgasse",62,"Lainz, Wolkersbergenstraße",166386
304,Längenfeldgasse U,12A,"Schmelz, Gablenzgasse",166263
22,"Flurschützstraße, Längenfeldgasse",WLB,Wien Oper,156843


In this example we're interested in the U4 towards Heiligenstadt as it leads directly to BOKU and tram 62 towards Lainz because it leads to the Lainzer Tiergarten and it's quite nice there.

In [4]:
chosen = data_class.getAvailable().loc[[30, 12]][['station', 'line', 'towards']].values
#chosen = data_class.getAvailable().loc[[0, 1]][['station', 'line', 'towards']].values # small subset for debugging
chosen

array([['Längenfeldgasse', 'U4', 'HEILIGENSTADT'],
       ['Flurschützstraße, Längenfeldgasse', '62',
        'Lainz, Wolkersbergenstraße']], dtype=object)

### Track departures over time

The next step is to transform the data in our dataset into something useful. The `trackMany()` method is used. It takes a set of chosen station/line/towards combinations and tries to track the arriving vehicles over time using the countdown list and some algorithm I came up with.

The depth is the number of first entries in the countdown list to take into account. The max_diff parameter describes how many seconds may have passed between entries in the dataset to be considered cohesive. Multithreading is also implemented, so when including more than one combination in the `which` list, multiple processes are spawned using the multiprocessing library. This does not work in Windows (at least not in my installation) is is thus limited to Linux (and possible OSX devices (not tested)).

Because of the size of the dataset, this operation may take several minutes. Also, take note that in order to reduce clutter, I opted to not display progress bars for some finalizing calculations. It therefore may take some additional time (approx. 2-3 minutes) per track to finish the calculations after the progress bar is complete.

Because of the time it takes to perform these calculations, the resulting object is also pickled. You may choose if you want to use the pickled data or calculate the data yourself at the beginning of the next code cell:

In [5]:
import dill as pickle
import platform
import os
import gzip

processes = 4
lazyLoad = False # If true, trying to retrieve an already calculated object from the storage_file
pickling = True # It true, save the result to the storage_file. This option is ignored when lazyLoad is enabled.

storage_file = 'datastore.gz'

if lazyLoad:
    data_class = pickle.load(gzip.open(storage_file, mode='rb'))

else:
    if pickling:
        if os.path.isfile(storage_file):
            confirm = input(f'File {storage_file} already exists. Do you really want to override it? (y/n)')

            if confirm == 'y':
                pass
            else:
                print('Aborting...')
                raise KeyboardInterrupt

    if platform.system() == 'Windows':
        processes = 0

    data_class.trackMany(which=chosen, depth=2, max_diff=16, multithreaded=processes)

    if pickling:
        with gzip.open(storage_file, 'wb') as s:
            pickle.dump(data_class, s)

Starting to process ['Längenfeldgasse' 'U4' 'HEILIGENSTADT']


Processing: ['Längenfeldgasse' 'U4' 'HEILIGENSTADT']:   0%|          | 0/172561 [00:00<?, ?it/s]

Starting to process ['Flurschützstraße, Längenfeldgasse' '62' 'Lainz, Wolkersbergenstraße']


Processing: ['Flurschützstraße, Längenfeldgasse' '62' 'Lainz, Wolkersbergenstraße']:   0%|          | 0/166386…

## Data analysis


In [7]:
data_class.fetchResults(which=chosen[1])

Unnamed: 0,countdown,start,end,complete,hour,warning,dt
9,8.0,1.637752e+09,1.637752e+09,True,12.0,False,10.382459
10,7.0,1.637752e+09,1.637752e+09,True,12.0,False,62.411744
11,6.0,1.637752e+09,1.637752e+09,True,12.0,False,41.446254
12,5.0,1.637752e+09,1.637752e+09,True,12.0,False,31.249654
15,13.0,1.637752e+09,1.637752e+09,True,12.0,False,41.648847
...,...,...,...,...,...,...,...
83378,5.0,1.640629e+09,1.640629e+09,True,19.0,False,10.313045
83396,2.0,1.640629e+09,1.640630e+09,True,19.0,False,41.731123
87795,13.0,1.640767e+09,1.640767e+09,True,9.0,False,15.339216
87799,11.0,1.640767e+09,1.640767e+09,True,9.0,False,15.341712
