# VR train data analysis

Lauri Vuorenkoski

Data description: https://www.digitraffic.fi/en/railway-traffic/

Data from single day: https://rata.digitraffic.fi/api/v1/trains/2017-11-09

VR offers only current data on punctuality?
https://www.vr.fi/en/train-traffic-at-the-moment

In [1]:
import urllib.request, gzip, json, pytz
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

In [2]:
datalink = 'https://rata.digitraffic.fi/api/v1/'

dateformat = '%Y-%m-%dT%H:%M:%S.%fZ'
utcTimezone = pytz.timezone('UTC')
localTimezone = pytz.timezone('Europe/Helsinki')

In [3]:
date = "2020-11-13"
#date = localTimezone.localize(datetime.now()-timedelta(days=1)).strftime('%Y-%m-%d') # yesterday

### Load station information from API

In [4]:
req = urllib.request.Request(datalink+'metadata/stations')
req.add_header('Accept-encoding', 'gzip')
res = urllib.request.urlopen(req)
data=json.loads(gzip.decompress(res.read()))
stations={}
for station in data:
    stations[station['stationUICCode']]=station['stationName'].replace(' asema','')
print(stations[869], 'has a station code 869')

Airaksela has a station code 869


### Load train data from one day

Data contains all trains in specific day.

In [5]:
req = urllib.request.Request(datalink+'trains/'+date)
req.add_header('Accept-encoding', 'gzip')
res = urllib.request.urlopen(req)
data=json.loads(gzip.decompress(res.read())) # Construct of lists and dicts

### Create dataframe 

First and last stations with departure and arrival times are extracted from timeTableRows. DifferenceInMinutes is delay in last station.

Some of the columns are dropped and only commuter and long-distance trains are selected. Also non-regular trains excluded (there are some museum trains etc.) 


In [6]:
for train in data:
    train['firstStation']=stations[train['timeTableRows'][0]['stationUICCode']]
    train['scheduledStartTime']=utcTimezone.localize(datetime.strptime(train['timeTableRows'][0]['scheduledTime'], dateformat))
    train['lastStation']=stations[train['timeTableRows'][-1]['stationUICCode']]
    train['scheduledStopTime']=utcTimezone.localize(datetime.strptime(train['timeTableRows'][-1]['scheduledTime'], dateformat))
    if 'differenceInMinutes' in train['timeTableRows'][-1]:
        train['differenceInMinutes']=train['timeTableRows'][-1]['differenceInMinutes']
    else:
        train['differenceInMinutes']=0

df = pd.DataFrame.from_dict(data) # Create dataframe from data
mask = np.logical_and(np.logical_or(df['trainCategory']=='Long-distance', df['trainCategory']=='Commuter'), df['timetableType']=='REGULAR')
df = df.loc[mask,:] 
df = df.drop(['departureDate','operatorUICCode','operatorShortCode','runningCurrently','version','timetableType','timetableAcceptanceDate'], axis=1)
df.head(10)

Unnamed: 0,trainNumber,trainType,trainCategory,commuterLineID,cancelled,timeTableRows,firstStation,scheduledStartTime,lastStation,scheduledStopTime,differenceInMinutes
0,1,IC,Long-distance,,False,"[{'stationShortCode': 'HKI', 'stationUICCode':...",Helsinki,2020-11-13 04:57:00+00:00,Joensuu,2020-11-13 09:40:00+00:00,-5
1,2,S,Long-distance,,False,"[{'stationShortCode': 'JNS', 'stationUICCode':...",Joensuu,2020-11-13 03:12:00+00:00,Helsinki,2020-11-13 07:30:00+00:00,0
2,3,IC,Long-distance,,False,"[{'stationShortCode': 'HKI', 'stationUICCode':...",Helsinki,2020-11-13 08:19:00+00:00,Joensuu,2020-11-13 12:49:00+00:00,-5
3,4,IC,Long-distance,,False,"[{'stationShortCode': 'JNS', 'stationUICCode':...",Joensuu,2020-11-13 04:10:00+00:00,Helsinki,2020-11-13 08:40:00+00:00,0
4,5,IC,Long-distance,,False,"[{'stationShortCode': 'HKI', 'stationUICCode':...",Helsinki,2020-11-13 11:19:00+00:00,Joensuu,2020-11-13 15:50:00+00:00,-2
5,6,IC,Long-distance,,False,"[{'stationShortCode': 'JNS', 'stationUICCode':...",Joensuu,2020-11-13 07:06:00+00:00,Helsinki,2020-11-13 11:40:00+00:00,1
6,7,S,Long-distance,,False,"[{'stationShortCode': 'HKI', 'stationUICCode':...",Helsinki,2020-11-13 13:19:00+00:00,Joensuu,2020-11-13 17:36:00+00:00,-7
7,8,IC,Long-distance,,False,"[{'stationShortCode': 'JNS', 'stationUICCode':...",Joensuu,2020-11-13 10:13:00+00:00,Helsinki,2020-11-13 14:40:00+00:00,3
8,9,IC,Long-distance,,False,"[{'stationShortCode': 'HKI', 'stationUICCode':...",Helsinki,2020-11-13 14:19:00+00:00,Joensuu,2020-11-13 18:56:00+00:00,-1
9,10,IC,Long-distance,,False,"[{'stationShortCode': 'JNS', 'stationUICCode':...",Joensuu,2020-11-13 13:13:00+00:00,Helsinki,2020-11-13 17:45:00+00:00,0


### Some data on trains of one day

In [7]:
print('Date:', date)
print('Number of trains:', df.shape[0])
print('Cancelled Long-distance trains:', df[np.logical_and(df['cancelled'], df['trainCategory']=='Long-distance')].shape[0])
print('Cancelled commuter trains:', df[np.logical_and(df['cancelled'], df['trainCategory']=='Commuter')].shape[0])
print('Trains delayd more than 10 minutes:', df[df['differenceInMinutes']>10].shape[0])
print('Trains delayd more than 2 minutes:', df[df['differenceInMinutes']>2].shape[0])

Date: 2020-11-13
Number of trains: 1131
Cancelled Long-distance trains: 10
Cancelled commuter trains: 25
Trains delayd more than 10 minutes: 13
Trains delayd more than 2 minutes: 116


In [8]:
max_delay = df['differenceInMinutes'].max()
min_delay = df['differenceInMinutes'].min()
print('Train with longest positive difference:',max_delay,'minutes')
print('Train with longest negative difference:',min_delay,'minutes')

Train with longest positive difference: 39 minutes
Train with longest negative difference: -10 minutes


### More detailed data on the actual schedule of one train

In [9]:
train = df.loc[df['differenceInMinutes']==max_delay] # a train with longest delay
train

Unnamed: 0,trainNumber,trainType,trainCategory,commuterLineID,cancelled,timeTableRows,firstStation,scheduledStartTime,lastStation,scheduledStopTime,differenceInMinutes
805,8631,HL,Commuter,P,False,"[{'stationShortCode': 'HKI', 'stationUICCode':...",Helsinki,2020-11-13 06:58:00+00:00,Helsinki,2020-11-13 08:01:00+00:00,39


In [14]:
def parse_hm(timestr):
    if timestr=='':
        return ''
    date = utcTimezone.localize(datetime.strptime(timestr, dateformat))
    return date.strftime('%H:%M')

schedule = pd.DataFrame.from_dict(train['timeTableRows'].iloc[0])
schedule=schedule.loc[schedule.trainStopping,:] # Select stations where train stops
schedule=schedule.drop(['commercialStop','commercialTrack','stationShortCode','countryCode','estimateSource','trainStopping','trainReady','liveEstimateTime'], axis=1, errors='ignore')
schedule['stationUICCode']=schedule['stationUICCode'].map(stations)
schedule['scheduledTime']=schedule['scheduledTime'].map(parse_hm)
schedule['actualTime']=schedule['actualTime'].fillna('').map(parse_hm)
schedule


Unnamed: 0,stationUICCode,type,cancelled,scheduledTime,actualTime,differenceInMinutes,causes
0,Helsinki,DEPARTURE,False,06:58,06:58,0,[]
1,Pasila,ARRIVAL,False,07:01,07:02,1,[]
2,Pasila,DEPARTURE,False,07:02,07:02,1,[]
3,Ilmala,ARRIVAL,False,07:04,07:04,0,[]
4,Ilmala,DEPARTURE,False,07:04,07:04,0,[]
7,Huopalahti,ARRIVAL,False,07:06,07:06,0,[]
8,Huopalahti,DEPARTURE,False,07:07,07:06,0,[]
9,Pohjois-Haaga,ARRIVAL,False,07:08,07:08,0,[]
10,Pohjois-Haaga,DEPARTURE,False,07:09,07:09,0,[]
11,Kannelmäki,ARRIVAL,False,07:10,07:10,0,[]


and there may be even some causes for delay

In [20]:
causes = []
for c in schedule['causes']:
    causes+=c
print(causes)

[{'categoryCode': 'R', 'detailedCategoryCode': 'R2', 'detailedCategoryCodeId': 161, 'categoryCodeId': 31}, {'categoryCode': 'L', 'detailedCategoryCode': 'L2', 'thirdCategoryCode': 'L204', 'detailedCategoryCodeId': 101, 'thirdCategoryCodeId': 93, 'categoryCodeId': 27}]


Lets load human readable causes

In [21]:
req = urllib.request.Request(datalink+'metadata/detailed-cause-category-codes')
req.add_header('Accept-encoding', 'gzip')
res = urllib.request.urlopen(req)
data=json.loads(gzip.decompress(res.read()))

In [22]:
category={}
for c in data:
    category[c['id']]=c
for c in causes:
    print(category[c['detailedCategoryCodeId']]['passengerTerm']['en'])

Track work
A congested line section
