# SEPTA On Time Performance - If a train is late in the city does anyone notice?

In which I look at something that has been bothering me about the SEPTA dataset.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
%matplotlib inline
sns.plt.rcParams['figure.figsize'] = (12, 10)

In [None]:
from sqlalchemy import create_engine
con = create_engine('sqlite:///../input/database.sqlite')

In [None]:
df = pd.read_sql_table('otp', con)
df.head()

In [None]:
df.info()

In [None]:
df.loc[df.status=="1440 min", "status"] = "999 min"
df['status_n'] = df.status.str.replace("On Time", "0").str.replace(" min","").astype("int")

Now to look at just line 550, to cut down on the data.

In [None]:
t = df[df.train_id=="550"].sort_values(by='timeStamp')
t.head()

How does the lateness of this train line evolve over time?

In [None]:
df[df.train_id=="550"].sort_values(by='timeStamp').iloc[:100].plot(x='date', y='status_n')

Does that seem right to you? That initial 10 minute lateness makes the subsequent stations seem very poor indeed, even as some of them actually made up some of the late time. What about the lateness contribution of each individual station?

In [None]:
t['status_diff']= t.status_n.diff()
t.head()

In [None]:
t.plot(x='date', y='status_diff')

In [None]:
t.status_diff.hist(bins=50, log=True)

In [None]:
tg = t.groupby(['next_station']).mean().sort_values(['status_diff'])
tg

In [None]:
tg.plot(kind="scatter", x='status_n', y='status_diff')

In [None]:
tg.corr()

In [None]:
df.sort_values(by=['train_id', 'timeStamp'], inplace=True)

In [None]:
df['status_diff'] = df.status_n.diff()

In [None]:
df.loc[df.next_station == "None",'status_diff'] = np.NaN
df.head()

In [None]:
diffs = df.dropna().groupby(['next_station']).mean().sort_values(['status_diff'])
diffs

Looking at data in terms of _incremental_ lateness makes a big difference in what stations seem to be problematic. I think this should be a better indicator of problem areas in the transportation system. Incremental lateness in any station seems to have a cumulative effect on the lateness of subsequent stations.

In [None]:
diffs.plot(kind='scatter', x='status_n', y='status_diff')

In [None]:
diffs.corr()

This way of looking at the data is even more valuable when looking at more than one train line, it seems.

I think that in the future I can do some useful things with this.