This kernel was originally created by [KaanD][1] 

How about we answer the question of how SEPTA is doing? The 91% On Time (6 >= min late) marker is something that we should be able to take a look at. Then we can see what else we can learn from the data.


  [1]: https://www.kaggle.com/kdivringi

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sqlalchemy import create_engine
import seaborn as sns
sns.plt.rcParams['figure.figsize'] = (12, 10)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

In [None]:
con = create_engine('sqlite:///../input/database.sqlite')

In [None]:
df = pd.read_sql_table('otp', con)
df.head()

In [None]:
df.describe()

In [None]:
df.status.unique()

In [None]:
df[df.status=="1440 min"]

It looks like, in addition to the 999 train suspended code, we have at least one train that is listed as 1440 minutes late (60 days!). It looks like [that train should have just been suspended](https://www.kaggle.com/forums/f/1300/septa-regional-rail/t/22861/what-s-the-deal-with-the-train-that-s-60-days-late). We'll set that train to the suspended code and then convert the strings to numbers.

In [None]:
df.loc[df.status=="1440 min", "status"] = "999 min"
df['status_n'] = df.status.str.replace("On Time", "0").str.replace(" min","").astype("int")
df.head()

Now let's take a look at the distribution, minus the suspended trains:

In [None]:
df[df.status_n!=999].status_n.hist(bins=100, log=True);

And then the number of suspended trains:

In [None]:
print("Number of suspended trains:", len(df[df.status_n==999]))

How many trains are "On time" by the definition of less than 6 minutes late? This should give us our first clue into how SEPTA is doing.

In [None]:
# On time trains:
ot = df[df.status_n < 6]
# Late trains:
lt = df[df.status_n >= 6]
print("On time trains:", len(ot), "Late trains:", len(lt), "Percentage on time:", len(ot)/len(df)*100)

It looks like SEPTA is at 81% for the spring, about 10% below the stated goal. Can we narrow down some of the problem areas? When do late trains generally occur? We can look at things in terms of a schedule, Day of week vs Time.

In [None]:
df['Day'] = df.timeStamp.dt.dayofweek
df['Hour'] = df.timeStamp.dt.hour
gb = df[df.status_n!=999].groupby(["Hour", "Day"]).aggregate(np.sum).unstack()
gb.head()

In [None]:
sns.heatmap(gb,xticklabels=["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]);

We can see that much of the lateness is generated during rush hour in the week, an expected result.

These are the late trains but what are the real late trains, as defined by SEPTA? What is the distribution of trains > 5 minutes late?

In [None]:
lt = df[df.status_n >= 6]
gb2 = lt[df.status_n!=999].groupby(["Hour", "Day"]).aggregate(np.sum).unstack()
gb2.head()

In [None]:
sns.heatmap(gb2,xticklabels=["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]);

Looks very similar. What is it about Tuesday morning rush hour anyways?