## Better identification of bus trips by ignoring Bus Time inferences
This notebook demonstrates that row-rise processing of time-sorted vehicle pings can be used to split the records according to unique bus trips.  The result of the splitting produces a total trip count that is much closer to the expected number (based on the schedule) for a sample date.

In [1]:
# -*- coding: utf-8 -*-
import os
import pandas as pd
import matplotlib.pyplot as plt
from scipy import interpolate
from itertools import compress
import time
%matplotlib inline  
import sys
sys.path.append('/gpfs2/projects/project-bus_capstone_2016/workspace/mu529/bus-Capstone')

# these two modules are homemade
import gtfs
import arrivals
import time
import ttools
os.chdir('/gpfs2/projects/project-bus_capstone_2016/workspace/share')

### Get raw data

In [4]:
# get the sample of parsed AVL data.  Beware, large files take more time.
bustime = pd.read_csv('20151203_parsed_with_destinations.csv')       
bustime.rename(columns={'vehicleID':'vehicle_id','Line':'route','Latitude':'lat','Longitude':'lon',
                        'Trip':'trip_id','TripDate':'trip_date','TripPattern':'shape_id',
                        'MonitoredCallRef':'next_stop_id','DistFromCall':'dist_from_stop',
                        'CallDistAlongRoute':'stop_dist_on_trip','RecordedAtTime':'timestamp'},inplace=True)

In [9]:
bustime.drop_duplicates(['vehicle_id','timestamp'],inplace=True)
bustime['trip_id'] = bustime['trip_id'].str.replace('MTA NYCT_','')
bustime['trip_id'] = bustime['trip_id'].str.replace('MTABC_','')
bustime['ts'] = bustime['timestamp'].apply(ttools.parseActualTime,tdate='2015-12-03')
print 'Loaded Bus Time data but did not set indexes'

Loaded Bus Time data but did not set indexes


In [10]:
# for demonstration, use a subset. Just get data one day.
tripDateLookup = "2015-12-03"
bustime = bustime[bustime['trip_date']==tripDateLookup]
# note that the AVL dataframe must be sorted by timestammp, since iloc[]
# selection is used later in this script to find the earliest time


bustime['stop_dist_on_trip'] = bustime['stop_dist_on_trip'].convert_objects(convert_numeric=True)
bustime['dist_from_stop'] = bustime['dist_from_stop'].convert_objects(convert_numeric=True)
bustime['veh_dist_along_trip'] = bustime['stop_dist_on_trip'] - bustime['dist_from_stop']

print 'Finished loading BusTime data, converting distances, and slicing ONE DAY.'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Finished loading BusTime data, converting distances, and slicing ONE DAY.


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [12]:
bustime.set_index(['route','trip_date','vehicle_id'],inplace=True,drop=True)
bustime.set_index('timestamp',append=True,drop=True,inplace=True)
bustime.sort_index(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


In [14]:
import numpy as np

### Labeling algorithm
This algorithm interates row-wise through a dataframe and increments the label if any of two conditions is met:
   1. The headsign (destination displayed to customers) changes
   2. More than 45 minutes elapses since the bus last reported observations on the same route

In [59]:
def matthew4(df,thresh=45):
    label = 0
    first_idx = df.index[0]
    time_limit = ttools.datetime.timedelta(minutes=thresh)
    limit_bools = df['ts'].diff()>time_limit
    for index, row in df.iterrows():
        if index==first_idx:
            labels = [label]
            hs = row.DestinationRef
            ts = row.ts
        else:
            if (row.DestinationRef!=hs or limit_bools[index]==True):
                label += 1
                labels.append(label)
                hs = row.DestinationRef
                ts = row.ts
            else:
                labels.append(label)
    return pd.Series(data=labels,index=df.index.get_level_values(3))

In [63]:
bustime['new_trip_id'] = bustime.groupby(level=(0,1,2)).apply(matthew4)

### Compare length of label lists, using various groupings
Using a combination of the route, vehicle and new inferred trip_id label:

In [68]:
sum(bustime.set_index('new_trip_id',append=True).groupby(level=(0,1,2,4)).size()>2)

53575

Also adding the shape_id to the grouping (to show how often multiple shape_id are reported for the same trip)

In [77]:
sum(bustime.set_index(['new_trip_id','shape_id'],append=True).groupby(level=(0,1,2,4,5)).size()>2)

54389

And compare to using the old method, which is to group by combination of vehicle and reported trip_id only.

In [69]:
bt_copy = bustime.set_index('trip_id',append=True)

In [71]:
sum(bt_copy.groupby(level=(0,1,2,4)).size()>2)

57232

In [72]:
sum(bt_copy.groupby(level=(0,1,4)).size()>2)

51102