## Project Benson

This notebook is a collaboration between Sasha Prokhorova, Nick Horton and Anterra Kennedy.

In [3]:
import pandas as pd
import numpy as np
import matplotlib as plt

In [7]:
# using one week's data as a sample to explore first 
df_sample = pd.read_csv("http://web.mta.info/developers/data/nyct/turnstile/turnstile_190907.txt")

In [9]:
df_sample.head(20)

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS
0,A002,R051,02-00-00,59 ST,NQR456W,BMT,08/31/2019,00:00:00,REGULAR,7183242,2433142
1,A002,R051,02-00-00,59 ST,NQR456W,BMT,08/31/2019,04:00:00,REGULAR,7183258,2433149
2,A002,R051,02-00-00,59 ST,NQR456W,BMT,08/31/2019,08:00:00,REGULAR,7183278,2433176
3,A002,R051,02-00-00,59 ST,NQR456W,BMT,08/31/2019,12:00:00,REGULAR,7183393,2433262
4,A002,R051,02-00-00,59 ST,NQR456W,BMT,08/31/2019,16:00:00,REGULAR,7183572,2433312
5,A002,R051,02-00-00,59 ST,NQR456W,BMT,08/31/2019,20:00:00,REGULAR,7183842,2433348
6,A002,R051,02-00-00,59 ST,NQR456W,BMT,09/01/2019,00:00:00,REGULAR,7184008,2433376
7,A002,R051,02-00-00,59 ST,NQR456W,BMT,09/01/2019,04:00:00,REGULAR,7184025,2433380
8,A002,R051,02-00-00,59 ST,NQR456W,BMT,09/01/2019,08:00:00,REGULAR,7184042,2433397
9,A002,R051,02-00-00,59 ST,NQR456W,BMT,09/01/2019,12:00:00,REGULAR,7184137,2433450


In [6]:
# Source: http://web.mta.info/developers/turnstile.html
def get_data(week_nums):
    url = "http://web.mta.info/developers/data/nyct/turnstile/turnstile_{}.txt"
    dfs = []
    for week_num in week_nums:
        file_url = url.format(week_num)
        dfs.append(pd.read_csv(file_url))
    return pd.concat(dfs)
        
# weeks: sept 7th to december 7th, 2019    
week_nums = [190907, 190914, 190921, 190928, 191005, 191012, 191019,
            191026, 191102, 191109, 191116, 191123, 191130, 191207]
df = get_data(week_nums)

In [8]:
pd.to_datetime(df['TIME'])

0        2020-07-01 00:00:00
1        2020-07-01 04:00:00
2        2020-07-01 08:00:00
3        2020-07-01 12:00:00
4        2020-07-01 16:00:00
                 ...        
205919   2020-07-01 04:00:00
205920   2020-07-01 08:00:00
205921   2020-07-01 12:00:00
205922   2020-07-01 16:00:00
205923   2020-07-01 20:00:00
Name: TIME, Length: 2882165, dtype: datetime64[ns]

# **Field Description** 
---
**C/A**     = Control Area (A002) <br>
**UNIT**     = Remote Unit for a station (R051) <br>
**SCP**      = Subunit Channel Position represents an specific address for a device (02-00-00)<br>
**STATION**  = Represents the station name the device is located at <br>
**LINENAME** = Represents all train lines that can be boarded at this station<br>
 - Normally lines are represented by one character. <br>
 - LINENAME 456NQR repersents train server for 4, 5, 6, N, Q, and R trains. <br>
---
**DIVISION** = Represents the Line originally the station belonged to BMT, IRT, or IND <br>  
**DATE**     = Represents the date (MM-DD-YY) <br>
**TIME**     = Represents the time (hh:mm:ss) for a scheduled audit event<br>
**DESc**     = Represent the "REGULAR" scheduled audit event (Normally occurs every 4 hours) <br>
 - Audits may occur more that 4 hours due to planning, or troubleshooting activities. <br>
 - Additionally, there may be a "RECOVR AUD" entry: This refers to a missed audit that was recovered. <br>
---
**ENTRIES**  = The comulative entry register value for a device<br>
**EXIST**    = The cumulative exit register value for a device <br>


**"ENTRIES"** and **"EXITS"** are *cumulative* register values for a device -- meaning the total cumulative number of entries for all time, hence entries within a 4 hour period would be the difference between a given value and the value that precedes it. Explains why so many of the values were identical every row! Means 0 people used the turnstile in that 4 hour period. <-- Anterra's discovery

In [11]:
df.sample(20)

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS
104825,N537,R258,00-03-01,4 AV-9 ST,DFGMNR,IND,11/16/2019,15:00:00,REGULAR,12351143,6577061
49562,N046,R281,00-06-01,72 ST,BC,IND,11/14/2019,03:00:00,REGULAR,304910,365687
137799,R142,R293,01-00-01,34 ST-PENN STA,123ACE,IRT,10/20/2019,10:00:00,REGULAR,13228705,8555398
163669,R246,R177,00-03-06,68ST-HUNTER CO,6,IRT,10/20/2019,05:00:00,REGULAR,1254041,1358219
100690,N513,R163,04-05-00,14 ST,FLM123,IND,11/18/2019,17:05:50,REGULAR,58,0
53884,N067,R012,00-03-05,34 ST-PENN STA,ACE,IND,11/13/2019,19:00:00,REGULAR,539657,253785
10771,A060,R001,00-00-00,WHITEHALL S-FRY,R1W,BMT,11/10/2019,08:00:00,REGULAR,4759067,2806318
193134,R550,R072,00-03-04,34 ST-HUDSON YD,7,IRT,11/30/2019,23:00:00,REGULAR,1976228,738372
95914,N420B,R317,00-00-00,CLINTON-WASH AV,G,IND,09/30/2019,09:00:00,REGULAR,887819,1921748
67388,N122,R439,00-00-02,ROCKAWAY AV,C,IND,09/28/2019,09:00:00,REGULAR,4375963,5090529


In [14]:
# This method allows us to see all the column names in the dataframe.
# Multiple trailing spaces in 'EXITS' column were discovered.
df_sample.columns

Index(['C/A', 'UNIT', 'SCP', 'STATION', 'LINENAME', 'DIVISION', 'DATE', 'TIME',
       'DESC', 'ENTRIES',
       'EXITS                                                               '],
      dtype='object')

In [15]:
# Combining date and time into one column.
df_sample["DATETIME"] = pd.to_datetime(df_sample["DATE"] + " " + df_sample["TIME"])

In [16]:
df_sample.head()

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS,DATETIME
0,A002,R051,02-00-00,59 ST,NQR456W,BMT,08/31/2019,00:00:00,REGULAR,7183242,2433142,2019-08-31 00:00:00
1,A002,R051,02-00-00,59 ST,NQR456W,BMT,08/31/2019,04:00:00,REGULAR,7183258,2433149,2019-08-31 04:00:00
2,A002,R051,02-00-00,59 ST,NQR456W,BMT,08/31/2019,08:00:00,REGULAR,7183278,2433176,2019-08-31 08:00:00
3,A002,R051,02-00-00,59 ST,NQR456W,BMT,08/31/2019,12:00:00,REGULAR,7183393,2433262,2019-08-31 12:00:00
4,A002,R051,02-00-00,59 ST,NQR456W,BMT,08/31/2019,16:00:00,REGULAR,7183572,2433312,2019-08-31 16:00:00


In [17]:
# cleaning up data column name "EXITS" which had a bunch of trailing spaces

df_sample.rename(columns={'EXITS                                                               ':"EXITS"}, inplace=True)

In [18]:
df_sample.head()

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS,DATETIME
0,A002,R051,02-00-00,59 ST,NQR456W,BMT,08/31/2019,00:00:00,REGULAR,7183242,2433142,2019-08-31 00:00:00
1,A002,R051,02-00-00,59 ST,NQR456W,BMT,08/31/2019,04:00:00,REGULAR,7183258,2433149,2019-08-31 04:00:00
2,A002,R051,02-00-00,59 ST,NQR456W,BMT,08/31/2019,08:00:00,REGULAR,7183278,2433176,2019-08-31 08:00:00
3,A002,R051,02-00-00,59 ST,NQR456W,BMT,08/31/2019,12:00:00,REGULAR,7183393,2433262,2019-08-31 12:00:00
4,A002,R051,02-00-00,59 ST,NQR456W,BMT,08/31/2019,16:00:00,REGULAR,7183572,2433312,2019-08-31 16:00:00


In [19]:
# dropping unneeded 'DATE' and 'TIME' columns now that we have 'DATETIME': 

df_sample.drop("DATE", axis=1, inplace=True)
df_sample.drop("TIME", axis=1, inplace=True)

In [21]:
df_sample.sample(10)

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DESC,ENTRIES,EXITS,DATETIME
28230,E004,R234,00-00-00,50 ST,D,BMT,REGULAR,5889379,5472459,2019-09-06 00:00:00
35328,H032,R295,00-05-01,WILSON AV,L,BMT,REGULAR,7,46,2019-09-04 08:00:00
53343,N063A,R011,00-00-09,42 ST-PORT AUTH,ACENQRS1237W,IND,REGULAR,520081,232054,2019-09-06 08:00:00
24717,D002,R390,00-00-00,8 AV,N,BMT,REGULAR,33806,104388,2019-08-31 17:00:00
164221,R249,R179,01-05-00,86 ST,456,IRT,REGULAR,17,0,2019-09-03 08:00:00
21165,C011,R231,01-00-01,UNION ST,R,BMT,REGULAR,2330415,5834016,2019-09-06 00:00:00
83903,N327,R254,00-05-01,GRAND-NEWTOWN,MR,IND,REGULAR,1197529,2602311,2019-09-01 16:00:00
161802,R244A,R050,01-00-04,59 ST,456NQRW,IRT,REGULAR,5605658,927437,2019-09-04 20:00:00
203450,S101,R070,00-03-01,ST. GEORGE,1,SRT,REGULAR,1619280,111,2019-09-06 12:00:00
29500,E015,R399,00-00-02,25 AV,D,BMT,REGULAR,4697750,1648298,2019-08-31 17:00:00


In [22]:
df_sample.shape

(204795, 10)

In [26]:
# finding actual entries during each time period: 

real_entries = df_sample.groupby(["C/A", "UNIT", "SCP"]).agg({"ENTRIES": "diff"})
real_entries.head(10)

Unnamed: 0,ENTRIES
0,
1,16.0
2,20.0
3,115.0
4,179.0
5,270.0
6,166.0
7,17.0
8,17.0
9,95.0


In [27]:
# finding actual exits during each time period: 

real_exits = df_sample.groupby(["C/A", "UNIT", "SCP"]).agg({"EXITS": "diff"})
real_exits.head(10)

Unnamed: 0,EXITS
0,
1,7.0
2,27.0
3,86.0
4,50.0
5,36.0
6,28.0
7,4.0
8,17.0
9,53.0


In [28]:
# adding columns of real entries and exits to the DataFrame 

df_sample["REAL_ENTRIES"] = real_entries["ENTRIES"]
df_sample["REAL_EXITS"] = real_exits["EXITS"]

In [29]:
df_sample.head(10)

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DESC,ENTRIES,EXITS,DATETIME,REAL_ENTRIES,REAL_EXITS
0,A002,R051,02-00-00,59 ST,NQR456W,BMT,REGULAR,7183242,2433142,2019-08-31 00:00:00,,
1,A002,R051,02-00-00,59 ST,NQR456W,BMT,REGULAR,7183258,2433149,2019-08-31 04:00:00,16.0,7.0
2,A002,R051,02-00-00,59 ST,NQR456W,BMT,REGULAR,7183278,2433176,2019-08-31 08:00:00,20.0,27.0
3,A002,R051,02-00-00,59 ST,NQR456W,BMT,REGULAR,7183393,2433262,2019-08-31 12:00:00,115.0,86.0
4,A002,R051,02-00-00,59 ST,NQR456W,BMT,REGULAR,7183572,2433312,2019-08-31 16:00:00,179.0,50.0
5,A002,R051,02-00-00,59 ST,NQR456W,BMT,REGULAR,7183842,2433348,2019-08-31 20:00:00,270.0,36.0
6,A002,R051,02-00-00,59 ST,NQR456W,BMT,REGULAR,7184008,2433376,2019-09-01 00:00:00,166.0,28.0
7,A002,R051,02-00-00,59 ST,NQR456W,BMT,REGULAR,7184025,2433380,2019-09-01 04:00:00,17.0,4.0
8,A002,R051,02-00-00,59 ST,NQR456W,BMT,REGULAR,7184042,2433397,2019-09-01 08:00:00,17.0,17.0
9,A002,R051,02-00-00,59 ST,NQR456W,BMT,REGULAR,7184137,2433450,2019-09-01 12:00:00,95.0,53.0


In [39]:
df_station_sample = df_sample.groupby(["STATION", "C/A", "UNIT", "SCP", "DATETIME"])[["REAL_ENTRIES", "REAL_EXITS"]].sum()

df_station_sample.head(50)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,REAL_ENTRIES,REAL_EXITS
STATION,C/A,UNIT,SCP,DATETIME,Unnamed: 5_level_1,Unnamed: 6_level_1
1 AV,H007,R248,00-00-00,2019-08-31 00:00:00,0.0,0.0
1 AV,H007,R248,00-00-00,2019-08-31 04:00:00,0.0,5.0
1 AV,H007,R248,00-00-00,2019-08-31 08:00:00,0.0,16.0
1 AV,H007,R248,00-00-00,2019-08-31 12:00:00,0.0,5.0
1 AV,H007,R248,00-00-00,2019-08-31 16:00:00,0.0,11.0
1 AV,H007,R248,00-00-00,2019-08-31 20:00:00,0.0,10.0
1 AV,H007,R248,00-00-00,2019-09-01 00:00:00,0.0,5.0
1 AV,H007,R248,00-00-00,2019-09-01 04:00:00,0.0,13.0
1 AV,H007,R248,00-00-00,2019-09-01 08:00:00,0.0,19.0
1 AV,H007,R248,00-00-00,2019-09-01 12:00:00,0.0,16.0


In [40]:
df_station_sample.dropna(how="any", inplace=True)

In [42]:
# resetting index to flatten multiindex 

df_station_sample.reset_index(inplace=True)

In [54]:
# Looking for rows where the audit was recovered.

df_station_sample.loc[df_sample.DESC != "REGULAR"]

Unnamed: 0,STATION,C/A,UNIT,SCP,DATETIME,REAL_ENTRIES,REAL_EXITS
111,1 AV,H007,R248,00-03-00,2019-09-04 12:00:00,284.0,398.0
10467,14 ST-UNION SQ,A035,R170,00-00-04,2019-08-31 17:00:00,501.0,132.0
10468,14 ST-UNION SQ,A035,R170,00-00-04,2019-08-31 21:00:00,424.0,115.0
10509,14 ST-UNION SQ,A037,R170,05-00-00,2019-08-31 17:00:00,840.0,383.0
10510,14 ST-UNION SQ,A037,R170,05-00-00,2019-08-31 21:00:00,988.0,242.0
...,...,...,...,...,...,...,...
197847,WALL ST,R111,R027,00-03-00,2019-09-03 16:00:00,135.0,104.0
197848,WALL ST,R111,R027,00-03-00,2019-09-03 20:00:00,574.0,133.0
197889,WALL ST,R111,R027,00-03-01,2019-09-03 16:00:00,208.0,42.0
197890,WALL ST,R111,R027,00-03-01,2019-09-03 20:00:00,965.0,41.0


In [55]:
df_station_sample.head(20)

Unnamed: 0,STATION,C/A,UNIT,SCP,DATETIME,REAL_ENTRIES,REAL_EXITS
0,1 AV,H007,R248,00-00-00,2019-08-31 00:00:00,0.0,0.0
1,1 AV,H007,R248,00-00-00,2019-08-31 04:00:00,0.0,5.0
2,1 AV,H007,R248,00-00-00,2019-08-31 08:00:00,0.0,16.0
3,1 AV,H007,R248,00-00-00,2019-08-31 12:00:00,0.0,5.0
4,1 AV,H007,R248,00-00-00,2019-08-31 16:00:00,0.0,11.0
5,1 AV,H007,R248,00-00-00,2019-08-31 20:00:00,0.0,10.0
6,1 AV,H007,R248,00-00-00,2019-09-01 00:00:00,0.0,5.0
7,1 AV,H007,R248,00-00-00,2019-09-01 04:00:00,0.0,13.0
8,1 AV,H007,R248,00-00-00,2019-09-01 08:00:00,0.0,19.0
9,1 AV,H007,R248,00-00-00,2019-09-01 12:00:00,0.0,16.0


In [47]:
# How many stations are in this dataset total?
df_station_sample['STATION'].nunique()

378

In [51]:
# What is the foot traffic like at noon?

twelve_pm = df_station_sample[df_station_sample["DATETIME"] == "2019-09-05 12:00:00"]

In [52]:
twelve_pm

Unnamed: 0,STATION,C/A,UNIT,SCP,DATETIME,REAL_ENTRIES,REAL_EXITS
33,1 AV,H007,R248,00-00-00,2019-09-05 12:00:00,1170.0,1333.0
75,1 AV,H007,R248,00-00-01,2019-09-05 12:00:00,1473.0,353.0
117,1 AV,H007,R248,00-03-00,2019-09-05 12:00:00,332.0,498.0
159,1 AV,H007,R248,00-03-01,2019-09-05 12:00:00,301.0,121.0
201,1 AV,H007,R248,00-03-02,2019-09-05 12:00:00,439.0,20.0
...,...,...,...,...,...,...,...
202770,WORLD TRADE CTR,N094,R029,01-05-01,2019-09-05 12:00:00,0.0,0.0
202812,WORLD TRADE CTR,N094,R029,01-06-00,2019-09-05 12:00:00,52.0,53.0
202854,WORLD TRADE CTR,N094,R029,01-06-01,2019-09-05 12:00:00,105.0,122.0
202896,WORLD TRADE CTR,N094,R029,01-06-02,2019-09-05 12:00:00,161.0,281.0


In [53]:
twelve_pm.dropna()

Unnamed: 0,STATION,C/A,UNIT,SCP,DATETIME,REAL_ENTRIES,REAL_EXITS
33,1 AV,H007,R248,00-00-00,2019-09-05 12:00:00,1170.0,1333.0
75,1 AV,H007,R248,00-00-01,2019-09-05 12:00:00,1473.0,353.0
117,1 AV,H007,R248,00-03-00,2019-09-05 12:00:00,332.0,498.0
159,1 AV,H007,R248,00-03-01,2019-09-05 12:00:00,301.0,121.0
201,1 AV,H007,R248,00-03-02,2019-09-05 12:00:00,439.0,20.0
...,...,...,...,...,...,...,...
202770,WORLD TRADE CTR,N094,R029,01-05-01,2019-09-05 12:00:00,0.0,0.0
202812,WORLD TRADE CTR,N094,R029,01-06-00,2019-09-05 12:00:00,52.0,53.0
202854,WORLD TRADE CTR,N094,R029,01-06-01,2019-09-05 12:00:00,105.0,122.0
202896,WORLD TRADE CTR,N094,R029,01-06-02,2019-09-05 12:00:00,161.0,281.0


In [59]:
noon = twelve_pm.groupby(['STATION'])[['REAL_ENTRIES', 'REAL_EXITS']].sum()
noon.reset_index()

Unnamed: 0,STATION,REAL_ENTRIES,REAL_EXITS
0,1 AV,4727.0,5219.0
1,103 ST-CORONA,5510.0,1417.0
2,104 ST,889.0,81.0
3,110 ST,3425.0,1776.0
4,111 ST,3566.0,982.0
...,...,...,...
208,WHITLOCK AV,494.0,335.0
209,WILSON AV,1564.0,344.0
210,WOODHAVEN BLVD,1486.0,466.0
211,WOODLAWN,1778.0,209.0


In [60]:
noon.sort_values(["REAL_ENTRIES", "REAL_EXITS"], ascending=[False,False])

Unnamed: 0_level_0,REAL_ENTRIES,REAL_EXITS
STATION,Unnamed: 1_level_1,Unnamed: 2_level_1
34 ST-PENN STA,37528.0,25072.0
34 ST-HERALD SQ,20483.0,37757.0
FLUSHING-MAIN,18318.0,8098.0
86 ST,17872.0,14985.0
GRD CNTRL-42 ST,17696.0,20225.0
...,...,...
BEACH 36 ST,270.0,146.0
BEACH 44 ST,177.0,88.0
AQUEDUCT RACETR,126.0,27.0
ORCHARD BEACH,0.0,0.0


In [61]:
noon.head(10)

Unnamed: 0_level_0,REAL_ENTRIES,REAL_EXITS
STATION,Unnamed: 1_level_1,Unnamed: 2_level_1
1 AV,4727.0,5219.0
103 ST-CORONA,5510.0,1417.0
104 ST,889.0,81.0
110 ST,3425.0,1776.0
111 ST,3566.0,982.0
121 ST,541.0,245.0
125 ST,9381.0,7774.0
135 ST,4160.0,831.0
138/GRAND CONC,885.0,622.0
14 ST,6715.0,10914.0


Look for outliers amongst entries and exits.

In [66]:
df_sample["REAL_ENTRIES"].sort_values(ascending = False).head(25)

73163     718560745.0
116104        92258.0
116895         7625.0
116877         7308.0
3095           2872.0
3089           2746.0
3101           2719.0
96607          2599.0
3173           2598.0
3185           2559.0
184356         2532.0
50270          2520.0
3107           2512.0
3179           2482.0
96613          2479.0
3137           2475.0
96601          2467.0
50276          2454.0
50264          2441.0
116408         2438.0
50318          2403.0
1373           2398.0
4873           2379.0
50312          2372.0
4915           2367.0
Name: REAL_ENTRIES, dtype: float64

In [68]:
pd.set_option('display.float_format', lambda x: '%.1f' % x) # supresses scientific notation
df_sample["REAL_EXITS"].sort_values(ascending = False).head()

73163    1886405893.0
116104       110916.0
116895         6509.0
116877         6264.0
97422          5598.0
Name: REAL_EXITS, dtype: float64

In [70]:
df_sample.loc[73163]

C/A                            N205
UNIT                           R195
SCP                        02-00-00
STATION             161/YANKEE STAD
LINENAME                        BD4
DIVISION                        IND
DESC                        REGULAR
ENTRIES                   721441289
EXITS                    1895802233
DATETIME        2019-09-06 12:22:00
REAL_ENTRIES            718560745.0
REAL_EXITS             1886405893.0
Name: 73163, dtype: object

In [71]:
df_sample.loc[116104]

C/A                           PTH03
UNIT                           R552
SCP                        00-00-00
STATION              JOURNAL SQUARE
LINENAME                          1
DIVISION                        PTH
DESC                        REGULAR
ENTRIES                      126651
EXITS                        126869
DATETIME        2019-09-01 15:50:02
REAL_ENTRIES                92258.0
REAL_EXITS                 110916.0
Name: 116104, dtype: object

In [72]:
df_sample.loc[116895]

C/A                           PTH03
UNIT                           R552
SCP                        00-01-08
STATION              JOURNAL SQUARE
LINENAME                          1
DIVISION                        PTH
DESC                        REGULAR
ENTRIES                        7626
EXITS                          6509
DATETIME        2019-09-06 12:03:31
REAL_ENTRIES                 7625.0
REAL_EXITS                   6509.0
Name: 116895, dtype: object

In [73]:
df_sample.loc[116877]

C/A                           PTH03
UNIT                           R552
SCP                        00-01-08
STATION              JOURNAL SQUARE
LINENAME                          1
DIVISION                        PTH
DESC                        REGULAR
ENTRIES                        7309
EXITS                          6264
DATETIME        2019-09-03 12:39:31
REAL_ENTRIES                 7308.0
REAL_EXITS                   6264.0
Name: 116877, dtype: object

Yankee Stadium and Journal Square stations display anomalies. Can be attributed to the turnstile malfunction.

In [75]:
df_sample.loc[73163-1 : 73163+1]

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DESC,ENTRIES,EXITS,DATETIME,REAL_ENTRIES,REAL_EXITS
73162,N205,R195,02-00-00,161/YANKEE STAD,BD4,IND,REGULAR,2880544,9396340,2019-09-06 08:22:00,101.0,164.0
73163,N205,R195,02-00-00,161/YANKEE STAD,BD4,IND,REGULAR,721441289,1895802233,2019-09-06 12:22:00,718560745.0,1886405893.0
73164,N205,R195,02-00-00,161/YANKEE STAD,BD4,IND,REGULAR,721441362,1895801915,2019-09-06 16:22:00,73.0,-318.0


In [76]:
df_sample.loc[116104-1 : 116104+1]

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DESC,ENTRIES,EXITS,DATETIME,REAL_ENTRIES,REAL_EXITS
116103,PTH03,R552,00-00-00,JOURNAL SQUARE,1,PTH,REGULAR,34393,15953,2019-09-01 13:05:39,73.0,17.0
116104,PTH03,R552,00-00-00,JOURNAL SQUARE,1,PTH,REGULAR,126651,126869,2019-09-01 15:50:02,92258.0,110916.0
116105,PTH03,R552,00-00-00,JOURNAL SQUARE,1,PTH,REGULAR,34450,15975,2019-09-01 17:17:39,-92201.0,-110894.0


Above we see, that the adjacent indeces at the Yankee Stadium station don't look abnormal. It's just the outlier itself. While 