# Chattanooga: Occupancy Analysis APC data 

Given my current limitations in computer capabilities, I needed to subdivide the data. Previously using `R` I subdivided the `cartaapc_dashboard.csv` data using the variable `direction_id`:

- `carta1.csv`: `direction_id == 1`.
- `carta0.csv`: `direction_id == 0`.

This notebook describes the steps that I took to find the following data attributes:

- Find max, median and 75% and 90% occupancy by `trip_id` conditioned by `service_period` (`weekday` and `weekend`).
    * *The percentile value can be taken as input for a function.*
- Find the `stop_id`s and the `date`s when the `trip_id`s have an `occupancy` greater than the median and 90th percentile occupancies.
- Find the `date`s when the `trip_id`s have an `occupancy` greater than the 75th percentile.
- Extract the `occupancy` data samples for each `stop_id` and `trip_id` and find the days when the stops have an anomaly.
    * Use a Z-score to show when the value is higher than expected and when it is lower than expected.
    * Show the dates for the `stop_id` and `trip_id` when that happened.

# Complete Data Set

## Required packages

In [71]:
import numpy as np
import pandas as pd
import scipy.stats as ss

## Data Input

In [4]:
carta = pd.read_csv('cartaapc_dashboard.csv')

In [5]:
carta.head()

Unnamed: 0,trip_id,arrival_time,stop_id,stop_sequence,stop_lat,stop_lon,route_id,direction_id,board_count,alight_count,occupancy,direction_desc,service_period,date,date_time,trip_start_time,day_of_week,trip_date,hour
0,139145,08:51:00,354.0,1.0,35.056167,-85.268713,16,0.0,0.0,0.0,0.0,OUTBOUND,Weekday,2019-11-01,2019-11-01 08:51:00,08:51:00,4.0,2019-11-01,8.0
1,139145,08:54:59,505.0,2.0,35.056017,-85.28108,16,0.0,0.0,0.0,0.0,OUTBOUND,Weekday,2019-11-01,2019-11-01 08:54:59,08:51:00,4.0,2019-11-01,8.0
2,139145,09:05:00,1713.0,3.0,35.042,-85.30867,16,0.0,1.0,1.0,0.0,OUTBOUND,Weekday,2019-11-01,2019-11-01 09:05:00,08:51:00,4.0,2019-11-01,9.0
3,139145,09:05:21,1560.0,4.0,35.04288,-85.309102,16,0.0,0.0,0.0,0.0,OUTBOUND,Weekday,2019-11-01,2019-11-01 09:05:21,08:51:00,4.0,2019-11-01,9.0
4,139145,09:05:39,163.0,5.0,35.043448,-85.309277,16,0.0,0.0,0.0,0.0,OUTBOUND,Weekday,2019-11-01,2019-11-01 09:05:39,08:51:00,4.0,2019-11-01,9.0


Since we will analysis occuopancy based on the type of of `service_period`. We need to know all the possible options for this variable:

In [6]:
carta['service_period'].unique()

array(['Weekday', 'Saturday', 'Sunday', nan], dtype=object)

It seems that there are some rows with `nan` as `service_period`. Let's find out which rows they are:

- **Note:** The following process ('service_period' == 'nan' extraction) takes more time than expected. While identifying these rows is important, the following step is not relevant right now.

In [41]:
#service_period_nana = carta.loc[carta['service_period'] == 'nan']
#service_period_nana.head()

Unnamed: 0,trip_id,arrival_time,stop_id,stop_sequence,stop_lat,stop_lon,route_id,direction_id,board_count,alight_count,occupancy,direction_desc,service_period,date,date_time,trip_start_time,day_of_week,trip_date,hour


The strategy is to create new `DataFrame`s with the summary statistics of `occupancy` conditioned to `trip_id`:

## Weekdays

In [7]:
Weekdays_ind = carta['service_period'] == 'Weekday'
Occupancy_Weekdays = carta[Weekdays_ind]
Occupancy_Weekdays.head(10)

Unnamed: 0,trip_id,arrival_time,stop_id,stop_sequence,stop_lat,stop_lon,route_id,direction_id,board_count,alight_count,occupancy,direction_desc,service_period,date,date_time,trip_start_time,day_of_week,trip_date,hour
0,139145,08:51:00,354.0,1.0,35.056167,-85.268713,16,0.0,0.0,0.0,0.0,OUTBOUND,Weekday,2019-11-01,2019-11-01 08:51:00,08:51:00,4.0,2019-11-01,8.0
1,139145,08:54:59,505.0,2.0,35.056017,-85.28108,16,0.0,0.0,0.0,0.0,OUTBOUND,Weekday,2019-11-01,2019-11-01 08:54:59,08:51:00,4.0,2019-11-01,8.0
2,139145,09:05:00,1713.0,3.0,35.042,-85.30867,16,0.0,1.0,1.0,0.0,OUTBOUND,Weekday,2019-11-01,2019-11-01 09:05:00,08:51:00,4.0,2019-11-01,9.0
3,139145,09:05:21,1560.0,4.0,35.04288,-85.309102,16,0.0,0.0,0.0,0.0,OUTBOUND,Weekday,2019-11-01,2019-11-01 09:05:21,08:51:00,4.0,2019-11-01,9.0
4,139145,09:05:39,163.0,5.0,35.043448,-85.309277,16,0.0,0.0,0.0,0.0,OUTBOUND,Weekday,2019-11-01,2019-11-01 09:05:39,08:51:00,4.0,2019-11-01,9.0
5,139145,09:06:14,164.0,6.0,35.045045,-85.309308,16,0.0,1.0,0.0,1.0,OUTBOUND,Weekday,2019-11-01,2019-11-01 09:06:14,08:51:00,4.0,2019-11-01,9.0
6,139145,09:06:34,165.0,7.0,35.046062,-85.30934,16,0.0,0.0,0.0,1.0,OUTBOUND,Weekday,2019-11-01,2019-11-01 09:06:34,08:51:00,4.0,2019-11-01,9.0
7,139145,09:07:00,1361.0,8.0,35.047342,-85.309403,16,0.0,0.0,0.0,1.0,OUTBOUND,Weekday,2019-11-01,2019-11-01 09:07:00,08:51:00,4.0,2019-11-01,9.0
8,139145,09:07:39,1589.0,9.0,35.049317,-85.309498,16,0.0,1.0,0.0,2.0,OUTBOUND,Weekday,2019-11-01,2019-11-01 09:07:39,08:51:00,4.0,2019-11-01,9.0
9,139145,09:08:01,166.0,10.0,35.05072,-85.309452,16,0.0,0.0,0.0,2.0,OUTBOUND,Weekday,2019-11-01,2019-11-01 09:08:01,08:51:00,4.0,2019-11-01,9.0


The following function calculates the required percentiles:

In [8]:
def percentile(n):
    def percentile_(x):
        return np.percentile(x, n)
    percentile_.__name__ = 'percentile_%s' % n
    return percentile_

Then, we calculate the required summary statistics:

In [9]:
Occupancy_Weekdays_trips = Occupancy_Weekdays.groupby(['trip_id'], as_index = False).agg({'occupancy':['mean', 'median', percentile(75), percentile(90), 'max']})

In [10]:
Occupancy_Weekdays_trips.head(10)

Unnamed: 0_level_0,trip_id,occupancy,occupancy,occupancy,occupancy,occupancy
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,median,percentile_75,percentile_90,max
0,132994,0.991736,1.0,2.0,2.0,5.0
1,132995,0.202267,0.0,0.0,1.0,3.0
2,132996,4.343357,4.0,6.0,8.0,13.0
3,132997,4.492528,4.0,6.0,8.0,16.0
4,132998,0.958874,1.0,2.0,3.0,5.0
5,132999,7.052161,7.0,10.0,11.0,15.0
6,133000,2.646904,2.0,4.0,5.0,9.0
7,133001,1.649254,1.0,2.0,4.0,7.0
8,133002,1.330177,1.0,3.0,4.0,8.0
9,133003,3.932179,4.0,5.0,6.0,10.0


In [11]:
Occupancy_Weekdays_trips1 = pd.DataFrame(Occupancy_Weekdays_trips)

It is good to rename the variable names to avoid misunderstandings.


In [12]:
Occupancy_Weekdays_trips.columns = ['trip_id', 'Mean_trip', 'Median_trip', 'percentile_75_trip', 'percentile_90_trip', 'max_trip']

In [13]:
Occupancy_Weekdays_trips.head(10)

Unnamed: 0,trip_id,Mean_trip,Median_trip,percentile_75_trip,percentile_90_trip,max_trip
0,132994,0.991736,1.0,2.0,2.0,5.0
1,132995,0.202267,0.0,0.0,1.0,3.0
2,132996,4.343357,4.0,6.0,8.0,13.0
3,132997,4.492528,4.0,6.0,8.0,16.0
4,132998,0.958874,1.0,2.0,3.0,5.0
5,132999,7.052161,7.0,10.0,11.0,15.0
6,133000,2.646904,2.0,4.0,5.0,9.0
7,133001,1.649254,1.0,2.0,4.0,7.0
8,133002,1.330177,1.0,3.0,4.0,8.0
9,133003,3.932179,4.0,5.0,6.0,10.0


Then, we can compare these summary statistics with the data from the Occupancy_Weekdays DataFrame.


In [14]:
df =  pd.merge(Occupancy_Weekdays, Occupancy_Weekdays_trips1, on = 'trip_id')

In [15]:
df.head()

Unnamed: 0,trip_id,arrival_time,stop_id,stop_sequence,stop_lat,stop_lon,route_id,direction_id,board_count,alight_count,...,date_time,trip_start_time,day_of_week,trip_date,hour,Mean_trip,Median_trip,percentile_75_trip,percentile_90_trip,max_trip
0,139145,08:51:00,354.0,1.0,35.056167,-85.268713,16,0.0,0.0,0.0,...,2019-11-01 08:51:00,08:51:00,4.0,2019-11-01,8.0,2.289608,2.0,3.0,5.0,12.0
1,139145,08:54:59,505.0,2.0,35.056017,-85.28108,16,0.0,0.0,0.0,...,2019-11-01 08:54:59,08:51:00,4.0,2019-11-01,8.0,2.289608,2.0,3.0,5.0,12.0
2,139145,09:05:00,1713.0,3.0,35.042,-85.30867,16,0.0,1.0,1.0,...,2019-11-01 09:05:00,08:51:00,4.0,2019-11-01,9.0,2.289608,2.0,3.0,5.0,12.0
3,139145,09:05:21,1560.0,4.0,35.04288,-85.309102,16,0.0,0.0,0.0,...,2019-11-01 09:05:21,08:51:00,4.0,2019-11-01,9.0,2.289608,2.0,3.0,5.0,12.0
4,139145,09:05:39,163.0,5.0,35.043448,-85.309277,16,0.0,0.0,0.0,...,2019-11-01 09:05:39,08:51:00,4.0,2019-11-01,9.0,2.289608,2.0,3.0,5.0,12.0


Also, we want to find the `stop_id`s and the `date`s when the `trip_id`s have an `occupancy` greater than the median, 75th and 90th percentiles occupancy:

In [18]:
Trips_Occup_higher_than_median = df['occupancy'] > df['Median_trip']
Occup_higher_than_75thpercentile = df['occupancy'] > df['percentile_75_trip']
Trips_Occup_higher_than_90thpercentile = df['occupancy'] > df['percentile_90_trip']

In [74]:
#df.to_csv('Weekdays_Occupancy_Analysis_by_trip_id.csv', index = False)

Finally, the `Weekdays_Occupancy_Analysis_by_trip_id.csv` has all the required data to run the requested statistical analysis.

- **Note:** *Due to time limitations I will try to get information using `R`.*

# Weekends

In [19]:
Weekends_ind1 = carta.loc[(carta['service_period'] != 'Weekday') & (carta['service_period'] != 'nan')]
Weekends_ind1.head()

Unnamed: 0,trip_id,arrival_time,stop_id,stop_sequence,stop_lat,stop_lon,route_id,direction_id,board_count,alight_count,occupancy,direction_desc,service_period,date,date_time,trip_start_time,day_of_week,trip_date,hour
22095,138429,05:51:00,354.0,1.0,35.056167,-85.268713,1,0.0,0.0,0.0,0.0,OUTBOUND,Saturday,2019-11-02,2019-11-02 05:51:00,05:51:00,5.0,2019-11-02,5.0
22096,138429,05:54:59,505.0,2.0,35.056017,-85.28108,1,0.0,0.0,0.0,0.0,OUTBOUND,Saturday,2019-11-02,2019-11-02 05:54:59,05:51:00,5.0,2019-11-02,5.0
22097,138429,05:57:59,784.0,3.0,35.051562,-85.299422,1,0.0,0.0,0.0,0.0,OUTBOUND,Saturday,2019-11-02,2019-11-02 05:57:59,05:51:00,5.0,2019-11-02,5.0
22098,138429,05:58:15,283.0,4.0,35.051912,-85.301298,1,0.0,0.0,0.0,0.0,OUTBOUND,Saturday,2019-11-02,2019-11-02 05:58:15,05:51:00,5.0,2019-11-02,5.0
22099,138429,05:58:32,284.0,5.0,35.052515,-85.302427,1,0.0,0.0,0.0,0.0,OUTBOUND,Saturday,2019-11-02,2019-11-02 05:58:32,05:51:00,5.0,2019-11-02,5.0


Similarly, we calculate the required summary statistics:

In [20]:
Occupancy_Weekends_trips = Weekends_ind1.groupby(['trip_id'], as_index = False).agg({'occupancy':['mean', 'median', percentile(75), percentile(90), 'max']})
Occupancy_Weekends_trips.columns = ['trip_id', 'Mean_trip', 'Median_trip', 'percentile_75_trip', 'percentile_90_trip', 'max_trip']

In [21]:
Occupancy_Weekends_trips.head()

Unnamed: 0,trip_id,Mean_trip,Median_trip,percentile_75_trip,percentile_90_trip,max_trip
0,133084,3.386364,4.0,4.0,6.0,7.0
1,133085,0.384793,0.0,1.0,1.0,2.0
2,133086,0.430108,0.0,1.0,2.0,2.0
3,133087,0.260215,0.0,1.0,1.0,2.0
4,133088,1.920455,2.0,3.0,4.0,6.0


Now, let's merge these two `DataFrame`s:

In [22]:
df4 =  pd.merge(Weekends_ind1, Occupancy_Weekends_trips, on = 'trip_id')
df4.head()

Unnamed: 0,trip_id,arrival_time,stop_id,stop_sequence,stop_lat,stop_lon,route_id,direction_id,board_count,alight_count,...,date_time,trip_start_time,day_of_week,trip_date,hour,Mean_trip,Median_trip,percentile_75_trip,percentile_90_trip,max_trip
0,138429,05:51:00,354.0,1.0,35.056167,-85.268713,1,0.0,0.0,0.0,...,2019-11-02 05:51:00,05:51:00,5.0,2019-11-02,5.0,0.352357,0.0,0.0,2.0,4.0
1,138429,05:54:59,505.0,2.0,35.056017,-85.28108,1,0.0,0.0,0.0,...,2019-11-02 05:54:59,05:51:00,5.0,2019-11-02,5.0,0.352357,0.0,0.0,2.0,4.0
2,138429,05:57:59,784.0,3.0,35.051562,-85.299422,1,0.0,0.0,0.0,...,2019-11-02 05:57:59,05:51:00,5.0,2019-11-02,5.0,0.352357,0.0,0.0,2.0,4.0
3,138429,05:58:15,283.0,4.0,35.051912,-85.301298,1,0.0,0.0,0.0,...,2019-11-02 05:58:15,05:51:00,5.0,2019-11-02,5.0,0.352357,0.0,0.0,2.0,4.0
4,138429,05:58:32,284.0,5.0,35.052515,-85.302427,1,0.0,0.0,0.0,...,2019-11-02 05:58:32,05:51:00,5.0,2019-11-02,5.0,0.352357,0.0,0.0,2.0,4.0


In [23]:
Trips_Occup_higher_than_median_weekend = df4['occupancy'] > df4['Median_trip']
Occup_higher_than_75thpercentile_weekend = df4['occupancy'] > df4['percentile_75_trip']
Trips_Occup_higher_than_90thpercentile_weekend = df4['occupancy'] > df4['percentile_90_trip']

In [24]:
df4.shape

(1297829, 24)

In [50]:
df4.to_csv('Weekends_Occupancy_Analysis_by_trip_id.csv', index = False)

# Descriptive Statistics for Anomalies

There are multiple ways to define anomaly thresholds:

1. If we define a `direction_id` and a `service_period`, then, we can define it by `trip_id`, which will highlight the `stop_id`s and `date`s where a given trip has an unusual `occupancy`. In other words, this will contrast the `occupancy` of a given `trip_id` against its own threshold (`trip_id` vs `trip_id`).

2. Moreover, we could also do it `route_id` and *hour of the day*. this will contrast the `occupancy` of a given `trip_id` against all the trips of  its own `route_id` per *hour of the day* (`trip_id` vs (`route_id` & *hour of the day*)).


## Weekdays

### Occupancy higher than the median

In [25]:
df_anomalies_median = df.loc[Trips_Occup_higher_than_median]
df_anomalies_median.head()

Unnamed: 0,trip_id,arrival_time,stop_id,stop_sequence,stop_lat,stop_lon,route_id,direction_id,board_count,alight_count,...,date_time,trip_start_time,day_of_week,trip_date,hour,Mean_trip,Median_trip,percentile_75_trip,percentile_90_trip,max_trip
35,139145,09:21:27,531.0,36.0,35.09023,-85.286865,16,0.0,1.0,0.0,...,2019-11-01 09:21:27,08:51:00,4.0,2019-11-01,9.0,2.289608,2.0,3.0,5.0,12.0
36,139145,09:21:47,532.0,37.0,35.091413,-85.286468,16,0.0,0.0,0.0,...,2019-11-01 09:21:47,08:51:00,4.0,2019-11-01,9.0,2.289608,2.0,3.0,5.0,12.0
37,139145,09:22:35,533.0,38.0,35.094123,-85.285372,16,0.0,0.0,0.0,...,2019-11-01 09:22:35,08:51:00,4.0,2019-11-01,9.0,2.289608,2.0,3.0,5.0,12.0
38,139145,09:23:37,534.0,39.0,35.096167,-85.282462,16,0.0,0.0,0.0,...,2019-11-01 09:23:37,08:51:00,4.0,2019-11-01,9.0,2.289608,2.0,3.0,5.0,12.0
39,139145,09:25:11,536.0,40.0,35.10018,-85.277837,16,0.0,0.0,0.0,...,2019-11-01 09:25:11,08:51:00,4.0,2019-11-01,9.0,2.289608,2.0,3.0,5.0,12.0


Extracting the `stop_id`s and `trip_date`s:

In [57]:
df_anomalies_median_stops_dates = df_anomalies_median[['trip_id', 'stop_id', 'trip_date', 'occupancy', 'Median_trip']]
df_anomalies_median_stops_dates.head()

Unnamed: 0,trip_id,stop_id,trip_date,occupancy,Median_trip
35,139145,531.0,2019-11-01,3.0,2.0
36,139145,532.0,2019-11-01,3.0,2.0
37,139145,533.0,2019-11-01,3.0,2.0
38,139145,534.0,2019-11-01,3.0,2.0
39,139145,536.0,2019-11-01,3.0,2.0


Since we only have one observation per `trip_id`, it is not possible to obtain summary statistics.

However, we can fix the `stop_id` values to analyze distribution of the anomalous occupancies for all the `trip_date`s:

In [77]:
Df_anomalies_median_stops_dates = df_anomalies_median_stops_dates.groupby(['trip_id','stop_id'])['occupancy'].describe()
Df_anomalies_median_stops_dates.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
trip_id,stop_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
132994,12.0,15.0,2.2,0.560612,2.0,2.0,2.0,2.0,4.0
132994,17.0,14.0,2.285714,0.61125,2.0,2.0,2.0,2.0,4.0
132994,18.0,14.0,2.285714,0.61125,2.0,2.0,2.0,2.0,4.0
132994,19.0,12.0,2.166667,0.57735,2.0,2.0,2.0,2.0,4.0
132994,21.0,11.0,2.181818,0.603023,2.0,2.0,2.0,2.0,4.0


In [65]:
z_median_weekdays =(df_anomalies_median_stops_dates.groupby(['trip_id','stop_id'])['occupancy'].mean())
z_median_weekdays.head()

trip_id  stop_id
132994   12.0       2.200000
         17.0       2.285714
         18.0       2.285714
         19.0       2.166667
         21.0       2.181818
Name: occupancy, dtype: float64

In [70]:
Z_score_median_weekdays = pd.DataFrame(ss.zscore(z_median_weekdays, ddof=1), z_median_weekdays.index)
Z_score_median_weekdays.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,0
trip_id,stop_id,Unnamed: 2_level_1
132994,12.0,-1.53314
132994,17.0,-1.508987
132994,18.0,-1.508987
132994,19.0,-1.542532
132994,21.0,-1.538263


We can also fix the `trip_date` values to analyze distribution of the anomalous occupancies for all the `stop_id`s:

In [101]:
Df_anomalies_median_dates_stops = df_anomalies_median_stops_dates.groupby(['trip_id','trip_date'])['occupancy'].describe()
Df_anomalies_median_dates_stops.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
trip_id,trip_date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
132994,2019-01-02,3.0,2.0,0.0,2.0,2.0,2.0,2.0,2.0
132994,2019-01-07,3.0,5.0,0.0,5.0,5.0,5.0,5.0,5.0
132994,2019-01-08,3.0,5.0,0.0,5.0,5.0,5.0,5.0,5.0
132994,2019-01-09,5.0,2.6,0.547723,2.0,2.0,3.0,3.0,3.0
132994,2019-01-14,3.0,4.0,0.0,4.0,4.0,4.0,4.0,4.0


### Occupancy higher than the 75th percentile

In [28]:
df_anomalies_higher_75_percentile = df.loc[Occup_higher_than_75thpercentile]
df_anomalies_higher_75_percentile.head()

Unnamed: 0,trip_id,arrival_time,stop_id,stop_sequence,stop_lat,stop_lon,route_id,direction_id,board_count,alight_count,...,date_time,trip_start_time,day_of_week,trip_date,hour,Mean_trip,Median_trip,percentile_75_trip,percentile_90_trip,max_trip
35,139145,09:21:27,531.0,36.0,35.09023,-85.286865,16,0.0,1.0,0.0,...,2019-11-01 09:21:27,08:51:00,4.0,2019-11-01,9.0,2.289608,2.0,3.0,5.0,12.0
36,139145,09:21:47,532.0,37.0,35.091413,-85.286468,16,0.0,0.0,0.0,...,2019-11-01 09:21:47,08:51:00,4.0,2019-11-01,9.0,2.289608,2.0,3.0,5.0,12.0
37,139145,09:22:35,533.0,38.0,35.094123,-85.285372,16,0.0,0.0,0.0,...,2019-11-01 09:22:35,08:51:00,4.0,2019-11-01,9.0,2.289608,2.0,3.0,5.0,12.0
38,139145,09:23:37,534.0,39.0,35.096167,-85.282462,16,0.0,0.0,0.0,...,2019-11-01 09:23:37,08:51:00,4.0,2019-11-01,9.0,2.289608,2.0,3.0,5.0,12.0
39,139145,09:25:11,536.0,40.0,35.10018,-85.277837,16,0.0,0.0,0.0,...,2019-11-01 09:25:11,08:51:00,4.0,2019-11-01,9.0,2.289608,2.0,3.0,5.0,12.0


In [29]:
df_anomalies_higher_75_percentile = df_anomalies_higher_75_percentile[['trip_id', 'stop_id', 'trip_date', 'occupancy', 'Median_trip']]
df_anomalies_higher_75_percentile.head()

Unnamed: 0,trip_id,stop_id,trip_date,occupancy,Median_trip
53,139145,551.0,2019-11-01,4.0,2.0
54,139145,552.0,2019-11-01,4.0,2.0
55,139145,553.0,2019-11-01,4.0,2.0
56,139145,554.0,2019-11-01,4.0,2.0
57,139145,1422.0,2019-11-01,4.0,2.0


In [33]:
Df_anomalies_75_percentile_stops_dates = df_anomalies_higher_75_percentile.groupby(['trip_id','stop_id'])['occupancy'].describe()
Df_anomalies_75_percentile_stops_dates.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
trip_id,stop_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
132994,12.0,2.0,3.5,0.707107,3.0,3.25,3.5,3.75,4.0
132994,17.0,3.0,3.333333,0.57735,3.0,3.0,3.0,3.5,4.0
132994,18.0,3.0,3.333333,0.57735,3.0,3.0,3.0,3.5,4.0
132994,19.0,1.0,4.0,,4.0,4.0,4.0,4.0,4.0
132994,21.0,1.0,4.0,,4.0,4.0,4.0,4.0,4.0


In [85]:
z_75_percentile_weekdays = df_anomalies_higher_75_percentile.groupby(['trip_id','stop_id'])['occupancy'].mean()
z_75_percentile_weekdays.head()

trip_id  stop_id
132994   12.0       3.500000
         17.0       3.333333
         18.0       3.333333
         19.0       4.000000
         21.0       4.000000
Name: occupancy, dtype: float64

In [76]:
Z_score_75_percentile_weekdays = pd.DataFrame(ss.zscore(z_75_percentile_weekdays, ddof=1), z_median_weekdays.index)
Z_score_75_percentile_weekdays.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,0
trip_id,stop_id,Unnamed: 2_level_1
132994,12.0,-1.408788
132994,17.0,-1.448138
132994,18.0,-1.448138
132994,19.0,-1.290735
132994,21.0,-1.290735


### Occupancy higher than the 90th percentile

In [80]:
df_anomalies_higher_90_percentile = df.loc[Trips_Occup_higher_than_90thpercentile]
df_anomalies_higher_90_percentile.head()

Unnamed: 0,trip_id,arrival_time,stop_id,stop_sequence,stop_lat,stop_lon,route_id,direction_id,board_count,alight_count,...,date_time,trip_start_time,day_of_week,trip_date,hour,Mean_trip,Median_trip,percentile_75_trip,percentile_90_trip,max_trip
177,139145,09:30:19,546.0,50.0,35.113042,-85.26253,16,0.0,2.0,0.0,...,2019-11-07 09:30:19,08:51:00,3.0,2019-11-07,9.0,2.289608,2.0,3.0,5.0,12.0
178,139145,09:30:36,547.0,51.0,35.113337,-85.261402,16,0.0,0.0,0.0,...,2019-11-07 09:30:36,08:51:00,3.0,2019-11-07,9.0,2.289608,2.0,3.0,5.0,12.0
179,139145,09:31:33,549.0,52.0,35.114353,-85.257142,16,0.0,0.0,0.0,...,2019-11-07 09:31:33,08:51:00,3.0,2019-11-07,9.0,2.289608,2.0,3.0,5.0,12.0
180,139145,09:32:21,550.0,53.0,35.11479,-85.253693,16,0.0,0.0,0.0,...,2019-11-07 09:32:21,08:51:00,3.0,2019-11-07,9.0,2.289608,2.0,3.0,5.0,12.0
181,139145,09:32:44,551.0,54.0,35.115197,-85.251818,16,0.0,0.0,0.0,...,2019-11-07 09:32:44,08:51:00,3.0,2019-11-07,9.0,2.289608,2.0,3.0,5.0,12.0


In [None]:
df_anomalies_higher_90_percentile = df_anomalies_higher_90_percentile[['trip_id', 'stop_id', 'trip_date', 'occupancy', 'Median_trip']]
df_anomalies_higher_90_percentile.head()

In [86]:
z_90_percentile_weekdays = df_anomalies_higher_90_percentile.groupby(['trip_id','stop_id'])['occupancy'].mean()
z_90_percentile_weekdays.head()

trip_id  stop_id
132994   12.0       3.500000
         17.0       3.333333
         18.0       3.333333
         19.0       4.000000
         21.0       4.000000
Name: occupancy, dtype: float64

In [87]:
Z_score_90_percentile_weekdays = pd.DataFrame(ss.zscore(z_90_percentile_weekdays, ddof=1), z_90_percentile_weekdays.index)
Z_score_90_percentile_weekdays.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,0
trip_id,stop_id,Unnamed: 2_level_1
132994,12.0,-1.582667
132994,17.0,-1.615815
132994,18.0,-1.615815
132994,19.0,-1.483223
132994,21.0,-1.483223


## Weekends

### Occupancy higher than the median

In [31]:
df_anomaly_median_weekends = df4[Trips_Occup_higher_than_median_weekend]
df_anomaly_median_weekends.shape

(548726, 24)

In [32]:
df_anomaly_median_weekends.head()

Unnamed: 0,trip_id,arrival_time,stop_id,stop_sequence,stop_lat,stop_lon,route_id,direction_id,board_count,alight_count,...,date_time,trip_start_time,day_of_week,trip_date,hour,Mean_trip,Median_trip,percentile_75_trip,percentile_90_trip,max_trip
107,138429,06:02:33,1353.0,15.0,35.045383,-85.309452,1,0.0,1.0,0.0,...,2019-10-05 06:02:33,05:51:00,5.0,2019-10-05,6.0,0.352357,0.0,0.0,2.0,4.0
108,138429,06:02:59,17.0,16.0,35.04425,-85.309435,1,0.0,0.0,0.0,...,2019-10-05 06:02:59,05:51:00,5.0,2019-10-05,6.0,0.352357,0.0,0.0,2.0,4.0
109,138429,06:03:15,742.0,17.0,35.043472,-85.309435,1,0.0,0.0,0.0,...,2019-10-05 06:03:15,05:51:00,5.0,2019-10-05,6.0,0.352357,0.0,0.0,2.0,4.0
110,138429,06:03:58,18.0,18.0,35.042347,-85.30907,1,0.0,0.0,0.0,...,2019-10-05 06:03:58,05:51:00,5.0,2019-10-05,6.0,0.352357,0.0,0.0,2.0,4.0
111,138429,06:04:40,19.0,19.0,35.040952,-85.308387,1,0.0,0.0,0.0,...,2019-10-05 06:04:40,05:51:00,5.0,2019-10-05,6.0,0.352357,0.0,0.0,2.0,4.0


In [35]:
df_anomaly_median_weekends = df_anomaly_median_weekends[['trip_id', 'stop_id', 'date', 'Median_trip', 'occupancy']]
df_anomaly_median_weekends.head()

Unnamed: 0,trip_id,stop_id,date,Median_trip,occupancy
107,138429,1353.0,2019-10-05,0.0,1.0
108,138429,17.0,2019-10-05,0.0,1.0
109,138429,742.0,2019-10-05,0.0,1.0
110,138429,18.0,2019-10-05,0.0,1.0
111,138429,19.0,2019-10-05,0.0,1.0


In [88]:
df_anomaly_median_weekends_stops_dates = df_anomaly_median_weekends.groupby(['trip_id','stop_id'])['occupancy'].describe()
df_anomaly_median_weekends_stops_dates.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
trip_id,stop_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
133084,12.0,1.0,5.0,,5.0,5.0,5.0,5.0,5.0
133084,17.0,4.0,5.5,0.57735,5.0,5.0,5.5,6.0,6.0
133084,18.0,4.0,5.5,0.57735,5.0,5.0,5.5,6.0,6.0
133084,19.0,4.0,5.5,0.57735,5.0,5.0,5.5,6.0,6.0
133084,21.0,4.0,5.5,0.57735,5.0,5.0,5.5,6.0,6.0


In [90]:
z_median_weekends = df_anomaly_median_weekends.groupby(['trip_id','stop_id'])['occupancy'].mean()
z_median_weekends.head()

trip_id  stop_id
133084   12.0       5.0
         17.0       5.5
         18.0       5.5
         19.0       5.5
         21.0       5.5
Name: occupancy, dtype: float64

In [91]:
Z_score_median_weekends = pd.DataFrame(ss.zscore(z_median_weekends, ddof=1), z_median_weekends.index)
Z_score_median_weekends.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,0
trip_id,stop_id,Unnamed: 2_level_1
133084,12.0,-0.860732
133084,17.0,-0.732548
133084,18.0,-0.732548
133084,19.0,-0.732548
133084,21.0,-0.732548


### Occupancy higher than the 75th percentile

In [34]:
df4_anomaly_higher_75_percentile_weekends = df4[Occup_higher_than_75thpercentile_weekend]
df4_anomaly_higher_75_percentile_weekends.shape

(258677, 24)

In [36]:
df4_anomaly_higher_75_percentile_weekends.head()

Unnamed: 0,trip_id,arrival_time,stop_id,stop_sequence,stop_lat,stop_lon,route_id,direction_id,board_count,alight_count,...,date_time,trip_start_time,day_of_week,trip_date,hour,Mean_trip,Median_trip,percentile_75_trip,percentile_90_trip,max_trip
107,138429,06:02:33,1353.0,15.0,35.045383,-85.309452,1,0.0,1.0,0.0,...,2019-10-05 06:02:33,05:51:00,5.0,2019-10-05,6.0,0.352357,0.0,0.0,2.0,4.0
108,138429,06:02:59,17.0,16.0,35.04425,-85.309435,1,0.0,0.0,0.0,...,2019-10-05 06:02:59,05:51:00,5.0,2019-10-05,6.0,0.352357,0.0,0.0,2.0,4.0
109,138429,06:03:15,742.0,17.0,35.043472,-85.309435,1,0.0,0.0,0.0,...,2019-10-05 06:03:15,05:51:00,5.0,2019-10-05,6.0,0.352357,0.0,0.0,2.0,4.0
110,138429,06:03:58,18.0,18.0,35.042347,-85.30907,1,0.0,0.0,0.0,...,2019-10-05 06:03:58,05:51:00,5.0,2019-10-05,6.0,0.352357,0.0,0.0,2.0,4.0
111,138429,06:04:40,19.0,19.0,35.040952,-85.308387,1,0.0,0.0,0.0,...,2019-10-05 06:04:40,05:51:00,5.0,2019-10-05,6.0,0.352357,0.0,0.0,2.0,4.0


In [37]:
df4_anomaly_higher_75_percentile_weekends = df4_anomaly_higher_75_percentile_weekends[['trip_id', 'stop_id', 'date', 'Median_trip', 'occupancy']]
df4_anomaly_higher_75_percentile_weekends.head()

Unnamed: 0,trip_id,stop_id,date,Median_trip,occupancy
107,138429,1353.0,2019-10-05,0.0,1.0
108,138429,17.0,2019-10-05,0.0,1.0
109,138429,742.0,2019-10-05,0.0,1.0
110,138429,18.0,2019-10-05,0.0,1.0
111,138429,19.0,2019-10-05,0.0,1.0


In [38]:
df_anomaly_higher_75_percentile_weekends_stops_dates = df4_anomaly_higher_75_percentile_weekends.groupby(['trip_id','stop_id'])['occupancy'].describe()
df_anomaly_higher_75_percentile_weekends_stops_dates.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
trip_id,stop_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
133084,12.0,1.0,5.0,,5.0,5.0,5.0,5.0,5.0
133084,17.0,4.0,5.5,0.57735,5.0,5.0,5.5,6.0,6.0
133084,18.0,4.0,5.5,0.57735,5.0,5.0,5.5,6.0,6.0
133084,19.0,4.0,5.5,0.57735,5.0,5.0,5.5,6.0,6.0
133084,21.0,4.0,5.5,0.57735,5.0,5.0,5.5,6.0,6.0


In [92]:
z_75_percentile_weekdends = df4_anomaly_higher_75_percentile_weekends.groupby(['trip_id','stop_id'])['occupancy'].mean()
z_75_percentile_weekdends.head()

trip_id  stop_id
133084   12.0       5.0
         17.0       5.5
         18.0       5.5
         19.0       5.5
         21.0       5.5
Name: occupancy, dtype: float64

In [93]:
Z_score_75_percentile_weekdends = pd.DataFrame(ss.zscore(z_75_percentile_weekdends, ddof=1), z_75_percentile_weekdends.index)
Z_score_75_percentile_weekdends.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,0
trip_id,stop_id,Unnamed: 2_level_1
133084,12.0,-1.144058
133084,17.0,-1.035522
133084,18.0,-1.035522
133084,19.0,-1.035522
133084,21.0,-1.035522


### Occupancy higher than the 90th percentile

In [40]:
df4_anomaly_higher_90_percentile_weekends = df4[Trips_Occup_higher_than_90thpercentile_weekend ]
df4_anomaly_higher_90_percentile_weekends.shape

(95422, 24)

In [41]:
df4_anomaly_higher_90_percentile_weekends.head()

Unnamed: 0,trip_id,arrival_time,stop_id,stop_sequence,stop_lat,stop_lon,route_id,direction_id,board_count,alight_count,...,date_time,trip_start_time,day_of_week,trip_date,hour,Mean_trip,Median_trip,percentile_75_trip,percentile_90_trip,max_trip
212,138429,06:09:20,26.0,27.0,35.027553,-85.31066,1,0.0,2.0,0.0,...,2020-01-11 06:09:20,05:51:00,5.0,2020-01-11,6.0,0.352357,0.0,0.0,2.0,4.0
213,138429,06:09:40,27.0,28.0,35.026735,-85.311025,1,0.0,0.0,0.0,...,2020-01-11 06:09:40,05:51:00,5.0,2020-01-11,6.0,0.352357,0.0,0.0,2.0,4.0
214,138429,06:10:00,28.0,29.0,35.025917,-85.311487,1,0.0,0.0,0.0,...,2020-01-11 06:10:00,05:51:00,5.0,2020-01-11,6.0,0.352357,0.0,0.0,2.0,4.0
215,138429,06:15:00,44.0,30.0,35.010633,-85.325823,1,0.0,0.0,0.0,...,2020-01-11 06:15:00,05:51:00,5.0,2020-01-11,6.0,0.352357,0.0,0.0,2.0,4.0
216,138429,06:30:00,95.0,31.0,34.989068,-85.319482,1,0.0,0.0,0.0,...,2020-01-11 06:30:00,05:51:00,5.0,2020-01-11,6.0,0.352357,0.0,0.0,2.0,4.0


In [42]:
df_anomaly_higher_90_percentile_weekends_stops_dates = df4_anomaly_higher_90_percentile_weekends.groupby(['trip_id','stop_id'])['occupancy'].describe()
df_anomaly_higher_90_percentile_weekends_stops_dates.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
trip_id,stop_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
133084,23.0,1.0,7.0,,7.0,7.0,7.0,7.0,7.0
133084,24.0,1.0,7.0,,7.0,7.0,7.0,7.0,7.0
133084,25.0,1.0,7.0,,7.0,7.0,7.0,7.0,7.0
133084,26.0,1.0,7.0,,7.0,7.0,7.0,7.0,7.0
133084,27.0,1.0,7.0,,7.0,7.0,7.0,7.0,7.0


In [94]:
z_90_percentile_weekdends = df4_anomaly_higher_90_percentile_weekends.groupby(['trip_id','stop_id'])['occupancy'].mean()
z_90_percentile_weekdends.head()

trip_id  stop_id
133084   23.0       7.0
         24.0       7.0
         25.0       7.0
         26.0       7.0
         27.0       7.0
Name: occupancy, dtype: float64

In [95]:
Z_score_90_percentile_weekdends = pd.DataFrame(ss.zscore(z_90_percentile_weekdends, ddof=1), z_90_percentile_weekdends.index)
Z_score_90_percentile_weekdends.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,0
trip_id,stop_id,Unnamed: 2_level_1
133084,23.0,-1.014479
133084,24.0,-1.014479
133084,25.0,-1.014479
133084,26.0,-1.014479
133084,27.0,-1.014479
