# Nashville Bus Occupancy Data Processing Steps

### **NOTE:** You need to convert this Jupyter Notebook to a .py file if you want to reproduce the same results on your machine.

I don't believe the portions of the notebook involving multiprocessing will work in Jupyter Notebook.

In [None]:
import pandas as pd
import datetime as dt
from multiprocessing import Pool, cpu_count

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

## Load Nashville APC Data 

This data comes from Teams. You can find it at: General > covid-19 > Datasets >  WeGO-DATA > data-used-for-analysis > **Nashville APC**

The original data comes from  WeGO-DATA > **APC Data with check type flag.xlsx** and each Excel sheet was exported as a CSV. For example, the sheet called "January" in APC Data with check type flag.xlsx corresponds to nashville_apc_jan.csv.

Alternatively, you can use pd.read_excel if you want to work with the Excel sheet directly. However, reading from a CSV is much faster than reading from Excel.

In [None]:
jan_df = pd.read_csv('nashville_apc_jan.csv')
feb_df = pd.read_csv('nashville_apc_feb.csv')
mar_df = pd.read_csv('nashville_apc_mar.csv')
apr_df = pd.read_csv('nashville_apc_apr.csv')
may_df = pd.read_csv('nashville_apc_may.csv')
jun_df = pd.read_csv('nashville_apc_jun.csv')

In [None]:
apc_df = jan_df.append(feb_df)
apc_df = apc_df.append(mar_df)
apc_df = apc_df.append(apr_df)
apc_df = apc_df.append(may_df)
apc_df = apc_df.append(jun_df)

apc_df = apc_df.reset_index(drop=True)
print(apc_df.shape)

In [None]:
# rename columns
apc_df.columns = ['initial_load', 'apc_stop_id', 'stop_number', 'e_time', 'actual_arrival_time', 'actual_depart_time',
                 'scheduled_arrival_time', 'scheduled_departure_time', 'sequence', 'board_count', 'alight_count',
                 'ride_check_type', 'line', 'block_name', 'bus_number', 'service_id', 'ride_check_date', 'pattern',
                 'pattern_id', 'apc_trip_id', 'apc_lat', 'apc_lon', 'stop_abbr', 'apc_stop_name', 'ride_check_mode']
apc_df.head(2)

In [None]:
# filter for where ride_check_mode = 2 --- resulting df should have 1121778 rows
apc_df = apc_df.loc[apc_df['ride_check_mode'] == 2]
apc_df.shape

## Load in Nashville GTFS

This data originally comes from TransitFeeds: https://transitfeeds.com/p/nashville-mta/220. 

Relevant GTFS files (stops.txt, stop_times.txt, and trips.txt) from GTFS feeds in service between January 2020 - June 2020 were joined and combined into one file. This file is in Teams under General > covid-19 > Datasets > WeGO-DATA > data-used-for-analysis > Nashville GTFS (static and realtime) > **gtfs.csv**

In [None]:
gtfs_df = pd.read_csv('gtfs.csv', index_col=0)
gtfs_df.head(2)

# Prepare APC dataset for join with GTFS dataset

Create a gtfs_start_date column (gtfs_start_date = date on which the GTFS feed in service at the time was published). For example, an APC entry on 2020-01-01 in Nashville would have a gtfs_start_date of 2019-11-13.

In [None]:
# convert ride_check_date to YYYY-MM-DD format
apc_df['ride_check_date'] = pd.to_datetime(apc_df['ride_check_date']).dt.date
apc_df['ride_check_date'] = apc_df['ride_check_date'].astype(str)

In [None]:
def get_gtfs_start_date(date):
    if date < '2020-01-24':
        return '2019-11-13'
    elif date < '2020-03-29':
        return '2020-01-24'
    elif date <  '2020-03-30':
        return '2020-03-29'
    elif date < '2020-05-22':
        return '2020-03-30'
    elif date < '2020-05-29':
        return '2020-05-22'
    elif date < '2020-06-13':
        return '2020-05-29'
    elif date < '2020-07-01': # last date in APC datset is 2020-06-30
        return '2020-06-13'
    else: 
        return ''

In [None]:
apc_df['gtfs_start_date'] = apc_df.apply(lambda row: get_gtfs_start_date(row['ride_check_date']), axis=1)
apc_df.head(2)

## Join APC and GTFS Datasets

The APC dataset only contains rows where board_count or alight_count > 0. To get occupancy values along a route, we need to fill in these "missing" rows. 

**Steps to calculate occupancy:**

1. For each trip (identified as a unique combo of trip_id and date), create the following: 
    - dataframe #1: df with all stops on the trip (using GTFS)
    - dataframe #2: assocaited board and alight counts that exist in the APC data
2. Create the "full" dataframe for the trip:
    - SELECT * From dataframe1 LEFT JOIN dataframe2 on dataframe1.trip_id = dataframe2.trip_id and dataframe1.stop_id dataframe2.stop_id and dataframe1.gtfs_start_date = dataframe2.gtfs_start_date
3. Fill resulting nan values from LEFT join:
    - fill in board count and alight count with 0 (these are the "missing" rows(
    - fill in fields that stay constant along a trip (e.g., initial_load, block_name, bus_number)
4. Fix timestamps
    - convert e_time and arrival_time to HH:MM format 
5. Determine which rows meet threshold and filter out rows above threshold
    - determine time difference between e_time (scheduled arrival time in APC) and arrival_time (scheduled arrival time in GTFS)
    - if time difference is above 10 min, mark as DROP, otherwise mark as KEEP
6.Calculate occupancy at a given stop using a cumulative sum function

#### Some apc_trip_ids in APC will not have corresponding matches in GTFS

These trip_ids are:

{204458,
 204459,
 204460,
 219293,
 219294,
 219295,
 219296,
 224376,
 224377,
 224378,
 224379,
 224381,
 224382,
 224383,
 224384,
 225263,
 225264,
 225265}
 
 They are not routes used by riders. If you look at these trips, they all belong to route 99 which is not found in GTFS.

In [None]:
def get_occupancy_at_all_stops_on_trip(group):
    """
    calculate occupancy at all stops on the trip

    :param group: a pandas DataFrame of all the apc readings for a unique trip
    :return: a pandas DataFrame with occupancy filled in for all stops in the trip
    """
    
    group = group.reset_index()

    trip_id = group.apc_trip_id[0]
    date = group.ride_check_date[0]
    gtfs_start_date = group.gtfs_start_date[0]

    # create dataframe #1: get all stops on trip
    # make sure that stops are sorted by order of stop sequence
    all_stops_on_trip = gtfs_df.loc[(gtfs_df['trip_id'] == trip_id)
                                    & (gtfs_df['gtfs_start_date'] == gtfs_start_date)].sort_values(
        ['stop_sequence']).copy()

    # create dataframe #2: get all apc board/alight counts sorted by sequence
    apc_for_trip = apc_df.loc[(apc_df['apc_trip_id'] == trip_id)
                              & (apc_df['ride_check_date'] == date)
                              & (apc_df['gtfs_start_date'] == gtfs_start_date)].copy()
    apc_for_trip = apc_for_trip.sort_values('sequence')

    # create "full" df for trip via LEFT join
    all_stops_on_trip['trip_id'] = all_stops_on_trip['trip_id'].astype(str)
    all_stops_on_trip['gtfs_start_date'] = all_stops_on_trip['gtfs_start_date'].astype(str)
    all_stops_on_trip['stop_id'] = all_stops_on_trip['stop_id'].astype(str)

    apc_for_trip['apc_trip_id'] = apc_for_trip['apc_trip_id'].astype(str)
    apc_for_trip['gtfs_start_date'] = apc_for_trip['gtfs_start_date'].astype(str)
    apc_for_trip['stop_abbr'] = apc_for_trip['stop_abbr'].astype(str)

    # left join
    combined = all_stops_on_trip.merge(apc_for_trip, left_on=['trip_id', 'gtfs_start_date', 'stop_id'],
                                       right_on=['apc_trip_id', 'gtfs_start_date', 'stop_abbr'],
                                       how='left')
    combined['stop_sequence'] = combined['stop_sequence'].astype(int)
    combined = combined.sort_values('stop_sequence').reset_index(drop=True) # has to be sorted for occupancy calculation

    # fill in missing values
    filled = fill_nan_values(combined)
    
    # fix timestamps and get rid of duplicate stops
    filled['e_time_fixed'] = filled.apply(lambda row: fix_times(row['e_time']), axis=1)
    filled['arrival_time_fixed'] = filled.apply(lambda row: fix_times(row['arrival_time']), axis=1)
    
    # use fixed timestamps to validate data join
    # mark rows where e_time (scheduled arrival time from APC data) is more than 10 minutes away from arrival_time (scheduled arrival time in GTFS)
    filled['keep'] = filled.apply(lambda row: flag_rows_to_keep(row['arrival_time_fixed'], row['e_time_fixed']), axis=1)
    
    # filter out rows above threshold
    df_keep = filled.loc[filled['keep'] == 'KEEP']
    
    # calc occupancy
    occupancy_df = calc_bus_occupancy(df_keep)

    return occupancy_df

In [None]:
def fix_times(time):
    """
    convert time to HH:MM format
    
    :param time: a time value (either e_time or arrival_time)
    :return: time converted to HH:MM format
    """
    time = str(time)
    if time == 'nan':
        return time
    # replace any whitespace with 0
    fixed_time = time.replace(' ', '0')

    # take first 5 characters for HH:MM format
    fixed_time = fixed_time[0:5]

    # fix times past midnight or ones that are missing a 0
    if (fixed_time > '24') or (fixed_time[0:2] == '0:'):
        fixed_time = '00:' + time[-2:]

    return fixed_time

In [None]:
def flag_rows_to_keep(arrival_time_fixed, e_time_fixed):
    """
    determine if a row should be kept in the final dataframe or not
    a row is flagged as 'KEEP' if the difference between arrival_time_fixed and e_time_fixed is < 10 min
    otherwise, it is flagged as 'DROP'
    
    :param arrival_time_fixed:
    :e_time_fixed:
    :return 'KEEP' or 'DROP'
    """
    
    # automatically keep rows that are nan (this means that no APC entry was found for this stop)
    if e_time_fixed == 'nan':
        return 'KEEP'
    
    # mark rows that have more than 10 min between the two times
    else:
        arrival_gtfs = dt.datetime.strptime(arrival_time_fixed, '%H:%M')
        arrival_apc = dt.datetime.strptime(e_time_fixed, '%H:%M')

        if arrival_gtfs < arrival_apc:
            diff = arrival_apc - arrival_gtfs
        else:
            diff = arrival_gtfs - arrival_apc

        # threshold is 10 minutes
        if diff.seconds > 600:
            return 'DROP'
        else:
            return 'KEEP'

In [None]:
def fill_nan_values(df):
    """
    fill nan values in the dataframe that result from the left join
    
    :param df: the pandas DataFrame after APC and GTFS data have been combined
    """
    filled = df.copy()

    # missing board and alight counts are all 0
    filled[['board_count', 'alight_count']] = filled[['board_count', 'alight_count']].fillna(0)

    # otherwise, fill in missing information from existing apc rows
    filled[['initial_load', 'line', 'block_name', 'bus_number',
            'service_id', 'ride_check_date', 'pattern', 'pattern_id', 'apc_trip_id']] = filled[
        ['initial_load', 'line', 'block_name', 'bus_number',
         'service_id', 'ride_check_date', 'pattern', 'pattern_id', 'apc_trip_id']].fillna(method='ffill',
                                                                                          axis=0).fillna(method='bfill',
                                                                                                         axis=0)
    return filled

In [None]:
def calc_bus_occupancy(df):
    """
    calculate occupancy at each stop along a route

    :param df: dataframe with board/alight values for all stops along a single trip
    """
    tmp = df.copy()

    tmp['initial_load'] = pd.to_numeric(tmp['initial_load'], errors='coerce')
    tmp['board_count'] = pd.to_numeric(tmp['board_count'], errors='coerce')
    tmp['alight_count'] = pd.to_numeric(tmp['alight_count'], errors='coerce')

    # calc occupancy net change
    tmp['occupancy_net_change'] = tmp['board_count'] - tmp['alight_count']

    # calc cumulative sum in occupancy net change as an intermediate step
    tmp_sum_df = pd.DataFrame(tmp['occupancy_net_change'].cumsum())
    tmp_sum_df.columns = ['tmp_sum']

    # merge tmp sum (cumulative sum) into tmp
    tmp = tmp.merge(tmp_sum_df, left_index=True, right_index=True)

    # calc occupancy for a particular stop
    tmp['occupancy'] = tmp['tmp_sum'] + tmp['initial_load']

    return tmp.drop(columns=['tmp_sum', 'occupancy_net_change'])

### The following cells use multiprocessing, which may not work with Jupyter Notebook
You should convert this .ipynb to .py file and run it as a .py file. This will take approximately 2 hours to run on the entire APC dataset if you have 12 cores. It will take longer to run if you have fewer cores on your machine. You can check how many cores you have available to use by checking the value of cpu_count().

In [None]:
# do this in parallel (otherwise, this would take several days to run)
def apply_parallel(df_grouped, get_full_dataframe_for_trip):
    with Pool(cpu_count()) as p:
        ret_list=p.map(get_full_dataframe_for_trip, [group for name, group in df_grouped])
    p.close()
    p.join()
    return pd.concat(ret_list)

In [None]:
def main(): 
    df_grouped = apc_df.groupby(['apc_trip_id', 'ride_check_date', 'gtfs_start_date'])

    start = datetime.datetime.now()
    parallel_result = apply_parallel(df_grouped, get_full_dataframe_for_trip)
    end = datetime.datetime.now()
    print("time elapsed:", end - start)

    parallel_result.to_csv('nashville_bus_occupancy_jan20_through_jun20_update.csv')

In [None]:
if __name__ == '__main__':
    main()

# TO DO: 
- double check that this code works (run it on a sample of trips and manually confirm)
- make sure that the sample has trips that run in a LOOP (if you want to join trip directions.xlsx first and filter for routes that run in a LOOP, that will work)
    - confirm that trips that run in a loop do not have duplicate entries for first/last stops
- upload CSV to Teams under WeGO-Data > data-used-for-analysis > Nashville Bus Occupancy Dashboard (and update the README in the folder)


## Create Dataset Used for Nashville Dashboard

# TO DO (add the following code to this notebook): 
1. filter out unecessary columns (see fields in dataprep/nashville/readme.asciidoc) -- these are the fields we want in the dataset

2. calculate the following fields: (see the code in Chattanooga data prep notebook if you need help)

  - date_time (use ride_check_date and arrival_time)
  - service_period (use service_id field and map to Weekday, Saturday, Sunday based on GTFS)
  - trip_start_time (find where stop_sequence == 1 for each trip in GTFS and join into this dataset)
  - trip_name (combo of trip_start_time and trip_id)
  - day_of_week (value of 0 through 6, where 0 = Monday, 6 = Sunday)
  - direction_desc (join trip directions.xlsx file in Teams under WeGO-Data)

3. export as csv and post to Teams under WeGo-Data > data-used-for-analysis > Nashville Bus Occupancy Dashboard (and update the README in the folder)

4. convert to .pbz2 file and post to GitHub