## Impute cruise time data using neighboring historical cruise time

At time interval t, if a taxi zone z does not have cruise time data, we look for cruise time data of adjacent taxi zones to z. The average of those adjacent value is used as a cruise time of zone z at time interval t.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

In [4]:
def impute_from_adjacent_zones(cruise_time_df, adjacent_zone, cruise_time_col, time_interval):
    
    ## pivot
    cruise_df_imputed = cruise_time_df.pivot_table(index='dropoff_datetime_index', 
                                   columns='taxizone_id', 
                                   values=cruise_time_col).copy()

    ## at each time interval, loop through each of the zone and find its adjacent
    ## if its cruise time is NaN
    taxi_zone_list = list(cruise_df_imputed.columns)
    timestamp_list = list(cruise_df_imputed.index)
    for t in timestamp_list:
        for z in taxi_zone_list:
            adjacent_list = np.array(adjacent_zone.loc[adjacent_zone['zone1'] == z]['zone2'])
            try:
                if np.isnan(cruise_df_imputed.loc[t, z]):
                    cruise_df_imputed.loc[t, z] = np.mean(cruise_df_imputed.loc[t, adjacent_list])
            except:
                pass
    ## unpivot
    cruise_df_imputed = cruise_df_imputed.stack().reset_index(name=cruise_time_col)
    
    ## compute interval from the imputed cruise time
    cruise_df_imputed[cruise_time_col + '_INT'] = [t//time_interval for t in cruise_df_imputed[cruise_time_col]]
    return cruise_df_imputed

In [3]:
adjacent_zone = pd.read_csv('../data/adjacent_zone.csv')
time_interval_list = [60, 30, 15, 10, 5, 1]
for it in time_interval_list:
    current_cruise_time_df = pd.read_csv('../data/cruise_time_{}m.csv'.format(int(it)))
    current_cruise_time_df = current_cruise_time_df.astype({'dropoff_datetime_index': 'int32'})
    
    cruise_imputed = impute_from_adjacent_zones(current_cruise_time_df,
                                                adjacent_zone, 
                                                'med_cruise_time', it)
    saved_path = '../data/cruise_time_imputed_{}m.csv'.format(int(it))
    cruise_imputed.to_csv(saved_path, index=False)
    print('saved at ', saved_path)

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike
  return getattr(section, self.name)[new_key]


saved at  ../data/cruise_time_1m_imputed.csv
saved at  ../data/cruise_time_5m_imputed.csv
saved at  ../data/cruise_time_10m_imputed.csv
saved at  ../data/cruise_time_15m_imputed.csv
saved at  ../data/cruise_time_30m_imputed.csv
saved at  ../data/cruise_time_60m_imputed.csv
