<a href="https://colab.research.google.com/github/sophiewagner7/its-too-nice-out-to-take-a-cab/blob/main/notebooks/consolidate_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Consolidate Taxi and Weather Data

Moacir P. de SÃ¡ Pereira

This notebook builds a consolidated dataset featuring weather data and taxi data from New York. The taxi data are an hourly aggregation of yellow and Uber-like intra-Manhattan trips between 2019-01-01 and 2024-08-31. Additionally, we have limited the aggregation to trips of under two hours and under ten miles. The taxi data are preprocessed by https://github.com/sophiewagner7/its-too-nice-out-to-take-a-cab/blob/main/notebooks/pre_process_taxi_data.ipynb

The weather data are hourly weather data collected from the KNYC0 weather station in Central Park, for a timespan similar to that of the taxi data. The data are preprocessed by https://github.com/sophiewagner7/its-too-nice-out-to-take-a-cab/blob/main/notebooks/pre_process_weather_data.ipynb

This notebook limits the data to 2019-01-01 to 2024-06-25, to account for the extent of the weather data.

It creates a blank dataframe that includes a row for each hour of each day of interest and then merges the weather and taxi data into that blank dataframe.

In [1]:
import pandas as pd
import numpy as np
from google.colab import drive

drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [2]:
start_datetime = '2019-02-01 00:00:00'
end_datetime = '2024-06-25 23:00:00'

date_hour_grid = pd.date_range(start=start_datetime, end=end_datetime, freq='h')
merged_df = pd.DataFrame({'datetime': date_hour_grid})

merged_df['date'] = merged_df['datetime'].dt.date
merged_df['hour'] = merged_df['datetime'].dt.hour
merged_df["day_of_week"] = merged_df.datetime.dt.day_of_week
merged_df["month"] = merged_df.datetime.dt.month
merged_df["year"] = merged_df.datetime.dt.year

taxi_df = pd.read_parquet(
  "https://github.com/sophiewagner7/its-too-nice-out-to-take-a-cab/raw/refs/heads/main/data/taxi-data/complete_hourly.parquet"
)
weather_df = pd.read_parquet(
    "https://github.com/sophiewagner7/its-too-nice-out-to-take-a-cab/raw/refs/heads/main/data/GHCNh/GHCNh_USW00094728_2019_to_2024.parquet"
)

In [3]:
df = merged_df.merge(taxi_df, on=['date', 'hour'], how='left').merge(weather_df, on=['date', 'hour'], how='left')
df.set_index('datetime', inplace=True)

## Handle NA Values

In [4]:
# Zero out company NAs
for company in ["uber", "yellow", "lyft", "juno", "via"]:
  df[company] = df[company].fillna(0)

# Interpolate other taxi NAs
for col in [
    'trip_count', 'trip_duration_mean', 'trip_duration_median',
       'trip_duration_std_dev', 'trip_duration_1Q', 'trip_duration_3Q',
       'trip_distance_mean', 'trip_distance_median', 'trip_distance_std_dev',
       'trip_distance_1Q', 'trip_distance_3Q', 'half_mile_trips',
       'one_mile_trips', 'two_mile_trips', 'three_mile_trips',
       'five_mile_trips'
]:
  df[col] = df[col].interpolate(method='linear', limit_direction='both')

In [5]:
# Zero out weather NAs
for col in ["wind_speed", "precipitation", "cloud_cover"]:
  df[col] = df[col].fillna(0)

# Interpolate weather NAs
for col in ["temperature", "dew_point_temperature", "relative_humidity"]:
  df[col] = df[col].interpolate(method='linear', limit_direction='both')

## Calculate Weekly Taxi Delta Values

In [6]:
def calculate_week_ago(row, col):
    week_ago_time = row.name - pd.Timedelta(weeks=1) # row.name is the index
    if week_ago_time in df.index:
        return row[col] - df.loc[week_ago_time, col]
    return None

In [7]:
for column in [
       'trip_count', 'trip_duration_mean', 'trip_duration_median',
       'trip_duration_std_dev', 'trip_duration_1Q', 'trip_duration_3Q',
       'trip_distance_mean', 'trip_distance_median', 'trip_distance_std_dev',
       'trip_distance_1Q', 'trip_distance_3Q', 'half_mile_trips',
       'one_mile_trips', 'two_mile_trips', 'three_mile_trips',
       'five_mile_trips', 'juno', 'lyft', 'uber', 'via', 'yellow'
    ]:
    df[f"{column}_change_since_prev_week"] = df.apply(lambda row: calculate_week_ago(row, column), axis=1)

## Calculate Daily Temperature Delta Values

In [8]:
def calculate_day_ago(row, col):
    day_ago_time = row.name - pd.Timedelta(days=1) # row.name is the index
    if day_ago_time in df.index:
        return row[col] - df.loc[day_ago_time, col]
    return None

In [9]:
for column in [
    "temperature",
    "dew_point_temperature",
    "wind_speed",
    "precipitation",
    "cloud_cover",
    "relative_humidity"
    ]:
    df[f"{column}_change_since_prev_day"] = df.apply(lambda row: calculate_day_ago(row, column), axis=1)

In [10]:
df.to_parquet("complete_weather_and_taxi_data.parquet")