<a href="https://colab.research.google.com/github/sophiewagner7/its-too-nice-out-to-take-a-cab/blob/main/notebooks/consolidate_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Consolidate Taxi and Weather Data

Moacir P. de Sá Pereira

This notebook builds a consolidated dataset featuring weather data and taxi data from New York. The taxi data are an hourly aggregation of yellow and Uber-like intra-Manhattan trips between 2019-01-01 and 2024-08-31. Additionally, we have limited the aggregation to trips of under two hours and under ten miles. The taxi data are preprocessed by https://github.com/sophiewagner7/its-too-nice-out-to-take-a-cab/blob/main/notebooks/pre_process_taxi_data.ipynb

The weather data are hourly weather data collected from the KNYC0 weather station in Central Park, for a timespan similar to that of the taxi data. The data are preprocessed by https://github.com/sophiewagner7/its-too-nice-out-to-take-a-cab/blob/main/notebooks/pre_process_weather_data.ipynb

This notebook limits the data to 2019-01-01 to 2024-06-25, to account for the extent of the weather data.

It creates a blank dataframe that includes a row for each hour of each day of interest and then merges the weather and taxi data into that blank dataframe.

In [1]:
import pandas as pd
import numpy as np
from google.colab import drive

drive.mount('/content/drive/')

Mounted at /content/drive/


In [2]:
start_datetime = '2019-01-01 00:00:00'
end_datetime = '2024-06-25 23:00:00'

date_hour_grid = pd.date_range(start=start_datetime, end=end_datetime, freq='h')
merged_df = pd.DataFrame({'datetime': date_hour_grid})

merged_df['date'] = merged_df['datetime'].dt.date
merged_df['hour'] = merged_df['datetime'].dt.hour

taxi_df = pd.read_parquet(
  "https://github.com/sophiewagner7/its-too-nice-out-to-take-a-cab/raw/refs/heads/main/data/taxi-data/complete_hourly.parquet"
)
weather_df = pd.read_parquet(
    "https://github.com/sophiewagner7/its-too-nice-out-to-take-a-cab/raw/refs/heads/main/data/GHCNh/GHCNh_USW00094728_2019_to_2024.parquet"
)

In [3]:
df = merged_df.merge(taxi_df, on=['date', 'hour'], how='left').merge(weather_df, on=['date', 'hour'], how='left')


In [4]:
df.to_parquet("complete_weather_and_taxi_data.parquet")