# Introduction

This notebook estimates $\text{CO}_2$ saved if passengers with similar pickup timestamps and same pickup and dropoff locations share their cab in NYC during March 2017.

# Data cleaning

In [None]:
import pandas as pd
import numpy as np
from IPython.display import display

dtypes = {
    "tpep_pickup_datetime": "object",
    "tpep_dropoff_datetime": "object",
    "PULocationID": "uint8",
    "DOLocationID": "uint8",
    "passenger_count": "uint8",
    "trip_distance": "float32"
}

df = pd.read_csv(
    "../input/nyc-yellow-cab-trip-data-201703/yellow_tripdata_2017-03.csv",
    usecols=list(dtypes.keys()),
    parse_dates = ["tpep_pickup_datetime", "tpep_dropoff_datetime"],
    dtype=dtypes
)

# insert trip duration
df.insert(loc=2, column="trip_duration", value=df.tpep_dropoff_datetime-df.tpep_pickup_datetime)

# display df properties
display(df.head())
display(df.info())
display(df.describe())

# check for invalid values
print("Null values: \t{}".format(df.isnull().any().any()))
print("Inf values: \t{}".format(np.isinf(df).any().any()))

Dataset does not contain invalid values and it can be used as it is.

Looking at `trip_duration` minimum value there are some rides with a negative duration.

In [None]:
df.loc[df.trip_duration < pd.Timedelta(seconds=0)]

Pickup and dropoff were probably transcribed incorrectly for the above rides.
These values are dropped.

In [None]:
df = df.drop(df.loc[df.trip_duration < pd.Timedelta(seconds=0)].index)

# Carpooling

$\text{CO}_2$ produced by each ride can be estimated using average value of $404\ \text{g}/\text{mile}$ (retrieved at [EPA.gov](https://www.epa.gov/greenvehicles/greenhouse-gas-emissions-typical-passenger-vehicle)).

In [None]:
# insert CO2 produced by each trip
df["CO2_grams"] = df.trip_distance * 404 # CO2 grams per mile

df.head()

To introduce carpooling each timestamp is aligned at specific times choosed using a period $T$ between them.
After the alignment only timestamps in the format

<center>$ H:\{kT\}:00 $</center>

are present in the dataframe.
This notebook uses

<center>$ T = 15\ \text{min} $</center>

Rides are pooled using aligned pickup times and pickup and dropoff locations.
Each pool's feature is aggregated as following:
* `trip_duration` is the **mean** of the original values;
* `trip_distance` is the **mean** of the original values;
* `passenger_count` is the **sum** of the original values;
* `CO2_grams` is the **maximum** of the original values.

After pooling `trip_duration` is replaced with `dropoff_datetime`.

In [None]:
import joblib as jb
from tqdm.auto import trange

# align timestamp using a specified time-distance in minutes (i.e. `T` parameter)
def align_time(t,T):
    if round(t.minute % T / T) == 1: t += pd.Timedelta(minutes = T)
    t += pd.Timedelta(minutes = -t.minute % T, seconds = -t.second)
    return t

# introduce carpooling
def carpool(df, T=15):
    
    # leave original unchanged
    df=df.copy()
    
    # align timestamps
    df.insert(loc=0, column="pool_pickup_datetime", value=jb.Parallel(n_jobs=-1)(jb.delayed(align_time)(df.tpep_pickup_datetime.iloc[i], T=T) for i in trange(df.shape[0], desc="Align timestamps")))

    # introduce pools
    df_group = df.groupby(["pool_pickup_datetime", "PULocationID", "DOLocationID"])
    df = df_group.agg(trip_duration=pd.NamedAgg(column="trip_duration", aggfunc=lambda t: t.values.mean()), trip_distance = pd.NamedAgg(column="trip_distance", aggfunc="mean"), passenger_count = pd.NamedAgg(column="passenger_count", aggfunc="sum"), CO2_grams = pd.NamedAgg(column="CO2_grams", aggfunc="max")).reset_index()
    df.insert(loc=0, column="pool", value=[list(x) for x in df_group.groups.values()])

    return df

In [None]:
df_pool = carpool(df)

# replace trip_duration with dropoff_datetime
df_pool.insert(loc=1, column="pool_dropoff_datetime", value=df_pool.pool_pickup_datetime+df_pool.trip_duration)
df_pool = df_pool.drop(columns="trip_duration")

display(df_pool.head())
display(df_pool.describe())

In [None]:
# print CO2 released

CO2 = df.CO2_grams.sum()/1e+6
CO2_pool = df_pool.CO2_grams.sum()/1e+6

print("CO2 produced without carpool: \t{:.0f} tons".format(CO2))
print("CO2 produced with carpool: \t{:.0f} tons ({:.0%})".format(CO2_pool, (CO2_pool - CO2)/CO2))

In [None]:
# save output
df.to_csv("nyc-yellow-cab-trip-data-201703-cleaned.csv", index=False) # refer to this dataframe to retrieve the original rides from the indexes of df_pool's column `pool`
df_pool.to_csv("nyc-yellow-cab-trip-data-201703-pooled.csv", index=False)