# **NYC Trip Fare Analysis**

Dataset description extracted from [Kaggle](https://www.kaggle.com/datasets/diishasiing/revenue-for-cab-drivers):

- **VendorID**: A unique identifier for the taxi vendor or service provider.
- **tpep_pickup_datetime**: The date and time when the passenger was picked up.
- **tpep_dropoff_datetime**: The date and time when the passenger was dropped off.
- **passenger_count**: The number of passengers in the taxi.
- **trip_distance**: The total distance of the trip in miles or kilometers.
- **RatecodeID**: The rate code assigned to the trip, representing fare types.
- **store_and_fwd_flag**: Indicates whether the trip data was stored locally and then forwarded later (Y/N).
- **PULocationID**: The unique identifier for the pickup location (zone or area).
- **DOLocationID**: The unique identifier for the drop-off location (zone or area).
- **payment_type**: The method of payment used by the passenger (e.g., cash, card).
- **fare_amount**: The base fare for the trip.
- **extra**: Additional charges applied during the trip (e.g., night surcharge).
- **mta_tax**: The tax imposed by the Metropolitan Transportation Authority.
- **tip_amount**: The tip given to the driver, if applicable.
- **tolls_amount**: The total amount of tolls charged during the trip.
- **improvement_surcharge**: A surcharge imposed for the improvement of services.
- **total_amount**: The total fare amount, including all charges and surcharges.
- **congestion_surcharge**: An additional charge for trips taken during high traffic congestion times.

## **1. Set Environment and Import Libraries**

We set up the working environment by ensuring that all scripts and notebooks can access the project’s main directories dynamically. The `conf.py` module, located in the conf folder, defines key paths (`MAIN_DIR`, `NOTEBOOK_DIR`, `DATA_DIR`) and adds them to `sys.path`. This allows seamless imports and ensures portability across different environments without requiring hardcoded paths.

In [None]:
import sys
import os

# Add 'conf' folder to sys.path (if not already present)
conf_path = os.path.abspath(os.path.join(os.getcwd(), '..', 'conf'))
if conf_path not in sys.path:
    sys.path.append(conf_path)

# Now import conf module from conf folder
import conf

Now, import main packages necessary for the developement of this project:

In [2]:
# Import libraries
import pandas as pd
import datetime
import time
import numpy as np

## **2. Data Importation and Pre-Processing**

Import the dataset present within the `DATA_DIR` folder:

In [3]:
# Read data from the .csv file within data folder
dataset = pd.read_csv(f'{conf.DATA_DIR}/data.csv')

  dataset = pd.read_csv(f'{conf.DATA_DIR}/data.csv')


First of all, create a copy of the original dataset for recovery.

In [6]:
# Create a copy of the original dataset
dataset_old = dataset.copy()

Print the **first 5 rows** of the dataset to visualize a sample of data:

In [4]:
dataset.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
0,1.0,2020-01-01 00:28:15,2020-01-01 00:33:03,1.0,1.2,1.0,N,238,239,1.0,6.0,3.0,0.5,1.47,0.0,0.3,11.27,2.5
1,1.0,2020-01-01 00:35:39,2020-01-01 00:43:04,1.0,1.2,1.0,N,239,238,1.0,7.0,3.0,0.5,1.5,0.0,0.3,12.3,2.5
2,1.0,2020-01-01 00:47:41,2020-01-01 00:53:52,1.0,0.6,1.0,N,238,238,1.0,6.0,3.0,0.5,1.0,0.0,0.3,10.8,2.5
3,1.0,2020-01-01 00:55:23,2020-01-01 01:00:14,1.0,0.8,1.0,N,238,151,1.0,5.5,0.5,0.5,1.36,0.0,0.3,8.16,0.0
4,2.0,2020-01-01 00:01:58,2020-01-01 00:04:16,1.0,0.0,1.0,N,193,193,2.0,3.5,0.5,0.5,0.0,0.0,0.3,4.8,0.0


After importing data, a `DtypeWarning` appeared. Print out **columns types** to check the warning (columns with different data types):

In [5]:
print(dataset.dtypes)

VendorID                 float64
tpep_pickup_datetime      object
tpep_dropoff_datetime     object
passenger_count          float64
trip_distance            float64
RatecodeID               float64
store_and_fwd_flag        object
PULocationID               int64
DOLocationID               int64
payment_type             float64
fare_amount              float64
extra                    float64
mta_tax                  float64
tip_amount               float64
tolls_amount             float64
improvement_surcharge    float64
total_amount             float64
congestion_surcharge     float64
dtype: object


Consider the column `store_and_fwd_flag`: 

In [None]:
# Get the unique values present in the 'store_and_fwd_flag' column, to understand the distinct forwarding flags in the dataset
dataset.store_and_fwd_flag.unique()

array(['N', 'Y', nan], dtype=object)

The problem is on 'nan' that are encoded in a wrong way, so handle the missing values correctly and then convert variable dtype into `category`.

In [None]:
# Clean up missing values before conversion into type "category"
dataset['store_and_fwd_flag'] = dataset['store_and_fwd_flag'].replace(["", " ", "NaN", "nan"], pd.NA).astype("category")

# Get again unique categories
dataset.store_and_fwd_flag.unique()

['N', 'Y', NaN]
Categories (2, object): ['N', 'Y']

Moreover, dates must be converted into the proper `datetime` format:

In [9]:
# Convert datetime fields to proper format
dataset['tpep_pickup_datetime'] = pd.to_datetime(dataset['tpep_pickup_datetime'], errors='coerce') # errors = 'coerce' to avoid anomaly
dataset['tpep_dropoff_datetime'] = pd.to_datetime(dataset['tpep_dropoff_datetime'], errors='coerce')

Check new dtypes:

In [None]:
# Print formats
print(dataset[['tpep_pickup_datetime', 'tpep_dropoff_datetime', 'store_and_fwd_flag']].dtypes)

tpep_pickup_datetime     datetime64[ns]
tpep_dropoff_datetime    datetime64[ns]
store_and_fwd_flag             category
dtype: object


Now, considering how data are described within the Kaggle datacard, try to understand if adjustments are required. 

It is important that numerical IDs, counters and numeric categorical variables are converted into **integer numbers**. With the format `Int64` it is possible to handle null values for integer data without converting them into `float64`. 

For numerical IDs and categorical columns, is then necessary to convert them into **categorical**.

Variables which require a conversion are `VendorID`, `RatecodeID`, `DOLocationID`, `PULocationID` (*numerical IDs*), `payment_type` (*categorical column*) and `passenger_count` (*counter column*).

In [None]:
# List of columns to modify
col_to_modify = ['VendorID', 'passenger_count', 'RatecodeID', 'payment_type', 'DOLocationID', 'PULocationID']

# Function to convert data types for optimization
def type_conversion(col_name):
    # Convert the column to nullable integers (Int64) to handle NaN values
    column_new = dataset[col_name].astype('Int64')
    
    # Convert to 'category' data type if it's not 'passenger_count'
    # 'category' type reduces memory usage for columns with repeated values
    if col_name != 'passenger_count':
        column_new = column_new.astype('category')

    return column_new

# Apply type conversion to each column in the list
for col in col_to_modify:
    dataset[col] = type_conversion(col)

Print how variable dtypes are changed:

In [13]:
for col in list(dataset_old.columns):
    print('Column :', col, '| before:', dataset_old[col].dtype, ' -> after:', dataset[col].dtype)

Column : VendorID | before: float64  -> after: category
Column : tpep_pickup_datetime | before: object  -> after: datetime64[ns]
Column : tpep_dropoff_datetime | before: object  -> after: datetime64[ns]
Column : passenger_count | before: float64  -> after: Int64
Column : trip_distance | before: float64  -> after: float64
Column : RatecodeID | before: float64  -> after: category
Column : store_and_fwd_flag | before: object  -> after: category
Column : PULocationID | before: int64  -> after: category
Column : DOLocationID | before: int64  -> after: category
Column : payment_type | before: float64  -> after: category
Column : fare_amount | before: float64  -> after: float64
Column : extra | before: float64  -> after: float64
Column : mta_tax | before: float64  -> after: float64
Column : tip_amount | before: float64  -> after: float64
Column : tolls_amount | before: float64  -> after: float64
Column : improvement_surcharge | before: float64  -> after: float64
Column : total_amount | before

Finally, create a rowID:

In [14]:
# Add rowID using .loc to avoid SettingWithCopyWarning
dataset['rowID'] = range(1, len(dataset) + 1)

# Reorder columns to place 'rowID' as the first column
dataset = dataset[['rowID'] + [col for col in dataset.columns if col != 'rowID']]

## **3. Assignements**

### 3.1. Extract all trips with `trip_distance` larger than 50

In [15]:
# Extract all trip with distance > 50
dataset[dataset['trip_distance'] > 50]

Unnamed: 0,rowID,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
23842,23843,2,2020-01-01 01:53:07,2020-01-01 03:54:41,1,52.30,5,N,262,265,1,300.00,0.00,0.0,61.78,6.12,0.3,370.70,2.5
39013,39014,2,2020-01-01 02:05:07,2020-01-01 03:03:10,1,51.23,5,N,264,264,1,329.00,0.00,0.5,100.78,6.12,0.3,436.70,0.0
41620,41621,1,2020-01-01 03:05:54,2020-01-01 04:16:26,1,53.80,5,N,132,265,1,250.00,0.00,0.0,53.35,16.62,0.3,320.27,0.0
58262,58263,2,2020-01-01 05:36:12,2020-01-01 06:40:06,1,55.23,5,N,132,265,2,170.00,0.00,0.5,0.00,18.26,0.3,189.06,0.0
63024,63025,2,2020-01-01 07:40:30,2020-01-01 08:40:01,1,54.19,5,N,132,265,1,230.00,0.00,0.0,0.00,12.24,0.3,242.54,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6326169,6326170,2,2020-01-31 22:47:26,2020-01-31 23:49:14,1,51.83,5,N,132,265,1,220.00,0.00,0.5,48.96,23.99,0.3,293.75,0.0
6331181,6331182,2,2020-01-31 23:45:36,2020-02-01 01:00:25,5,57.99,4,N,107,265,1,245.00,0.50,0.5,38.24,6.12,0.3,293.16,2.5
6333801,6333802,2,2020-01-31 23:24:16,2020-02-01 01:32:56,1,52.97,4,N,264,265,1,227.00,0.50,0.5,46.16,0.00,0.3,276.96,2.5
6397132,6397133,,2020-01-28 11:54:00,2020-01-28 19:35:00,,60.36,,,17,61,,12.04,0.00,0.5,0.00,12.24,0.3,25.08,0.0


### 3.2 Extract all trips where `payment_type` is missing

In [16]:
# Trips with payment_type missings
dataset[dataset['payment_type'].isna()]

Unnamed: 0,rowID,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
6339567,6339568,,2020-01-01 08:51:00,2020-01-01 09:19:00,,13.69,,,136,232,,51.05,2.75,0.5,0.0,0.00,0.3,54.60,0.0
6339568,6339569,,2020-01-01 08:38:43,2020-01-01 08:51:08,,3.42,,,121,9,,27.06,2.75,0.0,0.0,0.00,0.3,30.11,0.0
6339569,6339570,,2020-01-01 08:27:00,2020-01-01 08:32:00,,2.20,,,197,216,,24.36,2.75,0.5,0.0,0.00,0.3,27.91,0.0
6339570,6339571,,2020-01-01 08:46:00,2020-01-01 08:57:00,,0.84,,,262,236,,26.08,2.75,0.5,0.0,0.00,0.3,29.63,0.0
6339571,6339572,,2020-01-01 08:21:00,2020-01-01 08:38:00,,7.24,,,45,142,,25.28,2.75,0.5,0.0,0.00,0.3,28.83,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6405003,6405004,,2020-01-31 22:51:00,2020-01-31 23:22:00,,3.24,,,237,234,,17.59,2.75,0.5,0.0,0.00,0.3,21.14,0.0
6405004,6405005,,2020-01-31 22:10:00,2020-01-31 23:26:00,,22.13,,,259,45,,46.67,2.75,0.5,0.0,12.24,0.3,62.46,0.0
6405005,6405006,,2020-01-31 22:50:07,2020-01-31 23:17:57,,10.51,,,137,169,,48.85,2.75,0.0,0.0,0.00,0.3,51.90,0.0
6405006,6405007,,2020-01-31 22:25:53,2020-01-31 22:48:32,,5.49,,,50,42,,27.17,2.75,0.0,0.0,0.00,0.3,30.22,0.0


### 3.3 For each (`PULocationID`, `DOLocationID`) pair, determine the number of trips

To calculate the number of trips for each unique pair of pickup location (`PULocationID`) and dropoff location (`DOLocationID`), use the groupby operation, and `size()` counts the number of trips in each group. The result is then reset into a DataFrame with a new column trip_count representing the count of trips for each location pair.

In [None]:
# Group the dataset by 'PULocationID' and 'DOLocationID', counting the number of trips for each pair

dataset.groupby(['PULocationID', 'DOLocationID'], 
                observed=False # ensures that only pairs that exist in the data are included (no missing combinations)
                ).size().reset_index(name='trip_count') # converts the result back to the DataFrame and names the new column 'trip_count'

Unnamed: 0,PULocationID,DOLocationID,trip_count
0,1,1,638
1,1,2,0
2,1,3,0
3,1,4,0
4,1,5,0
...,...,...,...
68377,265,261,1
68378,265,262,0
68379,265,263,4
68380,265,264,317


### 3.4 Save all rows with missing `VendorID`, `passenger_count`, `store_and_fwd_flag`, `payment_type` in a new dataframe called `bad`, and remove those rows from the original dataframe

There are two possibilities:
-	`.any(axis=1)`: Removes rows where **at least one** of the specified columns is **NaN**.
-	`.all(axis=1)`: Removes rows where **all specified columns** are **NaN** at the same time.

In this case, the proper solution is `.any(axis=1)`:

In [None]:
# Filter rows where any of the specified columns ('VendorID', 'passenger_count', 'payment_type', 'store_and_fwd_flag') contain NaN values
# 'isna()' checks for missing values, and 'any(axis=1)' ensures that any row with NaN in any of these columns is included
bad = dataset[dataset[['VendorID', 'passenger_count', 'payment_type', 'store_and_fwd_flag']].isna().any(axis=1)]
bad

Unnamed: 0,rowID,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
6339567,6339568,,2020-01-01 08:51:00,2020-01-01 09:19:00,,13.69,,,136,232,,51.05,2.75,0.5,0.0,0.00,0.3,54.60,0.0
6339568,6339569,,2020-01-01 08:38:43,2020-01-01 08:51:08,,3.42,,,121,9,,27.06,2.75,0.0,0.0,0.00,0.3,30.11,0.0
6339569,6339570,,2020-01-01 08:27:00,2020-01-01 08:32:00,,2.20,,,197,216,,24.36,2.75,0.5,0.0,0.00,0.3,27.91,0.0
6339570,6339571,,2020-01-01 08:46:00,2020-01-01 08:57:00,,0.84,,,262,236,,26.08,2.75,0.5,0.0,0.00,0.3,29.63,0.0
6339571,6339572,,2020-01-01 08:21:00,2020-01-01 08:38:00,,7.24,,,45,142,,25.28,2.75,0.5,0.0,0.00,0.3,28.83,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6405003,6405004,,2020-01-31 22:51:00,2020-01-31 23:22:00,,3.24,,,237,234,,17.59,2.75,0.5,0.0,0.00,0.3,21.14,0.0
6405004,6405005,,2020-01-31 22:10:00,2020-01-31 23:26:00,,22.13,,,259,45,,46.67,2.75,0.5,0.0,12.24,0.3,62.46,0.0
6405005,6405006,,2020-01-31 22:50:07,2020-01-31 23:17:57,,10.51,,,137,169,,48.85,2.75,0.0,0.0,0.00,0.3,51.90,0.0
6405006,6405007,,2020-01-31 22:25:53,2020-01-31 22:48:32,,5.49,,,50,42,,27.17,2.75,0.0,0.0,0.00,0.3,30.22,0.0


Now, clean `dataset` from rows saved in `bad` dataframe. 

In [19]:
# Filter rows by checking if 'rowID' is in `bad`
dataset = dataset[~dataset.index.isin(bad.index)]

# Show the cleaned dataset
dataset

Unnamed: 0,rowID,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
0,1,1,2020-01-01 00:28:15,2020-01-01 00:33:03,1,1.20,1,N,238,239,1,6.0,3.0,0.5,1.47,0.0,0.3,11.27,2.5
1,2,1,2020-01-01 00:35:39,2020-01-01 00:43:04,1,1.20,1,N,239,238,1,7.0,3.0,0.5,1.50,0.0,0.3,12.30,2.5
2,3,1,2020-01-01 00:47:41,2020-01-01 00:53:52,1,0.60,1,N,238,238,1,6.0,3.0,0.5,1.00,0.0,0.3,10.80,2.5
3,4,1,2020-01-01 00:55:23,2020-01-01 01:00:14,1,0.80,1,N,238,151,1,5.5,0.5,0.5,1.36,0.0,0.3,8.16,0.0
4,5,2,2020-01-01 00:01:58,2020-01-01 00:04:16,1,0.00,1,N,193,193,2,3.5,0.5,0.5,0.00,0.0,0.3,4.80,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6339562,6339563,2,2020-01-31 23:38:07,2020-01-31 23:52:21,1,2.10,1,N,163,246,1,11.0,0.5,0.5,2.96,0.0,0.3,17.76,2.5
6339563,6339564,2,2020-01-31 23:00:18,2020-01-31 23:19:18,1,2.13,1,N,164,79,1,13.0,0.5,0.5,3.36,0.0,0.3,20.16,2.5
6339564,6339565,2,2020-01-31 23:24:22,2020-01-31 23:40:39,1,2.55,1,N,79,68,1,12.5,0.5,0.5,3.26,0.0,0.3,19.56,2.5
6339565,6339566,2,2020-01-31 23:44:22,2020-01-31 23:54:00,1,1.61,1,N,100,142,2,8.5,0.5,0.5,0.00,0.0,0.3,12.30,2.5


### 3.5 Add a duration column storing how long each trip has taken (use `tpep_pickup_datetime`, `tpep_dropoff_datetime`)

Since `tpep_pickup_datetime` is the date and time when the passenger was picked up and `tpep_dropoff_datetime` is the date and time when the passenger was dropped off, the duration can be simply computed as this difference: `tpep_dropoff_datetime` -  `tpep_pickup_datetime`.

In [None]:
# Save within the column trip_duration the difference between dropoff time and pickup time
dataset['trip_duration'] = dataset['tpep_dropoff_datetime'] - dataset['tpep_pickup_datetime']

# Display results
dataset[['tpep_pickup_datetime', 'tpep_dropoff_datetime', 'trip_duration']]

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,trip_duration
0,2020-01-01 00:28:15,2020-01-01 00:33:03,0 days 00:04:48
1,2020-01-01 00:35:39,2020-01-01 00:43:04,0 days 00:07:25
2,2020-01-01 00:47:41,2020-01-01 00:53:52,0 days 00:06:11
3,2020-01-01 00:55:23,2020-01-01 01:00:14,0 days 00:04:51
4,2020-01-01 00:01:58,2020-01-01 00:04:16,0 days 00:02:18
...,...,...,...
6339562,2020-01-31 23:38:07,2020-01-31 23:52:21,0 days 00:14:14
6339563,2020-01-31 23:00:18,2020-01-31 23:19:18,0 days 00:19:00
6339564,2020-01-31 23:24:22,2020-01-31 23:40:39,0 days 00:16:17
6339565,2020-01-31 23:44:22,2020-01-31 23:54:00,0 days 00:09:38


The result is a `Timedelta` column, which represents the time difference as number of days, hours, minutes and seconds.

### 3.6 For each pickup location, determine how many trips have started there 

The pickup location is stored within `PULocationID` column. It is simply necessary to compute values distribution of this column:

In [None]:
# Group the dataset by 'PULocationID', counting the number of trips for each pickup location
# 'observed=False' ensures that only locations that exist in the data are included
# 'size()' counts the number of rows (trips) in each group
# 'reset_index()' converts the result back to a DataFrame and names the new column 'trip_count'

dataset.groupby('PULocationID', observed=False).size().reset_index(name='trip_count')

Unnamed: 0,PULocationID,trip_count
0,1,753
1,2,3
2,3,70
3,4,9902
4,5,39
...,...,...
256,261,34229
257,262,85591
258,263,123997
259,264,43779


### 3.7 Cluster the pickup time of the day into 30-minute intervals (e.g. from 02:00 to 02:30)

The following procedure clusters the pickup times of the day into 30-minute intervals. The `pd.date_range()` function generates time bins from 00:00:00 to 23:59:59 with 30-minute intervals. Then, labels are created to represent each time interval in a readable format (e.g., “02:00 - 02:30”). Finally, the `pd.cut()` function assigns each pickup time to its corresponding 30-minute interval and creates a new column `pickup_time_interval` in the dataset.

Note: the `pd.date_range` excludes the interval **23:30 - 00:00**, thus append it manually.

In [22]:
# Generate time intervals (30 min bins)
bins = pd.date_range(start='00:00:00', end='23:59:59', freq='30min').time
# Include also the interval "23:30 - 23:59"
bins = np.append(bins, datetime.time(23, 59, 59))

# Define labels (e.g., "02:00-02:30", ...)
labels = [f"{bins[i].strftime('%H:%M')} - {bins[i+1].strftime('%H:%M')}" for i in range(len(bins)-1)]
labels[-1] = '23:30 - 00:00'

In [None]:
# Assign each pickup time to its corresponding 30-minute interval with pandas .cut module
dataset['pickup_time_interval'] = pd.cut(dataset['tpep_pickup_datetime'].dt.time, bins=bins, labels=labels, include_lowest=True)

In [24]:
# Check the presence of null values within pickup_time_interval
dataset[dataset['pickup_time_interval'].isna()]

Unnamed: 0,rowID,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,...,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,trip_duration,pickup_time_interval


In [None]:
# Print results
dataset

Unnamed: 0,rowID,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,...,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,trip_duration,pickup_time_interval
0,1,1,2020-01-01 00:28:15,2020-01-01 00:33:03,1,1.20,1,N,238,239,...,6.0,3.0,0.5,1.47,0.0,0.3,11.27,2.5,0 days 00:04:48,00:00 - 00:30
1,2,1,2020-01-01 00:35:39,2020-01-01 00:43:04,1,1.20,1,N,239,238,...,7.0,3.0,0.5,1.50,0.0,0.3,12.30,2.5,0 days 00:07:25,00:30 - 01:00
2,3,1,2020-01-01 00:47:41,2020-01-01 00:53:52,1,0.60,1,N,238,238,...,6.0,3.0,0.5,1.00,0.0,0.3,10.80,2.5,0 days 00:06:11,00:30 - 01:00
3,4,1,2020-01-01 00:55:23,2020-01-01 01:00:14,1,0.80,1,N,238,151,...,5.5,0.5,0.5,1.36,0.0,0.3,8.16,0.0,0 days 00:04:51,00:30 - 01:00
4,5,2,2020-01-01 00:01:58,2020-01-01 00:04:16,1,0.00,1,N,193,193,...,3.5,0.5,0.5,0.00,0.0,0.3,4.80,0.0,0 days 00:02:18,00:00 - 00:30
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6339562,6339563,2,2020-01-31 23:38:07,2020-01-31 23:52:21,1,2.10,1,N,163,246,...,11.0,0.5,0.5,2.96,0.0,0.3,17.76,2.5,0 days 00:14:14,23:30 - 00:00
6339563,6339564,2,2020-01-31 23:00:18,2020-01-31 23:19:18,1,2.13,1,N,164,79,...,13.0,0.5,0.5,3.36,0.0,0.3,20.16,2.5,0 days 00:19:00,23:00 - 23:30
6339564,6339565,2,2020-01-31 23:24:22,2020-01-31 23:40:39,1,2.55,1,N,79,68,...,12.5,0.5,0.5,3.26,0.0,0.3,19.56,2.5,0 days 00:16:17,23:00 - 23:30
6339565,6339566,2,2020-01-31 23:44:22,2020-01-31 23:54:00,1,1.61,1,N,100,142,...,8.5,0.5,0.5,0.00,0.0,0.3,12.30,2.5,0 days 00:09:38,23:30 - 00:00


### 3.8 For each interval, determine the average number of passengers and the average fare amount

Proceed with grouping the dataset by the pickup time intervals (`pickup_time_interval`) and calculate the **average passenger count** and **average fare amount** for each interval. The `agg()` function compute the mean of the `passenger_count` and `fare_amount` columns within each time interval.

In [None]:
# Group the dataset by 'pickup_time_interval'

dataset.groupby('pickup_time_interval', observed=False).agg(
    avg_passenger_count=('passenger_count', 'mean'),
    avg_fare_amount=('fare_amount', 'mean') # Calculate the average 'passenger_count' and 'fare_amount' for each interval
).reset_index() # 'reset_index()' converts the result back to a DataFrame

Unnamed: 0,pickup_time_interval,avg_passenger_count,avg_fare_amount
0,00:00 - 00:30,1.572854,13.525922
1,00:30 - 01:00,1.584273,13.214849
2,01:00 - 01:30,1.578807,12.698122
3,01:30 - 02:00,1.589424,12.2657
4,02:00 - 02:30,1.587403,12.089757
5,02:30 - 03:00,1.587833,12.040527
6,03:00 - 03:30,1.581925,12.503048
7,03:30 - 04:00,1.585897,13.094565
8,04:00 - 04:30,1.580107,14.193595
9,04:30 - 05:00,1.516206,16.413


### 3.9 For each payment type and each interval, determine the average fare amount

In [27]:
# Compute the average fare amount for each (payment_type, pickup_time_interval)
avg_fare = dataset.groupby(['payment_type', 'pickup_time_interval'], observed=False).agg(avg_fare_amount=('fare_amount', 'mean')).reset_index()
avg_fare

Unnamed: 0,payment_type,pickup_time_interval,avg_fare_amount
0,1,00:00 - 00:30,13.868641
1,1,00:30 - 01:00,13.472738
2,1,01:00 - 01:30,12.822628
3,1,01:30 - 02:00,12.358248
4,1,02:00 - 02:30,12.008941
...,...,...,...
235,5,21:30 - 22:00,
236,5,22:00 - 22:30,
237,5,22:30 - 23:00,
238,5,23:00 - 23:30,


### 3.10 For each payment type, determine the interval when the average fare amount is maximum

In [28]:
# Find the interval with the maximum average fare for each payment type
avg_fare.loc[avg_fare.groupby('payment_type', observed=False)['avg_fare_amount'].idxmax()]

Unnamed: 0,payment_type,pickup_time_interval,avg_fare_amount
10,1,05:00 - 05:30,21.260986
58,2,05:00 - 05:30,14.856701
110,3,07:00 - 07:30,10.950938
154,4,05:00 - 05:30,6.634043
227,5,17:30 - 18:00,0.0


### 3.11 For each payment type, determine the interval when the overall ratio between the tip and the fare amounts is maximum

In [29]:
# Group by payment_type and pickup_time_interval, and for each pair count the total tip and fare amounts
ratio_amount_df = dataset.groupby(['payment_type', 'pickup_time_interval'], observed=False).agg(tot_tip_amount = ('tip_amount', 'sum'),
                                                                                                tot_fare_amount = ('fare_amount', 'sum')
                                                                                                ).reset_index()

# Now, get the ratio between total tip and fare amount for each pair
ratio_amount_df['tot_amount_ratio'] = ratio_amount_df['tot_tip_amount'] / ratio_amount_df['tot_fare_amount'].replace(0, np.nan)  # Avoid division by zero

# Drop rows where tip_fare_ratio is NaN before using idxmax(), to avoid errors
ratio_amount_df = ratio_amount_df.dropna(subset=['tot_amount_ratio'])

# Finally, get the interval for each payment type where the overall ratio is maximum
ratio_amount_df.loc[ratio_amount_df.groupby('payment_type', observed=True)['tot_amount_ratio'].idxmax()]

Unnamed: 0,payment_type,pickup_time_interval,tot_tip_amount,tot_fare_amount,tot_amount_ratio
37,1,18:30 - 19:00,485536.47,1998420.2,0.24296
58,2,05:00 - 05:30,15.0,109464.17,0.000137
138,3,21:00 - 21:30,35.62,5644.17,0.006311
170,4,13:00 - 13:30,36.48,170.05,0.214525


Why do rows with `payment_type == 5` not appear? Let’s analyze the dataset rows where this condition is true:

In [30]:
dataset[['payment_type','pickup_time_interval','tip_amount','fare_amount']][dataset.payment_type == 5]

Unnamed: 0,payment_type,pickup_time_interval,tip_amount,fare_amount
4061635,5,17:30 - 18:00,0.0,0.0


This happens because the only row where the condition is valid has both amounts as null. The ratio between two null values results in `NaN`, since the denominator is zero.

### 3.12 Find the location with the highest average fare amount

There are two types of locations in the dataset: `PULocationID` (pickup location) and `DOLocationID` (drop-off location). Since it’s unclear whether the fare amount is primarily determined by the pickup or the drop-off location, we’ve decided to calculate the average fare amount for each location, considering both its role as a pickup location and as a drop-off location.

In [31]:
# Compute the average fare amount for each location, and find that with the highest average fare amount
locations = pd.concat([
    dataset.groupby('PULocationID', observed=False).agg(avg_fare=('fare_amount', 'mean')).reset_index().rename(columns={'PULocationID':'LocationID'}),
    dataset.groupby('DOLocationID', observed=False).agg(avg_fare=('fare_amount', 'mean')).reset_index().rename(columns={'DOLocationID':'LocationID'})
])

avg_fare_loc = locations.groupby(['LocationID'], observed=False)['avg_fare'].mean().reset_index()
highest_fare = avg_fare_loc.loc[avg_fare_loc['avg_fare'].idxmax()]

print(f"Location with Highest Avg Fare: {highest_fare['LocationID']}, Avg Fare: {highest_fare['avg_fare']:.2f}")

Location with Highest Avg Fare: 204.0, Avg Fare: 93.18


### 3.13 Build a new dataframe (called `common`) where, for each pickup location we keep all trips to the 5 most common destinations (i.e. each pickup location can have different common destinations)

In [32]:
# Group by PULocationID and DOLocationID to calculate the count of trips for each pair
common_destinations = dataset.groupby(['PULocationID', 'DOLocationID'], observed=False).size().reset_index(name='race_count')

# Sort by PULocationID and race_count to get the most common destinations for each pickup location
common_destinations = common_destinations.sort_values(by=['PULocationID', 'race_count'], ascending=[True, False])

# Add a counter within each PULocationID group
common_destinations['counter'] = common_destinations.groupby('PULocationID', observed=False).cumcount() + 1

# Filter the common_destinations to keep only the top 5 destinations per PULocationID
common_destinations = common_destinations[common_destinations['counter'] <= 5].drop(columns=['counter'])

In [33]:
len(set(zip(common_destinations['PULocationID'], common_destinations['DOLocationID'])))

1305

In [34]:
# Merge this filtered data back to the original dataset to get the corresponding rows
common = dataset.merge(common_destinations[['PULocationID', 'DOLocationID']], 
                       on=['PULocationID', 'DOLocationID'], 
                       how='right')

In [35]:
common

Unnamed: 0,rowID,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,...,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,trip_duration,pickup_time_interval
0,57670.0,2,2020-01-01 04:41:47,2020-01-01 04:42:22,4,0.00,5,N,1,1,...,89.0,0.0,0.5,5.00,0.00,0.3,94.80,0.0,0 days 00:00:35,04:30 - 05:00
1,61346.0,2,2020-01-01 06:54:53,2020-01-01 06:55:13,1,0.00,5,N,1,1,...,96.0,0.0,0.0,5.08,14.50,0.3,115.88,0.0,0 days 00:00:20,06:30 - 07:00
2,61378.0,1,2020-01-01 06:57:17,2020-01-01 06:58:01,4,0.00,5,N,1,1,...,84.0,0.0,0.0,5.00,0.00,0.3,89.30,0.0,0 days 00:00:44,06:30 - 07:00
3,61427.0,2,2020-01-01 06:20:55,2020-01-01 06:21:09,2,0.00,5,N,1,1,...,150.0,0.0,0.5,0.00,29.00,0.3,179.80,0.0,0 days 00:00:14,06:00 - 06:30
4,62601.0,2,2020-01-01 06:53:39,2020-01-01 06:53:49,1,0.00,5,N,1,1,...,60.0,0.0,0.0,0.00,0.00,0.3,60.30,0.0,0 days 00:00:10,06:30 - 07:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1956079,2802606.0,2,2020-01-15 10:54:42,2020-01-15 13:54:11,1,0.00,5,N,265,132,...,65.0,0.0,0.5,13.10,0.00,0.0,78.60,0.0,0 days 02:59:29,10:30 - 11:00
1956080,3741263.0,2,2020-01-19 19:12:56,2020-01-19 20:08:37,1,39.57,1,N,265,132,...,102.0,0.0,0.5,55.00,25.99,0.3,183.79,0.0,0 days 00:55:41,19:00 - 19:30
1956081,4261939.0,1,2020-01-22 15:20:32,2020-01-22 15:41:44,2,6.40,1,N,265,132,...,22.0,0.0,0.5,0.00,0.00,0.3,22.80,0.0,0 days 00:21:12,15:00 - 15:30
1956082,4717010.0,2,2020-01-24 15:31:54,2020-01-24 15:42:22,1,5.54,1,N,265,132,...,17.0,0.0,0.5,0.00,0.00,0.3,17.80,0.0,0 days 00:10:28,15:30 - 16:00


In [36]:
common.value_counts(['PULocationID', 'DOLocationID'])

PULocationID  DOLocationID
237           236             45537
236           236             38755
              237             38261
237           237             33897
264           264             27896
                              ...  
30            1                   1
99            1                   1
              2                   1
              3                   1
46            61                  1
Name: count, Length: 1305, dtype: int64

In [37]:
common[common['PULocationID'].isnull() | common['DOLocationID'].isnull()]

Unnamed: 0,rowID,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,...,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,trip_duration,pickup_time_interval


### 3.14 On the `common` dataframe, for each payment type and each interval, determine the average fare amount

In [38]:
# Compute the average fare amount for each (payment_type, pickup_time_interval) for common dataset
avg_fare_common = common.groupby(['payment_type', 'pickup_time_interval'], observed=False).agg(avg_fare_amount=('fare_amount', 'mean')).reset_index()
avg_fare_common

Unnamed: 0,payment_type,pickup_time_interval,avg_fare_amount
0,1,00:00 - 00:30,8.587600
1,1,00:30 - 01:00,8.693222
2,1,01:00 - 01:30,8.495154
3,1,01:30 - 02:00,8.025681
4,1,02:00 - 02:30,7.945085
...,...,...,...
235,5,21:30 - 22:00,
236,5,22:00 - 22:30,
237,5,22:30 - 23:00,
238,5,23:00 - 23:30,


### 3.15 Compute the difference of the average fare amount computed in the previous point with those computed at point 9

In [39]:
avg_fare_diff = avg_fare_common.rename(columns={'avg_fare_amount': 'avg_fare_amount_comm'}).merge(
    avg_fare.rename(columns={'avg_fare_amount': 'avg_fare_amount_tot'}), 
    on=['payment_type', 'pickup_time_interval'], 
    how='inner'
)

avg_fare_diff['average_fare_amount_diff'] = avg_fare_diff['avg_fare_amount_comm'] - avg_fare_diff['avg_fare_amount_tot']
avg_fare_diff

Unnamed: 0,payment_type,pickup_time_interval,avg_fare_amount_comm,avg_fare_amount_tot,average_fare_amount_diff
0,1,00:00 - 00:30,8.587600,13.868641,-5.281042
1,1,00:30 - 01:00,8.693222,13.472738,-4.779516
2,1,01:00 - 01:30,8.495154,12.822628,-4.327474
3,1,01:30 - 02:00,8.025681,12.358248,-4.332567
4,1,02:00 - 02:30,7.945085,12.008941,-4.063855
...,...,...,...,...,...
235,5,21:30 - 22:00,,,
236,5,22:00 - 22:30,,,
237,5,22:30 - 23:00,,,
238,5,23:00 - 23:30,,,


### 3.16 Compute the ratio between the differences computed in the previous point and those computed in point 9. Note: you have to compute a ratio for each pair (payment type, interval)

In [40]:
avg_fare_diff['ratio_diff_tot_amounts'] = avg_fare_diff['average_fare_amount_diff'] / avg_fare_diff['avg_fare_amount_tot']
avg_fare_diff

Unnamed: 0,payment_type,pickup_time_interval,avg_fare_amount_comm,avg_fare_amount_tot,average_fare_amount_diff,ratio_diff_tot_amounts
0,1,00:00 - 00:30,8.587600,13.868641,-5.281042,-0.380790
1,1,00:30 - 01:00,8.693222,13.472738,-4.779516,-0.354755
2,1,01:00 - 01:30,8.495154,12.822628,-4.327474,-0.337487
3,1,01:30 - 02:00,8.025681,12.358248,-4.332567,-0.350581
4,1,02:00 - 02:30,7.945085,12.008941,-4.063855,-0.338402
...,...,...,...,...,...,...
235,5,21:30 - 22:00,,,,
236,5,22:00 - 22:30,,,,
237,5,22:30 - 23:00,,,,
238,5,23:00 - 23:30,,,,


### 3.17 Build chains of trips. Two trips are consecutive in a chain if (a) they have the same VendorID, (b) the pickup location of the second trip is also the dropoff location of the first trip, (c) the pickup time of the second trip is after the dropoff time of the first trip, and (d) the pickup time of the second trip is at most 2 minutes later than the dropoff time of the first trip.

**Hint**: Add a column `chain` to the dataset. A chain can have more than two trips.

In [53]:
# Keep only necessary columns
dataset_clean = dataset.copy()[['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime', 'PULocationID', 'DOLocationID', 'rowID']]

In [85]:
dataset_clean_prova = dataset_clean.copy()
dataset_clean_prova = dataset_clean_prova.sort_values(['VendorID', 'tpep_pickup_datetime'])

In [None]:
import numpy as np
import pandas as pd

# Convert DataFrame columns to NumPy arrays for efficient access
vendor_ids = dataset_clean_prova['VendorID'].to_numpy()
pu_locs = dataset_clean_prova['PULocationID'].to_numpy()
do_locs = dataset_clean_prova['DOLocationID'].to_numpy()
pickup_times = (dataset_clean_prova["tpep_pickup_datetime"].astype("int64") // 10**9).to_numpy()
dropoff_times = (dataset_clean_prova["tpep_dropoff_datetime"].astype("int64") // 10**9).to_numpy()
row_ids = dataset_clean_prova['rowID'].to_numpy()

# Initialize chain array
chains = np.zeros(len(dataset_clean_prova), dtype=int)

chain_id = 0  # Counter for chain IDs
total_rows = len(vendor_ids)
progress_step = total_rows // 20  # Print progress every 5%

# Loop through all trips
for i in range(total_rows):
    if chains[i] == 0:  # If trip `i` is not yet in a chain, assign a new chain ID
        chain_id += 1
        chains[i] = chain_id

    vendor = vendor_ids[i]
    do_loc = do_locs[i]
    dropoff_time = dropoff_times[i]

    # Consider only future rows (i+1 onwards)
    valid_k = np.where(
        (vendor_ids[i+1:] == vendor) & 
        (pu_locs[i+1:] == do_loc) & 
        (pickup_times[i+1:] > dropoff_time) & 
        (pickup_times[i+1:] <= dropoff_time + 120)  # 2 minutes in seconds
    )[0] + (i + 1)  # Adjust indices

    if len(valid_k) > 0:
        # Select the trip `k` with the **minimum** pickup time
        k = valid_k[np.argmin(pickup_times[valid_k])]

        # Assign chain ID based on `i`
        chains[k] = chains[i]  # k takes the chain of i

    # Print progress
    if i % progress_step == 0:
        print(f"Progress: {i / total_rows * 100:.0f}%")

Progress: 0%


In [74]:
# Add the chain information back to the DataFrame
dataset_clean_prova['chain'] = chains

In [75]:
dataset_clean_prova

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,PULocationID,DOLocationID,rowID,chain
0,1,2020-01-01 00:28:15,2020-01-01 00:33:03,238,239,1,1
1,1,2020-01-01 00:35:39,2020-01-01 00:43:04,239,238,2,2
2,1,2020-01-01 00:47:41,2020-01-01 00:53:52,238,238,3,3
3,1,2020-01-01 00:55:23,2020-01-01 01:00:14,238,151,4,3
4,2,2020-01-01 00:01:58,2020-01-01 00:04:16,193,193,5,4
...,...,...,...,...,...,...,...
6339562,2,2020-01-31 23:38:07,2020-01-31 23:52:21,163,246,6339563,4779912
6339563,2,2020-01-31 23:00:18,2020-01-31 23:19:18,164,79,6339564,4779913
6339564,2,2020-01-31 23:24:22,2020-01-31 23:40:39,79,68,6339565,4779821
6339565,2,2020-01-31 23:44:22,2020-01-31 23:54:00,100,142,6339566,4779421


In [76]:
len(dataset_clean_prova.chain.unique())

4779914

In [77]:
dataset_clean_prova.value_counts('chain')

chain
1730413    37
2952899    36
1875970    34
4387993    33
2947178    31
           ..
1769484     1
1769485     1
1769486     1
1769487     1
4779914     1
Name: count, Length: 4779914, dtype: int64

In [78]:
dataset_clean_prova[dataset_clean_prova.chain == 2947178]

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,PULocationID,DOLocationID,rowID,chain
3909821,2,2020-01-20 20:16:17,2020-01-20 20:46:27,161,132,3909822,2947178
3909832,2,2020-01-20 20:48:08,2020-01-20 21:23:28,132,100,3909833,2947178
3909851,2,2020-01-20 20:47:02,2020-01-20 21:22:49,132,263,3909852,2947178
3909936,2,2020-01-20 20:47:59,2020-01-20 21:18:35,132,164,3909937,2947178
3909991,2,2020-01-20 20:47:17,2020-01-20 21:23:34,132,181,3909992,2947178
3911417,2,2020-01-20 20:46:50,2020-01-20 21:25:10,132,249,3911418,2947178
3912013,2,2020-01-20 20:46:58,2020-01-20 21:21:37,132,230,3912014,2947178
3912075,2,2020-01-20 20:47:02,2020-01-20 21:22:48,132,236,3912076,2947178
3912527,2,2020-01-20 20:46:30,2020-01-20 20:47:14,132,132,3912528,2947178
3912528,2,2020-01-20 20:46:30,2020-01-20 20:47:14,132,132,3912529,2947178
