# **NYC Trip Fare Analysis**

Dataset description extracted from [Kaggle](https://www.kaggle.com/datasets/diishasiing/revenue-for-cab-drivers):

- **VendorID**: A unique identifier for the taxi vendor or service provider.
- **tpep_pickup_datetime**: The date and time when the passenger was picked up.
- **tpep_dropoff_datetime**: The date and time when the passenger was dropped off.
- **passenger_count**: The number of passengers in the taxi.
- **trip_distance**: The total distance of the trip in miles or kilometers.
- **RatecodeID**: The rate code assigned to the trip, representing fare types.
- **store_and_fwd_flag**: Indicates whether the trip data was stored locally and then forwarded later (Y/N).
- **PULocationID**: The unique identifier for the pickup location (zone or area).
- **DOLocationID**: The unique identifier for the drop-off location (zone or area).
- **payment_type**: The method of payment used by the passenger (e.g., cash, card).
- **fare_amount**: The base fare for the trip.
- **extra**: Additional charges applied during the trip (e.g., night surcharge).
- **mta_tax**: The tax imposed by the Metropolitan Transportation Authority.
- **tip_amount**: The tip given to the driver, if applicable.
- **tolls_amount**: The total amount of tolls charged during the trip.
- **improvement_surcharge**: A surcharge imposed for the improvement of services.
- **total_amount**: The total fare amount, including all charges and surcharges.
- **congestion_surcharge**: An additional charge for trips taken during high traffic congestion times.

## **1. Set Environment and Import Libraries**

In [2]:
import sys
import os

# Add 'conf' folder to sys.path (if not already present)
conf_path = os.path.abspath(os.path.join(os.getcwd(), '..', 'conf'))
if conf_path not in sys.path:
    sys.path.append(conf_path)

# Now import conf module from conf folder
import conf

# Verifica che i percorsi siano corretti
#print("Main directory:", conf.MAIN_DIR)
#print("Notebook directory:", conf.NOTEBOOK_DIR)
#print("Data directory:", conf.DATA_DIR)

Now, import main packages necessary for the developement of this project:

In [3]:
# Import libraries
import pandas as pd
import datetime
import time
import numpy as np

## **2. Data Importation and Pre-Processing**

In [4]:
# Read data from the .csv file within data folder
dataset = pd.read_csv(f'{conf.DATA_DIR}/data.csv')

  dataset = pd.read_csv(f'{conf.DATA_DIR}/data.csv')


In [5]:
dataset.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
0,1.0,2020-01-01 00:28:15,2020-01-01 00:33:03,1.0,1.2,1.0,N,238,239,1.0,6.0,3.0,0.5,1.47,0.0,0.3,11.27,2.5
1,1.0,2020-01-01 00:35:39,2020-01-01 00:43:04,1.0,1.2,1.0,N,239,238,1.0,7.0,3.0,0.5,1.5,0.0,0.3,12.3,2.5
2,1.0,2020-01-01 00:47:41,2020-01-01 00:53:52,1.0,0.6,1.0,N,238,238,1.0,6.0,3.0,0.5,1.0,0.0,0.3,10.8,2.5
3,1.0,2020-01-01 00:55:23,2020-01-01 01:00:14,1.0,0.8,1.0,N,238,151,1.0,5.5,0.5,0.5,1.36,0.0,0.3,8.16,0.0
4,2.0,2020-01-01 00:01:58,2020-01-01 00:04:16,1.0,0.0,1.0,N,193,193,2.0,3.5,0.5,0.5,0.0,0.0,0.3,4.8,0.0


Print out columns types to check the warning (columns with different data types):

In [6]:
print(dataset.dtypes)

VendorID                 float64
tpep_pickup_datetime      object
tpep_dropoff_datetime     object
passenger_count          float64
trip_distance            float64
RatecodeID               float64
store_and_fwd_flag        object
PULocationID               int64
DOLocationID               int64
payment_type             float64
fare_amount              float64
extra                    float64
mta_tax                  float64
tip_amount               float64
tolls_amount             float64
improvement_surcharge    float64
total_amount             float64
congestion_surcharge     float64
dtype: object


Firstly, create a copy of the original dataset for recovery.

In [7]:
# Create a copy of the original dataset
dataset_old = dataset.copy()

Now, analyze columns' type.

In [8]:
dataset.store_and_fwd_flag.unique()

array(['N', 'Y', nan], dtype=object)

The problem is on 'nan' that are encoded in a wrong way, and also datetime columns must have a specified type.

In [9]:
# Clean up missing values before conversion into type "category"
dataset['store_and_fwd_flag'] = dataset['store_and_fwd_flag'].replace(["", " ", "NaN", "nan"], pd.NA).astype("category")
dataset.store_and_fwd_flag.unique()

['N', 'Y', NaN]
Categories (2, object): ['N', 'Y']

Moreover, dates must be converted into the proper format:

In [10]:
# Convert datetime fields to proper format
dataset['tpep_pickup_datetime'] = pd.to_datetime(dataset['tpep_pickup_datetime'], errors='coerce') # errors = 'coerce' to avoid anomaly
dataset['tpep_dropoff_datetime'] = pd.to_datetime(dataset['tpep_dropoff_datetime'], errors='coerce')

In [11]:
print(dataset[['tpep_pickup_datetime', 'tpep_dropoff_datetime', 'store_and_fwd_flag']].dtypes)

tpep_pickup_datetime     datetime64[ns]
tpep_dropoff_datetime    datetime64[ns]
store_and_fwd_flag             category
dtype: object


In [12]:
dataset.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
0,1.0,2020-01-01 00:28:15,2020-01-01 00:33:03,1.0,1.2,1.0,N,238,239,1.0,6.0,3.0,0.5,1.47,0.0,0.3,11.27,2.5
1,1.0,2020-01-01 00:35:39,2020-01-01 00:43:04,1.0,1.2,1.0,N,239,238,1.0,7.0,3.0,0.5,1.5,0.0,0.3,12.3,2.5
2,1.0,2020-01-01 00:47:41,2020-01-01 00:53:52,1.0,0.6,1.0,N,238,238,1.0,6.0,3.0,0.5,1.0,0.0,0.3,10.8,2.5
3,1.0,2020-01-01 00:55:23,2020-01-01 01:00:14,1.0,0.8,1.0,N,238,151,1.0,5.5,0.5,0.5,1.36,0.0,0.3,8.16,0.0
4,2.0,2020-01-01 00:01:58,2020-01-01 00:04:16,1.0,0.0,1.0,N,193,193,2.0,3.5,0.5,0.5,0.0,0.0,0.3,4.8,0.0


Now, considering how data are described within the Kaggle datacard, try to understand if adjustments are required. 

It is important that numerical IDs, counters and numeric categorical variables are converted into **integer numbers**. With the format `Int64` it is possible to handle null values for integer data without converting them into `float64`. 

For numerical IDs and categorical columns, is then necessary to convert them into **categorical**.

Variables which require a conversion are `VendorID`, `RatecodeID`, `DOLocationID`, `PULocationID` (*numerical IDs*), `payment_type` (*categorical column*) and `passenger_count` (*counter column*).

In [13]:
col_to_modify = ['VendorID', 'passenger_count', 'RatecodeID', 'payment_type', 'DOLocationID', 'PULocationID']

def type_conversion(col_name):
    column_new = dataset[col_name].astype('Int64')
    
    if col_name != 'passenger_count':
        column_new = column_new.astype('category')

    return column_new

for col in col_to_modify:
    dataset[col] = type_conversion(col)

In [14]:
for col in list(dataset_old.columns):
    print('Column :', col, '| before:', dataset_old[col].dtype, ' -> after:', dataset[col].dtype)

Column : VendorID | before: float64  -> after: category
Column : tpep_pickup_datetime | before: object  -> after: datetime64[ns]
Column : tpep_dropoff_datetime | before: object  -> after: datetime64[ns]
Column : passenger_count | before: float64  -> after: Int64
Column : trip_distance | before: float64  -> after: float64
Column : RatecodeID | before: float64  -> after: category
Column : store_and_fwd_flag | before: object  -> after: category
Column : PULocationID | before: int64  -> after: category
Column : DOLocationID | before: int64  -> after: category
Column : payment_type | before: float64  -> after: category
Column : fare_amount | before: float64  -> after: float64
Column : extra | before: float64  -> after: float64
Column : mta_tax | before: float64  -> after: float64
Column : tip_amount | before: float64  -> after: float64
Column : tolls_amount | before: float64  -> after: float64
Column : improvement_surcharge | before: float64  -> after: float64
Column : total_amount | before

## **3. Assignements**

### 3.1. Extract all trips with `trip_distance` larger than 50

In [15]:
# Extract all trip with distance > 50
dataset[dataset['trip_distance'] > 50]

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
23842,2,2020-01-01 01:53:07,2020-01-01 03:54:41,1,52.30,5,N,262,265,1,300.00,0.00,0.0,61.78,6.12,0.3,370.70,2.5
39013,2,2020-01-01 02:05:07,2020-01-01 03:03:10,1,51.23,5,N,264,264,1,329.00,0.00,0.5,100.78,6.12,0.3,436.70,0.0
41620,1,2020-01-01 03:05:54,2020-01-01 04:16:26,1,53.80,5,N,132,265,1,250.00,0.00,0.0,53.35,16.62,0.3,320.27,0.0
58262,2,2020-01-01 05:36:12,2020-01-01 06:40:06,1,55.23,5,N,132,265,2,170.00,0.00,0.5,0.00,18.26,0.3,189.06,0.0
63024,2,2020-01-01 07:40:30,2020-01-01 08:40:01,1,54.19,5,N,132,265,1,230.00,0.00,0.0,0.00,12.24,0.3,242.54,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6326169,2,2020-01-31 22:47:26,2020-01-31 23:49:14,1,51.83,5,N,132,265,1,220.00,0.00,0.5,48.96,23.99,0.3,293.75,0.0
6331181,2,2020-01-31 23:45:36,2020-02-01 01:00:25,5,57.99,4,N,107,265,1,245.00,0.50,0.5,38.24,6.12,0.3,293.16,2.5
6333801,2,2020-01-31 23:24:16,2020-02-01 01:32:56,1,52.97,4,N,264,265,1,227.00,0.50,0.5,46.16,0.00,0.3,276.96,2.5
6397132,,2020-01-28 11:54:00,2020-01-28 19:35:00,,60.36,,,17,61,,12.04,0.00,0.5,0.00,12.24,0.3,25.08,0.0


### 3.2 Extract all trips where `payment_type` is missing

In [16]:
# Trips with payment_type missings
dataset[dataset['payment_type'].isna()]

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
6339567,,2020-01-01 08:51:00,2020-01-01 09:19:00,,13.69,,,136,232,,51.05,2.75,0.5,0.0,0.00,0.3,54.60,0.0
6339568,,2020-01-01 08:38:43,2020-01-01 08:51:08,,3.42,,,121,9,,27.06,2.75,0.0,0.0,0.00,0.3,30.11,0.0
6339569,,2020-01-01 08:27:00,2020-01-01 08:32:00,,2.20,,,197,216,,24.36,2.75,0.5,0.0,0.00,0.3,27.91,0.0
6339570,,2020-01-01 08:46:00,2020-01-01 08:57:00,,0.84,,,262,236,,26.08,2.75,0.5,0.0,0.00,0.3,29.63,0.0
6339571,,2020-01-01 08:21:00,2020-01-01 08:38:00,,7.24,,,45,142,,25.28,2.75,0.5,0.0,0.00,0.3,28.83,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6405003,,2020-01-31 22:51:00,2020-01-31 23:22:00,,3.24,,,237,234,,17.59,2.75,0.5,0.0,0.00,0.3,21.14,0.0
6405004,,2020-01-31 22:10:00,2020-01-31 23:26:00,,22.13,,,259,45,,46.67,2.75,0.5,0.0,12.24,0.3,62.46,0.0
6405005,,2020-01-31 22:50:07,2020-01-31 23:17:57,,10.51,,,137,169,,48.85,2.75,0.0,0.0,0.00,0.3,51.90,0.0
6405006,,2020-01-31 22:25:53,2020-01-31 22:48:32,,5.49,,,50,42,,27.17,2.75,0.0,0.0,0.00,0.3,30.22,0.0


### 3.3 For each (`PULocationID`, `DOLocationID`) pair, determine the number of trips

Considering that some rows have missing `PULocationID` and/or `DOLocationID`, it doesn't make sense taking into account pairs of locations for which at least one column is missing. Hence, consider non missing values for these fields.

In [17]:
dataset.value_counts(['PULocationID', 'DOLocationID']).reset_index(name='trip_count')

Unnamed: 0,PULocationID,DOLocationID,trip_count
0,237,236,45539
1,236,236,38775
2,236,237,38264
3,237,237,33909
4,264,264,27928
...,...,...,...
31272,80,108,1
31273,9,70,1
31274,229,6,1
31275,163,245,1


### 3.4 Save all rows with missing `VendorID`, `passenger_count`, `store_and_fwd_flag`, `payment_type` in a new dataframe called `bad`, and remove those rows from the original dataframe

There are two possibilities:
-	`.any(axis=1)`: Removes rows where **at least one** of the specified columns is **NaN**.
-	`.all(axis=1)`: Removes rows where **all specified columns** are **NaN** at the same time.

In this case, the proper solution is `.any(axis=1)`:

In [18]:
bad = dataset[dataset[['VendorID', 'passenger_count', 'payment_type', 'store_and_fwd_flag']].isna().any(axis=1)]
bad

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
6339567,,2020-01-01 08:51:00,2020-01-01 09:19:00,,13.69,,,136,232,,51.05,2.75,0.5,0.0,0.00,0.3,54.60,0.0
6339568,,2020-01-01 08:38:43,2020-01-01 08:51:08,,3.42,,,121,9,,27.06,2.75,0.0,0.0,0.00,0.3,30.11,0.0
6339569,,2020-01-01 08:27:00,2020-01-01 08:32:00,,2.20,,,197,216,,24.36,2.75,0.5,0.0,0.00,0.3,27.91,0.0
6339570,,2020-01-01 08:46:00,2020-01-01 08:57:00,,0.84,,,262,236,,26.08,2.75,0.5,0.0,0.00,0.3,29.63,0.0
6339571,,2020-01-01 08:21:00,2020-01-01 08:38:00,,7.24,,,45,142,,25.28,2.75,0.5,0.0,0.00,0.3,28.83,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6405003,,2020-01-31 22:51:00,2020-01-31 23:22:00,,3.24,,,237,234,,17.59,2.75,0.5,0.0,0.00,0.3,21.14,0.0
6405004,,2020-01-31 22:10:00,2020-01-31 23:26:00,,22.13,,,259,45,,46.67,2.75,0.5,0.0,12.24,0.3,62.46,0.0
6405005,,2020-01-31 22:50:07,2020-01-31 23:17:57,,10.51,,,137,169,,48.85,2.75,0.0,0.0,0.00,0.3,51.90,0.0
6405006,,2020-01-31 22:25:53,2020-01-31 22:48:32,,5.49,,,50,42,,27.17,2.75,0.0,0.0,0.00,0.3,30.22,0.0


### 3.5 Add a duration column storing how long each trip has taken (use `tpep_pickup_datetime`, `tpep_dropoff_datetime`)

Since `tpep_pickup_datetime` are the date and time when the passenger was picked up and `tpep_dropoff_datetime` are the date and time when the passenger was dropped off, the duration can be simply computed as this difference: `tpep_dropoff_datetime` -  `tpep_pickup_datetime`.

In [19]:
dataset['trip_duration'] = dataset['tpep_dropoff_datetime'] - dataset['tpep_pickup_datetime']
dataset[['tpep_pickup_datetime', 'tpep_dropoff_datetime', 'trip_duration']]

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,trip_duration
0,2020-01-01 00:28:15,2020-01-01 00:33:03,0 days 00:04:48
1,2020-01-01 00:35:39,2020-01-01 00:43:04,0 days 00:07:25
2,2020-01-01 00:47:41,2020-01-01 00:53:52,0 days 00:06:11
3,2020-01-01 00:55:23,2020-01-01 01:00:14,0 days 00:04:51
4,2020-01-01 00:01:58,2020-01-01 00:04:16,0 days 00:02:18
...,...,...,...
6405003,2020-01-31 22:51:00,2020-01-31 23:22:00,0 days 00:31:00
6405004,2020-01-31 22:10:00,2020-01-31 23:26:00,0 days 01:16:00
6405005,2020-01-31 22:50:07,2020-01-31 23:17:57,0 days 00:27:50
6405006,2020-01-31 22:25:53,2020-01-31 22:48:32,0 days 00:22:39


The result is a `Timedelta` column, which represents the time difference as number of days, hours, minutes and seconds.

### 3.6 For each pickup location, determine how many trips have started there 

The pickup location is stored within `PULocationID` column. It is simply necessary to compute values distribution of this column:

In [20]:
dataset.PULocationID.value_counts().reset_index(name='trip_count')

Unnamed: 0,PULocationID,trip_count
0,237,292989
1,161,282213
2,236,272592
3,162,235602
4,186,228746
...,...,...
256,176,2
257,245,2
258,172,1
259,30,1


### 3.7 Cluster the pickup time of the day into 30-minute intervals (e.g. from 02:00 to 02:30)

In [21]:
# Generate time intervals (30 min bins)
bins = pd.date_range(start='00:00:00', end='23:59:59', freq='30min').time
# Include also the interval "23:30 - 23:59"
bins = np.append(bins, datetime.time(23, 59, 59))

# Define labels (e.g., "02:00-02:30", ...)
labels = [f"{bins[i].strftime('%H:%M')} - {bins[i+1].strftime('%H:%M')}" for i in range(len(bins)-1)]
labels[-1] = '23:30 - 00:00'

In [22]:
dataset['pickup_time_interval'] = pd.cut(dataset['tpep_pickup_datetime'].dt.time, bins=bins, labels=labels, include_lowest=True)

In [23]:
# Check the presence of null values within pickup_time_interval
dataset[dataset['pickup_time_interval'].isna()]

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,trip_duration,pickup_time_interval


In [24]:
dataset

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,trip_duration,pickup_time_interval
0,1,2020-01-01 00:28:15,2020-01-01 00:33:03,1,1.20,1,N,238,239,1,6.00,3.00,0.5,1.47,0.00,0.3,11.27,2.5,0 days 00:04:48,00:00 - 00:30
1,1,2020-01-01 00:35:39,2020-01-01 00:43:04,1,1.20,1,N,239,238,1,7.00,3.00,0.5,1.50,0.00,0.3,12.30,2.5,0 days 00:07:25,00:30 - 01:00
2,1,2020-01-01 00:47:41,2020-01-01 00:53:52,1,0.60,1,N,238,238,1,6.00,3.00,0.5,1.00,0.00,0.3,10.80,2.5,0 days 00:06:11,00:30 - 01:00
3,1,2020-01-01 00:55:23,2020-01-01 01:00:14,1,0.80,1,N,238,151,1,5.50,0.50,0.5,1.36,0.00,0.3,8.16,0.0,0 days 00:04:51,00:30 - 01:00
4,2,2020-01-01 00:01:58,2020-01-01 00:04:16,1,0.00,1,N,193,193,2,3.50,0.50,0.5,0.00,0.00,0.3,4.80,0.0,0 days 00:02:18,00:00 - 00:30
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6405003,,2020-01-31 22:51:00,2020-01-31 23:22:00,,3.24,,,237,234,,17.59,2.75,0.5,0.00,0.00,0.3,21.14,0.0,0 days 00:31:00,22:30 - 23:00
6405004,,2020-01-31 22:10:00,2020-01-31 23:26:00,,22.13,,,259,45,,46.67,2.75,0.5,0.00,12.24,0.3,62.46,0.0,0 days 01:16:00,22:00 - 22:30
6405005,,2020-01-31 22:50:07,2020-01-31 23:17:57,,10.51,,,137,169,,48.85,2.75,0.0,0.00,0.00,0.3,51.90,0.0,0 days 00:27:50,22:30 - 23:00
6405006,,2020-01-31 22:25:53,2020-01-31 22:48:32,,5.49,,,50,42,,27.17,2.75,0.0,0.00,0.00,0.3,30.22,0.0,0 days 00:22:39,22:00 - 22:30


### 3.8 For each interval, determine the average number of passengers and the average fare amount

In [25]:
dataset.groupby('pickup_time_interval', observed=False).agg(avg_passenger_count=('passenger_count', 'mean'),
                                                            avg_fare_amount=('fare_amount', 'mean')).reset_index()

Unnamed: 0,pickup_time_interval,avg_passenger_count,avg_fare_amount
0,00:00 - 00:30,1.572854,13.682459
1,00:30 - 01:00,1.584273,13.302491
2,01:00 - 01:30,1.578807,12.766396
3,01:30 - 02:00,1.589424,12.332639
4,02:00 - 02:30,1.587403,12.159536
5,02:30 - 03:00,1.587833,12.129466
6,03:00 - 03:30,1.581925,12.608487
7,03:30 - 04:00,1.585897,13.279612
8,04:00 - 04:30,1.580107,14.585496
9,04:30 - 05:00,1.516206,17.115791


### 3.9 For each payment type and each interval, determine the average fare amount

In [26]:
# Compute the average fare amount for each (payment_type, pickup_time_interval)
avg_fare = dataset.groupby(['payment_type', 'pickup_time_interval'], observed=False).agg(avg_fare_amount=('fare_amount', 'mean')).reset_index()
avg_fare

Unnamed: 0,payment_type,pickup_time_interval,avg_fare_amount
0,1,00:00 - 00:30,13.868641
1,1,00:30 - 01:00,13.472738
2,1,01:00 - 01:30,12.822628
3,1,01:30 - 02:00,12.358248
4,1,02:00 - 02:30,12.008941
...,...,...,...
235,5,21:30 - 22:00,
236,5,22:00 - 22:30,
237,5,22:30 - 23:00,
238,5,23:00 - 23:30,


### 3.10 For each payment type, determine the interval when the average fare amount is maximum

In [27]:
# Find the interval with the maximum average fare for each payment type
avg_fare.loc[avg_fare.groupby('payment_type', observed=False)['avg_fare_amount'].idxmax()]

Unnamed: 0,payment_type,pickup_time_interval,avg_fare_amount
10,1,05:00 - 05:30,21.260986
58,2,05:00 - 05:30,14.856701
110,3,07:00 - 07:30,10.950938
154,4,05:00 - 05:30,6.634043
227,5,17:30 - 18:00,0.0


### 3.11 For each payment type, determine the interval when the overall ratio between the tip and the fare amounts is maximum

In [28]:
# Group by payment_type and pickup_time_interval, and for each pair count the total tip and fare amounts
ratio_amount_df = dataset.groupby(['payment_type', 'pickup_time_interval'], observed=False).agg(tot_tip_amount = ('tip_amount', 'sum'),
                                                                                                tot_fare_amount = ('fare_amount', 'sum')
                                                                                                ).reset_index()

# Now, get the ratio between total tip and fare amount for each pair
ratio_amount_df['tot_amount_ratio'] = ratio_amount_df['tot_tip_amount'] / ratio_amount_df['tot_fare_amount'].replace(0, np.nan)  # Avoid division by zero

# Drop rows where tip_fare_ratio is NaN before using idxmax(), to avoid errors
ratio_amount_df = ratio_amount_df.dropna(subset=['tot_amount_ratio'])

# Finally, get the interval for each payment type where the overall ratio is maximum
ratio_amount_df.loc[ratio_amount_df.groupby('payment_type', observed=True)['tot_amount_ratio'].idxmax()]

Unnamed: 0,payment_type,pickup_time_interval,tot_tip_amount,tot_fare_amount,tot_amount_ratio
37,1,18:30 - 19:00,485536.47,1998420.2,0.24296
58,2,05:00 - 05:30,15.0,109464.17,0.000137
138,3,21:00 - 21:30,35.62,5644.17,0.006311
170,4,13:00 - 13:30,36.48,170.05,0.214525


Why do rows with `payment_type == 5` not appear? Let’s analyze the dataset rows where this condition is true:

In [29]:
dataset[['payment_type','pickup_time_interval','tip_amount','fare_amount']][dataset.payment_type == 5]

Unnamed: 0,payment_type,pickup_time_interval,tip_amount,fare_amount
4061635,5,17:30 - 18:00,0.0,0.0


This happens because the only row where the condition is valid has both amounts as null. The ratio between two null values results in `NaN`, since the denominator is zero.

### 3.12 Find the location with the highest average fare amount

There are two locations available: `PULocationID` (*pickup location*) and `DOLocationID` (*drop-off location*). Since the base fare of the trip is primarily determined by the **destination** (for example, airports, stations, or tourist attractions may have higher fares compared to other locations), the data should be grouped by the destination location (`DOLocationID`).

In [30]:
# Compute the average fare amount for each location, and find that with the highest average fare amount
avg_fare_locDO = dataset.groupby(['DOLocationID'], observed=False).agg(avg_fare_amount_DOLocation=('fare_amount', 'mean')).reset_index()
avg_fare_locDO.loc[avg_fare_locDO['avg_fare_amount_DOLocation'].idxmax()]

DOLocationID                       44.0
avg_fare_amount_DOLocation    85.548971
Name: 43, dtype: Float64

### 3.13 Build a new dataframe (called `common`) where, for each pickup location we keep all trips to the 5 most common destinations (i.e. each pickup location can have different common destinations)

In [31]:
# Group by PULocationID and DOLocationID to calculate the count of trips for each pair
common_destinations = dataset.groupby(['PULocationID', 'DOLocationID'], observed=False).size().reset_index(name='race_count')

# Sort by PULocationID and race_count to get the most common destinations for each pickup location
common_destinations = common_destinations.sort_values(by=['PULocationID', 'race_count'], ascending=[True, False])

# Add a counter within each PULocationID group
common_destinations['counter'] = common_destinations.groupby('PULocationID', observed=False).cumcount() + 1

# Filter the common_destinations to keep only the top 5 destinations per PULocationID
common_destinations = common_destinations[common_destinations['counter'] <= 5].drop(columns=['counter'])

In [32]:
len(set(zip(common_destinations['PULocationID'], common_destinations['DOLocationID'])))

1305

In [33]:
# Merge this filtered data back to the original dataset to get the corresponding rows
common = dataset.merge(common_destinations[['PULocationID', 'DOLocationID']], 
                       on=['PULocationID', 'DOLocationID'], 
                       how='right')

In [34]:
common

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,trip_duration,pickup_time_interval
0,2,2020-01-01 04:41:47,2020-01-01 04:42:22,4,0.00,5,N,1,1,1,89.00,0.00,0.5,5.00,0.00,0.3,94.80,0.0,0 days 00:00:35,04:30 - 05:00
1,2,2020-01-01 06:54:53,2020-01-01 06:55:13,1,0.00,5,N,1,1,1,96.00,0.00,0.0,5.08,14.50,0.3,115.88,0.0,0 days 00:00:20,06:30 - 07:00
2,1,2020-01-01 06:57:17,2020-01-01 06:58:01,4,0.00,5,N,1,1,1,84.00,0.00,0.0,5.00,0.00,0.3,89.30,0.0,0 days 00:00:44,06:30 - 07:00
3,2,2020-01-01 06:20:55,2020-01-01 06:21:09,2,0.00,5,N,1,1,1,150.00,0.00,0.5,0.00,29.00,0.3,179.80,0.0,0 days 00:00:14,06:00 - 06:30
4,2,2020-01-01 06:53:39,2020-01-01 06:53:49,1,0.00,5,N,1,1,1,60.00,0.00,0.0,0.00,0.00,0.3,60.30,0.0,0 days 00:00:10,06:30 - 07:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1964591,,2020-01-21 06:01:00,2020-01-21 06:48:00,,22.21,,,265,239,,52.21,2.75,0.5,0.00,6.12,0.3,61.88,0.0,0 days 00:47:00,06:00 - 06:30
1964592,,2020-01-25 06:02:00,2020-01-25 06:37:00,,21.30,,,265,239,,52.21,2.75,0.5,0.00,6.12,0.3,61.88,0.0,0 days 00:35:00,06:00 - 06:30
1964593,,2020-01-27 05:59:32,2020-01-27 06:49:38,,21.87,,,265,239,,52.71,2.75,0.0,0.00,6.12,0.3,61.88,0.0,0 days 00:50:06,05:30 - 06:00
1964594,,2020-01-28 06:03:00,2020-01-28 06:53:00,,22.17,,,265,239,,52.21,2.75,0.5,0.00,6.12,0.3,61.88,0.0,0 days 00:50:00,06:00 - 06:30


In [35]:
common.value_counts(['PULocationID', 'DOLocationID'])

PULocationID  DOLocationID
237           236             45539
236           236             38775
              237             38264
237           237             33909
264           264             27928
                              ...  
176           2                   1
204           3                   1
176           1                   1
251           235                 1
84            23                  1
Name: count, Length: 1305, dtype: int64

In [36]:
common[common['PULocationID'].isnull() | common['DOLocationID'].isnull()]

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,trip_duration,pickup_time_interval


### 3.14 On the `common` dataframe, for each payment type and each interval, determine the average fare amount

In [37]:
# Compute the average fare amount for each (payment_type, pickup_time_interval) for common dataset
avg_fare_common = common.groupby(['payment_type', 'pickup_time_interval'], observed=False).agg(avg_fare_amount=('fare_amount', 'mean')).reset_index()
avg_fare_common

Unnamed: 0,payment_type,pickup_time_interval,avg_fare_amount
0,1,00:00 - 00:30,8.588661
1,1,00:30 - 01:00,8.697597
2,1,01:00 - 01:30,8.489058
3,1,01:30 - 02:00,8.024828
4,1,02:00 - 02:30,7.948038
...,...,...,...
235,5,21:30 - 22:00,
236,5,22:00 - 22:30,
237,5,22:30 - 23:00,
238,5,23:00 - 23:30,


### 3.15 Compute the difference of the average fare amount computed in the previous point with those computed at point 9

In [38]:
avg_fare_diff = avg_fare_common.rename(columns={'avg_fare_amount': 'avg_fare_amount_comm'}).merge(
    avg_fare.rename(columns={'avg_fare_amount': 'avg_fare_amount_tot'}), 
    on=['payment_type', 'pickup_time_interval'], 
    how='inner'
)

avg_fare_diff['average_fare_amount_diff'] = avg_fare_diff['avg_fare_amount_comm'] - avg_fare_diff['avg_fare_amount_tot']
avg_fare_diff

Unnamed: 0,payment_type,pickup_time_interval,avg_fare_amount_comm,avg_fare_amount_tot,average_fare_amount_diff
0,1,00:00 - 00:30,8.588661,13.868641,-5.279981
1,1,00:30 - 01:00,8.697597,13.472738,-4.775141
2,1,01:00 - 01:30,8.489058,12.822628,-4.333570
3,1,01:30 - 02:00,8.024828,12.358248,-4.333420
4,1,02:00 - 02:30,7.948038,12.008941,-4.060903
...,...,...,...,...,...
235,5,21:30 - 22:00,,,
236,5,22:00 - 22:30,,,
237,5,22:30 - 23:00,,,
238,5,23:00 - 23:30,,,


### 3.16 Compute the ratio between the differences computed in the previous point and those computed in point 9. Note: you have to compute a ratio for each pair (payment type, interval)

In [39]:
avg_fare_diff['ratio_diff_tot_amounts'] = avg_fare_diff['average_fare_amount_diff'] / avg_fare_diff['avg_fare_amount_tot']
avg_fare_diff

Unnamed: 0,payment_type,pickup_time_interval,avg_fare_amount_comm,avg_fare_amount_tot,average_fare_amount_diff,ratio_diff_tot_amounts
0,1,00:00 - 00:30,8.588661,13.868641,-5.279981,-0.380714
1,1,00:30 - 01:00,8.697597,13.472738,-4.775141,-0.354430
2,1,01:00 - 01:30,8.489058,12.822628,-4.333570,-0.337963
3,1,01:30 - 02:00,8.024828,12.358248,-4.333420,-0.350650
4,1,02:00 - 02:30,7.948038,12.008941,-4.060903,-0.338157
...,...,...,...,...,...,...
235,5,21:30 - 22:00,,,,
236,5,22:00 - 22:30,,,,
237,5,22:30 - 23:00,,,,
238,5,23:00 - 23:30,,,,


### 3.17 Build chains of trips. Two trips are consecutive in a chain if (a) they have the same VendorID, (b) the pickup location of the second trip is also the dropoff location of the first trip, (c) the pickup time of the second trip is after the dropoff time of the first trip, and (d) the pickup time of the second trip is at most 2 minutes later than the dropoff time of the first trip.

**Hint**: Add a column `chain` to the dataset. A chain can have more than two trips.

In [45]:
# Reduce data excluding NaN values for vendorID, tpep_pickup_datetime, tpep_dropoff_datetime, PULocationID, DOLocationID
dataset_sorted = dataset.dropna(subset=['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime', 'PULocationID', 'DOLocationID']).copy()

dataset_sorted = dataset_sorted[['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime', 'PULocationID', 'DOLocationID', 'trip_duration', 'pickup_time_interval']]

# Ensure data is sorted for proper chaining
dataset_sorted = dataset_sorted.sort_values(by=['VendorID', 'tpep_dropoff_datetime'])

# Shift columns to compare current trip with the next one
dataset_sorted['next_PULocationID'] = dataset_sorted.groupby('VendorID', observed = False)['PULocationID'].shift(-1)
dataset_sorted['next_pickup_time'] = dataset_sorted.groupby('VendorID', observed = False)['tpep_pickup_datetime'].shift(-1)

# Compute time difference between consecutive trips
dataset_sorted['time_diff'] = (dataset_sorted['next_pickup_time'] - dataset_sorted['tpep_dropoff_datetime'])

In [43]:
dataset_sorted

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,PULocationID,DOLocationID,trip_duration,pickup_time_interval,next_PULocationID,next_pickup_time,time_diff
1487,1,2020-01-01 00:01:40,2020-01-01 00:01:52,79,79,0 days 00:00:12,00:00 - 00:30,158,2020-01-01 00:00:50,-62.0
10545,1,2020-01-01 00:00:50,2020-01-01 00:02:32,158,158,0 days 00:01:42,00:00 - 00:30,75,2020-01-01 00:00:07,-145.0
5050,1,2020-01-01 00:00:07,2020-01-01 00:03:26,75,75,0 days 00:03:19,00:00 - 00:30,141,2020-01-01 00:01:55,-91.0
7236,1,2020-01-01 00:01:55,2020-01-01 00:04:34,141,140,0 days 00:02:39,00:00 - 00:30,236,2020-01-01 00:01:01,-213.0
12297,1,2020-01-01 00:01:01,2020-01-01 00:04:46,236,236,0 days 00:03:45,00:00 - 00:30,181,2020-01-01 00:01:59,-167.0
...,...,...,...,...,...,...,...,...,...,...
4269480,2,2020-07-10 11:34:11,2020-07-10 11:42:41,236,262,0 days 00:08:30,11:30 - 12:00,236,2020-07-31 18:50:41,1840080.0
4282277,2,2020-07-31 18:50:41,2020-07-31 18:54:12,236,43,0 days 00:03:31,18:30 - 19:00,142,2021-01-02 00:22:00,13325268.0
275044,2,2021-01-02 00:22:00,2021-01-02 00:36:50,142,161,0 days 00:14:50,00:00 - 00:30,170,2021-01-02 00:44:08,438.0
275045,2,2021-01-02 00:44:08,2021-01-02 00:58:56,170,148,0 days 00:14:48,00:30 - 01:00,90,2021-01-02 01:12:10,794.0


In [39]:
# Reduce data excluding NaN values for vendorID, tpep_pickup_datetime, tpep_dropoff_datetime, PULocationID, DOLocationID
dataset_sorted_left = dataset.dropna(subset=['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime', 'PULocationID', 'DOLocationID']).copy()

# Keep only necessary columns
dataset_sorted_left = dataset_sorted_left[['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime', 'PULocationID', 'DOLocationID', 'trip_duration', 'pickup_time_interval']]

# Add rowID using .loc to avoid SettingWithCopyWarning
dataset_sorted_left['rowID'] = range(1, len(dataset_sorted_left) + 1)

# Reorder columns to place 'rowID' as the first column
dataset_sorted_left = dataset_sorted_left[['rowID'] + [col for col in dataset_sorted_left.columns if col != 'rowID']]

In [40]:
dataset_sorted_left

Unnamed: 0,rowID,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,PULocationID,DOLocationID,trip_duration,pickup_time_interval
0,1,1,2020-01-01 00:28:15,2020-01-01 00:33:03,238,239,0 days 00:04:48,00:00 - 00:30
1,2,1,2020-01-01 00:35:39,2020-01-01 00:43:04,239,238,0 days 00:07:25,00:30 - 01:00
2,3,1,2020-01-01 00:47:41,2020-01-01 00:53:52,238,238,0 days 00:06:11,00:30 - 01:00
3,4,1,2020-01-01 00:55:23,2020-01-01 01:00:14,238,151,0 days 00:04:51,00:30 - 01:00
4,5,2,2020-01-01 00:01:58,2020-01-01 00:04:16,193,193,0 days 00:02:18,00:00 - 00:30
...,...,...,...,...,...,...,...,...
6339562,6339563,2,2020-01-31 23:38:07,2020-01-31 23:52:21,163,246,0 days 00:14:14,23:30 - 00:00
6339563,6339564,2,2020-01-31 23:00:18,2020-01-31 23:19:18,164,79,0 days 00:19:00,23:00 - 23:30
6339564,6339565,2,2020-01-31 23:24:22,2020-01-31 23:40:39,79,68,0 days 00:16:17,23:00 - 23:30
6339565,6339566,2,2020-01-31 23:44:22,2020-01-31 23:54:00,100,142,0 days 00:09:38,23:30 - 00:00


In [41]:
# Create a copy of the dataset
dataset_sorted_right = dataset_sorted_left.copy()

In [43]:
dataset_sorted_right[dataset_sorted_left.VendorID == 1]

Unnamed: 0,rowID,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,PULocationID,DOLocationID,trip_duration,pickup_time_interval
0,1,1,2020-01-01 00:28:15,2020-01-01 00:33:03,238,239,0 days 00:04:48,00:00 - 00:30
1,2,1,2020-01-01 00:35:39,2020-01-01 00:43:04,239,238,0 days 00:07:25,00:30 - 01:00
2,3,1,2020-01-01 00:47:41,2020-01-01 00:53:52,238,238,0 days 00:06:11,00:30 - 01:00
3,4,1,2020-01-01 00:55:23,2020-01-01 01:00:14,238,151,0 days 00:04:51,00:30 - 01:00
9,10,1,2020-01-01 00:29:01,2020-01-01 00:40:28,246,48,0 days 00:11:27,00:00 - 00:30
...,...,...,...,...,...,...,...,...
6339543,6339544,1,2020-01-31 23:31:46,2020-01-31 23:41:29,100,233,0 days 00:09:43,23:30 - 00:00
6339544,6339545,1,2020-01-31 23:26:26,2020-01-31 23:48:26,79,48,0 days 00:22:00,23:00 - 23:30
6339550,6339551,1,2020-01-31 23:02:57,2020-01-31 23:15:20,230,236,0 days 00:12:23,23:00 - 23:30
6339551,6339552,1,2020-01-31 23:25:53,2020-01-31 23:35:44,237,140,0 days 00:09:51,23:00 - 23:30


In [44]:
chain_dataset = dataset_sorted_left[dataset_sorted_left.VendorID == 1].merge(  dataset_sorted_right[dataset_sorted_left.VendorID == 1],
                                            left_on=['VendorID', 'DOLocationID'], 
                                            right_on=['VendorID', 'PULocationID'],
                                            suffixes=('_l', '_r'),
                                            how='inner')

: 

In [222]:
dataset.shape

(6405008, 20)

In [None]:
dataset_sorted.merge( dataset_sorted,
                      left_on=['VendorID', 'DOLocationID'], 
                      right_on=['VendorID', 'PULocationID'],
                      suffixes=('_df1', '_df2'),
                      how='inner')