# **NYC Trip Fare Analysis**

Dataset description extracted from [Kaggle](https://www.kaggle.com/datasets/diishasiing/revenue-for-cab-drivers):

- **VendorID**: A unique identifier for the taxi vendor or service provider.
- **tpep_pickup_datetime**: The date and time when the passenger was picked up.
- **tpep_dropoff_datetime**: The date and time when the passenger was dropped off.
- **passenger_count**: The number of passengers in the taxi.
- **trip_distance**: The total distance of the trip in miles or kilometers.
- **RatecodeID**: The rate code assigned to the trip, representing fare types.
- **store_and_fwd_flag**: Indicates whether the trip data was stored locally and then forwarded later (Y/N).
- **PULocationID**: The unique identifier for the pickup location (zone or area).
- **DOLocationID**: The unique identifier for the drop-off location (zone or area).
- **payment_type**: The method of payment used by the passenger (e.g., cash, card).
- **fare_amount**: The base fare for the trip.
- **extra**: Additional charges applied during the trip (e.g., night surcharge).
- **mta_tax**: The tax imposed by the Metropolitan Transportation Authority.
- **tip_amount**: The tip given to the driver, if applicable.
- **tolls_amount**: The total amount of tolls charged during the trip.
- **improvement_surcharge**: A surcharge imposed for the improvement of services.
- **total_amount**: The total fare amount, including all charges and surcharges.
- **congestion_surcharge**: An additional charge for trips taken during high traffic congestion times.

## **1. Set Environment and Import Libraries**

In [2]:
import sys
import os

# Add 'conf' folder to sys.path (if not already present)
conf_path = os.path.abspath(os.path.join(os.getcwd(), '..', 'conf'))
if conf_path not in sys.path:
    sys.path.append(conf_path)

# Now import conf module from conf folder
import conf

# Verifica che i percorsi siano corretti
#print("Main directory:", conf.MAIN_DIR)
#print("Notebook directory:", conf.NOTEBOOK_DIR)
#print("Data directory:", conf.DATA_DIR)

Now, import main packages necessary for the developement of this project:

In [3]:
# Import libraries
import pandas as pd
import datetime

## **2. Data Importation and Pre-Processing**

In [18]:
# Read data from the .csv file within data folder
dataset = pd.read_csv(f'{conf.DATA_DIR}/data.csv')

  dataset = pd.read_csv(f'{conf.DATA_DIR}/data.csv')


In [19]:
dataset.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
0,1.0,2020-01-01 00:28:15,2020-01-01 00:33:03,1.0,1.2,1.0,N,238,239,1.0,6.0,3.0,0.5,1.47,0.0,0.3,11.27,2.5
1,1.0,2020-01-01 00:35:39,2020-01-01 00:43:04,1.0,1.2,1.0,N,239,238,1.0,7.0,3.0,0.5,1.5,0.0,0.3,12.3,2.5
2,1.0,2020-01-01 00:47:41,2020-01-01 00:53:52,1.0,0.6,1.0,N,238,238,1.0,6.0,3.0,0.5,1.0,0.0,0.3,10.8,2.5
3,1.0,2020-01-01 00:55:23,2020-01-01 01:00:14,1.0,0.8,1.0,N,238,151,1.0,5.5,0.5,0.5,1.36,0.0,0.3,8.16,0.0
4,2.0,2020-01-01 00:01:58,2020-01-01 00:04:16,1.0,0.0,1.0,N,193,193,2.0,3.5,0.5,0.5,0.0,0.0,0.3,4.8,0.0


Print out columns types to check the warning (columns with different data types):

In [20]:
print(dataset.dtypes)

VendorID                 float64
tpep_pickup_datetime      object
tpep_dropoff_datetime     object
passenger_count          float64
trip_distance            float64
RatecodeID               float64
store_and_fwd_flag        object
PULocationID               int64
DOLocationID               int64
payment_type             float64
fare_amount              float64
extra                    float64
mta_tax                  float64
tip_amount               float64
tolls_amount             float64
improvement_surcharge    float64
total_amount             float64
congestion_surcharge     float64
dtype: object


In [21]:
dataset.store_and_fwd_flag.unique()

array(['N', 'Y', nan], dtype=object)

The problem is on 'nan' that are encoded in a wrong way, and also datetime columns must have a specified type.

In [22]:
dataset['tpep_pickup_datetime'] = pd.to_datetime(dataset['tpep_pickup_datetime'])
dataset['tpep_dropoff_datetime'] = pd.to_datetime(dataset['tpep_dropoff_datetime'])

In [23]:
dataset['store_and_fwd_flag'] = dataset['store_and_fwd_flag'].astype('category')

In [24]:
print(dataset.dtypes)

VendorID                        float64
tpep_pickup_datetime     datetime64[ns]
tpep_dropoff_datetime    datetime64[ns]
passenger_count                 float64
trip_distance                   float64
RatecodeID                      float64
store_and_fwd_flag             category
PULocationID                      int64
DOLocationID                      int64
payment_type                    float64
fare_amount                     float64
extra                           float64
mta_tax                         float64
tip_amount                      float64
tolls_amount                    float64
improvement_surcharge           float64
total_amount                    float64
congestion_surcharge            float64
dtype: object


In [25]:
dataset.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
0,1.0,2020-01-01 00:28:15,2020-01-01 00:33:03,1.0,1.2,1.0,N,238,239,1.0,6.0,3.0,0.5,1.47,0.0,0.3,11.27,2.5
1,1.0,2020-01-01 00:35:39,2020-01-01 00:43:04,1.0,1.2,1.0,N,239,238,1.0,7.0,3.0,0.5,1.5,0.0,0.3,12.3,2.5
2,1.0,2020-01-01 00:47:41,2020-01-01 00:53:52,1.0,0.6,1.0,N,238,238,1.0,6.0,3.0,0.5,1.0,0.0,0.3,10.8,2.5
3,1.0,2020-01-01 00:55:23,2020-01-01 01:00:14,1.0,0.8,1.0,N,238,151,1.0,5.5,0.5,0.5,1.36,0.0,0.3,8.16,0.0
4,2.0,2020-01-01 00:01:58,2020-01-01 00:04:16,1.0,0.0,1.0,N,193,193,2.0,3.5,0.5,0.5,0.0,0.0,0.3,4.8,0.0


Now, considering how data are described within the Kaggle datacard, try to understand if adjustments are required.

In [26]:
# Define a function to convert float to categorical
def float_to_cat(column):
    column_new = column.astype('Int64') # Int64 handles Nan w.r.t. 'int'
    column_new = column_new.astype('category')
    return column_new

In [27]:
# VendorID
dataset.VendorID.unique()

array([ 1.,  2., nan])

Given it must be a unique ID, it is better to make it integer and than as a string:

In [28]:
dataset['VendorID'] = float_to_cat(dataset.VendorID)

In [29]:
dataset.VendorID.unique()

[1, 2, NaN]
Categories (2, Int64): [1, 2]

Also on `RatecodeID`:

In [31]:
dataset.RatecodeID.unique()

array([ 1.,  5.,  3.,  2.,  4., 99.,  6., nan])

In [32]:
dataset['RatecodeID'] = float_to_cat(dataset.RatecodeID)
dataset.RatecodeID.unique()

[1, 5, 3, 2, 4, 99, 6, NaN]
Categories (7, Int64): [1, 2, 3, 4, 5, 6, 99]

Also `PULocationID` and `DOLocationID` because are identifiers:

In [34]:
dataset['PULocationID'] = float_to_cat(dataset.PULocationID)
dataset.PULocationID.unique()

[238, 239, 193, 7, 246, ..., 59, 245, 176, 204, 27]
Length: 261
Categories (261, Int64): [1, 2, 3, 4, ..., 262, 263, 264, 265]

In [35]:
dataset['DOLocationID'] = float_to_cat(dataset.DOLocationID)
dataset.DOLocationID.unique()

[239, 238, 151, 193, 48, ..., 109, 84, 172, 105, 2]
Length: 262
Categories (262, Int64): [1, 2, 3, 4, ..., 262, 263, 264, 265]

Also `payment_type` is categorical:

In [40]:
dataset['payment_type'] = float_to_cat(dataset.payment_type)
dataset.payment_type.unique()

[1, 2, 4, 3, 5, NaN]
Categories (5, Int64): [1, 2, 3, 4, 5]

In [36]:
dataset.dtypes

VendorID                       category
tpep_pickup_datetime     datetime64[ns]
tpep_dropoff_datetime    datetime64[ns]
passenger_count                 float64
trip_distance                   float64
RatecodeID                     category
store_and_fwd_flag             category
PULocationID                   category
DOLocationID                   category
payment_type                    float64
fare_amount                     float64
extra                           float64
mta_tax                         float64
tip_amount                      float64
tolls_amount                    float64
improvement_surcharge           float64
total_amount                    float64
congestion_surcharge            float64
dtype: object

All columns related to counters must be integer (in this case `passenger_count`):

In [None]:
dataset['passenger_count'] = dataset['passenger_count'].astype('int')

In [44]:
dataset.passenger_count.unique()

<IntegerArray>
[1, 4, 2, 3, 6, 5, 0, 8, 7, 9, <NA>]
Length: 11, dtype: Int64

In [38]:
dataset.dtypes

VendorID                       category
tpep_pickup_datetime     datetime64[ns]
tpep_dropoff_datetime    datetime64[ns]
passenger_count                   Int64
trip_distance                   float64
RatecodeID                     category
store_and_fwd_flag             category
PULocationID                   category
DOLocationID                   category
payment_type                    float64
fare_amount                     float64
extra                           float64
mta_tax                         float64
tip_amount                      float64
tolls_amount                    float64
improvement_surcharge           float64
total_amount                    float64
congestion_surcharge            float64
dtype: object