### Jupyter Notebook to clean up the nytaxi2022.csv dataset

[This notebook is not for submission, only for personal use] 
The purpose of this notebook is to do basic exploratory data analysis to check for missing/invalid data prior to execution of Stochastic Gradient Descent using MPI.
The description of the attributes from [Kaggle](https://www.kaggle.com/datasets/diishasiing/revenue-for-cab-drivers/data) is as follows:

* VendorID: A unique identifier for the taxi vendor or service provider.
* tpep_pickup_datetime: The date and time when the passenger was picked up.
* tpep_dropoff_datetime: The date and time when the passenger was dropped off.
* passenger_count: The number of passengers in the taxi.
* trip_distance: The total distance of the trip in miles or kilometers.
* RatecodeID: The rate code assigned to the trip, representing fare types.
* store_and_fwd_flag: Indicates whether the trip data was stored locally and then forwarded later (Y/N).
* PULocationID: The unique identifier for the pickup location (zone or area).
* DOLocationID: The unique identifier for the drop-off location (zone or area).
* payment_type: The method of payment used by the passenger (e.g., cash, card).
* fare_amount: The base fare for the trip.
* extra: Additional charges applied during the trip (e.g., night surcharge).
* mta_tax: The tax imposed by the Metropolitan Transportation Authority.
* tip_amount: The tip given to the driver, if applicable.
* tolls_amount: The total amount of tolls charged during the trip.
* improvement_surcharge: A surcharge imposed for the improvement of services.
* total_amount: The total fare amount, including all charges and surcharges.
* congestion_surcharge: An additional charge for trips taken during high traffic congestion times.


Below we import the file as a Pandas dataframe and get a snapshot of the first five rows of the dataset

In [1]:
import pandas as pd
import numpy as np

taxi_data = pd.read_csv('../../data/nytaxi2022.csv', header=0)

taxi_data.head()

  taxi_data = pd.read_csv('../../data/nytaxi2022.csv', header=0)


Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
0,1,01/01/2022 12:35:40 AM,01/01/2022 12:53:29 AM,2.0,3.8,1.0,N,142,236,1,14.5,3.0,0.5,3.65,0.0,0.3,21.95,2.5,0.0
1,1,01/01/2022 12:33:43 AM,01/01/2022 12:42:07 AM,1.0,2.1,1.0,N,236,42,1,8.0,0.5,0.5,4.0,0.0,0.3,13.3,0.0,0.0
2,2,01/01/2022 12:53:21 AM,01/01/2022 01:02:19 AM,1.0,0.97,1.0,N,166,166,1,7.5,0.5,0.5,1.76,0.0,0.3,10.56,0.0,0.0
3,2,01/01/2022 12:25:21 AM,01/01/2022 12:35:23 AM,1.0,1.09,1.0,N,114,68,2,8.0,0.5,0.5,0.0,0.0,0.3,11.8,2.5,0.0
4,2,01/01/2022 12:36:48 AM,01/01/2022 01:14:20 AM,1.0,4.3,1.0,N,68,163,1,23.5,0.5,0.5,3.0,0.0,0.3,30.3,2.5,0.0


Below we gather the descriptive statistics for the attributes of the dataset to get a high level understanding of the distribution of the values of each attribute.

In [2]:
# Restricting the dataset to the feature columns mentioned in the problem statement
taxi_data = taxi_data[['tpep_pickup_datetime', 'tpep_dropoff_datetime', 'passenger_count', 'trip_distance', 'RatecodeID', 'PULocationID','DOLocationID','payment_type','extra','total_amount']]
taxi_data.columns

Index(['tpep_pickup_datetime', 'tpep_dropoff_datetime', 'passenger_count',
       'trip_distance', 'RatecodeID', 'PULocationID', 'DOLocationID',
       'payment_type', 'extra', 'total_amount'],
      dtype='object')

In [3]:
taxi_data.describe()

Unnamed: 0,passenger_count,trip_distance,RatecodeID,PULocationID,DOLocationID,payment_type,extra,total_amount
count,38287800.0,39656100.0,38287800.0,39656100.0,39656100.0,39656100.0,39656100.0,39656100.0
mean,1.401149,5.959399,1.424172,164.866,162.5752,1.18955,1.007532,21.67127
std,0.9628938,599.1907,5.794343,65.31082,70.23146,0.5190411,1.262564,96.3736
min,0.0,0.0,1.0,1.0,1.0,0.0,-22.18,-2567.8
25%,1.0,1.1,1.0,132.0,113.0,1.0,0.0,12.3
50%,1.0,1.9,1.0,162.0,162.0,1.0,0.5,15.96
75%,1.0,3.56,1.0,234.0,234.0,1.0,2.5,23.16
max,9.0,389678.5,99.0,265.0,265.0,5.0,33.5,401095.6


In [4]:
len(taxi_data)

39656098

With the code below, we're looking for the percentage of missing data across different columns to determine if we need to drop the corresponding rows or impute the data

In [5]:
taxi_data.isna().sum() * 100 / len(taxi_data)

tpep_pickup_datetime     0.000000
tpep_dropoff_datetime    0.000000
passenger_count          3.450423
trip_distance            0.000000
RatecodeID               3.450423
PULocationID             0.000000
DOLocationID             0.000000
payment_type             0.000000
extra                    0.000000
total_amount             0.000000
dtype: float64

Converting all the pickup and dropoff timestamps to the datetime format if they're not so already, extracting the date, month, year, day of week, hour, minute, second features out as cyclical trends for taxi rides may be more useful

In [6]:
# Get date, month, year, day of the week, hour, minute, second for pickup and dropoff datetime
# Make a method to do this and then apply it to both pickup and dropoff datetime
def get_datetime_features(df, col_name):
    df[col_name + '_date'] = df[col_name].dt.day
    df[col_name + '_month'] = df[col_name].dt.month
    df[col_name + '_year'] = df[col_name].dt.year
    df[col_name + '_dayofweek'] = df[col_name].dt.dayofweek
    df[col_name + '_hour'] = df[col_name].dt.hour
    df[col_name + '_minute'] = df[col_name].dt.minute
    df[col_name + '_second'] = df[col_name].dt.second
    return df

taxi_data['tpep_pickup_datetime'] = pd.to_datetime(taxi_data['tpep_pickup_datetime'])
taxi_data['tpep_dropoff_datetime'] = pd.to_datetime(taxi_data['tpep_dropoff_datetime'])

taxi_data = get_datetime_features(taxi_data, 'tpep_pickup_datetime')
taxi_data = get_datetime_features(taxi_data, 'tpep_dropoff_datetime')

#taxi_data.drop(columns=['tpep_pickup_datetime', 'tpep_dropoff_datetime'], inplace=True)
taxi_data.head()

  taxi_data['tpep_pickup_datetime'] = pd.to_datetime(taxi_data['tpep_pickup_datetime'])
  taxi_data['tpep_dropoff_datetime'] = pd.to_datetime(taxi_data['tpep_dropoff_datetime'])


Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,PULocationID,DOLocationID,payment_type,extra,total_amount,...,tpep_pickup_datetime_hour,tpep_pickup_datetime_minute,tpep_pickup_datetime_second,tpep_dropoff_datetime_date,tpep_dropoff_datetime_month,tpep_dropoff_datetime_year,tpep_dropoff_datetime_dayofweek,tpep_dropoff_datetime_hour,tpep_dropoff_datetime_minute,tpep_dropoff_datetime_second
0,2022-01-01 00:35:40,2022-01-01 00:53:29,2.0,3.8,1.0,142,236,1,3.0,21.95,...,0,35,40,1,1,2022,5,0,53,29
1,2022-01-01 00:33:43,2022-01-01 00:42:07,1.0,2.1,1.0,236,42,1,0.5,13.3,...,0,33,43,1,1,2022,5,0,42,7
2,2022-01-01 00:53:21,2022-01-01 01:02:19,1.0,0.97,1.0,166,166,1,0.5,10.56,...,0,53,21,1,1,2022,5,1,2,19
3,2022-01-01 00:25:21,2022-01-01 00:35:23,1.0,1.09,1.0,114,68,2,0.5,11.8,...,0,25,21,1,1,2022,5,0,35,23
4,2022-01-01 00:36:48,2022-01-01 01:14:20,1.0,4.3,1.0,68,163,1,0.5,30.3,...,0,36,48,1,1,2022,5,1,14,20


Excluding NA data as it only comprises of about 3.45% of the dataset, imputation is not necessary here. Also, per the descriptive statistics above, the min values of extra and total_amount are negative values which doesn't make sense as a passenger ought to pay the taxi fare instead of being paid to ride it.

In [7]:
taxi_data = taxi_data.dropna()
taxi_data = taxi_data[taxi_data['extra'] >= 0]
taxi_data = taxi_data[taxi_data['total_amount'] > 0]
taxi_data = taxi_data[taxi_data['tpep_dropoff_datetime'] > taxi_data['tpep_pickup_datetime']]
taxi_data.isna().sum() * 100 / len(taxi_data)

tpep_pickup_datetime               0.0
tpep_dropoff_datetime              0.0
passenger_count                    0.0
trip_distance                      0.0
RatecodeID                         0.0
PULocationID                       0.0
DOLocationID                       0.0
payment_type                       0.0
extra                              0.0
total_amount                       0.0
tpep_pickup_datetime_date          0.0
tpep_pickup_datetime_month         0.0
tpep_pickup_datetime_year          0.0
tpep_pickup_datetime_dayofweek     0.0
tpep_pickup_datetime_hour          0.0
tpep_pickup_datetime_minute        0.0
tpep_pickup_datetime_second        0.0
tpep_dropoff_datetime_date         0.0
tpep_dropoff_datetime_month        0.0
tpep_dropoff_datetime_year         0.0
tpep_dropoff_datetime_dayofweek    0.0
tpep_dropoff_datetime_hour         0.0
tpep_dropoff_datetime_minute       0.0
tpep_dropoff_datetime_second       0.0
dtype: float64

Checking the length of the dataframe before writing the cleaned data to a new file

In [8]:
len(taxi_data)

38009485

In [9]:
taxi_data.dtypes

tpep_pickup_datetime               datetime64[ns]
tpep_dropoff_datetime              datetime64[ns]
passenger_count                           float64
trip_distance                             float64
RatecodeID                                float64
PULocationID                                int64
DOLocationID                                int64
payment_type                                int64
extra                                     float64
total_amount                              float64
tpep_pickup_datetime_date                   int32
tpep_pickup_datetime_month                  int32
tpep_pickup_datetime_year                   int32
tpep_pickup_datetime_dayofweek              int32
tpep_pickup_datetime_hour                   int32
tpep_pickup_datetime_minute                 int32
tpep_pickup_datetime_second                 int32
tpep_dropoff_datetime_date                  int32
tpep_dropoff_datetime_month                 int32
tpep_dropoff_datetime_year                  int32


In [10]:
taxi_data['RatecodeID'].unique()

array([ 1.,  2.,  5.,  3.,  4., 99.,  6.])

In [11]:
taxi_data['PULocationID'].unique().shape

(262,)

In [12]:
taxi_data['DOLocationID'].unique().shape

(262,)

In [13]:
taxi_data['payment_type'].unique().shape

(5,)

In [14]:
# calculating trip duration in minutes
taxi_data['trip_duration'] = (taxi_data["tpep_dropoff_datetime"] - taxi_data["tpep_pickup_datetime"]).dt.total_seconds() / 60
len(taxi_data[taxi_data['trip_duration']<=0 | taxi_data['trip_duration']>180])

NameError: name 'df' is not defined

In [None]:
# capping the trip duration at 180 minutes/3 hours
taxi_data = taxi_data[taxi_data['trip_duration']>0 | taxi_data['trip_duration']<=180]

In [None]:
len(train_data)

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv("../../data/processed/nytaxi2022_preprocessed_final.csv", header=0)

In [4]:
df['total_amount'].describe()

count    3.796028e+07
mean     2.168057e+01
std      9.833151e+01
min      1.000000e-02
25%      1.230000e+01
50%      1.596000e+01
75%      2.280000e+01
max      4.010956e+05
Name: total_amount, dtype: float64