#### I'm currently working on a project utilizing a Decision Tree Regressor to predict taxi tips. The dataset I'm using is publicly available, sourced from the official NYC government website. This dataset contains valuable information about taxi tips and was collected and provided to the NYC Taxi and Limousine Commission (TLC) by authorized technology providers under the Taxicab & Livery Passenger Enhancement Programs (TPEP/LPEP). Due to its size, the data is stored in Parquet format, and I'm leveraging the PyArrow library to efficiently access and analyze this data.

#### Finaly i will check with sklearn and snapml libraries to compare the performance

In [2]:
# install pyarrow library
# !pip install pyarrow  

import pyarrow.parquet as pq  # import the parquet instance to open the parquet file
raw_data= pq.read_table('yellow_tripdata_2023-09 .parquet') # use .read_table() function to open the parquet file
raw_data = raw_data.to_pandas() 

#### Each row in the dataset represents a taxi trip. As shown above, each row has 18 variables. One variable is called tip_amount and represents the target variable and that is what I am going to predict.

In [3]:
raw_data.shape # Check the size of the dataset

(2846722, 19)

In [9]:
raw_data.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,Airport_fee
1,2,2023-09-01 00:18:40,2023-09-01 00:30:28,2.0,2.34,1.0,N,236,233,1,14.2,1.0,0.5,2.0,0.0,1.0,21.2,2.5,0.0
2,2,2023-09-01 00:35:01,2023-09-01 00:39:04,1.0,1.62,1.0,N,162,236,1,8.6,1.0,0.5,2.0,0.0,1.0,15.6,2.5,0.0
3,2,2023-09-01 00:45:45,2023-09-01 00:47:37,1.0,0.74,1.0,N,141,229,1,5.1,1.0,0.5,1.0,0.0,1.0,11.1,2.5,0.0
4,2,2023-09-01 00:01:23,2023-09-01 00:38:05,1.0,9.85,1.0,N,138,230,1,45.0,6.0,0.5,17.02,0.0,1.0,73.77,2.5,1.75
6,1,2023-09-01 00:51:50,2023-09-01 01:10:21,0.0,10.9,1.0,N,93,255,1,41.5,1.0,0.5,3.0,0.0,1.0,47.0,0.0,0.0


### Data Cleaning

In [8]:
# If the tip ammunt 0 then it means the tip has been paid in cash so eliminate those rows
raw_data=raw_data[raw_data["tip_amount"] > 0] 

# If the tip amount is larger than the fare cost then eliminate those rows
raw_data = raw_data[raw_data["tip_amount"] <= raw_data["fare_amount"]]

# Eliminate the rows which has the large fare_cost
raw_data = raw_data[((raw_data["fare_amount"] >= 2) & (raw_data["fare_amount"] < 200))]




In [6]:
raw_data[raw_data['fare_amount']>raw_data['tip_amount']]

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,Airport_fee
0,1,2023-09-01 00:15:37,2023-09-01 00:20:21,1.0,0.80,1.0,N,163,230,2,6.50,3.5,0.5,0.00,0.0,1.0,11.50,2.5,0.00
1,2,2023-09-01 00:18:40,2023-09-01 00:30:28,2.0,2.34,1.0,N,236,233,1,14.20,1.0,0.5,2.00,0.0,1.0,21.20,2.5,0.00
2,2,2023-09-01 00:35:01,2023-09-01 00:39:04,1.0,1.62,1.0,N,162,236,1,8.60,1.0,0.5,2.00,0.0,1.0,15.60,2.5,0.00
3,2,2023-09-01 00:45:45,2023-09-01 00:47:37,1.0,0.74,1.0,N,141,229,1,5.10,1.0,0.5,1.00,0.0,1.0,11.10,2.5,0.00
4,2,2023-09-01 00:01:23,2023-09-01 00:38:05,1.0,9.85,1.0,N,138,230,1,45.00,6.0,0.5,17.02,0.0,1.0,73.77,2.5,1.75
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2846717,2,2023-09-30 23:31:12,2023-09-30 23:48:29,,2.43,,,125,107,0,17.69,0.0,0.5,4.34,0.0,1.0,26.03,,
2846718,1,2023-09-30 23:42:18,2023-09-30 23:47:45,,0.00,,,236,75,0,11.33,0.0,0.5,0.00,0.0,1.0,15.33,,
2846719,1,2023-09-30 23:03:35,2023-09-30 23:14:50,,1.80,,,211,90,0,12.10,1.0,0.5,2.57,0.0,1.0,19.67,,
2846720,2,2023-09-30 23:57:05,2023-10-01 00:17:36,,3.39,,,209,97,0,20.33,0.0,0.5,4.87,0.0,1.0,29.20,,
