# Case Study: Yellow Taxi

One of the icons of New York City is the yellow taxis.  Since 2009, New York city taxi and limousine commission has been publishing [monthly taxi trip data online](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page). In this case study, we will dig into this large source of data and tidy it up.

## Retrieving the data

As an example, we will focus on the taxi trip data in 2022. We will grab the data directly from the web. Due to the large amount of data involved, the code in this notebook may take longer to run.

In [1]:
import pandas as pd

yellow_taxi_2022_monthly = []
for i in range(1, 13):
    url = f"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-{i:02}.parquet"
    yellow_taxi_2022_monthly.append(pd.read_parquet(url))

yellow_taxi_2022 = pd.concat(yellow_taxi_2022_monthly)
yellow_taxi_2022.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 39656098 entries, 0 to 3399548
Data columns (total 19 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   VendorID               int64         
 1   tpep_pickup_datetime   datetime64[ns]
 2   tpep_dropoff_datetime  datetime64[ns]
 3   passenger_count        float64       
 4   trip_distance          float64       
 5   RatecodeID             float64       
 6   store_and_fwd_flag     object        
 7   PULocationID           int64         
 8   DOLocationID           int64         
 9   payment_type           int64         
 10  fare_amount            float64       
 11  extra                  float64       
 12  mta_tax                float64       
 13  tip_amount             float64       
 14  tolls_amount           float64       
 15  improvement_surcharge  float64       
 16  total_amount           float64       
 17  congestion_surcharge   float64       
 18  airport_fee          

In the code above, we have downloaded 12 datasets into a list and used `pd.concat` to combine them into a single dataset that contains almost 40 million rows. Let's take a closer look.

In [2]:
data = yellow_taxi_2022
data

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
0,1,2022-01-01 00:35:40,2022-01-01 00:53:29,2.0,3.80,1.0,N,142,236,1,14.50,3.0,0.5,3.65,0.0,0.3,21.95,2.5,0.0
1,1,2022-01-01 00:33:43,2022-01-01 00:42:07,1.0,2.10,1.0,N,236,42,1,8.00,0.5,0.5,4.00,0.0,0.3,13.30,0.0,0.0
2,2,2022-01-01 00:53:21,2022-01-01 01:02:19,1.0,0.97,1.0,N,166,166,1,7.50,0.5,0.5,1.76,0.0,0.3,10.56,0.0,0.0
3,2,2022-01-01 00:25:21,2022-01-01 00:35:23,1.0,1.09,1.0,N,114,68,2,8.00,0.5,0.5,0.00,0.0,0.3,11.80,2.5,0.0
4,2,2022-01-01 00:36:48,2022-01-01 01:14:20,1.0,4.30,1.0,N,68,163,1,23.50,0.5,0.5,3.00,0.0,0.3,30.30,2.5,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3399544,2,2022-12-31 23:46:00,2023-01-01 00:11:00,,11.56,,,16,36,0,39.55,0.0,0.5,8.21,0.0,1.0,49.26,,
3399545,2,2022-12-31 23:13:24,2022-12-31 23:29:08,,5.06,,,75,50,0,26.23,0.0,0.5,0.00,0.0,1.0,30.23,,
3399546,2,2022-12-31 23:00:49,2022-12-31 23:26:57,,13.35,,,168,197,0,47.73,0.0,0.5,9.85,0.0,1.0,59.08,,
3399547,1,2022-12-31 23:02:50,2022-12-31 23:16:05,,0.00,,,238,116,0,12.74,0.0,0.5,0.00,0.0,1.0,16.74,,


You can find out more about the meaning of each column from the [yellow trip data dictionary](https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf). Overall, the dataset seems to be tidy. Each row represents an observation, i.e. a single trip. Each column corresponds to some attribute of the observation. Each cell records the measurement of the attribute for the given observation.

However, we can also quickly spot a few problems in the dataset.  For example, the presence of `NaN` and `None` indicate the data set contains missing values. Let's clean it up.

## Selecting columns

Assuming we are mostly interested in the columns related to the number of passengers, pickup/dropoff time and location, total cost and total distance traveled. It is best to filter out unused columns first.

In [3]:
selected_columns = ["tpep_pickup_datetime", "tpep_dropoff_datetime", "passenger_count", "PULocationID", "DOLocationID", "fare_amount", "total_amount", "trip_distance"]
data = data[selected_columns]
data

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,PULocationID,DOLocationID,fare_amount,total_amount,trip_distance
0,2022-01-01 00:35:40,2022-01-01 00:53:29,2.0,142,236,14.50,21.95,3.80
1,2022-01-01 00:33:43,2022-01-01 00:42:07,1.0,236,42,8.00,13.30,2.10
2,2022-01-01 00:53:21,2022-01-01 01:02:19,1.0,166,166,7.50,10.56,0.97
3,2022-01-01 00:25:21,2022-01-01 00:35:23,1.0,114,68,8.00,11.80,1.09
4,2022-01-01 00:36:48,2022-01-01 01:14:20,1.0,68,163,23.50,30.30,4.30
...,...,...,...,...,...,...,...,...
3399544,2022-12-31 23:46:00,2023-01-01 00:11:00,,16,36,39.55,49.26,11.56
3399545,2022-12-31 23:13:24,2022-12-31 23:29:08,,75,50,26.23,30.23,5.06
3399546,2022-12-31 23:00:49,2022-12-31 23:26:57,,168,197,47.73,59.08,13.35
3399547,2022-12-31 23:02:50,2022-12-31 23:16:05,,238,116,12.74,16.74,0.00


## Remove rows with missing data

Next, let us remove rows that contain missing value.

In [4]:
data = data.dropna()
data

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,PULocationID,DOLocationID,fare_amount,total_amount,trip_distance
0,2022-01-01 00:35:40,2022-01-01 00:53:29,2.0,142,236,14.5,21.95,3.80
1,2022-01-01 00:33:43,2022-01-01 00:42:07,1.0,236,42,8.0,13.30,2.10
2,2022-01-01 00:53:21,2022-01-01 01:02:19,1.0,166,166,7.5,10.56,0.97
3,2022-01-01 00:25:21,2022-01-01 00:35:23,1.0,114,68,8.0,11.80,1.09
4,2022-01-01 00:36:48,2022-01-01 01:14:20,1.0,68,163,23.5,30.30,4.30
...,...,...,...,...,...,...,...,...
3273081,2022-12-31 23:36:15,2022-12-31 23:52:36,1.0,233,7,19.8,29.76,4.02
3273082,2022-12-31 23:09:34,2022-12-31 23:17:46,1.0,161,142,8.6,13.60,1.12
3273083,2022-12-31 23:39:06,2022-12-31 23:51:55,1.0,161,141,12.8,22.25,1.81
3273084,2022-12-31 23:09:37,2022-12-31 23:23:07,1.0,229,142,14.9,19.90,2.35


## Filter out invalid rows

So far so good. Let's take a quick look at the statistics of the columns.

In [5]:
data.describe()

Unnamed: 0,passenger_count,PULocationID,DOLocationID,fare_amount,total_amount,trip_distance
count,38287800.0,38287800.0,38287800.0,38287800.0,38287800.0,38287800.0
mean,1.401149,164.9293,162.8563,14.62763,21.42373,3.514399
std,0.9628938,64.94462,70.15708,97.43135,98.01355,56.43246
min,0.0,1.0,1.0,-2564.0,-2567.8,0.0
25%,1.0,132.0,113.0,7.0,12.3,1.1
50%,1.0,162.0,162.0,10.0,15.95,1.86
75%,1.0,234.0,234.0,16.0,22.77,3.49
max,9.0,265.0,265.0,401092.3,401095.6,184340.8


If you look closely at the above statistics, you will be able to see some values that does not make sense. For example, the `min` of `passenger_count` is 0.  That means some trips are recorded without a passenger! Furthermore, the `min` of `fare_amount` and `total_amount` are negative. I.e. someone earned some money by taking a taxi trip?! These observations are clearly invalid and should be removed.

In [6]:
data = data[(data.passenger_count > 0) & (data.total_amount > 0) & (data.fare_amount > 0) & (data.trip_distance > 0)]
data.describe()

Unnamed: 0,passenger_count,PULocationID,DOLocationID,fare_amount,total_amount,trip_distance
count,36813320.0,36813320.0,36813320.0,36813320.0,36813320.0,36813320.0
mean,1.431075,164.9946,162.8103,14.65599,21.5385,3.569197
std,0.9538774,64.84743,70.10588,94.26327,94.83311,54.38823
min,1.0,1.0,1.0,0.01,0.31,0.01
25%,1.0,132.0,113.0,7.0,12.3,1.13
50%,1.0,162.0,162.0,10.0,15.96,1.9
75%,1.0,234.0,234.0,16.0,22.75,3.53
max,9.0,265.0,265.0,401092.3,401095.6,184340.8


## Specify column types

Lastly, it is always a good practice to ensure each column is of the correct type. From `data.info()` we did earlier, we know that `tpep_pickup_datetime` and `tpep_dropoff_datetime` are already in datetime type. Good! However, `passenger_count` should be of integer type, and `PULocationID` and `DOLocationID` should be "categories". The `astype` member function could be used to update the column types.

In [7]:
data = data.astype({"passenger_count": int, "PULocationID": "category", "DOLocationID": "category"})
data.dtypes

tpep_pickup_datetime     datetime64[ns]
tpep_dropoff_datetime    datetime64[ns]
passenger_count                   int64
PULocationID                   category
DOLocationID                   category
fare_amount                     float64
total_amount                    float64
trip_distance                   float64
dtype: object

## Saving the data
Lastly, let's save the data out for future use.

In [9]:
data.to_parquet("yellow_taxi_2022.parquet", index=False)

## Summary

In this case study, we demonstrate the process of cleaning up New York Yellow Taxi trip data. Here are some of the user methods we used:
* `pd.read_parquet` [doc](https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html)
* `pd.concat` [doc](https://pandas.pydata.org/docs/reference/api/pandas.concat.html)
* `pd.DataFrame` [doc](https://pandas.pydata.org/docs/reference/frame.html)
  * `pd.DataFrame.info` [doc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html)
  * `pd.DataFrame.describe` [doc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html)
  * `pd.DataFrame.dtypes` [doc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dtypes.html)
  * `pd.DataFrame.astype` [doc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html#)
  * `pd.DataFrame.dropna` [doc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html)