# Data Skills Lab

Materials:

- Download the January 2023 Yellow Taxi Data PARQUET file https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
- Download the Taxi Zone Lookup table CSV file on the same page
- Read the Yellow Taxi data dictionary https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf

*Assignment:*

Use pandas to read the 2 data files into your Python notebook. Answer the following questions and upload your results here:

Tips: there are 3 airports, JFK, LaGuardia, and Newark (EWR)

1. Answer the following questions:

- How many pickups happened at each airport?
- How many dropoffs happened at each airport?
- What is the total amount of airport fees collected at each NYC airport? (JFK and LaGuardia)
- What borough destination had the most tips?
- What were the top 10 pickup locations by number of passengers?

2. Create a data visualization of your choice

In [12]:
import pandas as pd
import seaborn as sns

In [23]:
# taxi_link = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet"

trips = pd.read_parquet("/Users/sam/Documents/projects/ds course/yellow_tripdata_2023-01.parquet")
taxi_zones = pd.read_csv("/Users/sam/Documents/projects/ds course/taxi_zone_lookup.csv")

In [24]:
trips.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
0,2,2023-01-01 00:32:10,2023-01-01 00:40:36,1.0,0.97,1.0,N,161,141,2,9.3,1.0,0.5,0.0,0.0,1.0,14.3,2.5,0.0
1,2,2023-01-01 00:55:08,2023-01-01 01:01:27,1.0,1.1,1.0,N,43,237,1,7.9,1.0,0.5,4.0,0.0,1.0,16.9,2.5,0.0
2,2,2023-01-01 00:25:04,2023-01-01 00:37:49,1.0,2.51,1.0,N,48,238,1,14.9,1.0,0.5,15.0,0.0,1.0,34.9,2.5,0.0
3,1,2023-01-01 00:03:48,2023-01-01 00:13:25,0.0,1.9,1.0,N,138,7,1,12.1,7.25,0.5,0.0,0.0,1.0,20.85,0.0,1.25
4,2,2023-01-01 00:10:29,2023-01-01 00:21:19,1.0,1.43,1.0,N,107,79,1,11.4,1.0,0.5,3.28,0.0,1.0,19.68,2.5,0.0


In [35]:
taxi_zones.head()
airport_list = [1, 132, 138]
airport_zones = taxi_zones.query("LocationID in @airport_list")

In [72]:
# rows before 3066766
trips_merged_pu = (
    trips.merge(
        taxi_zones,
        left_on=["PULocationID"],
        right_on=["LocationID"],
        how="inner")
)

In [40]:
trips_merged_pu.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 249628 entries, 0 to 249627
Data columns (total 23 columns):
 #   Column                 Non-Null Count   Dtype         
---  ------                 --------------   -----         
 0   VendorID               249628 non-null  int64         
 1   tpep_pickup_datetime   249628 non-null  datetime64[ns]
 2   tpep_dropoff_datetime  249628 non-null  datetime64[ns]
 3   passenger_count        249029 non-null  float64       
 4   trip_distance          249628 non-null  float64       
 5   RatecodeID             249029 non-null  float64       
 6   store_and_fwd_flag     249029 non-null  object        
 7   PULocationID           249628 non-null  int64         
 8   DOLocationID           249628 non-null  int64         
 9   payment_type           249628 non-null  int64         
 10  fare_amount            249628 non-null  float64       
 11  extra                  249628 non-null  float64       
 12  mta_tax                249628 non-null  floa

In [46]:
# How many pickups happened at each airport?
result_1 = trips_merged_pu.groupby(["Zone"]).agg({"Zone": "count", "passenger_count": "sum"})

In [47]:
result_1.head()

Unnamed: 0_level_0,Zone,passenger_count
Zone,Unnamed: 1_level_1,Unnamed: 2_level_1
JFK Airport,160030,228407.0
LaGuardia Airport,89188,119617.0
Newark Airport,410,648.0


In [50]:
# rows before 3066766
trips_merged_do = (
    trips.merge(
        taxi_zones.query("LocationID in @airport_list"),
        left_on=["DOLocationID"],
        right_on=["LocationID"],
        how="inner")
)

In [51]:
trips_merged_do.shape

(72747, 23)

In [52]:
trips_merged_do.groupby(["Zone"]).agg({"Zone": "count", "passenger_count": "sum"})

Unnamed: 0_level_0,Zone,passenger_count
Zone,Unnamed: 1_level_1,Unnamed: 2_level_1
JFK Airport,33190,49805.0
LaGuardia Airport,32031,42552.0
Newark Airport,7526,12156.0


In [54]:
# - What is the total amount of airport fees collected at each NYC airport? (JFK and LaGuardia)

trips.query("PULocationID in @airport_list").groupby("PULocationID").agg({"airport_fee": "sum", "PULocationID": "count"})



Unnamed: 0_level_0,airport_fee,PULocationID
PULocationID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,2.5,410
132,187165.0,160030
138,108615.0,89188


In [55]:
trips.query("PULocationID == 1 and airport_fee > 0")

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
261195,2,2023-01-04 14:49:22,2023-01-04 14:49:42,2.0,0.0,5.0,N,1,1,1,150.0,0.0,0.0,40.69,11.75,1.0,204.69,0.0,1.25
2559949,2,2023-01-27 15:15:51,2023-01-27 15:19:06,1.0,0.0,5.0,N,1,1,2,125.0,0.0,0.0,0.0,0.0,1.0,127.25,0.0,1.25


In [56]:
trips.query("trip_distance == 0")

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
278,2,2023-01-01 00:39:02,2023-01-01 00:46:03,1.0,0.0,1.0,N,137,162,1,7.90,1.0,0.5,3.22,0.0,1.0,16.12,2.5,0.0
279,2,2023-01-01 00:47:29,2023-01-01 00:55:49,1.0,0.0,1.0,N,233,141,1,8.60,1.0,0.5,2.72,0.0,1.0,16.32,2.5,0.0
280,2,2023-01-01 00:59:24,2023-01-01 01:14:26,1.0,0.0,1.0,N,141,193,2,13.50,1.0,0.5,0.00,0.0,1.0,18.50,2.5,0.0
333,1,2023-01-01 00:57:44,2023-01-01 00:57:59,1.0,0.0,1.0,N,137,137,3,3.00,3.5,0.5,0.00,0.0,1.0,8.00,2.5,0.0
398,2,2023-01-01 00:28:04,2023-01-01 00:28:35,1.0,0.0,2.0,N,142,142,2,70.00,0.0,0.5,0.00,0.0,1.0,74.00,2.5,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3066753,1,2023-01-31 23:12:06,2023-01-31 23:32:16,,0.0,,,164,13,0,12.64,0.0,0.5,0.00,0.0,1.0,16.64,,
3066755,1,2023-01-31 23:28:56,2023-01-31 23:45:11,,0.0,,,144,48,0,13.08,0.0,0.5,0.00,0.0,1.0,17.08,,
3066756,1,2023-01-31 23:05:36,2023-01-31 23:20:37,,0.0,,,161,148,0,12.74,0.0,0.5,0.00,0.0,1.0,16.74,,
3066758,1,2023-01-31 23:10:56,2023-01-31 23:23:37,,0.0,,,162,151,0,12.00,1.0,0.5,9.40,0.0,1.0,28.40,,


In [69]:
# what borough destination had the most tips?

trips_merged_do_all = (
    trips.merge(
        taxi_zones,
        left_on=["DOLocationID"],
        right_on=["LocationID"],
        how="left")
)

trips_merged_do_all.groupby("Borough").agg({"tip_amount": ["sum", "mean"] , "DOLocationID": "count", "trip_distance": "mean"})

Unnamed: 0_level_0,tip_amount,tip_amount,DOLocationID,trip_distance
Unnamed: 0_level_1,sum,mean,count,mean
Borough,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Bronx,61818.26,3.375649,18313,10.332796
Brooklyn,704746.4,5.92712,118902,9.061841
EWR,108362.21,14.39838,7526,17.885436
Manhattan,8382541.67,3.075169,2725880,3.189211
Queens,873584.81,5.405044,161624,8.808632
Staten Island,5859.28,6.028066,972,17.963426
Unknown,191773.31,5.716215,33549,7.840597


In [82]:
# - What were the top 10 pickup locations by number of passengers?
trips_merged_pu.groupby("Zone").sum()["passenger_count"].sort_values(ascending=False)[0:10]

  trips_merged_pu.groupby("Zone").sum()["passenger_count"].sort_values(ascending=False)[0:10]


Zone
JFK Airport                     228407.0
Upper East Side South           192476.0
Midtown Center                  181236.0
Upper East Side North           180238.0
Penn Station/Madison Sq West    143349.0
Times Sq/Theatre District       142150.0
Midtown East                    137405.0
Lincoln Square East             134096.0
LaGuardia Airport               119617.0
Upper West Side South           115799.0
Name: passenger_count, dtype: float64