# Data Skills Lab

Materials:

- Download the January 2023 Yellow Taxi Data PARQUET file https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
- Download the Taxi Zone Lookup table CSV file on the same page
- Read the Yellow Taxi data dictionary https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf

*Assignment:*

Use pandas to read the 2 data files into your Python notebook. Answer the following questions and upload your results here:

Tips: there are 3 airports, JFK, LaGuardia, and Newark (EWR)

1. Answer the following questions:

- How many pickups happened at each airport?
- How many dropoffs happened at each airport?
- What is the total amount of airport fees collected at each NYC airport? (JFK and LaGuardia)
- What borough destination had the most tips?
- What were the top 10 pickup locations by number of passengers?

2. Create a data visualization of your choice

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
taxi_link = (
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet"
)
zone_link = "https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv"

trips = pd.read_parquet(taxi_link, engine="pyarrow")
taxi_zones = pd.read_csv(zone_link)

In [None]:
trips.head()

In [None]:
trips["pickup_day"] = trips["tpep_pickup_datetime"].apply(lambda x: x.day)
trips["pickup_dow"] = trips["tpep_pickup_datetime"].apply(lambda x: x.day_name())
trips["pickup_dow_num"] = trips["tpep_pickup_datetime"].apply(lambda x: x.day_of_week)

In [None]:
taxi_zones.head()
airport_list = [1, 132, 138]
airport_zones = taxi_zones.query("LocationID in @airport_list")

In [None]:
# rows before 3066766
trips_merged_pu = trips.merge(
    taxi_zones, left_on=["PULocationID"], right_on=["LocationID"], how="inner"
)

In [None]:
trips_merged_pu.head()

In [None]:
trips_merged_pu.info()

In [None]:
# 1 - How many pickups happened at each airport?
result_1 = (
    trips_merged_pu.query("PULocationID in @airport_list")
    .groupby(["Zone"])
    .agg({"Zone": "count", "passenger_count": "sum"})
)
result_1.columns = ["pickup_count", "passenger_count"]
result_1.reset_index(inplace=True)

In [None]:
result_1

In [None]:
sns.barplot(result_1, x="Zone", y="pickup_count")

In [None]:
# 2 - How many dropoffs happened at each airport?
trips_merged_do = trips.merge(
    taxi_zones.query("LocationID in @airport_list"),
    left_on=["DOLocationID"],
    right_on=["LocationID"],
    how="inner",
)

In [None]:
trips_merged_do.shape

In [None]:
result_2 = trips_merged_do.groupby(["Zone"]).agg(
    {"Zone": "count", "passenger_count": "sum"}
)
result_2.columns = ["dropoff_count", "passenger_count"]
result_2.reset_index(inplace=True)

In [None]:
result_2

In [None]:
sns.barplot(result_2, x="Zone", y="dropoff_count")

In [None]:
# 3 - What is the total amount of airport fees collected at each NYC airport? (JFK and LaGuardia)
result_3 = (
    trips.query("PULocationID in @airport_list")
    .groupby("PULocationID")
    .agg({"airport_fee": "sum", "PULocationID": "count"})
)
result_3.columns = ["airport_fee_sum", "pickup_count"]
result_3.reset_index(inplace=True)

In [None]:
# dropping bad EWR airport row
result_3.drop(0, axis=0, inplace=True)

In [None]:
result_3 = result_3.merge(
    taxi_zones, left_on="PULocationID", right_on="LocationID", how="inner"
)

In [None]:
trips.query("PULocationID == 1 and airport_fee > 0")

In [None]:
sns.barplot(result_3, x="Zone", y="airport_fee_sum")

In [None]:
# 4 - What borough destination had the most tips?

trips_merged_do_all = trips.merge(
    taxi_zones, left_on=["DOLocationID"], right_on=["LocationID"], how="left"
)

borough_metrics = (
    trips_merged_do_all.groupby("Borough")
    .agg(
        {
            "tip_amount": "sum",
            "DOLocationID": "count",
            "trip_distance": "mean",
        }
    )
    .reset_index()
)

In [None]:
borough_metrics.head()

In [None]:
borough_metrics[["Borough", "tip_amount"]]

In [None]:
sns.barplot(borough_metrics, x="Borough", y="tip_amount")

In [None]:
trips_merged_pu.head()

In [None]:
sns.boxplot(
    trips_merged_pu.query("tip_amount < 30 and tip_amount >= 0"),
    x="Borough",
    y="tip_amount",
)

In [None]:
sns.histplot(
    trips_merged_pu.query("tip_amount < 30 and tip_amount > 0 and Borough == 'Bronx'"),
    x="tip_amount",
    binwidth=0.5,
)

In [None]:
# 5 - What were the top 10 pickup locations by number of passengers?
result_5 = pd.DataFrame(
    trips_merged_pu.groupby("Zone")["passenger_count"]
    .sum()
    .sort_values(ascending=False)[0:10]
).reset_index()

In [None]:
sns.barplot(result_5, x="Zone", y="passenger_count")
plt.xticks(rotation=75)