# Data Skills Lab

Links:

- We will be using NYC taxi data. The code will automatically download the files, but you can find the files and other links here: https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
- Read the Yellow Taxi data dictionary https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf

*Assignment:*

Use pandas to read the 2 data files into your Python notebook. Answer the following questions and upload your results here:

1. Answer the following questions:

- How many pickups happened at each NYC airport?
- How many dropoffs happened at each NYC airport?
- What is the total amount of airport fees collected at each NYC airport? (JFK and LaGuardia)
- What borough destination had the most tips?
- What were the top 10 pickup locations by number of passengers?

2. Create a data visualization of your choice

In [57]:
# import libraries (if running locally, make sure you install these with pip)

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [58]:
# links to data (pandas can load files from links as well as file paths)
# January 2024 data
taxi_link = (
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet"
)
zone_link = "https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv"

# read files using appropriate pd.read_* function for each format
trips = pd.read_parquet(taxi_link, engine="pyarrow")
taxi_zones = pd.read_csv(zone_link)

In [None]:
# use .head() to display the first n rows of the dataframe
trips.head()

In [None]:
taxi_zones.head()

In [61]:
# we need to extract date parts from timestamps for grouping later...
trips["pickup_day"] = trips["tpep_pickup_datetime"].apply(lambda x: x.day)
trips["pickup_dow"] = trips["tpep_pickup_datetime"].apply(lambda x: x.day_name())
trips["pickup_dow_num"] = trips["tpep_pickup_datetime"].apply(lambda x: x.day_of_week)

In [None]:
trips[["tpep_pickup_datetime", "pickup_day", "pickup_dow", "pickup_dow_num"]].head()

In [63]:
# I looked up the airport codes so you don't have to
airport_list = [132, 138] # JFK, Laguardia

# use df.query() to use a SQL-like expression on your dataframe (@ is used to refer to a variable outside the df)
airport_zones = taxi_zones.query("LocationID in @airport_list")

In [None]:
airport_zones

In [65]:
# merge taxi zones and trip data to get name of 
trips_merged_pu = trips.merge(
    taxi_zones, left_on=["PULocationID"], right_on=["LocationID"], how="inner"
)

In [None]:
trips_merged_pu.head()

In [None]:
trips_merged_pu.info()

### 1 - How many pickups happened at each airport?

In [68]:
result_1 = (
    # filter to just airport locations
    trips_merged_pu.query("PULocationID in @airport_list")
    # group by location (Zone)
    .groupby(["Zone"])
    # use .agg to pass a dict of {column: function} pairs for aggregation
    .agg({"Zone": "count", "passenger_count": "sum"})
)

result_1.columns = ["pickup_count", "passenger_count"]
result_1.reset_index(inplace=True)

In [None]:
result_1

In [None]:
sns.barplot(result_1, x="Zone", y="pickup_count")

### 2 - How many dropoffs happened at each NYC airport?

In [71]:
# we are going to do the opposite merge on dropoff ID (DOLocationID)
trips_merged_do = trips.merge(
    taxi_zones.query("LocationID in @airport_list"),
    left_on=["DOLocationID"],
    right_on=["LocationID"],
    how="inner",
)

In [None]:
trips_merged_do.shape

In [73]:
result_2 = trips_merged_do.groupby(["Zone"]).agg(
    {"Zone": "count", "passenger_count": "sum"}
)
result_2.columns = ["dropoff_count", "passenger_count"]
result_2.reset_index(inplace=True)

In [None]:
result_2

In [None]:
sns.barplot(result_2, x="Zone", y="dropoff_count")

In [None]:
trips.columns

### 3 - What is the total amount of airport fees collected at each NYC airport? (JFK and LaGuardia)

Tip, airport fee is collected by Taxi meter if picked up at an airport

In [77]:
result_3 = (
    trips.query("PULocationID in @airport_list")
    .groupby("PULocationID")
    .agg({"Airport_fee": "sum", "PULocationID": "count"})
)

result_3.columns = ["airport_fee_sum", "pickup_count"]

In [78]:
result_3 = result_3.merge(
    taxi_zones, left_on="PULocationID", right_on="LocationID", how="inner"
)

In [None]:
result_3

In [None]:
sns.barplot(result_3, x="Zone", y="airport_fee_sum")

### 4 - What borough destination had the most tips?

In [81]:
trips_merged_do_all = trips.merge(
    taxi_zones, left_on=["DOLocationID"], right_on=["LocationID"], how="left"
)

borough_metrics = (
    trips_merged_do_all.groupby("Borough")
    .agg(
        {
            "tip_amount": "sum",
            "DOLocationID": "count",
            "trip_distance": "mean",
        }
    )
    .reset_index()
)

In [None]:
borough_metrics.head()

In [None]:
borough_metrics[["Borough", "tip_amount"]]

In [None]:
sns.barplot(borough_metrics, x="Borough", y="tip_amount")

In [None]:
trips_merged_pu.head()

### 5 - What were the top 10 pickup locations by number of passengers?

In [88]:
result_5 = pd.DataFrame(
    trips_merged_pu.groupby("Zone")["passenger_count"]
    .sum()
    .sort_values(ascending=False)[0:10]
).reset_index()

In [None]:
ax = sns.barplot(result_5, x="Zone", y="passenger_count")
# rotate ticks 
plt.xticks(rotation=80)
plt.show()