# Visual Analytics

## Assignment 3

**Instructor:** Dr. Marco D'Ambros  
**TAs:** Carmen Armenti, Mattia Giannaccari

**Contacts:** marco.dambros@usi.ch, carmen.armenti@usi.ch, mattia.giannaccari@usi.ch

**Due Date:** May 16, 2025 @ 23:55

---
The goal of this assignment is to use **Spark (PySpark)** and **Polars** in Jupyter notebooks.  
The files `trip_data.csv`, `trip_fare.csv`, and `nyc_boroughs.geojson` are available in the provided folder: [Assignment3-data](https://usi365-my.sharepoint.com/:f:/g/personal/armenc_usi_ch/Ejp7sb8QAMROoWe0XUDcAkMBoqUFk-w2Vgroup025NhAww?e=2I7SMC).

You may clean the data as needed; however, please note that specific data cleaning steps will be required in **Exercise 5**. If you choose to clean the data before Exercise 5, make sure to retain the **original dataset** for use with the Polars exercises.

- Use **Spark** to solve **Exercises 1–4**
- Use **Polars** to solve **Exercises 5–8**

You are encouraged to use [Spark window functions](https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-window.html) whenever appropriate.

Please name your notebook file as `SurnameName_Assignment3.ipynb`

## Spark

### Exercise 1
Join the `trip_data` and `trip_fare` dataframes into one and consider only data on 2013-01-01. Please specify the number of rows obtained after joining the 2 datasets.

In [5]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.types import DateType
import geopandas as gpd

In [6]:
session = SparkSession.builder.getOrCreate()
session.conf.set("spark.sql.shuffle.partitions", 400)
session.conf.set('spark.sql.repl.eagerEval.enabled', True)

In [7]:
trip_data_df = session.read.option("inferSchema", "true").option("header", "true").csv("./datasets/trip_data.csv")
trip_fare_df = session.read.option("inferSchema", "true").option("header", "true").csv("./datasets/trip_fare.csv")


                                                                                

In [8]:
trip_data_df.select([col(c).alias(c.strip()) for c in trip_data_df.columns]) 
trip_fare_df = trip_fare_df.select([col(c).alias(c.strip()) for c in trip_fare_df.columns])

In [9]:
trip_data_df = trip_data_df.select(
    "*",
    col("pickup_datetime").cast(DateType()).alias("pickup_date")
)

In [10]:
joined_df = trip_data_df.join(trip_fare_df, on=["medallion", "hack_license", "pickup_datetime"])

In [11]:
filtered_df = joined_df.filter(joined_df.pickup_date == "2013-01-01")

In [12]:
print("Number of records for 2013-01-01: ", filtered_df.count())



Number of records for 2013-01-01:  412630


                                                                                

In [13]:
#print schema
filtered_df.printSchema()

root
 |-- medallion: string (nullable = true)
 |-- hack_license: string (nullable = true)
 |-- pickup_datetime: timestamp (nullable = true)
 |-- vendor_id: string (nullable = true)
 |-- rate_code: integer (nullable = true)
 |-- store_and_fwd_flag: string (nullable = true)
 |-- dropoff_datetime: timestamp (nullable = true)
 |-- passenger_count: integer (nullable = true)
 |-- trip_time_in_secs: integer (nullable = true)
 |-- trip_distance: double (nullable = true)
 |-- pickup_longitude: double (nullable = true)
 |-- pickup_latitude: double (nullable = true)
 |-- dropoff_longitude: double (nullable = true)
 |-- dropoff_latitude: double (nullable = true)
 |-- pickup_date: date (nullable = true)
 |-- vendor_id: string (nullable = true)
 |-- payment_type: string (nullable = true)
 |-- fare_amount: double (nullable = true)
 |-- surcharge: double (nullable = true)
 |-- mta_tax: double (nullable = true)
 |-- tip_amount: double (nullable = true)
 |-- tolls_amount: double (nullable = true)
 |-- t

### Exercise 2
Provide a graphical representation to compare the average fare amount for trips _within_ and _across_ all the boroughs. You may want to have a look at: https://docs.bokeh.org/en/latest/docs/user_guide/topics/categorical.html#categorical-heatmaps

In [14]:
nyc_df = gpd.read_file('./datasets/nyc-boroughs.geojson')

In [15]:
filtered_df_pandas = filtered_df.toPandas()
pickup_gdf = gpd.GeoDataFrame(
    filtered_df_pandas,
    geometry=gpd.points_from_xy(filtered_df_pandas['pickup_longitude'], filtered_df_pandas['pickup_latitude']),
    crs=nyc_df.crs
)

dropoff_gdf = gpd.GeoDataFrame(
    filtered_df_pandas,
    geometry=gpd.points_from_xy(filtered_df_pandas['dropoff_longitude'], filtered_df_pandas['dropoff_latitude']),
    crs=nyc_df.crs
)

                                                                                

In [16]:
import geopandas as gpd

In [17]:
# Spatial join to get borough names
pickup_boroughs = gpd.sjoin(pickup_gdf, nyc_df, how="left", predicate="within")
dropoff_boroughs = gpd.sjoin(dropoff_gdf, nyc_df, how="left", predicate="within")

In [18]:
# Add to original DataFrame
filtered_df_pandas["pickup_borough"] = pickup_boroughs["borough"].values
filtered_df_pandas["dropoff_borough"] = dropoff_boroughs["borough"].values

In [19]:
filtered_df_pandas.head()

Unnamed: 0,medallion,hack_license,pickup_datetime,vendor_id,rate_code,store_and_fwd_flag,dropoff_datetime,passenger_count,trip_time_in_secs,trip_distance,...,vendor_id.1,payment_type,fare_amount,surcharge,mta_tax,tip_amount,tolls_amount,total_amount,pickup_borough,dropoff_borough
0,00005007A9F30E289E760362F69E4EAD,43468C5D35F828693D96CB7CC9FDF341,2013-01-01 06:48:43,CMT,1,N,2013-01-01 06:50:38,1,114,0.4,...,CMT,DIS,3.5,0.0,0.5,0.0,0.0,4.0,Manhattan,Manhattan
1,00005007A9F30E289E760362F69E4EAD,43468C5D35F828693D96CB7CC9FDF341,2013-01-01 10:04:50,CMT,1,N,2013-01-01 10:12:33,1,463,1.5,...,CMT,CSH,8.0,0.0,0.5,0.0,0.0,8.5,Manhattan,Manhattan
2,00005007A9F30E289E760362F69E4EAD,43468C5D35F828693D96CB7CC9FDF341,2013-01-01 12:31:47,CMT,1,N,2013-01-01 12:39:41,1,473,2.3,...,CMT,CSH,9.0,0.0,0.5,0.0,0.0,9.5,Manhattan,Manhattan
3,00005007A9F30E289E760362F69E4EAD,43468C5D35F828693D96CB7CC9FDF341,2013-01-01 15:13:35,CMT,1,N,2013-01-01 15:31:07,1,1052,3.4,...,CMT,CSH,14.5,0.0,0.5,0.0,0.0,15.0,Manhattan,Manhattan
4,00005007A9F30E289E760362F69E4EAD,43468C5D35F828693D96CB7CC9FDF341,2013-01-01 16:14:19,CMT,1,N,2013-01-01 16:18:46,1,266,0.8,...,CMT,CSH,5.0,0.0,0.5,0.0,0.0,5.5,Manhattan,Manhattan


In [20]:
# Step 6: Back to Spark for aggregation
spark_df = session.createDataFrame(filtered_df_pandas)

trip_group_df = spark_df \
    .groupBy(['pickup_borough', 'dropoff_borough']) \
    .avg('fare_amount') \
    .withColumnRenamed('avg(fare_amount)', 'avg_fare')

unique_borough = nyc_df['borough'].unique()

In [37]:
from bokeh.models import BasicTicker, PrintfTickFormatter
from bokeh.plotting import figure, show
from pyspark.sql.functions import max as spark_max, min as spark_min
from bokeh.transform import linear_cmap

min_val = trip_group_df.agg(spark_min('avg_fare')).collect()[0][0]
max_val = trip_group_df.agg(spark_max('avg_fare')).collect()[0][0]

colors = ["#03045e", "#023e8a", "#0077b6", "#0096c7", "#00b4d8", "#48cae4", "#90e0ef", "#ade8f4", "#caf0f8"]

TOOLS = "hover"
TOOLTIPS = [
    ('Pickup Borough', '@unique_borough'),
    ('Dropoff Borough', '@unique_borough'),
    ('Average Fare Amount', '@avg_fare{0.2f}')
]

p = figure(title="Average Fare Amount for Pickup and Dropoff Boroughs",
           x_range=unique_borough, y_range=unique_borough,
           x_axis_location="above", width=900, height=400,
           tools=TOOLS, toolbar_location='below', tooltips=TOOLTIPS)

p.grid.grid_line_color = None
p.axis.axis_line_color = None
p.axis.major_tick_line_color = None
p.axis.major_label_text_font_size = "7px"
p.axis.major_label_standoff = 0

r = p.rect(x="pickup_borough", y="dropoff_borough", width=1, height=1, source=trip_group_df.toPandas(),
           fill_color=linear_cmap("avg_fare", colors[::-1], low=min_val, high=max_val),
           line_color=None)

p.add_layout(r.construct_color_bar(
    major_label_text_font_size="7px",
    ticker=BasicTicker(desired_num_ticks=len(colors)),
    formatter=PrintfTickFormatter(format="%d%%"),
    label_standoff=6,
    border_line_color=None,
    padding=5,
), 'right')

show(p)

25/05/09 15:15:25 WARN TaskSetManager: Stage 39 contains a task of very large size (8958 KiB). The maximum recommended task size is 1000 KiB.
25/05/09 15:15:26 WARN TaskSetManager: Stage 45 contains a task of very large size (8958 KiB). The maximum recommended task size is 1000 KiB.


### Exercise 3
Consider only Manhattan, Bronx and Brooklyn boroughs. Then create a dataframe that shows the total number of trips *within* the same borough and *across* all the other boroughs mentioned before (Manhattan, Bronx, and Brooklyn) where the passengers are more or equal than 3.

For example, for Manhattan borough you should consider the total number of the following trips:
- Manhattan → Manhattan
- Manhattan → Bronx
- Manhattan → Brooklyn

You should then do the same for Bronx and Brooklyn boroughs.

In [22]:
from pyspark.sql.functions import col, when
boroughs = ["Manhattan", "Bronx", "Brooklyn"]

filtered_df = spark_df.filter(
    (col("pickup_borough").isin(boroughs)) &
    (col("dropoff_borough").isin(boroughs)) &
    (col("passenger_count") >= 3)
)

In [23]:
# Step 2: Define trip type: 'within' or 'across'
labeled_df = filtered_df.withColumn(
    "trip_type",
    when(col("pickup_borough") == col("dropoff_borough"), "within")
    .otherwise("across")
)

# Step 3: Group by trip_type and count
result_df = labeled_df.groupBy("trip_type").count()
result_df

25/05/09 14:57:00 WARN TaskSetManager: Stage 33 contains a task of very large size (8958 KiB). The maximum recommended task size is 1000 KiB.
25/05/09 14:57:00 WARN TaskSetManager: Stage 36 contains a task of very large size (8958 KiB). The maximum recommended task size is 1000 KiB.


trip_type,count
within,64683
across,4706


### Exercise 4
Create a dataframe where each row represents a driver, and there is one column per borough.
For each driver-borough, the dataframe provides the maximum number of consecutive trips
for the given driver, within the given borough. Please consider only trips which were payed by card. 

For example, if for driver A we have (sorted by time):
- Trip 1: Bronx → Bronx
- Trip 2: Bronx \→ Bronx
- Trip 3: Bronx → Manhattan
- Trip 4: Manhattan → Bronx.
    
The maximum number of consecutive trips for Bronx is 2.

In [24]:
card_trips = spark_df.filter(
    (col("payment_type") == "CRD") &
    (col("pickup_borough") == col("dropoff_borough"))
).orderBy("hack_license", "pickup_datetime")

## Polars

### Exercise 5

Please work on the merged dataset of trips and fares and perform the following data cleaning tasks:

1. Remove trips with invalid locations (i.e. not in New York City);
3. Remove trips with invalid amounts:
    - Total amount must be greater than zero;
    - Total amount must correspond to the sum of all the other amounts.
5. Remove trips with invalid time:
    - Pick-up before drop-off;
    - Valid duration.

After each data cleaning task, report how many rows where removed. Finally report:
- Are there **duplicate trips**?
- How many trips remain after cleaning?

In [133]:
import polars as pl

In [134]:
trip_data_df = pl.read_csv(
    "./datasets/trip_data.csv",
    infer_schema_length=250,
    has_header=True,
    schema_overrides={"pickup_datetime": pl.Datetime, "dropoff_datetime": pl.Datetime},
)
trip_fare_df = pl.read_csv(
    "./datasets/trip_fare.csv",
    infer_schema_length=250,
    has_header=True,
    schema_overrides={" pickup_datetime": pl.Datetime, " dropoff_datetime": pl.Datetime},
)

In [135]:
trip_fare_df.head(5)

medallion,hack_license,vendor_id,pickup_datetime,payment_type,fare_amount,surcharge,mta_tax,tip_amount,tolls_amount,total_amount
str,str,str,datetime[μs],str,f64,f64,f64,f64,f64,f64
"""89D227B655E5C82AECF13C3F540D4C…","""BA96DE419E711691B9445D6A6307C1…","""CMT""",2013-01-01 15:11:48,"""CSH""",6.5,0.0,0.5,0.0,0.0,7.0
"""0BD7C8F5BA12B88E0B67BED28BEA73…","""9FD8F69F0804BDB5549F40E9DA1BE4…","""CMT""",2013-01-06 00:18:35,"""CSH""",6.0,0.5,0.5,0.0,0.0,7.0
"""0BD7C8F5BA12B88E0B67BED28BEA73…","""9FD8F69F0804BDB5549F40E9DA1BE4…","""CMT""",2013-01-05 18:49:41,"""CSH""",5.5,1.0,0.5,0.0,0.0,7.0
"""DFD2202EE08F7A8DC9A57B02ACB81F…","""51EE87E3205C985EF8431D850C7863…","""CMT""",2013-01-07 23:54:15,"""CSH""",5.0,0.5,0.5,0.0,0.0,6.0
"""DFD2202EE08F7A8DC9A57B02ACB81F…","""51EE87E3205C985EF8431D850C7863…","""CMT""",2013-01-07 23:25:03,"""CSH""",9.5,0.5,0.5,0.0,0.0,10.5


In [136]:
print("Trip data columns: ", trip_data_df.columns)
print("Trip fare columns: ", trip_fare_df.columns)

Trip data columns:  ['medallion', 'hack_license', 'vendor_id', 'rate_code', 'store_and_fwd_flag', 'pickup_datetime', 'dropoff_datetime', 'passenger_count', 'trip_time_in_secs', 'trip_distance', 'pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude']
Trip fare columns:  ['medallion', ' hack_license', ' vendor_id', ' pickup_datetime', ' payment_type', ' fare_amount', ' surcharge', ' mta_tax', ' tip_amount', ' tolls_amount', ' total_amount']


In [137]:
trip_data_df = trip_data_df.rename({name: name.strip() for name in trip_data_df.columns})
trip_fare_df = trip_fare_df.rename({name: name.strip() for name in trip_fare_df.columns})

In [138]:
joined_df = trip_data_df.join(trip_fare_df, on=["medallion", "hack_license", "pickup_datetime"], how="inner")

In [139]:
joined_df.head(5)

medallion,hack_license,vendor_id,rate_code,store_and_fwd_flag,pickup_datetime,dropoff_datetime,passenger_count,trip_time_in_secs,trip_distance,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,vendor_id_right,payment_type,fare_amount,surcharge,mta_tax,tip_amount,tolls_amount,total_amount
str,str,str,i64,str,datetime[μs],datetime[μs],i64,i64,f64,f64,f64,f64,f64,str,str,f64,f64,f64,f64,f64,f64
"""89D227B655E5C82AECF13C3F540D4C…","""BA96DE419E711691B9445D6A6307C1…","""CMT""",1,"""N""",2013-01-01 15:11:48,2013-01-01 15:18:10,4,382,1.0,-73.978165,40.757977,-73.989838,40.751171,"""CMT""","""CSH""",6.5,0.0,0.5,0.0,0.0,7.0
"""0BD7C8F5BA12B88E0B67BED28BEA73…","""9FD8F69F0804BDB5549F40E9DA1BE4…","""CMT""",1,"""N""",2013-01-06 00:18:35,2013-01-06 00:22:54,1,259,1.5,-74.006683,40.731781,-73.994499,40.75066,"""CMT""","""CSH""",6.0,0.5,0.5,0.0,0.0,7.0
"""0BD7C8F5BA12B88E0B67BED28BEA73…","""9FD8F69F0804BDB5549F40E9DA1BE4…","""CMT""",1,"""N""",2013-01-05 18:49:41,2013-01-05 18:54:23,1,282,1.1,-74.004707,40.73777,-74.009834,40.726002,"""CMT""","""CSH""",5.5,1.0,0.5,0.0,0.0,7.0
"""DFD2202EE08F7A8DC9A57B02ACB81F…","""51EE87E3205C985EF8431D850C7863…","""CMT""",1,"""N""",2013-01-07 23:54:15,2013-01-07 23:58:20,2,244,0.7,-73.974602,40.759945,-73.984734,40.759388,"""CMT""","""CSH""",5.0,0.5,0.5,0.0,0.0,6.0
"""DFD2202EE08F7A8DC9A57B02ACB81F…","""51EE87E3205C985EF8431D850C7863…","""CMT""",1,"""N""",2013-01-07 23:25:03,2013-01-07 23:34:24,1,560,2.1,-73.97625,40.748528,-74.002586,40.747868,"""CMT""","""CSH""",9.5,0.5,0.5,0.0,0.0,10.5


In [140]:
import json
from pathlib import Path
from typing import Any


def get_min_max_coordinates(
    geojson: dict[str, Any],
) -> tuple[float, float, float, float]:
    """
    Get the min/max coordinates from a geojson file.
    """
    # get all the coordinates from the geojson file
    coordinates: list[tuple[float, float]] = []
    for feature in geojson["features"]:
        coordinates.extend(feature["geometry"]["coordinates"][0])
        
    # get the min/max coordinates
    min_lon = min([point[0] for point in coordinates])
    max_lon = max([point[0] for point in coordinates])
    min_lat = min([point[1] for point in coordinates])
    max_lat = max([point[1] for point in coordinates])

    return min_lon, max_lon, min_lat, max_lat

nyc_boroughs_geojson_path = Path("./datasets/nyc-boroughs.geojson")

with nyc_boroughs_geojson_path.open("r") as f:
    json_data = f.read()
    nyc_boroughs_geo_data: dict[str, Any] = json.loads(json_data)
# get the min/max coordinates
min_lon, max_lon, min_lat, max_lat = get_min_max_coordinates(nyc_boroughs_geo_data)

print(f"Min lon: {min_lon}, Max lon: {max_lon}")
print(f"Min lat: {min_lat}, Max lat: {max_lat}")

Min lon: -74.25559136315215, Max lon: -73.70002020503293
Min lat: 40.4961339876118, Max lat: 40.91553277700519


In [141]:
filtered_ny_df = joined_df.filter(
    pl.col("pickup_longitude") > min_lon,
    pl.col("pickup_longitude") < max_lon,
    pl.col("pickup_latitude") > min_lat,
    pl.col("pickup_latitude") < max_lat,
    pl.col("dropoff_longitude") > min_lon,
    pl.col("dropoff_longitude") < max_lon,
    pl.col("dropoff_latitude") > min_lat,
    pl.col("dropoff_latitude") < max_lat,
)
print("Data after filtering invalid coordinates: ", filtered_ny_df.shape[0])

Data after filtering invalid coordinates:  14478922


In [142]:
filtered_ny_df = filtered_ny_df.filter(
        pl.col("total_amount") > 0,
        pl.col("total_amount")
        == (
            pl.col("fare_amount")
            + pl.col("surcharge")
            + pl.col("mta_tax")
            + pl.col("tip_amount")
            + pl.col("tolls_amount")
        ),
    )
print("Data after filtering invalid amount: ", filtered_ny_df.shape[0])

Data after filtering invalid amount:  14211370


In [143]:
filtered_ny_df = filtered_ny_df.filter(
        pl.col("pickup_datetime") < pl.col("dropoff_datetime"),
        (pl.col("dropoff_datetime") - pl.col("pickup_datetime")).dt.total_seconds() == pl.col("trip_time_in_secs"),
        pl.col("trip_distance") > 0,
        pl.col("trip_time_in_secs") < 24 * 60 * 60
)
print("Data after filtering invalid trip time: ", filtered_ny_df.shape[0])

Data after filtering invalid trip time:  10341916


In [144]:
print("Number of records before filtering: ", joined_df.shape[0])
print("Number of records after filtering: ", filtered_ny_df.shape[0])
print("Deleted records: ", joined_df.shape[0] - filtered_ny_df.shape[0])

Number of records before filtering:  14776615
Number of records after filtering:  10341916
Deleted records:  4434699


In [145]:
# Check for duplicates based on "medallion", "hack_license", "pickup_datetime" and "dropoff_datetime"
duplicates = filtered_ny_df.filter(filtered_ny_df.select(["hack_license", "pickup_datetime", "dropoff_datetime"]).is_duplicated())
print("Number of duplicates: ", duplicates.shape[0])

filtered_ny_df = filtered_ny_df.unique(subset=["hack_license", "pickup_datetime", "dropoff_datetime"])

Number of duplicates:  8


### Exercise 6

Compute the **total revenue** (total_amount) grouped by:
- Pick-up hour of the day (0–23)
- Passenger count (group >=6 into “6+”)

Create a heatmap where:
- X-axis = hour
- Y-axis = passenger count group
- Cell value = average revenue per trip

In [146]:
filtered_ny_df_new_columns = filtered_ny_df.with_columns([
    # Extract pickup hour
    pl.col("pickup_datetime").dt.hour().alias("pickup_hour"),

    # Group passenger count: >=6 as "6+"
    pl.when(pl.col("passenger_count") >= 6)
      .then(pl.lit("6+"))
      .otherwise(pl.col("passenger_count").cast(pl.Utf8))
      .alias("passenger_number_per_group"),
])

In [147]:
filtered_ny_df_new_columns.head(5)

medallion,hack_license,vendor_id,rate_code,store_and_fwd_flag,pickup_datetime,dropoff_datetime,passenger_count,trip_time_in_secs,trip_distance,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,vendor_id_right,payment_type,fare_amount,surcharge,mta_tax,tip_amount,tolls_amount,total_amount,pickup_hour,passenger_number_per_group
str,str,str,i64,str,datetime[μs],datetime[μs],i64,i64,f64,f64,f64,f64,f64,str,str,f64,f64,f64,f64,f64,f64,i8,str
"""23CCCBA2153418E69D8937D0148799…","""B1D8BF0131E690A334AC172FDC4808…","""VTS""",2,,2013-01-24 12:29:00,2013-01-24 13:11:00,1,2520,18.4,-73.983864,40.754971,-73.781837,40.647072,"""VTS""","""CRD""",52.0,0.0,0.5,8.0,0.0,60.5,12,"""1"""
"""419FD46C50CFA0CFD57170AC0D5803…","""B1DC132973C11FD20F05BAF177D6D6…","""CMT""",1,"""N""",2013-01-12 21:06:01,2013-01-12 21:12:22,1,381,0.8,-73.984901,40.742165,-73.986343,40.750359,"""CMT""","""CSH""",6.0,0.5,0.5,0.0,0.0,7.0,21,"""1"""
"""6111CEDC91A2B6369C3BC5AD02C4F6…","""70288814593DC8D1D8BD6B2753C715…","""VTS""",1,,2013-01-12 12:19:00,2013-01-12 12:37:00,1,1080,4.16,-73.949883,40.768585,-74.001358,40.755497,"""VTS""","""CRD""",16.5,0.0,0.5,3.3,0.0,20.3,12,"""1"""
"""7D5BEC7C39560EB229DF829FF300AF…","""1AE1813BAAC09E7D3B89B7E3B1214C…","""VTS""",1,,2013-01-12 22:41:00,2013-01-12 22:49:00,1,480,0.99,-73.989059,40.757942,-73.989883,40.752247,"""VTS""","""CSH""",7.0,0.5,0.5,0.0,0.0,8.0,22,"""1"""
"""0778FF54211E1DC38256E88A13C366…","""C81E4905F7D13F99DFC8E44E76F8AF…","""VTS""",1,,2013-01-15 12:57:00,2013-01-15 13:08:00,1,660,2.38,-73.999748,40.721966,-73.997063,40.746464,"""VTS""","""CRD""",10.0,0.0,0.5,2.5,0.0,13.0,12,"""1"""


In [148]:
grouped_df = filtered_ny_df_new_columns.group_by(
    ["pickup_hour", "passenger_number_per_group"]
).agg(
    pl.col("total_amount").mean().alias("avg_total_amount")
).sort(["pickup_hour", "passenger_number_per_group"])

In [149]:
grouped_df.head(5)

pickup_hour,passenger_number_per_group,avg_total_amount
i8,str,f64
0,"""0""",78.75
0,"""1""",14.639476
0,"""2""",14.525148
0,"""3""",14.01229
0,"""4""",13.819334


### Exercise 7

Define an "anomalous trip" as one that satisfies at least two of the following:
- Fare per mile is above the 95th percentile
- Tip amount > 100% of fare
- trip_time_in_secs is less than 60 seconds but distance is more than 1 mile

Create a dataframe of anomalous trips and:
- Report how many such trips exist
- Create a scatterplot to visualize the anomaly metrics
- Describe the visualization identifying groups and outliers

In [187]:
anomalous_df = filtered_ny_df.filter(
    pl.col("trip_distance") > 0
)

anomalous_df = filtered_ny_df.with_columns(
    (pl.col("fare_amount") / pl.col("trip_distance")).alias("fare_per_mile")
)

fare_per_mile_95th = anomalous_df["fare_per_mile"].quantile(0.95)

anomalous_df = anomalous_df.with_columns(
    (pl.col("fare_per_mile") > fare_per_mile_95th).alias("wrong_fare_per_mile"),

    (pl.col("tip_amount") > pl.col("fare_amount")).alias("wrong_tip_amount"),

    ((pl.col("trip_time_in_secs") < 60) & (pl.col("trip_distance") > 1)).alias("wrong_trip_time"),
)

In [188]:
anomalous_df.head(5)

medallion,hack_license,vendor_id,rate_code,store_and_fwd_flag,pickup_datetime,dropoff_datetime,passenger_count,trip_time_in_secs,trip_distance,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,vendor_id_right,payment_type,fare_amount,surcharge,mta_tax,tip_amount,tolls_amount,total_amount,fare_per_mile,wrong_fare_per_mile,wrong_tip_amount,wrong_trip_time
str,str,str,i64,str,datetime[μs],datetime[μs],i64,i64,f64,f64,f64,f64,f64,str,str,f64,f64,f64,f64,f64,f64,f64,bool,bool,bool
"""23CCCBA2153418E69D8937D0148799…","""B1D8BF0131E690A334AC172FDC4808…","""VTS""",2,,2013-01-24 12:29:00,2013-01-24 13:11:00,1,2520,18.4,-73.983864,40.754971,-73.781837,40.647072,"""VTS""","""CRD""",52.0,0.0,0.5,8.0,0.0,60.5,2.826087,False,False,False
"""419FD46C50CFA0CFD57170AC0D5803…","""B1DC132973C11FD20F05BAF177D6D6…","""CMT""",1,"""N""",2013-01-12 21:06:01,2013-01-12 21:12:22,1,381,0.8,-73.984901,40.742165,-73.986343,40.750359,"""CMT""","""CSH""",6.0,0.5,0.5,0.0,0.0,7.0,7.5,False,False,False
"""6111CEDC91A2B6369C3BC5AD02C4F6…","""70288814593DC8D1D8BD6B2753C715…","""VTS""",1,,2013-01-12 12:19:00,2013-01-12 12:37:00,1,1080,4.16,-73.949883,40.768585,-74.001358,40.755497,"""VTS""","""CRD""",16.5,0.0,0.5,3.3,0.0,20.3,3.966346,False,False,False
"""7D5BEC7C39560EB229DF829FF300AF…","""1AE1813BAAC09E7D3B89B7E3B1214C…","""VTS""",1,,2013-01-12 22:41:00,2013-01-12 22:49:00,1,480,0.99,-73.989059,40.757942,-73.989883,40.752247,"""VTS""","""CSH""",7.0,0.5,0.5,0.0,0.0,8.0,7.070707,False,False,False
"""0778FF54211E1DC38256E88A13C366…","""C81E4905F7D13F99DFC8E44E76F8AF…","""VTS""",1,,2013-01-15 12:57:00,2013-01-15 13:08:00,1,660,2.38,-73.999748,40.721966,-73.997063,40.746464,"""VTS""","""CRD""",10.0,0.0,0.5,2.5,0.0,13.0,4.201681,False,False,False


In [189]:
anomalous_df = anomalous_df.filter(
    (
       ( pl.col("wrong_fare_per_mile").cast(pl.Int8)) + 
        (pl.col("wrong_tip_amount").cast(pl.Int8)) + 
       ( pl.col("wrong_trip_time").cast(pl.Int8))
    ) >= 2
)
print("Number of anomalous records: ", anomalous_df.shape[0])

Number of anomalous records:  1182


In [196]:
from bokeh.plotting import figure, show
from bokeh.layouts import row
from bokeh.io import output_notebook
from bokeh.models import HoverTool, ColumnDataSource
from bokeh.palettes import Viridis256

# Enable notebook output
output_notebook()

# Convert to ColumnDataSource for better performance
source = ColumnDataSource(anomalous_df.to_pandas())

# Create hover tool with detailed information
hover = HoverTool(
    tooltips=[
        ("Distance", "@trip_distance miles"),
        ("Fare/mile", "@fare_per_mile{$0.2f}"),
        ("Tip", "@tip_amount{$0.2f}"),
        ("Time", "@trip_time_in_secs{0} secs"),
        ("Total Fare", "@total_amount{$0.2f}")
    ],
    formatters={
        '@trip_distance': 'printf',
        '@fare_per_mile': 'numeral',
        '@tip_amount': 'numeral',
        '@total_amount': 'numeral'
    }
)


# 1. Fare per mile vs trip distance
p1 = figure(
    title="Fare Anomalies",
    x_axis_label="Trip Distance (miles)",
    y_axis_label="Fare per Mile ($)",
    height=400,
    width=400,
    tools="",
    
)
p1.scatter(
    source=source,
    x="trip_distance",
    y="fare_per_mile",
    size=8,
    alpha=0.6,
    line_alpha=1
)

# 2. Tip amount vs fare per mile
p2 = figure(
    title="Tip Anomalies",
    x_axis_label="Fare per Mile ($)",
    y_axis_label="Tip Amount ($)",
    height=400,
    width=400,
    tools="",
)
p2.scatter(
    source=source,
    x="fare_per_mile",
    y="tip_amount",
    size=8,
    alpha=0.6,
    line_alpha=1
)

# 3. Time vs distance
p3 = figure(
    title="Time-Distance Anomalies",
    x_axis_label="Trip Distance (miles)",
    y_axis_label="Trip Time (seconds)",
    height=400,
    width=400,
    tools="",
    
)
p3.scatter(
    source=source,
    x="trip_distance",
    y="trip_time_in_secs",
    size=8,
    alpha=0.6,
    line_alpha=1
)
# Add hover tool to each plot
p1.add_tools(hover)
p2.add_tools(hover)
p3.add_tools(hover)

# Display the plots
show(row(p1, p2, p3))

### Exercise 8
For each driver (hack_license), calculate the **total profit per hour worked**, where:
> profit = 0.7 * (fare_amount + tip_amount) when the trip starts between 7:01 AM and 7:00 PM\
> profit = 0.8 * (fare_amount + tip_amount) when the trip starts between 7:01PM and 7:00 AM

Estimate "hours worked" by summing trip_time_in_secs.

Plot a line chart showing the distribution of average profit per hour **for the top 10% drivers** in terms of total trips.

Which time of day offers **best earning efficiency**?

In [230]:
driver_profit_df = filtered_ny_df_new_columns.with_columns(
    pl.when(
        (pl.col("pickup_datetime").dt.time() >= pl.time(7, 1)) & 
        (pl.col("pickup_datetime").dt.time() <= pl.time(19, 0))
    )
    .then(0.7 * (pl.col("fare_amount") + pl.col("tip_amount")))
    .otherwise(0.8 * (pl.col("fare_amount") + pl.col("tip_amount")))
    .alias("total_profit"),
    (pl.col("trip_time_in_secs") / 3600).alias("trip_hours")
).group_by("hack_license", "pickup_hour").agg(
    pl.col("total_profit").sum().alias("total_profit"),
    pl.col("trip_hours").sum().alias("total_hours_worked"),
    pl.col("trip_hours").count().alias("total_trips"),
).with_columns(
    (pl.col("total_profit") / pl.col("total_hours_worked")).round(2).alias("profit_per_hour"),
    (pl.col("total_profit") / pl.col("total_trips")).round(2).alias("profit_per_trip")
).sort("profit_per_hour", descending=True)

In [231]:
driver_profit_df.head(5)

hack_license,pickup_hour,total_profit,total_hours_worked,total_trips,profit_per_hour,profit_per_trip
str,i8,f64,f64,u32,f64,f64
"""003C68DFE1EBE120556D011948C788…",4,30.0,0.000278,1,108000.0,30.0
"""CF74B361ABBD77E73448BC536C6D6A…",2,16.4,0.000556,1,29520.0,16.4
"""5EE1CD0A797F45CA403AD5D4AB224E…",17,12.334,0.000556,1,22201.2,12.33
"""C6C26153997715375783FB29E9D100…",5,48.8,0.0025,1,19520.0,48.8
"""5EE1CD0A797F45CA403AD5D4AB224E…",15,32.2,0.001944,3,16560.0,10.73


In [236]:
top_10_pct_threshold = driver_profit_df["total_trips"].quantile(0.9)
top_drivers = driver_profit_df.filter(pl.col("total_trips") >= top_10_pct_threshold)

In [250]:
best_hours = top_drivers.group_by("pickup_hour").agg(
    (pl.col("total_profit").sum() / pl.col("total_hours_worked").sum()).round(2).alias("avg_profit_per_hour")
).sort("pickup_hour")

best_hours

pickup_hour,avg_profit_per_hour
i8,f64
0,58.52
1,60.25
2,61.39
3,63.43
4,66.96
…,…
19,51.28
20,54.94
21,56.61
22,56.74


In [263]:
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource, HoverTool

# Convert Polars DataFrame to Pandas for Bokeh compatibility
best_hours_pd = best_hours.to_pandas()

# Create ColumnDataSource
source = ColumnDataSource(best_hours_pd)

# Create the plot
p = figure(
    title="Average Profit per Hour for Top 10% Drivers",
    x_axis_label='Hour of Day',
    y_axis_label='Average Profit per Hour ($)',
    x_range=(0, 23),  # Numeric range is fine
    width=800,
    height=400,
    tools="pan,wheel_zoom,box_zoom,reset"
)

# Add hover tool
hover = HoverTool(
    tooltips=[
        ("Hour", "@pickup_hour:00"),
        ("Avg Profit/Hour", "$@avg_profit_per_hour{0.00}")
    ],
    mode='vline'
)
p.add_tools(hover)

# Add line plot
p.line(
    x='pickup_hour',
    y='avg_profit_per_hour',
    source=source,
    line_width=2,
    color='navy',
    legend_label='Hourly Profit'
)

# Customize axes
p.xaxis.ticker = list(range(24))
p.xgrid.ticker = list(range(24))
p.yaxis.minor_tick_line_color = None

# Show the plot
show(p)


In [264]:
best_hour_row = best_hours.sort('avg_profit_per_hour', descending=True).row(0, named=True)
best_hour = f"{best_hour_row['pickup_hour']}:00"
print(f"The most profitable hour is at {best_hour} with ${best_hour_row['avg_profit_per_hour']} average profit per hour")

The most profitable hour is at 5:00 with $74.46 average profit per hour
