# **COMP 3610 - ASSIGNMENT 1**
### _**Samuel Soman - 816039318**_
Data Pipeline & Visualization Dashboard
- This notebook builds an end-to-end data pipeline that ingests, transforms, and analyzes the NYC Yellow Taxi Trip dataset (January 2024). 
- We download the data programmatically, clean and validate it, perform SQL-based analysis with DuckDB, and prototype interactive visualizations for the Streamlit dashboard.

## _**Part 1: Data Ingestion & Storage**_
This part contains Programmatic Download, Data Validation, File Organisation.

In [29]:
# Import all libraries required for the assignment
import os
import requests
import pandas as pd
import numpy as np
import duckdb
import plotly.express as px
import plotly.graph_objects as go

print('All Libraries Imported Successfully!')

All Libraries Imported Successfully!


### 1. Programmatic Download

In [30]:
TRIP_DATA_URL = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet"
ZONE_LOOKUP_URL = "https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv"

RAW_DIR = os.path.join("data", "raw")
os.makedirs(RAW_DIR, exist_ok=True)

TRIP_DATA_PATH = os.path.join(RAW_DIR, "yellow_tripdata_2024-01.parquet")
ZONE_LOOKUP_PATH = os.path.join(RAW_DIR, "taxi_zone_lookup.csv")

In [31]:
def download_file(url, dest_path):
    if os.path.exists(dest_path):
        print(f"Already exists, skipping: {dest_path}")
        return

    print(f"Downloading {url} ...")
    response = requests.get(url, stream=True)
    response.raise_for_status()

    with open(dest_path, "wb") as f:
        for chunk in response.iter_content(chunk_size=8192):
            f.write(chunk)

    size_mb = os.path.getsize(dest_path) / (1024 * 1024)
    print(f"Saved to {dest_path} ({size_mb:.1f} MB)")


# Download both datasets
download_file(TRIP_DATA_URL, TRIP_DATA_PATH)
download_file(ZONE_LOOKUP_URL, ZONE_LOOKUP_PATH)

Downloading https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet ...
Saved to data\raw\yellow_tripdata_2024-01.parquet (47.6 MB)
Downloading https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv ...
Saved to data\raw\taxi_zone_lookup.csv (0.0 MB)


### 2. Data Validation
Implement validation checks that:
    
    a) Verify all expected columns exist in the dataset

    b) Check that date columns are valid datetime types

    c) Report total row count and print a summary to the console

    d) Raise an exception or exit with an error message if validation fails

In [32]:
# Validate that the downloads completed successfully
for label, path in [("Trip data", TRIP_DATA_PATH), ("Zone lookup", ZONE_LOOKUP_PATH)]:
    if os.path.exists(path):
        size_mb = os.path.getsize(path) / (1024 * 1024)
        print(f"{label}: {path} ({size_mb:.2f} MB)")
    else:
        print(f"{label}: {path} NOT FOUND!")

# Validate parquet file is readable
df_raw = pd.read_parquet(TRIP_DATA_PATH)
print(f"\nTrip dataset shape: {df_raw.shape}")
print(f"Columns: {list(df_raw.columns)}")

# Validate zone lookup CSV
zone_df = pd.read_csv(ZONE_LOOKUP_PATH)
print(f"\nZone lookup shape: {zone_df.shape}")
print(f"Columns: {list(zone_df.columns)}")

Trip data: data\raw\yellow_tripdata_2024-01.parquet (47.65 MB)
Zone lookup: data\raw\taxi_zone_lookup.csv (0.01 MB)

Trip dataset shape: (2964624, 19)
Columns: ['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime', 'passenger_count', 'trip_distance', 'RatecodeID', 'store_and_fwd_flag', 'PULocationID', 'DOLocationID', 'payment_type', 'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge', 'total_amount', 'congestion_surcharge', 'Airport_fee']

Zone lookup shape: (265, 4)
Columns: ['LocationID', 'Borough', 'Zone', 'service_zone']


### 3. File Organisation & Initial Inspection
Save downloaded files to a data/raw/ directory. Include a .gitignore file that excludes the data directory from version control.

In [33]:
print("data/ directory structure:")
for root, dirs, files in os.walk("data"):
    level = root.replace("data", "").count(os.sep)
    indent = "  " * level
    print(f"{indent}{os.path.basename(root)}/")
    for f in files:
        size_mb = os.path.getsize(os.path.join(root, f)) / (1024 * 1024)
        print(f"{indent}  {f} ({size_mb:.2f} MB)")

print("\n--- Data Types ---")
print(df_raw.dtypes)

print("\n--- First 5 Rows ---")
df_raw.head()

data/ directory structure:
data/
  raw/
    taxi_zone_lookup.csv (0.01 MB)
    yellow_tripdata_2024-01.parquet (47.65 MB)

--- Data Types ---
VendorID                          int32
tpep_pickup_datetime     datetime64[us]
tpep_dropoff_datetime    datetime64[us]
passenger_count                 float64
trip_distance                   float64
RatecodeID                      float64
store_and_fwd_flag               object
PULocationID                      int32
DOLocationID                      int32
payment_type                      int64
fare_amount                     float64
extra                           float64
mta_tax                         float64
tip_amount                      float64
tolls_amount                    float64
improvement_surcharge           float64
total_amount                    float64
congestion_surcharge            float64
Airport_fee                     float64
dtype: object

--- First 5 Rows ---


Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,Airport_fee
0,2,2024-01-01 00:57:55,2024-01-01 01:17:43,1.0,1.72,1.0,N,186,79,2,17.7,1.0,0.5,0.0,0.0,1.0,22.7,2.5,0.0
1,1,2024-01-01 00:03:00,2024-01-01 00:09:36,1.0,1.8,1.0,N,140,236,1,10.0,3.5,0.5,3.75,0.0,1.0,18.75,2.5,0.0
2,1,2024-01-01 00:17:06,2024-01-01 00:35:01,1.0,4.7,1.0,N,236,79,1,23.3,3.5,0.5,3.0,0.0,1.0,31.3,2.5,0.0
3,1,2024-01-01 00:36:38,2024-01-01 00:44:56,1.0,1.4,1.0,N,79,211,1,10.0,3.5,0.5,2.0,0.0,1.0,17.0,2.5,0.0
4,1,2024-01-01 00:46:51,2024-01-01 00:52:57,1.0,0.8,1.0,N,211,148,1,7.9,3.5,0.5,3.2,0.0,1.0,16.1,2.5,0.0


## _**Part 2: Data Transformation & Analysis**_
This part covers Data Cleaning, Feature Engineering, Saving the Processed Dataset, and Running SQL queries with DuckDB.