This repository contains my solution for Module 2 Homework of the Data Engineering Zoomcamp.
The goal of this homework is to orchestrate an ETL pipeline using Kestra, extend existing flows to process NYC Taxi data for 2021, and answer analytical questions using real executions.
- Kestra (local setup)
- DuckDB (for SQL analytics)
- Shell Script tasks
- HTTP Download tasks
- NYC TLC Green & Yellow Taxi datasets
NYC Taxi data from DataTalksClub:
https://github.com/DataTalksClub/nyc-tlc-data
Both Yellow and Green taxi datasets were processed.
- Parameterized ETL flow
- Inputs:
- taxi (yellow / green)
- year
- month (MM format)
- Downloads
.csv.gzfiles - Extracts CSV files
- Used to inspect file sizes and naming
- Data processed for January–July 2021
- Executed for both yellow and green taxi data
- Monthly runs performed manually (compatible with older Kestra versions)
- DuckDB used to count rows
- Verified:
- Yellow Taxi 2020 total rows
- Green Taxi 2020 total rows
- Yellow Taxi March 2021 row count
Within the execution for Yellow Taxi data for the year 2020 and month 12: what is the uncompressed file size (i.e. the output file yellow_tripdata_2020-12.csv of the extract task)? answer : 364.7 MiB
#2. question What is the rendered value of the variable file when the inputs taxi is set to green, year is set to 2020, and month is set to 04 during execution? answer : green_tripdata_2020-04.csv
#3. question How many rows are there for the Yellow Taxi data for all CSV files in the year 2020? answer : 24,648,499
#4. question How many rows are there for the Green Taxi data for all CSV files in the year 2020? answer : 5,327,301
#5. question How many rows are there for the Yellow Taxi data for the March 2021 CSV file? answer : 1,925,152
#6. question How would you configure the timezone to New York in a Schedule trigger? answer : Add a timezone property set to America/New_York in the Schedule trigger configuration