1. question

Module 2 Homework – Orchestration

This repository contains my solution for Module 2 Homework of the Data Engineering Zoomcamp.

The goal of this homework is to orchestrate an ETL pipeline using Kestra, extend existing flows to process NYC Taxi data for 2021, and answer analytical questions using real executions.

Tools & Technologies

Kestra (local setup)
DuckDB (for SQL analytics)
Shell Script tasks
HTTP Download tasks
NYC TLC Green & Yellow Taxi datasets

Dataset

NYC Taxi data from DataTalksClub:

https://github.com/DataTalksClub/nyc-tlc-data

Both Yellow and Green taxi datasets were processed.

Flows Overview

1. `ETL_pipeline.yaml`

Parameterized ETL flow
Inputs:
- taxi (yellow / green)
- year
- month (MM format)
Downloads .csv.gz files
Extracts CSV files
Used to inspect file sizes and naming

2. Backfill for 2021

Data processed for January–July 2021
Executed for both yellow and green taxi data
Monthly runs performed manually (compatible with older Kestra versions)

3. Row Count Flows

DuckDB used to count rows
Verified:
- Yellow Taxi 2020 total rows
- Green Taxi 2020 total rows
- Yellow Taxi March 2021 row count

1. question

Within the execution for Yellow Taxi data for the year 2020 and month 12: what is the uncompressed file size (i.e. the output file yellow_tripdata_2020-12.csv of the extract task)? answer : 364.7 MiB

#2. question What is the rendered value of the variable file when the inputs taxi is set to green, year is set to 2020, and month is set to 04 during execution? answer : green_tripdata_2020-04.csv

#3. question How many rows are there for the Yellow Taxi data for all CSV files in the year 2020? answer : 24,648,499

#4. question How many rows are there for the Green Taxi data for all CSV files in the year 2020? answer : 5,327,301

#5. question How many rows are there for the Yellow Taxi data for the March 2021 CSV file? answer : 1,925,152

#6. question How would you configure the timezone to New York in a Schedule trigger? answer : Add a timezone property set to America/New_York in the Schedule trigger configuration

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
flows		flows
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Module 2 Homework – Orchestration

Tools & Technologies

Dataset

Flows Overview

1. `ETL_pipeline.yaml`

2. Backfill for 2021

3. Row Count Flows

1. question

About

Uh oh!

Releases

Packages

yaaronn/module-2-homework

Folders and files

Latest commit

History

Repository files navigation

Module 2 Homework – Orchestration

Tools & Technologies

Dataset

Flows Overview

1. ETL_pipeline.yaml

2. Backfill for 2021

3. Row Count Flows

1. question

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

1. `ETL_pipeline.yaml`

Packages