Skip to content

This homework is to orchestrate an ETL pipeline using **Kestra**, extend existing flows to process **NYC Taxi data for 2021**, and answer analytical questions using real executions.

Notifications You must be signed in to change notification settings

yaaronn/module-2-homework

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

Module 2 Homework – Orchestration

This repository contains my solution for Module 2 Homework of the Data Engineering Zoomcamp.

The goal of this homework is to orchestrate an ETL pipeline using Kestra, extend existing flows to process NYC Taxi data for 2021, and answer analytical questions using real executions.


Tools & Technologies

  • Kestra (local setup)
  • DuckDB (for SQL analytics)
  • Shell Script tasks
  • HTTP Download tasks
  • NYC TLC Green & Yellow Taxi datasets

Dataset

NYC Taxi data from DataTalksClub:

https://github.com/DataTalksClub/nyc-tlc-data

Both Yellow and Green taxi datasets were processed.


Flows Overview

1. ETL_pipeline.yaml

  • Parameterized ETL flow
  • Inputs:
    • taxi (yellow / green)
    • year
    • month (MM format)
  • Downloads .csv.gz files
  • Extracts CSV files
  • Used to inspect file sizes and naming

2. Backfill for 2021

  • Data processed for January–July 2021
  • Executed for both yellow and green taxi data
  • Monthly runs performed manually (compatible with older Kestra versions)

3. Row Count Flows

  • DuckDB used to count rows
  • Verified:
    • Yellow Taxi 2020 total rows
    • Green Taxi 2020 total rows
    • Yellow Taxi March 2021 row count

1. question

Within the execution for Yellow Taxi data for the year 2020 and month 12: what is the uncompressed file size (i.e. the output file yellow_tripdata_2020-12.csv of the extract task)? answer : 364.7 MiB

#2. question What is the rendered value of the variable file when the inputs taxi is set to green, year is set to 2020, and month is set to 04 during execution? answer : green_tripdata_2020-04.csv

#3. question How many rows are there for the Yellow Taxi data for all CSV files in the year 2020? answer : 24,648,499

#4. question How many rows are there for the Green Taxi data for all CSV files in the year 2020? answer : 5,327,301

#5. question How many rows are there for the Yellow Taxi data for the March 2021 CSV file? answer : 1,925,152

#6. question How would you configure the timezone to New York in a Schedule trigger? answer : Add a timezone property set to America/New_York in the Schedule trigger configuration

About

This homework is to orchestrate an ETL pipeline using **Kestra**, extend existing flows to process **NYC Taxi data for 2021**, and answer analytical questions using real executions.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published