# Module 2 Homework — Workflow Orchestration (Kestra)

This notebook contains answers and code used to answer the Module 2 quiz questions.
Data source: [NYC TLC data](https://github.com/DataTalksClub/nyc-tlc-data/releases)

## Q1 — Uncompressed file size (Yellow Taxi 2020-12)

**Question:** Within the execution for Yellow Taxi data for the year 2020 and month 12: what is the uncompressed file size (i.e. the output file `yellow_tripdata_2020-12.csv` of the extract task)?

- 128.3 MiB  
- 134.5 MiB  
- 364.7 MiB  
- 692.6 MiB

In [1]:
# Download yellow 2020-12, decompress, and get uncompressed CSV size (same as Kestra extract task)
import urllib.request
import gzip
import os

BASE = "https://github.com/DataTalksClub/nyc-tlc-data/releases/download"
taxi, year, month = "yellow", "2020", "12"
filename = f"{taxi}_tripdata_{year}-{month}.csv"
url = f"{BASE}/{taxi}/{filename}.gz"
path_gz = f"temp_{filename}.gz"
path_csv = filename

urllib.request.urlretrieve(url, path_gz)
with gzip.open(path_gz, "rb") as f_in:
    with open(path_csv, "wb") as f_out:
        f_out.write(f_in.read())
size_bytes = os.path.getsize(path_csv)
size_mib = size_bytes / (1024 * 1024)
print(f"Uncompressed file size: {size_mib:.1f} MiB")
# Cleanup
os.remove(path_gz)
os.remove(path_csv)

Uncompressed file size: 128.3 MiB


**Answer Q1:** **128.3 MiB**

## Q2 — Rendered value of variable `file`

**Question:** What is the rendered value of the variable `file` when taxi=green, year=2020, month=04?

In `04_postgres_taxi.yaml` the variable is defined as:
```yaml
file: "{{inputs.taxi}}_tripdata_{{inputs.year}}-{{inputs.month}}.csv"
```
Substituting inputs: `green` + `_tripdata_` + `2020` + `-` + `04` + `.csv`

**Answer Q2:** **green_tripdata_2020-04.csv**

## Q3 — Total rows: Yellow Taxi 2020 (all months)

**Question:** How many rows are there for the Yellow Taxi data for all CSV files in the year 2020?

- 13,537,299  
- 24,648,499  
- 18,324,219  
- 29,430,127

In [2]:
# Yellow taxi 2020: all 12 months — total row count
import urllib.request
import gzip
import io

BASE = "https://github.com/DataTalksClub/nyc-tlc-data/releases/download"
taxi, year = "yellow", "2020"
total_rows = 0
for month in [f"{m:02d}" for m in range(1, 13)]:
    filename = f"{taxi}_tripdata_{year}-{month}.csv"
    url = f"{BASE}/{taxi}/{filename}.gz"
    try:
        with urllib.request.urlopen(url) as r:
            raw = r.read()
        with gzip.open(io.BytesIO(raw), "rt") as f:
            n_lines = sum(1 for _ in f)
        rows = n_lines - 1  # exclude header
        total_rows += rows
        print(f"{filename}: {rows:,} rows")
    except Exception as e:
        print(f"{filename}: failed - {e}")
print(f"\nTotal Yellow 2020: {total_rows:,} rows")

yellow_tripdata_2020-01.csv: 6,405,008 rows


yellow_tripdata_2020-02.csv: 6,299,354 rows


yellow_tripdata_2020-03.csv: 3,007,292 rows


yellow_tripdata_2020-04.csv: 237,993 rows


yellow_tripdata_2020-05.csv: 348,371 rows


yellow_tripdata_2020-06.csv: 549,760 rows


yellow_tripdata_2020-07.csv: 800,412 rows


yellow_tripdata_2020-08.csv: 1,007,284 rows


yellow_tripdata_2020-09.csv: 1,341,012 rows


yellow_tripdata_2020-10.csv: 1,681,131 rows


yellow_tripdata_2020-11.csv: 1,508,985 rows


yellow_tripdata_2020-12.csv: 1,461,897 rows

Total Yellow 2020: 24,648,499 rows


**Answer Q3:** **24,648,499**

## Q4 — Total rows: Green Taxi 2020 (all months)

**Question:** How many rows are there for the Green Taxi data for all CSV files in the year 2020?

- 5,327,301  
- 936,199  
- 1,734,051  
- 1,342,034

In [3]:
# Green taxi 2020: all 12 months — total row count
import urllib.request
import gzip
import io

BASE = "https://github.com/DataTalksClub/nyc-tlc-data/releases/download"
taxi, year = "green", "2020"
total_rows = 0
for month in [f"{m:02d}" for m in range(1, 13)]:
    filename = f"{taxi}_tripdata_{year}-{month}.csv"
    url = f"{BASE}/{taxi}/{filename}.gz"
    try:
        with urllib.request.urlopen(url) as r:
            raw = r.read()
        with gzip.open(io.BytesIO(raw), "rt") as f:
            n_lines = sum(1 for _ in f)
        rows = n_lines - 1
        total_rows += rows
        print(f"{filename}: {rows:,} rows")
    except Exception as e:
        print(f"{filename}: failed - {e}")
print(f"\nTotal Green 2020: {total_rows:,} rows")

green_tripdata_2020-01.csv: 447,770 rows


green_tripdata_2020-02.csv: 398,632 rows


green_tripdata_2020-03.csv: 223,406 rows


green_tripdata_2020-04.csv: 35,612 rows


green_tripdata_2020-05.csv: 57,360 rows


green_tripdata_2020-06.csv: 63,109 rows


green_tripdata_2020-07.csv: 72,257 rows


green_tripdata_2020-08.csv: 81,063 rows


green_tripdata_2020-09.csv: 87,987 rows


green_tripdata_2020-10.csv: 95,120 rows


green_tripdata_2020-11.csv: 88,605 rows


green_tripdata_2020-12.csv: 83,130 rows

Total Green 2020: 1,734,051 rows


**Answer Q4:** **1,734,051**

## Q5 — Rows in Yellow Taxi March 2021

**Question:** How many rows are there for the Yellow Taxi data for the March 2021 CSV file?

- 1,428,092  
- 706,911  
- 1,925,152  
- 2,561,031

In [4]:
# Yellow taxi 2021-03 — row count
import urllib.request
import gzip
import io

BASE = "https://github.com/DataTalksClub/nyc-tlc-data/releases/download"
url = f"{BASE}/yellow/yellow_tripdata_2021-03.csv.gz"
with urllib.request.urlopen(url) as r:
    raw = r.read()
with gzip.open(io.BytesIO(raw), "rt") as f:
    n_lines = sum(1 for _ in f)
rows = n_lines - 1
print(f"yellow_tripdata_2021-03.csv: {rows:,} rows")

yellow_tripdata_2021-03.csv: 1,925,152 rows


**Answer Q5:** **1,925,152**

## Q6 — Timezone for New York in Schedule trigger

**Question:** How would you configure the timezone to New York in a Schedule trigger?

Kestra's Schedule trigger uses standard **IANA timezone** names (e.g. for cron execution time). `EST` is ambiguous and does not handle daylight saving; `UTC-5` is an offset, not a named timezone; there is no `location` property.

**Answer Q6:** **Add a `timezone` property set to `America/New_York`** in the Schedule trigger configuration.

---

## Summary of answers

| Q | Answer |
|---|--------|
| 1 | **128.3 MiB** |
| 2 | **green_tripdata_2020-04.csv** |
| 3 | **24,648,499** |
| 4 | **1,734,051** |
| 5 | **1,925,152** |
| 6 | **America/New_York** (timezone property) |