## Week 2 Homework

> In case you don't get one option exactly, select the closest one 

For the homework, we'll be working with the _green_ taxi dataset located here:

`https://github.com/DataTalksClub/nyc-tlc-data/releases/tag/green/download`

### Assignment

The goal will be to construct an ETL pipeline that loads the data, performs some transformations, and writes the data to a database (and Google Cloud!).

- Create a new pipeline, call it `green_taxi_etl`
- Add a data loader block and use Pandas to read data for the final quarter of 2020 (months `10`, `11`, `12`).
  - You can use the same datatypes and date parsing methods shown in the course.
  - `BONUS`: load the final three months using a for loop and `pd.concat`
- Add a transformer block and perform the following:
  - Remove rows where the passenger count is equal to 0 _or_ the trip distance is equal to zero.
  - Create a new column `lpep_pickup_date` by converting `lpep_pickup_datetime` to a date.
  - Rename columns in Camel Case to Snake Case, e.g. `VendorID` to `vendor_id`.
  - Add three assertions:
    - `vendor_id` is one of the existing values in the column (currently)
    - `passenger_count` is greater than 0
    - `trip_distance` is greater than 0
- Using a Postgres data exporter (SQL or Python), write the dataset to a table called `green_taxi` in a schema `mage`. Replace the table if it already exists.
- Write your data as Parquet files to a bucket in GCP, partioned by `lpep_pickup_date`. Use the `pyarrow` library!
- Schedule your pipeline to run daily at 5AM UTC.

## Question 1. Data Loading

Once the dataset is loaded, what's the shape of the data?

* 266,855 rows x 20 columns
* 544,898 rows x 18 columns
* 544,898 rows x 20 columns
* 133,744 rows x 20 columns

### **Answer =>** `266,855 rows x 20 columns`


```shell
1/1 tests passed.
=== After Loading: ===
+-------------------------+--------+
|          Metric         | Value  |
+-------------------------+--------+
|        Total Rows       | 266855 |
|      Total Columns      |   20   |
|  NaN in Passenger Count | 119462 |
|   NaN in Trip Distance  |   0    |
| Zero in Passenger Count |  661   |
|  Zero in Trip Distance  |  8862  |
+-------------------------+--------+
DataFrame Column Types Summary:
--------------------------------
VendorID                      : Int64
lpep_pickup_datetime          : datetime64[ns]
lpep_dropoff_datetime         : datetime64[ns]
store_and_fwd_flag            : object
RatecodeID                    : Int64
PULocationID                  : Int64
DOLocationID                  : Int64
passenger_count               : Int64
trip_distance                 : float64
fare_amount                   : float64
extra                         : float64
mta_tax                       : float64
tip_amount                    : float64
tolls_amount                  : float64
ehail_fee                     : float64
improvement_surcharge         : float64
total_amount                  : float64
payment_type                  : Int64
trip_type                     : float64
congestion_surcharge          : float64
--------------------------------
```

## Question 2. Data Transformation

Upon filtering the dataset where the passenger count is greater than 0 _and_ the trip distance is greater than zero, how many rows are left?

* 544,897 rows
* 266,855 rows
* 139,370 rows
* 266,856 rows

### **Answer =>** `139,370 rows`

```shell
=== Before Transform : ===
+-------------------------+--------+
|          Metric         | Value  |
+-------------------------+--------+
|        Total Rows       | 266855 |
|      Total Columns      |   20   |
|  NaN in Passenger Count | 119462 |
|   NaN in Trip Distance  |   0    |
| Zero in Passenger Count |  661   |
|  Zero in Trip Distance  |  8862  |
+-------------------------+--------+
=== After Transform: ===
+-------------------------+--------+
|          Metric         | Value  |
+-------------------------+--------+
|        Total Rows       | 139370 |
|      Total Columns      |   21   |
|  NaN in Passenger Count |   0    |
|   NaN in Trip Distance  |   0    |
| Zero in Passenger Count |   0    |
|  Zero in Trip Distance  |   0    |
+-------------------------+--------+
DataFrame Column Types Summary:
--------------------------------
vendor_id                     : Int64
lpep_pickup_datetime          : datetime64[ns]
lpep_dropoff_datetime         : datetime64[ns]
store_and_fwd_flag            : object
ratecode_id                   : Int64
pu_location_id                : Int64
do_location_id                : Int64
passenger_count               : Int64
trip_distance                 : float64
fare_amount                   : float64
extra                         : float64
mta_tax                       : float64
tip_amount                    : float64
tolls_amount                  : float64
ehail_fee                     : float64
improvement_surcharge         : float64
total_amount                  : float64
payment_type                  : Int64
trip_type                     : float64
congestion_surcharge          : float64
lpep_pickup_date              : object
--------------------------------
```

## Question 3. Data Transformation

Which of the following creates a new column `lpep_pickup_date` by converting `lpep_pickup_datetime` to a date?

* `data = data['lpep_pickup_datetime'].date`
* `data('lpep_pickup_date') = data['lpep_pickup_datetime'].date`
* `data['lpep_pickup_date'] = data['lpep_pickup_datetime'].dt.date`
* `data['lpep_pickup_date'] = data['lpep_pickup_datetime'].dt().date()`

### **Answer =>** `data['lpep_pickup_date'] = data['lpep_pickup_datetime'].dt.date`


## Question 4. Data Transformation

What are the existing values of `VendorID` in the dataset?

* 1, 2, or 3
* 1 or 2
* 1, 2, 3, 4
* 1

### **Answer =>** `1 or 2`

## Question 5. Data Transformation

How many columns need to be renamed to snake case?

* 3
* 6
* 2
* 4

### **Answer =>** `4`

## Question 6. Data Exporting

Once exported, how many partitions (folders) are present in Google Cloud?

* 96
* 56
* 67
* 108

### **Answer =>** `96`

```shell
 gsutil \
    ls \
    -p shamb0-zoomcamp-lab-01 \
    gs://shamb0_zcamp_2024_hcl_demo_v1_bucket/green_trip_2020_q3 | \
    grep "/" | \
    wc -l
96
```
