## Enrich the Previously Ingested Dataset with Holidays Dataset
Now that we have taxi data downloaded prepared and saved on the Silver Layer as a table, let's add in holiday data as additional features. Holiday-specific features will assist model accuracy, as major holidays are times where taxi demand increases dramatically and supply becomes limited. 

Let's load the [public holidays]() from the Bronze Layer.


<font size="2" color="red" face="sans-serif" bold> 

<b> <i> <u>Make sure that you pin this Notebook to the SilverLakehouse. To do this, click on Lakehouses, and make sure that the SilverLakehouse is the one that has the pinned sign for this Notebook.
</font>






Import the Data types we need for this Notebook

In [11]:
from datetime import datetime
import pyspark.sql.functions as f

StatementMeta(, a7ee5eb5-3d8f-4bb1-bc4a-799dee887287, 13, Finished, Available)

##  Data Ingestion  -- Holiday Data et
We are going to load the Holiday  data from the bronze layer

<font size="2" color="red" face="sans-serif" bold> 

<b> <i> <u>We are going to Load the Holiday Data from the Bronze Layer, Which means that you need to Retrieve the ABFSS path from the Bronze Layer.
This is required because the SilverLakehouse is pinned to this Notebook
</font>

In [8]:
# Let's display the hol_df data frame from the Bronze files 
# Sample ABFSS Path: abfss://d239837d-5508-4a0c-acf9-8699feb71c5a@msit-onelake.dfs.fabric.microsoft.com/22c420ed-63ea-4c95-9d01-302573d1d5db

hol_df = spark.read.format("parquet").load("<Replace with your BronzeLakehouse ABFSS path>/Files/HolidayRawFiles/")

StatementMeta(, a7ee5eb5-3d8f-4bb1-bc4a-799dee887287, 10, Finished, Available)

Display the Data Frame for the Holiday data. We are going to limit the result to 10 rows

In [None]:
# hol_df now is a Spark DataFrame containing CSV data from  the above path 
display(hol_df.head(10))

Rename the countryRegionCode and date columns to match the respective field names from the taxi data, and normalize the time so it can be used as a key.

In [None]:
hol_df_clean = hol_df \
                .withColumnRenamed('countryRegionCode','country_code') \
                .withColumn('datetime',f.to_date('date'))

hol_df_clean.show(5)

## Writing a Delta Table
Before proceeding, it's a good idea to check if the table already exists. If it doesn't, you'll need to un-comment the command below and execute the cell to create it.

In [None]:
%%sql
--DROP TABLE IF EXISTS Holiday_Clean

## Save the Clean File as a Table in the Silver Layer by Providing the ABFSS Path

To retrieve the ABFS path, please follow these steps:

1. Go to the Workspace and click on the Silver Lakehouse.
1. Navigate to the Tables section and select the desired table.
1. Click on Properties and copy the ABFS path provided.

In [4]:
# Save the clean file as a table in the Silver Lakehouse
# Sample ABFSS Path: abfss://d239837d-5508-4a0c-acf9-8699feb71c5a@msit-onelake.dfs.fabric.microsoft.com/a6114b59-17f1-45b7-a5f1-9d4fd93a92d8
table_name = 'Holiday_Clean'

hol_df_clean.write \
    .mode("overwrite") \
    .format("delta") \
    .save("Tables/" + table_name)

StatementMeta(, , , Waiting, )

Since we saved the `NYCTaxi_Clean` as a table in the Silver layer, we create a new dataframe, namely nyc_tlc_df_clean by reading the NYCTaxiClean table

<font size="2" color="red" face="sans-serif" bold> 

<b> <i> <u>In case you encounter any errors indicating that the table or the lakehouse cannot be located, please try refreshing your BronzeLakehouse and verifying the tablename once more
</font>


In [17]:
# enrich taxi data with holiday data
You will be able to read data from different tables from different lakehouses from the same workspace by creating a dataframe.

nyc_tlc_df_clean = spark.read.table("SilverLakehouse.NYCTaxi_Clean")

StatementMeta(, , , Waiting, )

## Adding Transformations
Next, join the holiday data with the taxi data by performing a left-join. This will preserve all records from taxi data, but add in holiday data where it exists for the corresponding datetime and country_code, which in this case is always "US". Preview the data to verify that they were merged correctly.

Basically during this step we perform the join between two datasets.

In [None]:
nyc_taxi_holiday_df = nyc_tlc_df_clean.join(hol_df_clean, on = ['datetime', 'country_code'] , how = 'left')

nyc_taxi_holiday_df.show(5)

## Writing a Delta Table with Transformed Data
Before proceeding, it's a good idea to check if the table already exists. Right click on the Tables and hit refresh. If it doesn't, you'll need to uncomment the command below and execute the cell to create it.

In [None]:
%%sql
-- DROP TABLE IF EXISTS SilverLakehouse.NYCTaxi_Holiday

We're now ready to store the nyc_taxi_holiday_df as a table named NYCTaxi_Holiday on the Silver layer. By specifying the ABFSS path, we'll be able to easily create dataFrames from this table and use them for any future data transformations.

In [None]:
# Save the clean file as a table in the Silver Layer

table_name = 'NYCTaxi_Holiday'

nyc_taxi_holiday_df.write \
    .mode("overwrite") \
    .format("delta") \
    .save("Tables/" + table_name)