### Bronze Layer: Ingest drivers.json

Load raw JSON (with nested name object) from landing zone and save as parquet in bronze/raw layer.

In [1]:
dbutils.widgets.text("p_data_source", "")
v_data_source = dbutils.widgets.get("p_data_source")

To use databricks widgets interactively in your notebook, please install databricks sdk using:
	pip install 'databricks-sdk[notebook]'
Falling back to default_value_only implementation for databricks widgets.


In [2]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DateType
from pyspark.sql.functions import current_timestamp, lit

In [3]:
from formula1.formula1_constants import landing_folder_path, raw_folder_path

##### Step 1 - Define schema and read JSON from landing zone

In [4]:
name_schema = StructType(fields=[
    StructField("forename", StringType(), True),
    StructField("surname", StringType(), True)
])

drivers_schema = StructType(fields=[
    StructField("driverId", IntegerType(), False),
    StructField("driverRef", StringType(), True),
    StructField("number", IntegerType(), True),
    StructField("code", StringType(), True),
    StructField("name", name_schema),
    StructField("dob", DateType(), True),
    StructField("nationality", StringType(), True),
    StructField("url", StringType(), True)
])

In [5]:
drivers_df = spark.read \
    .schema(drivers_schema) \
    .json(f"{landing_folder_path}/drivers.json")

##### Step 2 - Add ingestion metadata

In [6]:
drivers_df = drivers_df \
    .withColumn("ingestion_date", current_timestamp()) \
    .withColumn("data_source", lit(v_data_source))

##### Step 3 - Write to bronze/raw layer as parquet

In [7]:
drivers_df.write.mode("overwrite").parquet(f"{raw_folder_path}/drivers")