Silver Layer  **Clean** and **Standardize** Data from the bronze layer

At this layer we are going to clean and transform the raw Bronze data:
  - Casts dates to proper format
  - Handles missing/null values
  - Adds `return_delay_days` to track delays
  
You can think of this as librarians organizing, tagging, and fixing messy book records.

#### We are importing all the necessary functions we will need to perform this task

In [0]:
from pyspark.sql.functions import col, to_date, datediff, lit, coalesce, current_date
from pyspark.sql.types import *


#### We are going to create the silver tables using the following schema, this is to ensure schema enforcement

In [0]:
books_schema = StructType([
    StructField("isbn", StringType(), False),
    StructField("title", StringType(), True),
    StructField("author", StringType(), True),
    StructField("genre", StringType(), True),
    StructField("publish_date", DateType(), True),
    StructField("pages", IntegerType(), True)
])

borrowers_schema = StructType([
    StructField("user_id", StringType(), False),
    StructField("name", StringType(), True),
    StructField("book_isbn", StringType(), True),
    StructField("borrow_date", DateType(), True),
    StructField("return_date", DateType(), True),
    StructField("return_delay_days", IntegerType(), True)
])

staff_schema = StructType([
    StructField("staff_id", StringType(), False),
    StructField("name", StringType(), True),
    StructField("role", StringType(), True),
    StructField("hire_date", DateType(), True)
])

#### We are loading the data from the bronze tables into variables.

In [0]:
# Load Bronze tables
books_bronze = spark.table("books_bronze")
borrowers_bronze = spark.table("borrowers_bronze")
staff_bronze = spark.table("staff_bronze") 

#### This code is going to create books_silver from books_bronze variable after clenaing and transforming the selected rows in the books_bronze.

In [0]:
# Clean books data
books_silver = books_bronze.select(
    col("isbn").cast("string"),
    col("title"),
    col("author"),
    col("genre"),
    to_date(col("publish_date")).alias("publish_date"),
    col("pages").cast("int")
)


In [0]:
books_silver.write.mode("overwrite").format("delta").option("overwriteSchema", True).saveAsTable("books_silver")

#### We are viewing the table to have a look at our data. We are limiting to just 10 role.

In [0]:
books_silver_df = spark.table("books_silver").limit(10)
books_silver_df.show()


#### This code is going to create borrowers_silver from borrowers_bronze variable after clenaing and transforming the selected rows in the borrowers_bronze.

In [0]:
# Clean borrowers data
borrowers_silver = borrowers_bronze.select(
    col("user_id"),
    col("name"),
    col("book_isbn"),
    col("borrow_date"),
    to_date(coalesce(col("return_date"), current_date())).alias("return_date"),
    datediff(
        to_date(coalesce(col("return_date"), current_date())),
        to_date(coalesce(col("borrow_date"), lit("2000-01-01")))
    ).alias("return_delay_days")
)

In [0]:
borrowers_silver.write.mode("overwrite").format("delta").option("overwriteSchema", True).saveAsTable("borrowers_silver")

#### We are viewing the table to have a look at our data. We are limiting to just 10 role.

In [0]:
borrowers_silver_df = spark.table("borrowers_silver").limit(10)
borrowers_silver_df.show()


#### This code is going to create staff_silver from staff_bronze variable after clenaing and transforming the selected rows in the staff_bronze.

In [0]:
# Clean staff data
staff_silver = staff_bronze.select(
    col("staff_id").cast("string"),
    col("name"),
    col("role"),
    to_date(col("hire_date")).alias("hire_date")
)


In [0]:
staff_silver.write.mode("overwrite").format("delta").option("overwriteSchema", True).saveAsTable("staff_silver")

#### We are viewing the table to have a look at our data. We are limiting to just 10 role.

In [0]:
staff_silver_df = spark.table("staff_silver").limit(10)
staff_silver_df.show()
