# 03_Silver_Cleaning_and_Privacy
This notebook shows transforming the raw Bronze data into a high-quality "Silver" table by cleaning text artifacts and enforcing data privacy standards.

## Architecture Mapping
* **Source:** `safety_signal_catalog.raw_data.bronze_drug_reviews`
* **Destination:** `safety_signal_catalog.raw_data.silver_drug_reviews`

## Transformations
1.  **Privacy (HIPAA):** The `uniqueID` is hashed using SHA-256 to de-identify patients while maintaining referential integrity.
2.  **Text Cleaning:** HTML entities (e.g., `&#039;`) are decoded to standard text.
3.  **Data Quality:** 'Date' strings are parsed into actual Date objects for time-series analysis.

#### 1. LOAD BRONZE DATA

In [0]:
from pyspark.sql.functions import col, sha2, to_date, regexp_replace

catalog = "safety_signal_catalog"
schema  = "raw_data"

# Read the Bronze Delta Table
df_bronze = spark.read.table(f"{catalog}.{schema}.bronze_drug_reviews")

print(f"Loaded {df_bronze.count()} records from Bronze.")

#### 2. APPLY TRANSFORMATIONS (Privacy & Cleaning)

In [0]:

print("Applying Silver Transformations...")

df_silver = (df_bronze
    # Hash the ID (Salted usually, but simple SHA2 here)
    .withColumn("patient_token", sha2(col("uniqueID").cast("string"), 256))
    
    # Fix HTML artifacts (&#039; -> ') using Regex
    .withColumn("clean_review", regexp_replace(col("review"), "&#039;", "'"))
    .withColumn("clean_review", regexp_replace(col("clean_review"), '&quot;', '"'))
    
    # Convert String Date ("May 20, 2012") -> Date Object (2012-05-20)
    .withColumn("event_date", to_date(col("date"), "MMMM d, yyyy"))
    
    # Drop raw columns to keep it clean
    .select(
        "patient_token",
        "drugName",
        "condition",
        "clean_review",
        "rating",
        "event_date",
        "usefulCount"
    )
)

print("Transformations applied. View Sample data")
display(df_silver.limit(5))

#### 3. WRITE TO SILVER (Delta Lake)


In [0]:

table_name = f"{catalog}.{schema}.silver_drug_reviews"

print(f"Saving to Silver Table: {table_name}")

(df_silver.write
    .format("delta")
    .mode("overwrite")
    .option("overwriteSchema", "true") 
    .saveAsTable(table_name)
)

print("Silver Table Created!")