# 02_Bronze_Ingestion
This notebook shows how to ingest the raw TSV files from the Landing Zone into a structured **Delta Lake Table** (Bronze Layer). 

## Architecture Mapping
* **Source:** Raw TSV files in `safety_signal_catalog.raw_data.landing_zone`.
* **Destination:** Delta Table `safety_signal_catalog.raw_data.bronze_drug_reviews`.
* **Pattern:** Schema-on-Read with enforcement.

## Technical Decisions
* **Schema Enforcement:** Since TSV files lack metadata, I manually defined the schema (StructType) to prevent data type mismatch errors (e.g., ensuring `rating` is an Integer).

#### 1. SETUP CONFIGURATION & SCHEMA DEFINITION

In [0]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType

In [0]:
# Setup Context
catalog = "safety_signal_catalog"
schema  = "raw_data"
volume  = "landing_zone"

spark.sql(f"USE CATALOG {catalog}")
spark.sql(f"USE SCHEMA {schema}")

In [0]:
# Define the Schema (Schema-on-Read)
# TSV files are just text; there is no explicit definition of 'rating' being a number. 
# Specify spark exactly what to expect.
drug_schema = StructType([
    StructField("uniqueID", IntegerType(), True),
    StructField("drugName", StringType(), True),
    StructField("condition", StringType(), True),
    StructField("review", StringType(), True),
    StructField("rating", DoubleType(), True),
    StructField("date", StringType(), True), 
    StructField("usefulCount", IntegerType(), True)
])

print("Schema Defined.")

#### 2. READ RAW DATA (Ingestion)

In [0]:
# Read the 'train_data.tsv' file. 
# Set multiLine=True because patient reviews often have newlines.
print("Reading TSV file...")

df_raw = (spark.read
    .format("csv")
    .option("delimiter", "\t")       # It is Tab-Separated, not Comma-Separated
    .option("header", "true")        # The first row contains column names
    .option("multiLine", "true")     # for text analysis!
    .option("quote", "\"")           # Handles quotes inside reviews
    .option("escape", "\"")
    .schema(drug_schema)             # Enforce our strict schema
    .load(f"/Volumes/{catalog}/{schema}/{volume}/train_data.tsv")
)

#### 3. WRITE TO BRONZE (Delta Lake)

In [0]:
# Save this as a Delta table so it supports ACID transactions.
table_name = "bronze_drug_reviews"
print(f"Saving to Delta Table: {table_name}...")

(df_raw.write
    .format("delta")
    .mode("overwrite")
    .option("overwriteSchema", "true") 
    .saveAsTable(table_name)
)

print("Bronze Table Created Successfully.")

###### View the bronze table

In [0]:
%sql
--- View the first 5 rows
SELECT * FROM safety_signal_catalog.raw_data.bronze_drug_reviews LIMIT 5