### Bronze Layer - Raw Data Ingestion

### Purpose
The Bronze layer is responsible for ingesting raw data from the source system into the Lakehouse without applying any business logic.

### What happens in this notebook?
- Reads raw transaction data from the source table
- Adds ingestion metadata such as:
  - Ingestion timestamp
  - Source system name
  - Ingestion date
- Writes data into a Delta table in the **Bronze schema**

### Why Bronze Layer?
- Preserves raw data for traceability
- Enables reprocessing if upstream data changes
- Acts as the foundation for Silver & Gold layers

### Execution
This notebook is parameterized and executed as part of a **scheduled Databricks multi-task job**.


In [0]:
from pyspark.sql.functions import current_timestamp, lit, to_date
events = spark.table("default.ecommerce_transactions")
bronze_df = (
    events
    .withColumn("ingest_ts", current_timestamp())
    .withColumn("source_name", lit("default.ecommerce_transactions"))
    .withColumn("ingest_date", to_date(current_timestamp()))
)
bronze_df.write.format("delta").mode("overwrite").saveAsTable("ecom_bronze.transactions_bronze")

In [0]:
bronze = spark.table("ecom_bronze.transactions_bronze")
print("Bronze count:", bronze.count())
display(bronze.limit(5))

Bronze count: 50000


Transaction_ID,User_Name,Age,Country,Product_Category,Purchase_Amount,Payment_Method,Transaction_Date,ingest_ts,source_name,ingest_date
1,Ava Hall,63,Mexico,Clothing,780.69,Debit Card,2023-04-14,2026-01-15T16:20:24.516Z,default.ecommerce_transactions,2026-01-15
2,Sophia Hall,59,India,Beauty,738.56,PayPal,2023-07-30,2026-01-15T16:20:24.516Z,default.ecommerce_transactions,2026-01-15
3,Elijah Thompson,26,France,Books,178.34,Credit Card,2023-09-17,2026-01-15T16:20:24.516Z,default.ecommerce_transactions,2026-01-15
4,Elijah White,43,Mexico,Sports,401.09,UPI,2023-06-21,2026-01-15T16:20:24.516Z,default.ecommerce_transactions,2026-01-15
5,Ava Harris,48,Germany,Beauty,594.83,Net Banking,2024-10-29,2026-01-15T16:20:24.516Z,default.ecommerce_transactions,2026-01-15


In [0]:
from pyspark.sql import functions as F
bronze = spark.table("ecom_bronze.transactions_bronze")
display(
    bronze.select(
        F.count("*").alias("rows"),
        F.countDistinct("Transaction_ID").alias("distinct_txn_id"),
        F.sum((F.col("Transaction_ID").isNull()).cast("int")).alias("null_txn_id"),
        F.sum((F.col("Purchase_Amount").isNull()).cast("int")).alias("null_purchase_amount"),
        F.sum((F.col("Transaction_Date").isNull()).cast("int")).alias("null_transaction_date"),
        F.sum((F.col("Purchase_Amount") <= 0).cast("int")).alias("invalid_purchase_amount")
        
    )
)

rows,distinct_txn_id,null_txn_id,null_purchase_amount,null_transaction_date,invalid_purchase_amount
50000,50000,0,0,0,0
