# ðŸ“¥ Notebook: 00 ETL Bronze Layer

This notebook forms the **first stage** of the AI-powered claims processing pipeline, focusing on the **Bronze Layer (Raw Ingestion)** of the Medallion Architecture. It sets up the foundational data required for downstream processing in the Databricks platform.

---

## ðŸ§± Purpose
To ingest raw call audio files from a defined volume location into a structured Delta Lake table for further processing in the pipeline.

In [0]:
dbutils.library.restartPython()

In [0]:
%run "./resources/init" 

In [0]:
catalog_exists = spark.sql(f"SHOW CATALOGS LIKE '{CATALOG}'").count() > 0

if not catalog_exists:
    spark.sql(f"CREATE CATALOG `{CATALOG}`")

spark.sql(f"CREATE SCHEMA IF NOT EXISTS `{CATALOG}`.`{SCHEMA}`")
spark.sql(f"CREATE VOLUME IF NOT EXISTS `{CATALOG}`.`{SCHEMA}`.`{VOLUME}`")

DataFrame[]

In [0]:
raw_audio_path = f"/Volumes/{CATALOG}/{SCHEMA}/{VOLUME}/raw_recordings/"
if not dbutils.fs.mkdirs(raw_audio_path):
    dbutils.fs.mkdirs(raw_audio_path)

In [0]:
import pyspark.sql.functions as F

files = dbutils.fs.ls(raw_audio_path)
if not files:
    raise ValueError("Empty directory")

file_reference_df = spark.createDataFrame(files).withColumn("file_path", F.expr("substring(path, 6, length(path))")).withColumn("file_name", F.expr("substring(name, 1, length(name) - 4)"))

display(file_reference_df)

path,name,size,modificationTime,file_path,file_name
dbfs:/Volumes/samantha_wise/ai_claims_processing/audio_recordings/raw_recordings/5e7e3k53_AGT002_2025-01-15 13_35_10.m4a,5e7e3k53_AGT002_2025-01-15 13_35_10.m4a,787392,1743602105000,/Volumes/samantha_wise/ai_claims_processing/audio_recordings/raw_recordings/5e7e3k53_AGT002_2025-01-15 13_35_10.m4a,5e7e3k53_AGT002_2025-01-15 13_35_10
dbfs:/Volumes/samantha_wise/ai_claims_processing/audio_recordings/raw_recordings/ct4m50n5_AGT005_2025-03-01 12_36_07.m4a,ct4m50n5_AGT005_2025-03-01 12_36_07.m4a,939809,1743602105000,/Volumes/samantha_wise/ai_claims_processing/audio_recordings/raw_recordings/ct4m50n5_AGT005_2025-03-01 12_36_07.m4a,ct4m50n5_AGT005_2025-03-01 12_36_07
dbfs:/Volumes/samantha_wise/ai_claims_processing/audio_recordings/raw_recordings/nv7032f9_AGT001_2025-02-27 12_40_45.m4a,nv7032f9_AGT001_2025-02-27 12_40_45.m4a,993088,1743602105000,/Volumes/samantha_wise/ai_claims_processing/audio_recordings/raw_recordings/nv7032f9_AGT001_2025-02-27 12_40_45.m4a,nv7032f9_AGT001_2025-02-27 12_40_45
dbfs:/Volumes/samantha_wise/ai_claims_processing/audio_recordings/raw_recordings/pxvlh18a_AGT001_2025-02-11 11_33_33.m4a,pxvlh18a_AGT001_2025-02-11 11_33_33.m4a,1028483,1743602105000,/Volumes/samantha_wise/ai_claims_processing/audio_recordings/raw_recordings/pxvlh18a_AGT001_2025-02-11 11_33_33.m4a,pxvlh18a_AGT001_2025-02-11 11_33_33
dbfs:/Volumes/samantha_wise/ai_claims_processing/audio_recordings/raw_recordings/ulnocrnh_AGT005_2025-02-04 05_42_51.m4a,ulnocrnh_AGT005_2025-02-04 05_42_51.m4a,1038857,1743602105000,/Volumes/samantha_wise/ai_claims_processing/audio_recordings/raw_recordings/ulnocrnh_AGT005_2025-02-04 05_42_51.m4a,ulnocrnh_AGT005_2025-02-04 05_42_51


In [0]:
if not spark._jsparkSession.catalog().tableExists(f"{CATALOG}.{SCHEMA}.meta_data"):
    metadata_df = file_reference_df.select("file_name").withColumn("processed", F.lit(False))
    metadata_df.write.mode("overwrite").option("overwriteSchema", "true").saveAsTable(f"{CATALOG}.{SCHEMA}.meta_data")

metadata_df = spark.table(f"{CATALOG}.{SCHEMA}.meta_data")

if not spark._jsparkSession.catalog().tableExists(f"{CATALOG}.{SCHEMA}.recordings_file_reference_bronze"):
    file_reference_df.write.mode("overwrite").option("overwriteSchema", "true").saveAsTable(f"{CATALOG}.{SCHEMA}.recordings_file_reference_bronze")
else:
    new_files_df = file_reference_df.join(metadata_df.filter(F.col("processed") == True), "file_name", "left_anti").drop("processed")
    if new_files_df.count() > 0:
        new_files_df.select(file_reference_df.columns).write.mode("append").saveAsTable(f"{CATALOG}.{SCHEMA}.recordings_file_reference_bronze")   

## âœ… Output
- A Delta table: recordings_file_reference_bronze
- This serves as the source of truth for all raw audio ingestions in the pipeline.