### Medallion Architecture
##### The Medallion Architecture is a data design pattern that organizes data processing into three layers: Bronze, Silver, and Gold. It improves data quality, scalability, and reusability.__

### Bronze Layer (Raw ingestion)
##### Source data ingested as-is (often using Auto Loader).

##### May include duplicates, nulls, and unclean data.

##### Stored for audit and traceability.



In [0]:
dbutils.fs.ls('/FileStore')

Out[1]: [FileInfo(path='dbfs:/FileStore/2015_summary.csv', name='2015_summary.csv', size=7080, modificationTime=1742561274000),
 FileInfo(path='dbfs:/FileStore/BigMart_Sales.csv', name='BigMart_Sales.csv', size=869537, modificationTime=1741607242000),
 FileInfo(path='dbfs:/FileStore/Pyspark.ipynb', name='Pyspark.ipynb', size=88247, modificationTime=1741607310000),
 FileInfo(path='dbfs:/FileStore/checkpoints/', name='checkpoints/', size=0, modificationTime=0),
 FileInfo(path='dbfs:/FileStore/shared_uploads/', name='shared_uploads/', size=0, modificationTime=0),
 FileInfo(path='dbfs:/FileStore/sonu/', name='sonu/', size=0, modificationTime=0),
 FileInfo(path='dbfs:/FileStore/table/', name='table/', size=0, modificationTime=0),
 FileInfo(path='dbfs:/FileStore/tables/', name='tables/', size=0, modificationTime=0)]

In [0]:
bronze_df = (
    spark.read
    .option("header", "true")
    .option("inferSchema", "true")
    .csv("dbfs:/FileStore/BigMart_Sales.csv")
)

bronze_df.write.format("delta").mode("overwrite").save("/mnt/bronze/bigmart_sales")


###  Silver Layer
##### Data is filtered, cleaned, and joined here.

##### Performs basic transformations like:

####### Removing duplicates

####### Converting data types

####### Enriching with lookup tables


In [0]:
bronze_df = spark.read.format("delta").load("/mnt/bronze/bigmart_sales")


In [0]:
from pyspark.sql.functions import expr

silver_df = bronze_df.dropna(subset=["Item_Weight", "Item_Visibility", "Item_Outlet_Sales"]) \
    .withColumn("Item_Visibility", expr("Item_Visibility * 100"))

In [0]:
silver_df.write.format("delta").mode("overwrite").save("/mnt/silver/bigmart_sales_cleaned")


### Gold Layer
##### Data is aggregated and modeled for analytics or reporting.

##### Used by BI tools like Power BI, Tableau, or Databricks SQL dashboards.

In [0]:
silver_df = spark.read.format("delta").load("/mnt/silver/bigmart_sales_cleaned")


In [0]:
from pyspark.sql.functions import sum, avg, col

silver_df = silver_df.withColumn("Item_Outlet_Sales", col("Item_Outlet_Sales").cast("double"))


In [0]:
gold_df = silver_df.groupBy("Outlet_Identifier").agg(
    sum("Item_Outlet_Sales").alias("Total_Sales"),
    avg("Item_Outlet_Sales").alias("Average_Sales")
)

In [0]:
gold_df.write.format("delta").mode("overwrite").save("/mnt/gold/bigmart_sales_metrics")


In [0]:
gold_df = spark.read.format("delta").load("/mnt/gold/bigmart_sales_metrics")
gold_df.display()


Outlet_Identifier,Total_Sales,Average_Sales
OUT046,2118395.168199999,2277.8442668817192
OUT013,2142663.5781999985,2298.995255579397
OUT018,1851822.8300000008,1995.4987392241392
OUT010,188340.17240000013,339.3516619819822
OUT045,2036725.4769999988,2192.3847976318607
OUT035,2268122.935400002,2438.8418660215075
OUT017,2167465.294,2340.67526349892
OUT049,2183969.8102,2348.354634623656
