## Medallion Architecture: Bronze → Silver → Gold


### Our dataset looks like this

![Medallion architecture overview](../images/di.png)


A **medallion architecture** is a layered design pattern where data quality and structure improve as data flows through successive layers:

- **Bronze:** raw landing
- **Silver:** cleaned and standardized
- **Gold:** curated, business-ready outputs

This is sometimes called a **multi-hop architecture** because data progresses through multiple refinement steps rather than jumping directly from source to final outputs.

![Medallion architecture overview](../images/ma.png)

## Bronze Layer: 

The Bronze layer stores **raw and unprocessed data** as it arrives from sources. The core value is **traceability**:

- You can go back to the raw records when downstream issues appear
- You can investigate problems at the source level
- You can support auditing and controlled reprocessing


A strict rule in Bronze is: **do not manipulate the data**. If the source data is incomplete or malformed, it remains that way in Bronze. Corrections belong in downstream layers.

## Silver Layer:

The Silver layer stores **clean and standardized data**. This is the “heavy lifting” layer for technical quality improvements such as:

- Data cleansing (null handling, duplicates, malformed values)
- Standardization (formats, naming, canonical representations)
- Normalization and conformance where appropriate
- Derived columns and lightweight enrichment


Silver focuses on **technical data quality** and alignment to standards. Complex, use-case-specific business rules are usually deferred to Gold.

## Gold Layer: 
The Gold layer contains **consumption-ready**, business-aligned outputs intended for reporting, analytics, and other downstream use cases.

Typical Gold work includes:

- Data integration across sources (enterprise or domain views)
- Business logic and rules aligned to reporting needs
- Aggregations and read-optimized structures
- Modeling patterns such as star schema (facts and dimensions) or curated marts


![Gold boundary: business transformations](../images/gold.png)

Gold is where you apply **business transformations** needed by consumers. Gold should not be where you perform basic cleansing that should have been handled in Silver.

## Modeling Responsibilities by Layer

- **Bronze:** preserve source structures; do not reshape core models
- **Silver:** keep structures close to sources while improving data quality and standardization
- **Gold:** build the analytics model (star schema, snowflake, curated marts, or aggregated outputs)


## Target Audience and Access by Layer


- **Bronze:** typically restricted to data engineers (raw data is not suitable for broad consumption)
- **Silver:** accessible to data engineers and technical analysts/data scientists who need high-quality data close to sources
- **Gold:** designed for analysts and business consumers who need stable, understandable datasets for reporting and decisions


## Todo: Analyze & Explore the Bronze Layer Data

- Explore each Bronze table using **limited queries** (e.g., top 1000 rows).
- Understand the **content and structure** of the data.
- Identify what each table represents and the meaning of each column.
- Identify whether tables contain **current data, historical data, or transactional data**.
- Identify how tables are connected across source systems and cross-system relationships (CRM ↔ ERP).
- Determine valid join columns (ID vs Key).
- Identify indirect relationships (e.g., category embedded in product key).
- Check data quality issues and what problems exist in the Bronze layer.

**Important:** No transformations should be written at this stage.

### Observations:

- Extra spaces in customer names
- Marital status stored as short codes (S/M) or inconsistent text
- Gender stored as short codes (F/M) or inconsistent text
- Duplicate customer rows for the same customer ID
- Customer records with missing cst_id
- Product key mixes category + product identifier (not join-ready)
- Category ID format mismatch
- Product line stored as single-letter codes
- Sales order dates stored as integers
- Some sales dates are invalid 