# Layer: Gold (Business)
**Project:** Lean Logistics Data Pipeline  
**Business Domain:** E-commerce (Olist Dataset)\
**Table Name:** `dm_sellers`

---
## üìë Notebook Information
| Version | Date | Author | Summary of Changes |
| :--- | :--- | :--- | :--- |
| v1.0 | 2026-02-20 | T√°ssia Marchito | Consolidated script: Seller and Geolocation join, Business comments, and Tags. |

---
## üéØ Objectives
This notebook creates the Seller Dimension by enriching seller data with geographic coordinates.
* **Data Enrichment:** Joining `tb_sellers` with aggregated `tb_geolocation` from Silver.
* **Data Quality:** Ensuring unique records per `cd_seller_id`.
* **Governance:** Applying standardized prefixes, column comments, and discovery tags for Unity Catalog.

In [0]:
from pyspark.sql.functions import col, current_timestamp, avg

In [0]:
# 1. Configura√ß√µes
source_sellers = "cat_tm_services_silver.db_logistics.tb_sellers"
source_geo = "cat_tm_services_silver.db_logistics.tb_geolocation"
target_table = "cat_tm_services_gold.db_logistics.dm_sellers"

print(f"üöÄ Building {target_table}...")

# 2. Prepara√ß√£o da Geolocaliza√ß√£o (Agregando para evitar duplicidade de CEP)
df_geo_agg = spark.read.table(source_geo) \
    .groupBy("cd_geolocation_zip_code_prefix") \
    .agg(
        avg("cd_geolocation_lat").alias("vl_latitude"),
        avg("cd_geolocation_lng").alias("vl_longitude")
    )

# 3. Join com Vendedores
df_sellers = spark.read.table(source_sellers)

dm_sellers = df_sellers.join(
    df_geo_agg, 
    df_sellers.cd_seller_zip_code_prefix == df_geo_agg.cd_geolocation_zip_code_prefix, 
    "left"
).select(
    col("cd_seller_id"),
    col("cd_seller_zip_code_prefix").alias("cd_zip_code"),
    col("nm_seller_city").alias("nm_city"),
    col("nm_seller_state").alias("nm_state"),
    col("vl_latitude"),
    col("vl_longitude")
).withColumn("ts_gold_at", current_timestamp())

# 4. Escrita da Tabela
dm_sellers.write.format("delta").mode("overwrite").option("overwriteSchema", "true").saveAsTable(target_table)

# 5. Aplica√ß√£o de Governan√ßa (Tags e Coment√°rios)
print(f"üìù Applying metadata to {target_table}...")

# Tags
spark.sql(f"ALTER TABLE {target_table} SET TAGS ('quality' = 'gold', 'domain' = 'logistics', 'type' = 'dimension')")

# Coment√°rios
spark.sql(f"ALTER TABLE {target_table} ALTER COLUMN cd_seller_id COMMENT 'Unique identifier for the seller'")
spark.sql(f"ALTER TABLE {target_table} ALTER COLUMN cd_zip_code COMMENT 'Seller zip code prefix'")
spark.sql(f"ALTER TABLE {target_table} ALTER COLUMN nm_city COMMENT 'City name of the seller'")
spark.sql(f"ALTER TABLE {target_table} ALTER COLUMN nm_state COMMENT 'State abbreviation of the seller'")
spark.sql(f"ALTER TABLE {target_table} ALTER COLUMN vl_latitude COMMENT 'Average latitude for the seller zip code'")
spark.sql(f"ALTER TABLE {target_table} ALTER COLUMN vl_longitude COMMENT 'Average longitude for the seller zip code'")
spark.sql(f"ALTER TABLE {target_table} ALTER COLUMN ts_gold_at COMMENT 'Timestamp of Gold layer processing'")

# Constraints
spark.sql(f"ALTER TABLE {target_table} ALTER COLUMN cd_seller_id SET NOT NULL")
try:
    spark.sql(f"ALTER TABLE {target_table} ADD CONSTRAINT pk_dm_sellers PRIMARY KEY(cd_seller_id) RELY")
except:
    pass

print(f"‚úÖ Table {target_table} is complete!")