# Layer: Gold (Business)
**Project:** Lean Logistics Data Pipeline  
**Business Domain:** E-commerce (Olist Dataset)\
**Table Name:** `ft_sales`

---
## üìë Notebook Information
| Version | Date | Author | Summary of Changes |
| :--- | :--- | :--- | :--- |
| v1.0 | 2026-02-20 | T√°ssia Marchito | Initial creation of Sales Fact table (`ft_sales`) with delivery KPIs. |
| v1.1 | 2026-02-20 | T√°ssia Marchito | Refactored primary keys to `id_` prefix and added full column metadata. |

---
## üéØ Objectives
This notebook assembles the central Fact table, aggregating sales metrics and calculating business performance KPIs.
* **Refined Keys:** Standardized IDs using `id_` prefix for better downstream consumption.
* **Metric Aggregation:** Combining order items, total payments, and review scores.
* **KPI Calculation:** Deriving delivery lead time and estimated vs. actual delivery performance.
* **Full Governance:** 100% column documentation and discovery tags for Unity Catalog.

In [0]:
from pyspark.sql.functions import col, current_timestamp, datediff, sum as _sum

In [0]:
# 1. Configura√ß√µes de Origem e Destino
silver_db = "cat_tm_services_silver.db_logistics"
target_table = "cat_tm_services_gold.db_logistics.ft_sales"

print(f"üöÄ Starting the build for {target_table}...")

# 2. Carregamento das Tabelas Silver
df_orders = spark.read.table(f"{silver_db}.tb_orders")
df_items = spark.read.table(f"{silver_db}.tb_order_items")
df_payments = spark.read.table(f"{silver_db}.tb_order_payments")
df_reviews = spark.read.table(f"{silver_db}.tb_order_reviews")

# 3. Pr√©-agrega√ß√£o de Pagamentos
df_pay_agg = df_payments.groupBy("cd_order_id").agg(_sum("vl_payment_value").alias("vl_total_order"))

# 4. Constru√ß√£o da Fato
ft_sales = df_items.join(df_orders, "cd_order_id", "inner") \
    .join(df_pay_agg, "cd_order_id", "left") \
    .join(df_reviews.select("cd_order_id", "vl_review_score"), "cd_order_id", "left")

# 5. Sele√ß√£o, Renomea√ß√£o e KPIs
ft_sales_final = ft_sales.select(
    col("cd_order_id").alias("id_order"),
    col("cd_customer_id").alias("id_customer"),
    col("cd_product_id").alias("id_product"),
    col("cd_seller_id").alias("id_seller"),
    col("ts_order_purchase"),
    col("ts_order_delivered_customer_date").alias("ts_order_delivered_customer"),
    col("dt_order_estimated_delivery"),
    col("vl_price"),
    col("vl_freight_value"),
    col("vl_total_order"),
    col("vl_review_score"),
    datediff(col("ts_order_delivered_customer_date"), col("ts_order_purchase")).alias("nr_days_to_deliver"),
    datediff(col("dt_order_estimated_delivery"), col("ts_order_delivered_customer_date")).alias("nr_days_delivery_performance")
).withColumn("ts_gold_at", current_timestamp())

# 6. Escrita da Tabela
ft_sales_final.write.format("delta").mode("overwrite").option("overwriteSchema", "true").saveAsTable(target_table)

# 7. Governan√ßa e Metadados Completos
print(f"üìù Applying tags and full metadata to {target_table}...")

# Table Tags
spark.sql(f"ALTER TABLE {target_table} SET TAGS ('quality' = 'gold', 'domain' = 'logistics', 'type' = 'fact')")

# Column Comments (Dicion√°rio de Dados Completo)
spark.sql(f"ALTER TABLE {target_table} ALTER COLUMN id_order COMMENT 'Unique identifier for the order'")
spark.sql(f"ALTER TABLE {target_table} ALTER COLUMN id_customer COMMENT 'Unique identifier for the customer'")
spark.sql(f"ALTER TABLE {target_table} ALTER COLUMN id_product COMMENT 'Unique identifier for the product'")
spark.sql(f"ALTER TABLE {target_table} ALTER COLUMN id_seller COMMENT 'Unique identifier for the seller'")
spark.sql(f"ALTER TABLE {target_table} ALTER COLUMN ts_order_purchase COMMENT 'Timestamp of when the order was placed'")
spark.sql(f"ALTER TABLE {target_table} ALTER COLUMN ts_order_delivered_customer COMMENT 'Timestamp of actual delivery to customer'")
spark.sql(f"ALTER TABLE {target_table} ALTER COLUMN dt_order_estimated_delivery COMMENT 'Estimated delivery date informed at purchase'")
spark.sql(f"ALTER TABLE {target_table} ALTER COLUMN vl_price COMMENT 'Item price'")
spark.sql(f"ALTER TABLE {target_table} ALTER COLUMN vl_freight_value COMMENT 'Shipping cost for the item'")
spark.sql(f"ALTER TABLE {target_table} ALTER COLUMN vl_total_order COMMENT 'Total order value (sum of all payments for the order id)'")
spark.sql(f"ALTER TABLE {target_table} ALTER COLUMN vl_review_score COMMENT 'Customer satisfaction score (1 to 5)'")
spark.sql(f"ALTER TABLE {target_table} ALTER COLUMN nr_days_to_deliver COMMENT 'Number of days between purchase and actual delivery'")
spark.sql(f"ALTER TABLE {target_table} ALTER COLUMN nr_days_delivery_performance COMMENT 'Days difference: estimated vs actual (positive is early, negative is late)'")
spark.sql(f"ALTER TABLE {target_table} ALTER COLUMN ts_gold_at COMMENT 'Processing timestamp in the Gold layer'")

# Constraints (PK)
spark.sql(f"ALTER TABLE {target_table} ALTER COLUMN id_order SET NOT NULL")
try:
    spark.sql(f"ALTER TABLE {target_table} ADD CONSTRAINT pk_ft_sales PRIMARY KEY(id_order, id_product) RELY")
except:
    pass

print(f"‚úÖ Fact Table {target_table} is now fully documented!")