# Silver Fact Order

**Tipo de tabla:** Hechos de Detalle de ventas

**Origen:**  `orders_header` 

**Destino:** `fact_order`

## Lectura de datos parquet de Bronze

In [None]:
from pyspark.sql import SparkSession
from dotenv import load_dotenv
import os

load_dotenv("/home/jovyan/work/.env")
spark = SparkSession.builder.appName("silver_order").getOrCreate()

bronze_path = os.getenv("BRONZE_PATH")
silver_path = os.getenv("SILVER_PATH")

Se genera nuevo Dataframe de spark desde tablas bronze.

In [None]:
# Lectura de headers y limpieza
directory_path = os.path.join(bronze_path, "orders_header")
bronze_df = spark.read.parquet(directory_path)

# Display esquema y filas
bronze_df.printSchema()
bronze_df.count()

root
 |-- order_id: integer (nullable = true)
 |-- customer_id: integer (nullable = true)
 |-- order_date: timestamp (nullable = true)
 |-- canal: string (nullable = true)
 |-- tipo_pago: string (nullable = true)
 |-- flag_promo_pedido: integer (nullable = true)
 |-- order_net_amount: double (nullable = true)
 |-- order_gross_amount: double (nullable = true)
 |-- currency: string (nullable = true)
 |-- ingestion_timestamp: timestamp (nullable = true)
 |-- source_file: string (nullable = true)



2810000

In [4]:
#Creacion de tabla de paso para limpieza
# esto para utilziar SQL en los procesos de estandarizaciÃ³n
bronze_df.createOrReplaceTempView("temp_orders_header")

In [None]:
# tablas dimensionales para cruce de info
customer_path=  os.path.join(silver_path, "dim_customer.parquet")
spark.read.parquet(customer_path).createOrReplaceTempView("dim_customer")

## Limpieza de Silver
### Reglas de Limpieza aplicada:
* Tipos Correctos
* Eliminacion de Duplicados por customer_id
* Evitar * en selects

### Reglas de Transformacion:
* Se agrega dato dummy para consistencia en productos no existentes
* Estandarizacion en lenguaje de nombres de columnas

In [13]:
# ORDER HEADERS
query = """
with cte_order_header AS (
    SELECT MD5(concat_ws('||', order_id, customer_id, order_date, channel, pay_type, currency)) AS order_key 
        , COALESCE(order_id, 0) AS order_id
        , COALESCE(customer_id, 0) AS customer_id
        , order_date
        , channel
        , pay_type
        , currency
        , flag_promo_pedido
        , order_net_amount
        , order_gross_amount
    FROM (
        SELECT  CAST(order_id AS INT) AS order_id
            , cast(customer_id AS INT) AS customer_id
            , CAST(order_date AS DATE) AS order_date
            , TRIM(CAST(canal AS VARCHAR(30))) AS channel
            , TRIM(CAST(tipo_pago AS VARCHAR(30))) AS pay_type
            , CAST(flag_promo_pedido AS INT) AS flag_promo_pedido
            , CAST(order_net_amount AS NUMERIC(18,2)) AS order_net_amount
            , CAST(order_gross_amount AS NUMERIC(18,2)) AS order_gross_amount
            , TRIM(CAST(currency AS VARCHAR(5))) AS currency
            , ROW_NUMBER() OVER(PARTITION BY order_id ORDER BY order_date DESC) AS RW
        FROM temp_orders_header
    ) AS A
    WHERE RW = 1
    AND order_net_amount > 0
)
SELECT cte.order_key
    , cte.order_id
    , cus.customer_id AS customer_key
    , CAST(date_format(cte.order_date, 'yyyyMMdd') AS INT) AS date_key
    , cte.channel AS channel_name
    , cte.pay_type
    , cte.currency
    , cte.order_net_amount
    , cte.order_gross_amount
FROM cte_order_header AS cte
LEFT JOIN dim_customer AS cus
    ON cte.customer_id = cus.customer_id
ORDER BY cte.order_id
"""

# Execute the SQL query and get the result as a new DataFrame
orders_header_df = spark.sql(query)

# Display the results
orders_header_df.printSchema()
orders_header_df.count()

root
 |-- order_key: string (nullable = false)
 |-- order_id: integer (nullable = false)
 |-- customer_key: integer (nullable = true)
 |-- date_key: integer (nullable = true)
 |-- channel_name: string (nullable = true)
 |-- pay_type: string (nullable = true)
 |-- currency: string (nullable = true)
 |-- order_net_amount: decimal(18,2) (nullable = true)
 |-- order_gross_amount: decimal(18,2) (nullable = true)



1405000

## Escritura de datos
* Todo se escribe en parquet, en carpetas de silver.
* Se escribe con metodo upsert para posteriores ingestas masivas de datos
* Posibilidad de realizar SCD para preservar cambios historicos en dimensiones

**Mejora** utilziar deltas tables para control de merges y log (ACID)

In [None]:
# Escritura de dim_product en silver como parquet
output_path = os.path.join(silver_path, "fact_order.parquet")

orders_header_df.write.mode("overwrite").parquet(output_path)

## Validaciones.

In [16]:
# 1. filas origen = filas destino
n_rows_silver= orders_header_df.count() # se le resta el dato dummy
n_rows_bronze= bronze_df.count()
n_rows_silver_parquet = spark.read.parquet(output_path).count()

print(f'Cantidad Filas Silver: {n_rows_silver}')
print(f'Cantidad Filas Silver Parquet: {n_rows_silver_parquet}')

if n_rows_silver != n_rows_silver_parquet:
    raise Exception("Error Validacion, filas de dim no son iguales al origen")
else:
    print(f'Tabla con datos validados.')


Cantidad Filas Silver: 1405000
Cantidad Filas Silver Parquet: 1405000
Tabla con datos validados.


In [17]:
summary_stats = spark.read.parquet(output_path).describe()
summary_stats.show()

+-------+--------------------+------------------+------------------+--------------------+------------+--------+--------+-----------------+------------------+
|summary|           order_key|          order_id|      customer_key|            date_key|channel_name|pay_type|currency| order_net_amount|order_gross_amount|
+-------+--------------------+------------------+------------------+--------------------+------------+--------+--------+-----------------+------------------+
|  count|             1405000|           1405000|           1405000|             1405000|     1405000| 1405000| 1405000|          1405000|           1405000|
|   mean|5.107322637177585...|          702500.5| 5474.031326690391|2.0246453561268326E7|        NULL|    NULL|    NULL|    159507.677985|     166806.662408|
| stddev|                NULL|405588.70844325377|3145.1389739658484|   5160.631140377572|        NULL|    NULL|    NULL|70844.50706689163| 73805.22091492597|
|    min|00002b0824548b3af...|                 1|   