# Silver Dim_product

**Tipo de tabla:** Maestro de productos de institucion

**Origen:** `bronze_products`

**Destino:** `silver_dim_product`

## Lectura de datos parquet de Bronze

In [1]:
from pyspark.sql import SparkSession
from dotenv import load_dotenv
import os

load_dotenv("/home/jovyan/work/.env")
spark = SparkSession.builder.appName("silver_products").getOrCreate()


Se genera nuevo Dataframe de spark desde tablas bronze.

In [2]:
bronze_path = os.getenv("BRONZE_PATH")
silver_path = os.getenv("SILVER_PATH")

directory_path = os.path.join(bronze_path, "product")
bronze_df = spark.read.parquet(directory_path)

# Display esquema y filas
bronze_df.printSchema()
bronze_df.show(5)

root
 |-- product_id: integer (nullable = true)
 |-- product_category: string (nullable = true)
 |-- brand: string (nullable = true)
 |-- sub_category: string (nullable = true)
 |-- ingestion_timestamp: timestamp (nullable = true)
 |-- source_file: string (nullable = true)

+----------+----------------+------------+--------------------+--------------------+--------------------+
|product_id|product_category|       brand|        sub_category| ingestion_timestamp|         source_file|
+----------+----------------+------------+--------------------+--------------------+--------------------+
|         1|        Confites|   Ambrosoli| Chocolates surtidos|2025-12-04 14:49:...|file:///home/jovy...|
|         2|       Tortillas|Pancho Villa|   Tortillas de maíz|2025-12-04 14:49:...|file:///home/jovy...|
|         3|           Jugos|       Calaf|Jugos listos para...|2025-12-04 14:49:...|file:///home/jovy...|
|         4|        Confites|       Calaf|           Pastillas|2025-12-04 14:49:...|file:

In [3]:
#Creacion de tabla de paso para limpieza
# esto para utilziar SQL en los procesos de estandarización
bronze_df.createOrReplaceTempView("temp_dim_product")

## Limpieza de Silver
### Reglas de Limpieza aplicada:
* Tipos Correctos
* Eliminacion de Duplicados
* Evitar * en selects

### Reglas de Transformacion:
* Dedupp por product_id
* Se agrega dato dummy para consistencia en productos no existentes

In [4]:
# query de limpieza de datos para silver
query = """
SELECT product_id
    , product_category
    , product_sub_category
    , product_brand
FROM (
    SELECT COALESCE(product_id, 0) AS product_id
        , TRIM(CAST(product_category AS VARCHAR(100))) AS product_category
        , TRIM(CAST(sub_category AS VARCHAR(100))) AS product_sub_category
        , TRIM(CAST(brand AS VARCHAR(50))) AS product_brand
        , ROW_NUMBER() OVER(PARTITION BY product_id ORDER BY product_id) as RW
    FROM temp_dim_product
)
WHERE RW = 1
UNION ALL
SELECT 0 AS product_id
    , 'Sin Categoria' AS product_category
    , 'Sin Subcategoria' AS product_sub_category
    , 'Sin Marca' AS product_brand
ORDER BY product_id ASC
"""

# Execute the SQL query and get the result as a new DataFrame
sql_result_df = spark.sql(query)

# Display the results
sql_result_df.printSchema()
sql_result_df.show(20)

root
 |-- product_id: integer (nullable = false)
 |-- product_category: string (nullable = true)
 |-- product_sub_category: string (nullable = true)
 |-- product_brand: string (nullable = true)

+----------+----------------+--------------------+-------------+
|product_id|product_category|product_sub_category|product_brand|
+----------+----------------+--------------------+-------------+
|         0|   Sin Categoria|    Sin Subcategoria|    Sin Marca|
|         1|        Confites| Chocolates surtidos|    Ambrosoli|
|         2|       Tortillas|   Tortillas de maíz| Pancho Villa|
|         3|           Jugos|Jugos listos para...|        Calaf|
|         4|        Confites|           Pastillas|        Calaf|
|         5|          Snacks|  Barritas de cereal|    Ambrosoli|
|         6|        Confites|   Caramelos blandos|    Ambrosoli|
|         7|          Snacks|              Nachos| Pancho Villa|
|         8|           Jugos|Jugos listos para...|        Calaf|
|         9|          Sna

## Escritura de datos
* Todo se escribe en parquet, en carpetas de silver.
* Se escribe sobre escribiendo
* Posibilidad de realizar SCD para preservar cambios historicos en dimensiones

In [5]:
# Escritura de dim_product en silver como parquet
output_path =  os.path.join(silver_path, "dim_product.parquet")

sql_result_df.write.mode("overwrite").parquet(output_path)

#en caso de necesitar, se puede cambiar logica a SCD para preservar historia

## Validaciones.

In [6]:
# 1. filas origen = filas destino
n_rows_silver= sql_result_df.count() - 1 # se le resta el dato dummy
n_rows_bronze= bronze_df.count()
print(f'Cantidad Filas Silver: {n_rows_silver}')
print(f'Cantidad Filas Bronze: {n_rows_bronze}')

if n_rows_silver != n_rows_bronze:
    raise Exception("Error Validacion, filas cargadas no son iguales")
else:
    print(f'Tabla con datos validados.')

# Diferencia de 1 por registro dummy

Cantidad Filas Silver: 400
Cantidad Filas Bronze: 400
Tabla con datos validados.
