![Flujo de Data Profiling](asset/raw_dataprofiling_01.png)

El **Data Profiling** tiene como objetivo comprender en profundidad un dataset en su estado inicial (**RAW / PRE**), antes de aplicar transformaciones, modelado o anal√≠tica avanzada.

En este proyecto, el proceso se aplica sobre el archivo **`dirty_cafe_sales.csv`**, como parte del proyecto:

**üì¶ CoffeeSales ‚Äì Data Profiling (RAW PRE)**

A trav√©s de este an√°lisis buscamos:

- Comprender la **estructura real** del dataset recibido.
- Identificar **problemas de calidad de datos** (nulos, duplicados, valores inv√°lidos).
- Evaluar la **consistencia de tipos y formatos**.
- Detectar posibles **reglas de negocio incumplidas**.
- Determinar si el dataset est√° en condiciones de avanzar a etapas posteriores del pipeline (**decisi√≥n GO / NO-GO**).

Este proceso se ejecuta sobre la capa **RAW PRE**, sin aplicar transformaciones, con el fin de obtener un diagn√≥stico fiel del estado original de los datos.

---

## üß™ Consultas utilizadas en la exploraci√≥n
**Proyecto:** CoffeeSales ‚Äì Data Profiling (RAW PRE)
**Dataset analizado:** `dirty_cafe_sales.csv`

Las siguientes consultas SQL se utilizan para realizar una exploraci√≥n estructurada y repetible del dataset:

- **Vista inicial del dataset**
  _Permite revisar las primeras filas para comprender la estructura, tipos de datos y valores generales._

- **Volumen total de registros**
  _Valida la cantidad total de filas presentes en `dirty_cafe_sales.csv`._

- **Conteo de valores nulos por columna**
  _Eval√∫a la presencia y distribuci√≥n de valores nulos en cada campo._

- **Detecci√≥n de registros duplicados**
  _Identifica filas o claves repetidas que pueden afectar la integridad del an√°lisis._

- **Unicidad y cardinalidad de columnas clave**
  _Analiza si ciertos campos cumplen su rol como identificadores o dimensiones._

- **Validaci√≥n de formatos de fecha**
  _Revisa consistencia en campos de fecha y timestamps._

- **Revisi√≥n de valores inv√°lidos o fuera de dominio**
  _Detecta valores como `NaN`, `UNKNOWN`, `ERROR` o inconsistencias similares._

- **An√°lisis de longitud de campos de texto**
  _Eval√∫a longitudes para identificar truncamientos, valores vac√≠os o problemas de dise√±o._

- **Detecci√≥n de espacios y caracteres no deseados**
  _Identifica espacios en blanco, saltos de l√≠nea u otros caracteres sucios._

- **Resumen general del dataset**
  _Consolida m√©tricas clave como totales, rangos de fechas y valores distintos._

---

## üß† Resultado esperado

El resultado de este proceso es un **diagn√≥stico t√©cnico documentado** del dataset **`dirty_cafe_sales.csv`**, que permite:

- Justificar una decisi√≥n **GO / NO-GO**.
- Definir las transformaciones necesarias para etapas posteriores.
- Sentar la base t√©cnica para **modelado, ETL y arquitectura Medallion**.

![Flujo de Data Profiling](asset/01_vista_inicial_dataset.png)

In [1]:
%%sql
-- ========================================
-- EDA_SAMPLE_ROWS
-- Descripci√≥n: Ver una muestra de filas para entender el contenido
-- ========================================

SELECT *
FROM profiling.data_profiling_summary
LIMIT 10;

Unnamed: 0,Transaction ID,Item,Quantity,Price Per Unit,Total Spent,Payment Method,Location,Transaction Date
0,TXN_1961373,Coffee,2,2.0,4.0,Credit Card,Takeaway,2023-09-08
1,TXN_4977031,Cake,4,3.0,12.0,Cash,In-store,2023-05-16
2,TXN_4271903,Cookie,4,1.0,ERROR,Credit Card,In-store,2023-07-19
3,TXN_7034554,Salad,2,5.0,10.0,UNKNOWN,UNKNOWN,2023-04-27
4,TXN_3160411,Coffee,2,2.0,4.0,Digital Wallet,In-store,2023-06-11
5,TXN_2602893,Smoothie,5,4.0,20.0,Credit Card,,2023-03-31
6,TXN_4433211,UNKNOWN,3,3.0,9.0,ERROR,Takeaway,2023-10-06
7,TXN_6699534,Sandwich,4,4.0,16.0,Cash,UNKNOWN,2023-10-28
8,TXN_4717867,,5,3.0,15.0,,Takeaway,2023-07-28
9,TXN_2064365,Sandwich,5,4.0,20.0,,In-store,2023-12-31


![Flujo de Data Profiling](asset/02_estructura_tabla.png)

In [2]:
%%sql
-- ========================================
-- TEMPLATE: RAW_DESCRIBE_TABLE
-- Descripci√≥n: Mostrar estructura (columnas y tipos) de una tabla
-- ========================================

DESCRIBE profiling.data_profiling_summary;

Unnamed: 0,column_name,column_type,null,key,default,extra
0,Transaction ID,VARCHAR,YES,,,
1,Item,VARCHAR,YES,,,
2,Quantity,VARCHAR,YES,,,
3,Price Per Unit,VARCHAR,YES,,,
4,Total Spent,VARCHAR,YES,,,
5,Payment Method,VARCHAR,YES,,,
6,Location,VARCHAR,YES,,,
7,Transaction Date,VARCHAR,YES,,,


![Flujo de Data Profiling](asset/03_contar_nulos_pseudo.png)

In [3]:
%%sql
-- ========================================
-- TEMPLATE: EDA_NULL_PSEUDONULL_COUNT
-- Descripci√≥n: Contar NULL reales y pseudo-NULL ('', 'NA', 'NaN', 'UNKNOWN', 'ERROR') por columna
-- ========================================

SELECT
    -- Columna 1
    SUM(CASE
            WHEN "Transaction ID" IS NULL
              OR TRIM("Transaction ID") IN ('', 'NA', 'NaN', 'UNKNOWN', 'ERROR')
        THEN 1 ELSE 0 END) AS Transaction_ID_null_like,

    -- Columna 2
    SUM(CASE
            WHEN Item IS NULL
              OR TRIM(Item) IN ('', 'NA', 'NaN', 'UNKNOWN', 'ERROR')
        THEN 1 ELSE 0 END) AS Item_null_like,

    -- Columna 3
    SUM(CASE
            WHEN Quantity IS NULL
              OR TRIM(Quantity) IN ('', 'NA', 'NaN', 'UNKNOWN', 'ERROR')
        THEN 1 ELSE 0 END) AS Quantity_null_like,

    -- Columna 4
    SUM(CASE
            WHEN "Price Per Unit" IS NULL
              OR TRIM("Price Per Unit") IN ('', 'NA', 'NaN', 'UNKNOWN', 'ERROR')
        THEN 1 ELSE 0 END) AS Price_Per_Unit_null_like,

    -- Columna 5
    SUM(CASE
            WHEN "Total Spent" IS NULL
              OR TRIM("Total Spent") IN ('', 'NA', 'NaN', 'UNKNOWN', 'ERROR')
        THEN 1 ELSE 0 END) AS Total_Spent_null_like,

    -- Columna 6
    SUM(CASE
            WHEN "Payment Method" IS NULL
              OR TRIM("Payment Method") IN ('', 'NA', 'NaN', 'UNKNOWN', 'ERROR')
        THEN 1 ELSE 0 END) AS Payment_Method_null_like,

    -- Columna 7
    SUM(CASE
            WHEN Location IS NULL
              OR TRIM(Location) IN ('', 'NA', 'NaN', 'UNKNOWN', 'ERROR')
        THEN 1 ELSE 0 END) AS Location_null_like,

    -- Columna 8
    SUM(CASE
            WHEN "Transaction Date" IS NULL
              OR TRIM("Transaction Date") IN ('', 'NA', 'NaN', 'UNKNOWN', 'ERROR')
        THEN 1 ELSE 0 END) AS Transaction_Date_null_like

FROM profiling.data_profiling_summary;

Unnamed: 0,Transaction_ID_null_like,Item_null_like,Quantity_null_like,Price_Per_Unit_null_like,Total_Spent_null_like,Payment_Method_null_like,Location_null_like,Transaction_Date_null_like
0,0,969,479,533,502,3178,3961,460


![Flujo de Data Profiling](asset/04_duplicado_por_columna.png)

In [5]:
%%sql
-- ========================================
-- TEMPLATE: EDA_DUPLICATES_ALL_COLUMNS
-- Descripci√≥n: Detectar duplicados por columna completa
-- ========================================

SELECT
    columna,
    COUNT(*) AS total_registros,
    COUNT(DISTINCT valor) AS valores_unicos,
    COUNT(*) - COUNT(DISTINCT valor) AS duplicados
FROM (
    -- Columna 1
    SELECT 'Transaction ID' AS columna, CAST("Transaction ID" AS VARCHAR) AS valor
    FROM profiling.data_profiling_summary

    UNION ALL
    -- Columna 2
    SELECT 'Item', CAST("Item" AS VARCHAR)
    FROM profiling.data_profiling_summary

    UNION ALL
    -- Columna 3
    SELECT 'Quantity', CAST(Quantity AS VARCHAR)
    FROM profiling.data_profiling_summary

    UNION ALL
    -- Columna 4
    SELECT 'Price Per Unit', CAST("Price Per Unit" AS VARCHAR)
    FROM profiling.data_profiling_summary

    UNION ALL
    -- Columna 5
    SELECT 'Total Spent', CAST("Total Spent" AS VARCHAR)
    FROM profiling.data_profiling_summary

    UNION ALL
    -- Columna 6
    SELECT 'Payment Method', CAST("Payment Method" AS VARCHAR)
    FROM profiling.data_profiling_summary

    UNION ALL
    -- Columna 7
    SELECT 'Location', CAST(Location AS VARCHAR)
    FROM profiling.data_profiling_summary

    UNION ALL
    -- Columna 8
    SELECT 'Transaction Date', CAST("Transaction Date" AS VARCHAR)
    FROM profiling.data_profiling_summary
) t
GROUP BY columna
ORDER BY duplicados DESC;

Unnamed: 0,columna,total_registros,valores_unicos,duplicados
0,Location,10000,4,9996
1,Payment Method,10000,5,9995
2,Quantity,10000,7,9993
3,Price Per Unit,10000,8,9992
4,Item,10000,10,9990
5,Total Spent,10000,19,9981
6,Transaction Date,10000,367,9633
7,Transaction ID,10000,10000,0


![Flujo de Data Profiling](asset/05_duplicado_multiple_columna.png)

In [6]:
%%sql
-- ========================================
-- TEMPLATE: EDA_DUPLICATES_MULTI_COLUMN
-- Descripci√≥n: Detectar valores duplicados en m√∫ltiples columnas
-- ========================================

SELECT
    columna,
    valor,
    COUNT(*) AS repeticiones
FROM (
    -- Columna 1
    SELECT 'Location' AS columna, Location AS valor
    FROM profiling.data_profiling_summary

    UNION ALL
    -- Columna 2
    SELECT 'Payment Method', "Payment Method"
    FROM profiling.data_profiling_summary

    UNION ALL
    -- Columna 3
    SELECT 'Quantity', Quantity
    FROM profiling.data_profiling_summary

    UNION ALL
    -- Columna 4
    SELECT 'Price Per Unit', "Price Per Unit"
    FROM profiling.data_profiling_summary

     UNION ALL
    -- Columna 5
    SELECT 'Item', Item
    FROM profiling.data_profiling_summary

    UNION ALL
    -- Columna 6
    SELECT 'Total Spent', "Total Spent"
    FROM profiling.data_profiling_summary

    UNION ALL
    -- Columna 7
    SELECT 'Transaction Date', "Transaction Date"
    FROM profiling.data_profiling_summary
) t
GROUP BY columna, valor
HAVING COUNT(*) > 1
ORDER BY columna, repeticiones DESC;

Unnamed: 0,columna,valor,repeticiones
0,Item,Juice,1171
1,Item,Coffee,1165
2,Item,Salad,1148
3,Item,Cake,1139
4,Item,Sandwich,1131
...,...,...,...
422,Transaction Date,2023-07-30,15
423,Transaction Date,2023-11-24,15
424,Transaction Date,2023-03-11,14
425,Transaction Date,2023-02-17,14


![Flujo de Data Profiling](asset/06_validar_regla_negocio.png)

In [7]:
%%sql
-- ========================================
-- TEMPLATE: EDA_VALIDATE_BUSINESS_RULES
-- Descripci√≥n: Validar regla de negocio fila a fila
-- Regla ejemplo: Quantity * Price Per Unit = Total Spent
-- ========================================

SELECT
    *,
    CASE
        WHEN Quantity    IN ('NaN','UNKNOWN','ERROR')
          OR "Price Per Unit"  IN ('NaN','UNKNOWN','ERROR')
          OR "Total Spent"  IN ('NaN','UNKNOWN','ERROR')
        THEN 'INVALID_VALUE'

        WHEN TRY_CAST(Quantity AS DOUBLE) IS NULL
          OR TRY_CAST("Price Per Unit" AS DOUBLE) IS NULL
          OR TRY_CAST("Total Spent" AS DOUBLE) IS NULL
        THEN 'CAST_FAIL'

        WHEN TRY_CAST(Quantity AS DOUBLE)
             * TRY_CAST("Price Per Unit" AS DOUBLE)
             = TRY_CAST("Total Spent" AS DOUBLE)
        THEN 'OK'
        ELSE 'BUSINESS_ERROR'
    END AS check_total
FROM profiling.data_profiling_summary
LIMIT 10;

Unnamed: 0,Transaction ID,Item,Quantity,Price Per Unit,Total Spent,Payment Method,Location,Transaction Date,check_total
0,TXN_1961373,Coffee,2,2.0,4.0,Credit Card,Takeaway,2023-09-08,OK
1,TXN_4977031,Cake,4,3.0,12.0,Cash,In-store,2023-05-16,OK
2,TXN_4271903,Cookie,4,1.0,ERROR,Credit Card,In-store,2023-07-19,INVALID_VALUE
3,TXN_7034554,Salad,2,5.0,10.0,UNKNOWN,UNKNOWN,2023-04-27,OK
4,TXN_3160411,Coffee,2,2.0,4.0,Digital Wallet,In-store,2023-06-11,OK
5,TXN_2602893,Smoothie,5,4.0,20.0,Credit Card,,2023-03-31,OK
6,TXN_4433211,UNKNOWN,3,3.0,9.0,ERROR,Takeaway,2023-10-06,OK
7,TXN_6699534,Sandwich,4,4.0,16.0,Cash,UNKNOWN,2023-10-28,OK
8,TXN_4717867,,5,3.0,15.0,,Takeaway,2023-07-28,OK
9,TXN_2064365,Sandwich,5,4.0,20.0,,In-store,2023-12-31,OK
