# Pipeline de Ingestão CDC - Upcell

Este notebook implementa o pipeline de ingestão de dados CDC (Change Data Capture) do S3 para o Bronze no Databricks.

## Objetivo
- Full-load: Carga inicial completa das tabelas
- CDC: Ingestão incremental com operações Insert, Update e Delete
- Delta Lake: Merge atômico na camada Bronze

## Requisitos
- Tabelas no S3: `s3://meudatalake-raw/upcell/`
- Catálogo: `bronze.upcell`
- Coluna de controle: `DtAtualizacao` (presente em todos os arquivos)

## 1️⃣ Importações e Setup

In [0]:
import delta
def table_exists(catalog, database, table):
    count = (spark.sql(f"SHOW TABLES IN `{catalog}`.`{database}`")
               .filter(f"database = '{database}' AND tableName = '{table}'")
               .count())
    return count == 1

In [0]:
catalog = "bronze"
schema = "upcell"

tablename = "clientes"
id_field = "idcliente"
timefield = "DtAtualizacao"

## 1. Importações e Setup

In [0]:
df_full = spark.read.format("parquet").load(f"/Volumes/raw/upcell/cdc/{tablename}/")
df_schema = df_full.schema

In [0]:
if not table_exists(catalog, schema, tablename):
    print("tabela nao existe")
    df_full = spark.read.format("parquet").load(f"/Volumes/raw/upcell/full-load/{tablename}/")
    
    (df_full.coalesce(1)
        .write
        .format("delta")
        .mode("overwrite")
        .saveAsTable(f"{catalog}.{schema}.{tablename}"))
else:
    print("tabela ja existe")


## 3. Full-Load (Carga Inicial)

Se a tabela não existe, cria a partir dos dados de full-load.

In [0]:
bronze = delta.DeltaTable.forName(spark, f"{catalog}.{schema}.{tablename}")

def upsert(df, deltatable):

    df.createOrReplaceGlobalTempView(f"view_{tablename}")
    
    query = f"""
    SELECT *  
    FROM view_{tablename}
    QUALIFY ROW_NUMBER() OVER (PARTITION BY {id_field} ORDER BY {timefield} DESC) = 1    
    """
    df_cdc = spark.sql(query)
        
    (deltatable.alias("b") 
      .merge(df_cdc.alias("d"), f"b.{id_field} = d.{id_field}") 
      .whenMatchedDelete(condition = "d.op = 'D'")           # Delete se op = 'D'
      .whenMatchedUpdateAll(condition = "d.op = 'U'")        # Update se op = 'U'
      .whenNotMatchedInsertAll(condition = "d.op = 'I'")     # Insert se op = 'I'
      .execute()
    )

df_stram = (spark.readStream
                .format("cloudFiles")
                .option("cloudFiles.format", "parquet")
                .schema(schema)
                .load(f"/Volumes/raw/upcell/cdc/{tablename}/"))

stream = (df_stram.writeStream
                  .option("chekpointLocation", f"/Volumes/raw/upcell/cdc/{tablename}/_checkpoints/")
                  .foreachBatch(lambda df, batchId: upsert(df, bronze))                
)




In [0]:
start = stream.start()