# Cap√≠tulo 03: Introdu√ß√£o √† Extens√£o Delta

Aprenda a criar, explorar e trabalhar com tabelas Delta Lake usando DuckDB.

## üì¶ Setup Inicial

In [None]:
%pip install duckdb deltalake pandas pyarrow -q

import duckdb
from deltalake import write_deltalake
import importlib.util

def safe_install_ext(con, ext_name):
    try:
        con.execute(f"INSTALL {ext_name}")
        con.execute(f"LOAD {ext_name}")
        return True
    except Exception as e:
        print(f"‚ö† {ext_name}: {e}")
        return False

con = duckdb.connect(':memory:')
print(f"DuckDB {duckdb.__version__}")

## üî® Criando Tabela Delta Particionada

Cria√ß√£o de tabela Delta com particionamento para performance otimizada.

In [None]:
# Criar DataFrame com dados de exemplo
df = con.execute("""
    SELECT
        i as id,
        i % 10 as category,
        i % 2 as partition_col,
        'value-' || i as description,
        CURRENT_DATE - (i % 100) * INTERVAL '1 day' as created_date,
        RANDOM() * 1000 as amount
    FROM range(0, 10000) tbl(i)
""").df()

# Escrever como tabela Delta particionada
write_deltalake(
    "./my_delta_table",
    df,
    partition_by=["partition_col"],
    mode="overwrite"
)

print("‚úì Tabela Delta criada com 10,000 registros particionados")

## üìä Leitura e An√°lise com DuckDB

In [None]:
# Ler e analisar tabela Delta
result = con.execute("""
    SELECT
        partition_col,
        COUNT(*) as total_rows,
        ROUND(AVG(amount), 2) as avg_amount,
        MIN(created_date) as earliest_date,
        MAX(created_date) as latest_date
    FROM delta_scan('./my_delta_table')
    GROUP BY partition_col
    ORDER BY partition_col
""").fetchdf()

print("An√°lise por Parti√ß√£o:")
print(result)

## üìÅ Cria√ß√£o de Dataset Completo

Cria√ß√£o de m√∫ltiplas tabelas Delta (customers, products, sales) para an√°lises complexas.

In [None]:
from pathlib import Path

# 1. Tabela de Clientes
customers_df = con.execute("""
    SELECT
        i as customer_id,
        'Customer ' || i as customer_name,
        ['US', 'UK', 'BR', 'JP'][i % 4 + 1] as country,
        CURRENT_DATE - (i % 1000) * INTERVAL '1 day' as signup_date
    FROM range(1, 1001) tbl(i)
""").df()

write_deltalake("./delta_tables/customers", customers_df, mode="overwrite")

# 2. Tabela de Produtos
products_df = con.execute("""
    SELECT
        i as product_id,
        'Product ' || i as product_name,
        ['Electronics', 'Clothing', 'Food', 'Books'][i % 4 + 1] as category,
        10.0 + RANDOM() * 1000 as price
    FROM range(1, 101) tbl(i)
""").df()

write_deltalake("./delta_tables/products", products_df, mode="overwrite")

# 3. Tabela de Vendas (particionada por data)
sales_df = con.execute("""
    SELECT
        i as order_id,
        1 + (i % 1000) as customer_id,
        1 + (i % 100) as product_id,
        1 + (RANDOM() * 5)::INTEGER as quantity,
        CURRENT_DATE - (i % 365) * INTERVAL '1 day' as order_date,
        RANDOM() * 1000 as amount
    FROM range(1, 50001) tbl(i)
""").df()

write_deltalake(
    "./delta_tables/sales",
    sales_df,
    partition_by=["order_date"],
    mode="overwrite"
)

print("‚úì Dataset completo criado:")
print("  - customers: 1,000 registros")
print("  - products: 100 registros")
print("  - sales: 50,000 registros (particionados)")

# Capitulo 03 Introducao Extensao Delta

Notebook gerado automaticamente a partir do c√≥digo fonte python.
