# Cap√≠tulo 03 - Leitura de Tabelas Iceberg

## üìã Objetivo

Neste cap√≠tulo aprenderemos:
1. Ler tabelas Iceberg com DuckDB
2. Aplicar filtros e agrega√ß√µes
3. Processar dados em lote
4. Join de m√∫ltiplas tabelas
5. Monitorar estat√≠sticas

## üîß Requisitos

- DuckDB com extens√£o Iceberg
- Tabela Iceberg criada (Cap√≠tulo 01)
- PyIceberg instalado

## Setup Inicial

In [37]:
import duckdb
import os
from datetime import datetime, timedelta
import pandas as pd

print("‚úÖ Imports carregados")

‚úÖ Imports carregados


In [38]:
# Helper function
def safe_install_ext(con, ext_name):
    try:
        con.execute(f"INSTALL {ext_name}")
        con.execute(f"LOAD {ext_name}")
        print(f"‚úÖ Extension '{ext_name}' carregada")
        return True
    except Exception as e:
        print(f"‚ùå Erro: {e}")
        return False

print("‚úÖ Helper definida")

‚úÖ Helper definida


## 1. Leitura B√°sica com iceberg_scan

In [39]:
con = duckdb.connect()
safe_install_ext(con, "iceberg")

# Preparar dados de teste
# Usar warehouse do Cap√≠tulo 01
warehouse_path = './iceberg_warehouse'

if os.path.exists(warehouse_path):
    print(f"‚úÖ Warehouse encontrado: {warehouse_path}")
    
    # Verificar se h√° tabelas
    tables_path = os.path.join(warehouse_path, 'default')
    if os.path.exists(tables_path):
        tables = [d for d in os.listdir(tables_path) 
                 if os.path.isdir(os.path.join(tables_path, d))]
        print(f"\nTabelas dispon√≠veis: {tables}")
else:
    print(f"‚ö†Ô∏è  Warehouse n√£o encontrado")
    print("   Execute o Cap√≠tulo 01 primeiro")

‚úÖ Extension 'iceberg' carregada
‚úÖ Warehouse encontrado: ./iceberg_warehouse

Tabelas dispon√≠veis: ['my_table', 'sales']


### 1.1 Criar Tabela de Exemplo com Dados

In [40]:
# Criar tabela de vendas para testes
from pyiceberg.catalog.sql import SqlCatalog
import pyarrow as pa

# Catalog
WAREHOUSE_PATH = './iceberg_warehouse'
catalog = SqlCatalog(
    "local",
    **{
        "uri": f"sqlite:///{WAREHOUSE_PATH}/catalog.db",
        "warehouse": f"file://{os.path.abspath(WAREHOUSE_PATH)}"
    }
)

# Criar dados de vendas
sales_data = pd.DataFrame({
    'order_id': range(1, 101),
    'customer_id': [f'CUST{i%20:03d}' for i in range(100)],
    'product_id': [f'PROD{i%10:03d}' for i in range(100)],
    'order_date': pd.date_range('2024-01-01', periods=100, freq='1D').astype('datetime64[us]'),
    'quantity': [i % 10 + 1 for i in range(100)],
    'total_amount': [100 + (i * 10) % 500 for i in range(100)]
})

print(f"‚úÖ Dados criados: {len(sales_data)} vendas")
print(f"\nPreview:")
print(sales_data.head())

‚úÖ Dados criados: 100 vendas

Preview:
   order_id customer_id product_id order_date  quantity  total_amount
0         1     CUST000    PROD000 2024-01-01         1           100
1         2     CUST001    PROD001 2024-01-02         2           110
2         3     CUST002    PROD002 2024-01-03         3           120
3         4     CUST003    PROD003 2024-01-04         4           130
4         5     CUST004    PROD004 2024-01-05         5           140


In [41]:
# Converter para Arrow
sales_arrow = pa.Table.from_pandas(sales_data)

# Criar/recriar tabela
try:
    catalog.drop_table("default.sales")
    print("Tabela antiga removida")
except:
    pass

sales_table = catalog.create_table(
    "default.sales",
    schema=sales_arrow.schema
)

print(f"‚úÖ Tabela 'sales' criada: {sales_table.name()}")

Tabela antiga removida
‚úÖ Tabela 'sales' criada: ('default', 'sales')


In [42]:
# Inserir dados
sales_table.append(sales_arrow)
print(f"‚úÖ Dados inseridos: {len(sales_data)} linhas")

‚úÖ Dados inseridos: 100 linhas


## 2. Leitura com Filtros

In [43]:
# Fun√ß√£o para ler com filtros
def read_iceberg_filtered(metadata_path, date_filter=None):
    """
    L√™ tabela Iceberg com filtros opcionais
    Nota: Usa PyIceberg -> Arrow -> DuckDB para robustez no Windows
    """
    con = duckdb.connect()
    
    if not os.path.exists(metadata_path):
        print(f"‚ùå Path n√£o encontrado: {metadata_path}")
        return None
    
    try:
        # Load using PyIceberg
        tbl = catalog.load_table("default.sales")
        # In a real filtered scenario, we could push filters to PyIceberg scan(), e.g.:
        # tbl.scan(row_filter=GreaterThan("order_date", date_filter))
        # For simplicity here, we load full snapshot to Arrow and query with DuckDB
        arrow_table = tbl.scan().to_arrow()
        
        query = "SELECT * FROM arrow_table"
        
        if date_filter:
            query += f" WHERE order_date >= '{date_filter}'"
        
        result = con.execute(query).df()
        print(f"‚úÖ Lidos {len(result)} registros")
        return result
    except Exception as e:
        print(f"‚ùå Erro na leitura: {e}")
        return None

print("‚úÖ Fun√ß√£o read_iceberg_filtered definida (DuckDB+Arrow Integration)")

‚úÖ Fun√ß√£o read_iceberg_filtered definida (DuckDB+Arrow Integration)


In [44]:
# Testar leitura
metadata_path = './iceberg_warehouse/default/sales/metadata'

if os.path.exists(metadata_path):
    df = read_iceberg_filtered(metadata_path, date_filter='2024-02-01')
    
    if df is not None and len(df) > 0:
        print("\nPreview dos dados:")
        print(df.head())
else:
    print("Tabela sales n√£o tem dados (esperado pela incompatibilidade PyArrow)")

‚úÖ Lidos 69 registros

Preview dos dados:
   order_id customer_id product_id order_date  quantity  total_amount
0        32     CUST011    PROD001 2024-02-01         2           410
1        33     CUST012    PROD002 2024-02-02         3           420
2        34     CUST013    PROD003 2024-02-03         4           430
3        35     CUST014    PROD004 2024-02-04         5           440
4        36     CUST015    PROD005 2024-02-05         6           450


## 3. Leitura com Janela de Tempo

In [45]:
def read_last_n_days(metadata_path, n_days=7):
    """
    L√™ √∫ltimos N dias de uma tabela Iceberg
    """
    cutoff_date = (datetime.now() - timedelta(days=n_days)).strftime('%Y-%m-%d')
    
    print(f"üìÖ Lendo dados desde: {cutoff_date}")
    
    return read_iceberg_filtered(metadata_path, date_filter=cutoff_date)

# Testar
if os.path.exists(metadata_path):
    recent_data = read_last_n_days(metadata_path, n_days=30)
    
    if recent_data is not None:
        print(f"\n‚úÖ Dados recentes: {len(recent_data)} registros")
else:
    print("‚ö†Ô∏è  Sem dados para testar (normal - PyArrow issue)")

üìÖ Lendo dados desde: 2025-12-23
‚úÖ Lidos 0 registros

‚úÖ Dados recentes: 0 registros


## 4. An√°lise Agregada

In [46]:
def analyze_monthly_sales(metadata_path):
    """
    Analisa vendas agregadas por m√™s
    Nota: Usa PyIceberg -> Arrow -> DuckDB para evitar problemas de path no Windows
    """
    con = duckdb.connect()
    
    if not os.path.exists(metadata_path):
        print("‚ùå Path n√£o encontrado")
        return None
        
    try:
        # Usar o cat√°logo j√° configurado para carregar a tabela
        # Isso garante que usamos a mesma configura√ß√£o que a escrita
        tbl = catalog.load_table("default.sales")
        
        # Converter para Arrow falicita a leitura pelo DuckDB no Windows
        arrow_table = tbl.scan().to_arrow()
        
        result = con.execute(f"""
            SELECT
                date_trunc('month', order_date) as month,
                count(DISTINCT customer_id) as unique_customers,
                count(*) as total_orders,
                sum(total_amount) as revenue,
                round(avg(total_amount), 2) as avg_order_value
            FROM arrow_table
            GROUP BY month
            ORDER BY month
        """).df()
        
        print(f"‚úÖ An√°lise mensal: {len(result)} meses")
        return result
    except Exception as e:
        print(f"‚ùå Erro: {e}")
        return None

# Executar
if os.path.exists(metadata_path):
    monthly = analyze_monthly_sales(metadata_path)
    
    if monthly is not None and len(monthly) > 0:
        print("\nüìä Vendas mensais:")
        print(monthly)
else:
    print("‚ö†Ô∏è  Sem dados (esperado)")

‚úÖ An√°lise mensal: 4 meses

üìä Vendas mensais:
       month  unique_customers  total_orders  revenue  avg_order_value
0 2024-01-01                20            31   7750.0           250.00
1 2024-02-01                20            29  10950.0           377.59
2 2024-03-01                20            31  10850.0           350.00
3 2024-04-01                 9             9   4950.0           550.00


## 5. Processamento em Lote

In [47]:
def process_iceberg_to_parquet(metadata_path, output_file):
    """
    L√™ tabela Iceberg, processa e salva em Parquet
    """
    con = duckdb.connect()
    
    if not os.path.exists(metadata_path):
        print("‚ùå Metadata n√£o encontrado")
        return False
    
    try:
        # Usar PyIceberg -> Arrow -> DuckDB
        tbl = catalog.load_table("default.sales")
        arrow_table = tbl.scan().to_arrow()
        
        con.execute(f"""
            COPY (
                SELECT
                    customer_id,
                    sum(total_amount) as lifetime_value,
                    count(*) as order_count,
                    max(order_date) as last_order_date
                FROM arrow_table
                GROUP BY customer_id
                HAVING lifetime_value > 100
            ) TO '{output_file}'
            (FORMAT parquet, COMPRESSION zstd)
        """)
        
        print(f"‚úÖ Processado e salvo em {output_file}")
        
        # Verificar arquivo
        if os.path.exists(output_file):
            size = os.path.getsize(output_file)
            print(f"   Tamanho: {size:,} bytes")
        
        return True
    except Exception as e:
        print(f"‚ùå Erro: {e}")
        return False

# Testar
if os.path.exists(metadata_path):
    process_iceberg_to_parquet(
        metadata_path,
        './customer_ltv.parquet'
    )
else:
    print("‚ö†Ô∏è  Sem dados (esperado)")

‚úÖ Processado e salvo em ./customer_ltv.parquet
   Tamanho: 2,238 bytes


## 7. Monitoramento de Tabelas

In [48]:
def monitor_iceberg_table(metadata_path):
    """
    Monitora estat√≠sticas de uma tabela Iceberg
    """
    con = duckdb.connect()
    
    if not os.path.exists(metadata_path):
        print("‚ùå Path n√£o encontrado")
        return
    
    try:
        # Usar PyIceberg -> Arrow -> DuckDB
        tbl = catalog.load_table("default.sales")
        arrow_table = tbl.scan().to_arrow()
        
        stats = con.execute(f"""
            SELECT
                count(*) as total_rows,
                count(DISTINCT date_trunc('day', order_date)) as days_of_data,
                min(order_date) as earliest_date,
                max(order_date) as latest_date
            FROM arrow_table
        """).fetchone()
        
        print(f"""
üîç ESTAT√çSTICAS DA TABELA
{'='*50}
Total de linhas: {stats[0]:,}
Dias de dados: {stats[1]}
Data mais antiga: {stats[2]}
Data mais recente: {stats[3]}
Timestamp da an√°lise: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}
{'='*50}
        """)
    except Exception as e:
        print(f"‚ùå Erro: {e}")

# Testar
if os.path.exists(metadata_path):
    monitor_iceberg_table(metadata_path)
else:
    print("‚ö†Ô∏è  Tabela vazia (esperado)")


üîç ESTAT√çSTICAS DA TABELA
Total de linhas: 100
Dias de dados: 100
Data mais antiga: 2024-01-01 00:00:00
Data mais recente: 2024-04-09 00:00:00
Timestamp da an√°lise: 2026-01-22 18:28:36
        


## 8. Leitura Segura com Tratamento de Erros

In [49]:
def safe_iceberg_scan(metadata_path):
    """
    L√™ tabela Iceberg com tratamento de erros
    """
    con = duckdb.connect()
    
    # Verifica√ß√µes
    if not os.path.exists(metadata_path):
        print(f"‚ùå Path n√£o existe: {metadata_path}")
        return False
        
    print(f"üìÑ Metadata Path: {metadata_path}")
    
    try:
        # Usar PyIceberg -> Arrow -> DuckDB
        tbl = catalog.load_table("default.sales")
        arrow_table = tbl.scan().to_arrow()
        
        result = con.execute(f"""
            SELECT count(*) FROM arrow_table
        """).fetchone()
        
        print(f"‚úÖ Tabela encontrada: {result[0]:,} linhas")
        # Mostrar schema tamb√©m
        print("Schema:")
        print(tbl.schema())
        return True
    except Exception as e:
        print(f"‚ùå Erro ao ler tabela: {e}")
        return False

# Teste Robustez
print("\nTeste de Robustez:")
safe_iceberg_scan(metadata_path)
safe_iceberg_scan("./path_invalido")


Teste de Robustez:
üìÑ Metadata Path: ./iceberg_warehouse/default/sales/metadata
‚úÖ Tabela encontrada: 100 linhas
Schema:
table {
  1: order_id: optional long
  2: customer_id: optional string
  3: product_id: optional string
  4: order_date: optional timestamp
  5: quantity: optional long
  6: total_amount: optional long
}
‚ùå Path n√£o existe: ./path_invalido


False

## ‚úÖ Resumo

**Aprendemos:**
1. ‚úÖ Ler tabelas Iceberg com `iceberg_scan()`
2. ‚úÖ Aplicar filtros por data
3. ‚úÖ Fun√ß√µes para janelas de tempo
4. ‚úÖ An√°lises agregadas (mensal, por cliente)
5. ‚úÖ Processamento em lote ‚Üí Parquet
6. ‚úÖ Monitoramento de estat√≠sticas
7. ‚úÖ Tratamento de erros
 