# Cap√≠tulo 09: Integra√ß√£o com Cloud Storage

Este cap√≠tulo demonstra a integra√ß√£o do DuckDB + Iceberg com servi√ßos de armazenamento em nuvem (AWS S3, Azure Blob Storage).

## üì¶ Instala√ß√£o

Instala√ß√£o das extens√µes necess√°rias para trabalhar com DuckDB e Iceberg.

In [None]:
import duckdb

con = duckdb.connect()

# Instalar e carregar extens√µes
con.execute("INSTALL iceberg")
con.execute("INSTALL httpfs")
con.execute("INSTALL azure")

con.execute("LOAD iceberg")
con.execute("LOAD httpfs")
con.execute("LOAD azure")

print("‚úÖ Extens√µes instaladas e carregadas com sucesso!")
con.close()

## üîê Configura√ß√£o de Credenciais S3

Configura√ß√£o de acesso ao AWS S3 para leitura de tabelas Iceberg.


In [None]:
import duckdb

con = duckdb.connect()
con.execute("LOAD iceberg")
con.execute("LOAD httpfs")

# Op√ß√£o 1: Credentials expl√≠citas
con.execute("""
    CREATE SECRET s3_secret (
        TYPE s3,
        PROVIDER config,
        KEY_ID 'AKIAIOSFODNN7EXAMPLE',
        SECRET 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY',
        REGION 'us-east-1'
    )
""")

# Op√ß√£o 2: Credential chain (AWS CLI/env vars)
con.execute("""
    CREATE SECRET s3_secret (
        TYPE s3,
        PROVIDER credential_chain
    )
""")

## üìä Leitura de Tabelas Iceberg no S3

Exemplo de consulta a tabelas Iceberg armazenadas no AWS S3.


In [None]:
import duckdb

con = duckdb.connect()
con.execute("LOAD iceberg")
con.execute("LOAD httpfs")

# Configurar S3
con.execute("""
    CREATE SECRET s3_secret (
        TYPE s3,
        PROVIDER credential_chain
    )
""")

# Ler tabela Iceberg
result = con.execute("""
    SELECT count(*)
    FROM iceberg_scan('s3://my-bucket/warehouse/sales')
""").fetchone()

print(f"Total de registros: {result[0]:,}")

## üîë Integra√ß√£o com Cat√°logo Iceberg

Configura√ß√£o de secrets para autentica√ß√£o com cat√°logo Iceberg.


In [None]:
import duckdb

con = duckdb.connect()
con.execute("LOAD iceberg")
con.execute("LOAD httpfs")

# Secret para S3
con.execute("""
    CREATE SECRET s3_secret (
        TYPE s3,
        PROVIDER credential_chain
    )
""")

# Secret para cat√°logo Iceberg
con.execute("""
    CREATE SECRET iceberg_secret (
        TYPE iceberg,
        CLIENT_ID 'catalog_client',
        CLIENT_SECRET 'catalog_secret',
        OAUTH2_SERVER_URI 'https://catalog.example.com/oauth/tokens'
    )
""")

# Anexar cat√°logo
con.execute("""
    ATTACH 's3://my-bucket/warehouse' AS iceberg_cat (
        TYPE iceberg,
        SECRET iceberg_secret,
        ENDPOINT 'https://catalog.example.com'
    )
""")

# Consultar
result = con.execute("""
    SELECT * FROM iceberg_cat.default.sales LIMIT 10
""").df()

print(result)

## ‚òÅÔ∏è Configura√ß√£o de Credenciais Azure

Configura√ß√£o de acesso ao Azure Blob Storage.


In [None]:
import duckdb

con = duckdb.connect()
con.execute("LOAD iceberg")
con.execute("LOAD azure")

# Configurar credenciais Azure
con.execute("""
    CREATE SECRET azure_secret (
        TYPE azure,
        PROVIDER config,
        ACCOUNT_NAME 'mystorageaccount',
        ACCOUNT_KEY 'my_account_key=='
    )
""")

## üóÇÔ∏è Leitura de Tabelas Iceberg no Azure

Exemplo de consulta a tabelas Iceberg armazenadas no Azure Blob Storage.


In [None]:
import duckdb

con = duckdb.connect()
con.execute("LOAD iceberg")
con.execute("LOAD azure")

# Configurar Azure
con.execute("""
    CREATE SECRET azure_secret (
        TYPE azure,
        PROVIDER config,
        ACCOUNT_NAME 'mystorageaccount',
        ACCOUNT_KEY 'key=='
    )
""")

# Ler tabela Iceberg
result = con.execute("""
    SELECT *
    FROM iceberg_scan('az://container/path/to/iceberg/table')
    LIMIT 100
""").df()

print(result.head())

## üé´ Azure SAS Token

Alternativa de autentica√ß√£o usando SAS Token.


In [None]:
con.execute("""
    CREATE SECRET azure_secret (
        TYPE azure,
        PROVIDER sas_token,
        ACCOUNT_NAME 'mystorageaccount',
        SAS_TOKEN 'sp=r&st=...'
    )
""")

## ‚ö° Otimiza√ß√£o de Performance para Cloud

Configura√ß√µes para melhorar performance ao ler dados da nuvem.


In [None]:
import duckdb

con = duckdb.connect()
con.execute("LOAD iceberg")
con.execute("LOAD httpfs")

# Aumentar threads para I/O paralelo
con.execute("SET threads = 16")

# Ler grandes datasets
result = con.execute("""
    SELECT
        date_trunc('month', order_date) as month,
        count(*) as orders
    FROM iceberg_scan('s3://bucket/large_table')
    WHERE order_date >= '2024-01-01'
    GROUP BY month
""").df()

## üéØ Boas Pr√°ticas: Sele√ß√£o de Colunas

Compara√ß√£o entre boas e m√°s pr√°ticas na sele√ß√£o de colunas.


In [None]:
# ‚úÖ BOM: Especificar colunas
SELECT customer_id, total_amount
FROM iceberg_scan('s3://bucket/sales');

# ‚ùå RUIM: SELECT *
SELECT *
FROM iceberg_scan('s3://bucket/sales');

## üîç Predicate Pushdown

Filtros s√£o aplicados durante a leitura para melhor performance.


In [None]:
# Filtros s√£o aplicados durante leitura
SELECT *
FROM iceberg_scan('s3://bucket/sales')
WHERE order_date >= '2024-01-01'  # Pushdown para Parquet
  AND region = 'US';               # Pushdown para Parquet

## üìà An√°lise de Query com EXPLAIN

Uso do EXPLAIN ANALYZE para entender o plano de execu√ß√£o.


In [None]:
import duckdb

con = duckdb.connect()
con.execute("LOAD iceberg")
con.execute("LOAD httpfs")

# Analisar query
explain = con.execute("""
    EXPLAIN ANALYZE
    SELECT
        region,
        sum(total_amount) as revenue
    FROM iceberg_scan('s3://bucket/sales')
    WHERE order_date >= '2024-01-01'
    GROUP BY region
""").fetchall()

for row in explain:
    print(row[0])

## ‚è±Ô∏è Benchmark de Performance

Medi√ß√£o de tempo de execu√ß√£o e throughput.


In [None]:
import duckdb
import time

con = duckdb.connect()
con.execute("LOAD iceberg")
con.execute("LOAD httpfs")

# Medir tempo de leitura
start = time.time()
result = con.execute("""
    SELECT count(*)
    FROM iceberg_scan('s3://bucket/large_table')
    WHERE event_date = '2024-01-15'
""").fetchone()
elapsed = time.time() - start

print(f"Tempo: {elapsed:.2f}s")
print(f"Registros: {result[0]:,}")
print(f"Taxa: {result[0]/elapsed:,.0f} registros/s")

## üîí Seguran√ßa: Credenciais de Ambiente

Uso de vari√°veis de ambiente para credenciais seguras.


In [None]:
import duckdb
import os

# Ler credenciais de ambiente
aws_key = os.getenv('AWS_ACCESS_KEY_ID')
aws_secret = os.getenv('AWS_SECRET_ACCESS_KEY')

con = duckdb.connect()
con.execute("LOAD iceberg")
con.execute("LOAD httpfs")

con.execute(f"""
    CREATE SECRET s3_secret (
        TYPE s3,
        PROVIDER config,
        KEY_ID '{aws_key}',
        SECRET '{aws_secret}',
        REGION 'us-east-1'
    )
""")

## üõ°Ô∏è Seguran√ßa: Credential Chain

Uso de credential chain para evitar hardcoded credentials.


In [None]:
# Usar credential chain ao inv√©s de credenciais hardcoded
con.execute("""
    CREATE SECRET s3_secret (
        TYPE s3,
        PROVIDER credential_chain
    )
""")

# Funciona com:
# - AWS CLI (~/.aws/credentials)
# - Vari√°veis de ambiente
# - IAM roles (EC2/ECS)

## üèóÔ∏è Pipeline Completo: Classe IcebergS3Pipeline

Implementa√ß√£o de pipeline completo para leitura e processamento de dados Iceberg no S3.


In [None]:
import duckdb
import os
from datetime import datetime

class IcebergS3Pipeline:
    def __init__(self, bucket, prefix):
        self.bucket = bucket
        self.prefix = prefix

        self.con = duckdb.connect()
        self.con.execute("LOAD iceberg")
        self.con.execute("LOAD httpfs")

        # Usar credential chain
        self.con.execute("""
            CREATE SECRET s3_secret (
                TYPE s3,
                PROVIDER credential_chain
            )
        """)

        # Otimiza√ß√µes
        self.con.execute("SET threads = 16")
        self.con.execute("SET memory_limit = '8GB'")

    def read_table(self, table_name, filter_date=None):
        """L√™ tabela Iceberg do S3"""
        table_path = f"s3://{self.bucket}/{self.prefix}/{table_name}"

        query = f"SELECT * FROM iceberg_scan('{table_path}')"
        if filter_date:
            query += f" WHERE event_date >= '{filter_date}'"

        return self.con.execute(query).df()

    def aggregate_data(self, table_name, date_from):
        """Agrega√ß√£o mensal"""
        table_path = f"s3://{self.bucket}/{self.prefix}/{table_name}"

        return self.con.execute(f"""
            SELECT
                date_trunc('month', event_date) as month,
                count(*) as total_events,
                count(DISTINCT user_id) as unique_users
            FROM iceberg_scan('{table_path}')
            WHERE event_date >= '{date_from}'
            GROUP BY month
            ORDER BY month
        """).df()

    def export_to_parquet(self, query, output_path):
        """Exporta resultado para Parquet"""
        self.con.execute(f"""
            COPY ({query})
            TO '{output_path}'
            (FORMAT parquet, COMPRESSION zstd)
        """)

# Usar
pipeline = IcebergS3Pipeline(
    bucket='analytics-bucket',
    prefix='warehouse'
)

# An√°lise mensal
monthly = pipeline.aggregate_data('events', '2024-01-01')
print(monthly)

# Exportar
pipeline.export_to_parquet(
    "SELECT * FROM iceberg_scan('s3://analytics-bucket/warehouse/events') WHERE event_date >= '2024-01-01'",
    'local_export.parquet'
)

## ‚úÖ Testes de Conex√£o

Teste de conex√£o e troubleshooting.


In [None]:
import duckdb

try:
    con = duckdb.connect()
    con.execute("LOAD iceberg")
    con.execute("LOAD httpfs")

    con.execute("""
        CREATE SECRET test_secret (
            TYPE s3,
            PROVIDER credential_chain
        )
    """)

    result = con.execute("""
        SELECT count(*)
        FROM iceberg_scan('s3://bucket/table')
    """).fetchone()

    print(f"‚úÖ Sucesso: {result[0]} linhas")

except Exception as e:
    print(f"‚ùå Erro: {e}")
    print("Verifique:")
    print("1. AWS credentials configuradas")
    print("2. Permiss√µes S3")
    print("3. Regi√£o correta")

## ‚öôÔ∏è Configura√ß√£o de Timeout

Ajuste de timeout para redes lentas.


In [None]:
# Aumentar timeout para redes lentas
con.execute("SET http_timeout = 120000")  # 120 segundos