## üöß Setup Zone

### Prerequisites

Before running this notebook, ensure you have:
1. Completed the environment setup from the main tutorial
2. The `stocks.db` DuckDB database in the `data/` directory
3. DuckDB package installed

```bash
pip install duckdb
```

---
## ‚úÖ End of Setup Zone
---

## üì¶ Import Libraries

We'll use three main packages:

| Package | Purpose |
|---------|---------|
| `duckdb` | Read data from local DuckDB database |
| `snowflake.core` | Python API for managing Snowflake objects |
| `snowflake.snowpark` | DataFrame API for data transformations |

In [1]:
# Standard library
import os
import time
from datetime import date, datetime

# DuckDB for local database access
import duckdb
import pandas as pd

# Snowflake imports
from snowflake.snowpark import Session
from snowflake.snowpark import functions as F
from snowflake.core import Root, CreateMode
from snowflake.core.database import Database
from snowflake.core.schema import Schema
from snowflake.core.warehouse import Warehouse
from snowflake.core.table import Table, TableColumn
from snowflake.core.role import Role, Securable
from snowflake.core.dynamic_table import DynamicTableCollection

## ü¶Ü Explore DuckDB Database

First, let's connect to the local DuckDB database and explore its structure.

In [2]:
# Connect to DuckDB database
duckdb_path = "./data/stocks.db"
duck_conn = duckdb.connect(duckdb_path, read_only=True)

print(f"ü¶Ü Connected to DuckDB database: {duckdb_path}")

ü¶Ü Connected to DuckDB database: ./data/stocks.db


In [3]:
# List all tables in the DuckDB database
tables_df = duck_conn.execute("SHOW TABLES").fetchdf()
print("Tables in DuckDB database:")
display(tables_df)

Tables in DuckDB database:


Unnamed: 0,name
0,exchanges
1,index_list
2,index_quotes
3,risk_premium
4,stock_metrics
5,stock_profiles
6,stock_quotes
7,stock_tickers


In [4]:
# Get detailed schema information for each table
for table_name in tables_df['name'].tolist():
    print(f"\nüìã Schema for table '{table_name}':")
    schema_df = duck_conn.execute(f"DESCRIBE {table_name}").fetchdf()
    display(schema_df)
    
    # Get row count
    count = duck_conn.execute(f"SELECT COUNT(*) FROM {table_name}").fetchone()[0]
    print(f"Row count: {count:,}")


üìã Schema for table 'exchanges':


Unnamed: 0,column_name,column_type,null,key,default,extra
0,exchange,VARCHAR,YES,,,
1,name,VARCHAR,YES,,,
2,countryName,VARCHAR,YES,,,
3,countryCode,VARCHAR,YES,,,
4,symbolSuffix,VARCHAR,YES,,,
5,delay,VARCHAR,YES,,,


Row count: 14

üìã Schema for table 'index_list':


Unnamed: 0,column_name,column_type,null,key,default,extra
0,symbol,VARCHAR,YES,,,
1,name,VARCHAR,YES,,,
2,exchange,VARCHAR,YES,,,
3,currency,VARCHAR,YES,,,


Row count: 36

üìã Schema for table 'index_quotes':


Unnamed: 0,column_name,column_type,null,key,default,extra
0,symbol,VARCHAR,YES,,,
1,date,DATE,YES,,,
2,price,DOUBLE,YES,,,
3,volume,BIGINT,YES,,,


Row count: 111,946

üìã Schema for table 'risk_premium':


Unnamed: 0,column_name,column_type,null,key,default,extra
0,country,VARCHAR,YES,,,
1,continent,VARCHAR,YES,,,
2,countryRiskPremium,DOUBLE,YES,,,
3,totalEquityRiskPremium,DOUBLE,YES,,,


Row count: 44

üìã Schema for table 'stock_metrics':


Unnamed: 0,column_name,column_type,null,key,default,extra
0,symbol,VARCHAR,YES,,,
1,date,TIMESTAMP_MS,YES,,,
2,fiscalYear,BIGINT,YES,,,
3,period,VARCHAR,YES,,,
4,reportedCurrency,VARCHAR,YES,,,
...,...,...,...,...,...,...
102,priceToFairValue,DOUBLE,YES,,,
103,debtToMarketCap,DOUBLE,YES,,,
104,effectiveTaxRate,DOUBLE,YES,,,
105,enterpriseValueMultiple,DOUBLE,YES,,,


Row count: 129,384

üìã Schema for table 'stock_profiles':


Unnamed: 0,column_name,column_type,null,key,default,extra
0,marketCap,BIGINT,YES,,,
1,ipoDate,VARCHAR,YES,,,
2,ceo,VARCHAR,YES,,,
3,cusip,VARCHAR,YES,,,
4,volume,BIGINT,YES,,,
5,sector,VARCHAR,YES,,,
6,image,VARCHAR,YES,,,
7,address,VARCHAR,YES,,,
8,price,DOUBLE,YES,,,
9,companyName,VARCHAR,YES,,,


Row count: 11,895

üìã Schema for table 'stock_quotes':


Unnamed: 0,column_name,column_type,null,key,default,extra
0,symbol,VARCHAR,YES,,,
1,date,TIMESTAMP_MS,YES,,,
2,open,DOUBLE,YES,,,
3,low,DOUBLE,YES,,,
4,high,DOUBLE,YES,,,
5,close,DOUBLE,YES,,,
6,adjClose,DOUBLE,YES,,,
7,volume,BIGINT,YES,,,


Row count: 12,556,103

üìã Schema for table 'stock_tickers':


Unnamed: 0,column_name,column_type,null,key,default,extra
0,symbol,VARCHAR,YES,,,


Row count: 11,895


In [5]:
# Preview data from each table
for table_name in tables_df['name'].tolist():
    print(f"\nüìä Sample data from '{table_name}':")
    sample_df = duck_conn.execute(f"SELECT * FROM {table_name} LIMIT 5").fetchdf()
    display(sample_df)


üìä Sample data from 'exchanges':


Unnamed: 0,exchange,name,countryName,countryCode,symbolSuffix,delay
0,AMEX,New York Stock Exchange Arca,United States of America,US,,Real-time
1,BUE,Buenos Aires Stock Exchange,Argentina,AR,.BA,20 min
2,BVC,Colombia Stock Exchange,Colombia,CO,.CL,15 min
3,CBOE,Chicago Board Options Exchange,United States of America,US,,Real-time
4,CNQ,Canadian Securities Exchange,Canada,CA,.CN,Real-time



üìä Sample data from 'index_list':


Unnamed: 0,symbol,name,exchange,currency
0,^TTIN,S&P/TSX Capped Industrials Index,TSX,CAD
1,^NYA,NYSE Composite,NYSE,USD
2,^XAX,NYSE American Composite Index,NYSE,USD
3,^NYITR,NYSE International 100 Index,NYSE,USD
4,^DJU,Dow Jones Utility Average,NASDAQ,USD



üìä Sample data from 'index_quotes':


Unnamed: 0,symbol,date,price,volume
0,TX60.TS,2020-11-03,948.200012,94504821
1,TX60.TS,2020-11-04,952.299988,179295563
2,TX60.TS,2020-11-05,968.52002,127428692
3,TX60.TS,2020-11-06,966.869995,118405024
4,TX60.TS,2020-11-09,980.429993,246182781



üìä Sample data from 'risk_premium':


Unnamed: 0,country,continent,countryRiskPremium,totalEquityRiskPremium
0,Haiti,North America,16.02,20.35
1,Greenland,North America,1.12,5.45
2,Uruguay,South America,2.13,6.46
3,Montserrat,North America,2.93,7.26
4,Barbados,North America,8.68,13.01



üìä Sample data from 'stock_metrics':


Unnamed: 0,symbol,date,fiscalYear,period,reportedCurrency,marketCap,enterpriseValue,evToSales,evToOperatingCashFlow,evToFreeCashFlow,...,operatingCashFlowPerShare,capexPerShare,freeCashFlowPerShare,netIncomePerEBT,ebtPerEbit,priceToFairValue,debtToMarketCap,effectiveTaxRate,enterpriseValueMultiple,dividendPerShare
0,AACQU,2025-03-31,2025,Q1,USD,97451210.0,61180210.0,11.267074,-7.249699,-3.964246,...,-0.057466,0.047626,-0.105092,1.002236,0.808693,0.309754,0.074191,-0.002236,-2.590187,0.0
1,AACQU,2024-12-31,2024,Q4,USD,184392400.0,137768400.0,14.939101,-24.140247,-14.436591,...,-0.039616,0.026628,-0.066245,1.020066,0.81807,0.544951,0.029839,-0.020066,-13.210125,0.0
2,AACQU,2024-09-30,2024,Q3,USD,220816900.0,174171900.0,21.2353,-12.972734,-10.907561,...,-0.093634,0.017728,-0.111362,1.006323,2.11964,0.615336,0.024917,-0.006323,-5.1973,0.0
3,AACQU,2024-06-30,2024,Q2,USD,119108400.0,76534430.0,10.882188,-5.115254,-4.906682,...,-0.104626,0.004447,-0.109074,1.002777,1.065071,0.304907,0.076124,-0.002777,-4.668726,0.0
4,AACQU,2024-03-31,2024,Q1,USD,72205090.0,9640090.0,1.412467,-0.576044,-0.516231,...,-0.117994,0.013671,-0.131666,1.008554,0.766517,0.176893,0.121349,-0.008554,-0.859495,0.0



üìä Sample data from 'stock_profiles':


Unnamed: 0,marketCap,ipoDate,ceo,cusip,volume,sector,image,address,price,companyName,...,defaultImage,city,currency,website,zip,exchangeFullName,symbol,changePercentage,fullTimeEmployees,industry
0,10695127780,2021-02-10,Domenic J. Dell'Osso Jr.,165167735.0,2992655,Energy,https://images.financialmodelingprep.com/symbo...,6100 North Western Avenue,81.46,Chesapeake Energy Corporation,...,False,Oklahoma City,USD,https://www.chk.com,73118,NASDAQ Global Select,CHK,-0.9605,1000,Oil & Gas Exploration & Production
1,8652631527,2014-11-05,John C. Malone,530307305.0,22997,Communication Services,https://images.financialmodelingprep.com/symbo...,12300 Liberty Boulevard,60.28,Liberty Broadband Corporation,...,False,Englewood,USD,https://www.libertybroadband.com,80112,NASDAQ Global Select,LBRDK,-0.42947,1900,Telecommunications Services
2,60807002560,2010-01-05,Thomas Schafer,,100,Consumer Cyclical,https://images.financialmodelingprep.com/symbo...,Berliner Ring 2,,Volkswagen AG,...,False,Wolfsburg,USD,https://www.volkswagenag.com,38440,Other OTC,VLKAF,1.94554,639608,Auto - Manufacturers
3,9477156427,2009-09-29,Mikael Staffas MBA,,135,Basic Materials,https://images.financialmodelingprep.com/symbo...,Klarabergsviadukten 90,34.65,Boliden AB (publ),...,True,Stockholm,USD,https://www.boliden.com,101 20,Other OTC,BOLIF,0.0,6153,Industrial Materials
4,2491479594,2001-06-27,Daniel Llewellyn Rees,,90,Financial Services,https://images.financialmodelingprep.com/symbo...,33 City Centre Drive,,goeasy Ltd.,...,False,Mississauga,USD,https://www.goeasy.com,L5B 2N5,Other OTC,EHMEF,2.90817,2600,Financial - Credit Services



üìä Sample data from 'stock_quotes':


Unnamed: 0,symbol,date,open,low,high,close,adjClose,volume
0,0P00016G44,2020-12-24,10.23,10.23,10.23,10.23,10.23,0
1,A,2020-12-24,117.04,116.84,118.37,117.31,113.62,733600
2,AA,2020-12-24,22.34,21.81,22.34,21.96,21.1,1075501
3,AAALF,2020-12-24,20.775801,20.775801,20.775801,20.775801,18.550209,0
4,AAALY,2020-12-24,19.3,19.3,19.3,19.3,17.17,0



üìä Sample data from 'stock_tickers':


Unnamed: 0,symbol
0,0P00016G44
1,A
2,AA
3,AAALF
4,AAALY


## üîå Connect to Snowflake

We'll connect using Snowflake's compute engine for distributed processing.

> ‚ö†Ô∏è **Security Note:** Never commit credentials to version control. Use environment variables or secret managers.

In [6]:
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Build connection parameters from environment variables
connection_parameters = {
    "account": os.getenv("SNOWFLAKE_ACCOUNT"),
    "user": os.getenv("SNOWFLAKE_USER"),
    "password": os.getenv("SNOWFLAKE_PASSWORD"),
    "role": os.getenv("SNOWFLAKE_ROLE"),
    "warehouse": os.getenv("SNOWFLAKE_WAREHOUSE")
}

# Create a session
session = Session.builder.configs(connection_parameters).create()
print(f"‚ùÑÔ∏è Connected to Snowflake account: {session.get_current_account()}")
print(f"Current role: {session.get_current_role()}")

‚ùÑÔ∏è Connected to Snowflake account: "OWVFCQY-OUB97142"
Current role: "ACCOUNTADMIN"


### Initialize Root API Object

The `Root` object is the entry point for Snowflake's Python API.

In [7]:
# Create root object
root = Root(session)
print("Root API object created successfully")

Root API object created successfully


### üîê Create Lab Role (RBAC)

Create a dedicated role following the **principle of least privilege**.

In [8]:
# Create lab role using Python API
stocks_lab_role = Role(name="stocks_lab_role")
root.roles.create(stocks_lab_role, mode=CreateMode.if_not_exists)
print("Role 'stocks_lab_role' created successfully")

# Grant role to SYSADMIN
session.sql("GRANT ROLE stocks_lab_role TO ROLE SYSADMIN").collect()

# Grant necessary privileges to role
session.sql("GRANT CREATE WAREHOUSE ON ACCOUNT TO ROLE stocks_lab_role").collect()
session.sql("GRANT CREATE DATABASE ON ACCOUNT TO ROLE stocks_lab_role").collect()

print("Privileges granted to stocks_lab_role")

Role 'stocks_lab_role' created successfully
Privileges granted to stocks_lab_role


### Switch to Lab Role

In [9]:
session.use_role("stocks_lab_role")
print(f"Switched to role: {session.get_current_role()}")

Switched to role: "STOCKS_LAB_ROLE"


---
# Part 1: Define Data Objects üìê

Create the foundational infrastructure:
- Database and schemas
- Compute warehouse
- Raw data tables (matching DuckDB schema)

### üóÑÔ∏è Create Database

In [10]:
# Create database
database_name = "stocks_db"
new_database = Database(name=database_name)
root.databases.create(new_database, mode=CreateMode.or_replace)

print(f"Database '{database_name}' created successfully")

# Set database context
session.use_database(database_name)
print(f"Session is now using database: {session.get_current_database()}")

Database 'stocks_db' created successfully
Session is now using database: "STOCKS_DB"


### üìÅ Create Schemas

| Schema | Purpose |
|--------|---------|
| `raw` | Source data from DuckDB |
| `analytics` | Transformed, business-ready data |

In [11]:
# Get database reference
db = root.databases[database_name]

# Create raw schema
raw_schema = Schema(name="raw")
db.schemas.create(raw_schema, mode=CreateMode.or_replace)
print("Schema 'raw' created successfully")

# Create analytics schema
analytics_schema = Schema(name="analytics")
db.schemas.create(analytics_schema, mode=CreateMode.or_replace)
print("Schema 'analytics' created successfully")

Schema 'raw' created successfully
Schema 'analytics' created successfully


### ‚ö° Create Compute Warehouse

In [12]:
# Create compute resource
virtual_warehouse_name = "stocks_lab_wh"

new_warehouse = Warehouse(
    name=virtual_warehouse_name,
    warehouse_size="MEDIUM"
)

root.warehouses.create(new_warehouse, mode=CreateMode.or_replace)
print(f"Warehouse '{virtual_warehouse_name}' created successfully")

# Grant usage on warehouse to stocks_lab_role
root.roles["stocks_lab_role"].grant_privileges(
    privileges=["USAGE"],
    securable_type="WAREHOUSE",
    securable=Securable(name=virtual_warehouse_name)
)
print(f"Granted USAGE on warehouse to stocks_lab_role")

# Use the warehouse
session.use_warehouse(virtual_warehouse_name)
print(f"Using virtual warehouse: {session.get_current_warehouse()}")

Warehouse 'stocks_lab_wh' created successfully
Granted USAGE on warehouse to stocks_lab_role
Using virtual warehouse: "STOCKS_LAB_WH"


### üìä Create Raw Tables

We'll dynamically create tables in Snowflake based on the DuckDB schema.

In [13]:
# Helper function to map DuckDB types to Snowflake types
def duckdb_to_snowflake_type(duckdb_type: str, max_length: int = None) -> str:
    """Map DuckDB data types to Snowflake data types.
    
    Args:
        duckdb_type: The DuckDB column type
        max_length: Optional max string length for VARCHAR optimization
    """
    type_mapping = {
        'BIGINT': 'NUMBER(38,0)',
        'INTEGER': 'NUMBER(38,0)',
        'SMALLINT': 'NUMBER(38,0)',
        'TINYINT': 'NUMBER(38,0)',
        'DOUBLE': 'FLOAT',
        'FLOAT': 'FLOAT',
        'REAL': 'FLOAT',
        'DECIMAL': 'NUMBER(38,10)',
        'DATE': 'DATE',
        'TIMESTAMP': 'TIMESTAMP_NTZ(9)',
        'TIMESTAMP WITH TIME ZONE': 'TIMESTAMP_TZ(9)',
        'TIME': 'TIME(9)',
        'BOOLEAN': 'BOOLEAN',
        'BLOB': 'BINARY',
        'HUGEINT': 'NUMBER(38,0)',
    }
    
    duckdb_type_upper = duckdb_type.upper()
    
    # Handle VARCHAR/TEXT/STRING types with optimized sizing
    if any(t in duckdb_type_upper for t in ['VARCHAR', 'TEXT', 'STRING']):
        if max_length is not None and max_length > 0:
            # Add 20% buffer and round up to nearest power of 2 or nice number
            buffer_size = max(int(max_length * 1.2), max_length + 10)
            # Round to nice sizes: 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, etc.
            nice_sizes = [16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536]
            for size in nice_sizes:
                if buffer_size <= size:
                    return f'VARCHAR({size})'
            # If larger than our nice sizes, use the buffered value rounded up
            return f'VARCHAR({buffer_size})'
        else:
            # No data or empty strings - use minimal default
            return 'VARCHAR(256)'
    
    # Check for exact match
    if duckdb_type_upper in type_mapping:
        return type_mapping[duckdb_type_upper]
    
    # Check for partial matches (e.g., DECIMAL(10,2))
    for duck_type, snow_type in type_mapping.items():
        if duckdb_type_upper.startswith(duck_type):
            return snow_type
    
    # Default fallback
    return 'VARCHAR(256)'


def get_varchar_max_lengths(conn, table_name: str) -> dict:
    """Query DuckDB to get max string lengths for VARCHAR columns."""
    # Get schema info
    schema_df = conn.execute(f"DESCRIBE {table_name}").fetchdf()
    
    max_lengths = {}
    varchar_cols = []
    
    for _, row in schema_df.iterrows():
        col_name = row['column_name']
        col_type = row['column_type'].upper()
        
        if any(t in col_type for t in ['VARCHAR', 'TEXT', 'STRING']):
            varchar_cols.append(col_name)
    
    if varchar_cols:
        # Build a single query to get max lengths for all VARCHAR columns
        length_expressions = [f'MAX(LENGTH("{col}")) as "{col}"' for col in varchar_cols]
        query = f"SELECT {', '.join(length_expressions)} FROM {table_name}"
        result = conn.execute(query).fetchone()
        
        for i, col in enumerate(varchar_cols):
            max_lengths[col] = result[i] if result[i] is not None else 0
    
    return max_lengths

print("‚úÖ Type mapping function with VARCHAR optimization created")

‚úÖ Type mapping function with VARCHAR optimization created


In [None]:
# Get schema reference
raw_schema_ref = db.schemas["raw"]

# Get list of tables from DuckDB
duckdb_tables = duck_conn.execute("SHOW TABLES").fetchdf()['name'].tolist()

print("üìä Creating tables with optimized VARCHAR sizes...\n")

for table_name in duckdb_tables:
    print(f"üìã Creating table '{table_name}' in Snowflake...")
    
    # Get schema from DuckDB
    schema_df = duck_conn.execute(f"DESCRIBE {table_name}").fetchdf()
    
    # Get max lengths for VARCHAR columns
    max_lengths = get_varchar_max_lengths(duck_conn, table_name)
    if max_lengths:
        print(f"   VARCHAR max lengths: {max_lengths}")
    
    # Build Snowflake column definitions with optimized types
    columns = []
    for _, row in schema_df.iterrows():
        col_name = row['column_name']
        duckdb_type = row['column_type']
        max_len = max_lengths.get(col_name)
        col_type = duckdb_to_snowflake_type(duckdb_type, max_len)
        
        columns.append(TableColumn(name=col_name, datatype=col_type))
        print(f"   - {col_name}: {duckdb_type} -> {col_type}")
    
    # Create the table
    snowflake_table = Table(name=table_name, columns=columns)
    raw_schema_ref.tables.create(snowflake_table, mode=CreateMode.or_replace)
    print(f"   ‚úÖ Table '{table_name}' created\n")

---
# Part 2: Ingest Data üì•

Load data from DuckDB into Snowflake using bulk upload.

> üí° **Pattern:** DuckDB ‚Üí pandas ‚Üí `write_pandas()` (bulk upload via Parquet staging)

### ‚ö° Performance Optimizations:
| Optimization | Benefit |
|--------------|---------|
| **Optimized VARCHAR sizes** | Reduced storage footprint based on actual data |
| **`write_pandas()`** | Bulk upload via internal Parquet staging (10-100x faster than row-by-row) |
| **Pre-created tables** | Correct data types (DATE, NUMBER, etc.) preserved |

In [15]:
def load_table_to_snowflake(table_name: str) -> tuple[str, int, float]:
    """Load a single table from DuckDB to Snowflake using bulk upload."""
    start_time = time.perf_counter()
    
    # DuckDB -> pandas
    pandas_df = duck_conn.execute(f"SELECT * FROM {table_name}").fetchdf()
    row_count = len(pandas_df)
    
    if row_count == 0:
        print(f"  ‚ö†Ô∏è Table '{table_name}' is empty, skipping...")
        return table_name, 0, time.perf_counter() - start_time
    
    # Uppercase column names for Snowflake
    pandas_df.columns = [col.upper() for col in pandas_df.columns]
    
    # Convert date/datetime columns to ISO format strings
    # Snowflake auto-casts to DATE/TIMESTAMP based on pre-defined schema
    for col in pandas_df.columns:
        dtype_str = str(pandas_df[col].dtype)
        
        if 'datetime64' in dtype_str:
            pandas_df[col] = pandas_df[col].dt.strftime('%Y-%m-%d %H:%M:%S')
        elif pandas_df[col].dtype == 'object' and len(pandas_df[col].dropna()) > 0:
            first_val = pandas_df[col].dropna().iloc[0]
            if isinstance(first_val, date) and not isinstance(first_val, datetime):
                pandas_df[col] = pandas_df[col].apply(lambda x: x.strftime('%Y-%m-%d') if x is not None else None)
            elif isinstance(first_val, datetime):
                pandas_df[col] = pandas_df[col].apply(lambda x: x.strftime('%Y-%m-%d %H:%M:%S') if x is not None else None)
    
    # Bulk upload using pre-created tables with optimized schema
    session.write_pandas(
        pandas_df,
        table_name=table_name.upper(),
        database="STOCKS_DB",
        schema="RAW",
        auto_create_table=False,
        overwrite=True,
        quote_identifiers=False
    )
    
    elapsed = time.perf_counter() - start_time
    return table_name, row_count, elapsed


# Load all tables
print("üöÄ Starting data ingestion...\n")
total_start = time.perf_counter()

results = []
for table_name in duckdb_tables:
    name, row_count, elapsed = load_table_to_snowflake(table_name)
    results.append((name, row_count, elapsed))
    print(f"‚úÖ {name}: {row_count:,} rows in {elapsed:.2f}s")

total_elapsed = time.perf_counter() - total_start
total_rows = sum(r[1] for r in results)

print(f"\nüìä Summary: Loaded {total_rows:,} total rows in {total_elapsed:.2f}s")
print(f"   Throughput: {total_rows/total_elapsed:,.0f} rows/second")

üöÄ Starting data ingestion...

‚úÖ exchanges: 14 rows in 3.43s
‚úÖ index_list: 36 rows in 2.85s
‚úÖ index_quotes: 111,946 rows in 6.97s
‚úÖ risk_premium: 44 rows in 3.03s
‚úÖ stock_metrics: 129,384 rows in 103.13s
‚úÖ stock_profiles: 11,895 rows in 10.46s


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

‚úÖ stock_quotes: 12,556,103 rows in 325.48s
‚úÖ stock_tickers: 11,895 rows in 3.51s

üìä Summary: Loaded 12,821,317 total rows in 459.40s
   Throughput: 27,909 rows/second


### ‚úÖ Verify Data Loading

Confirm data was loaded correctly by checking row counts.

In [16]:
# Verify row counts match
print("üìä Data Verification Summary:\n")
print(f"{'Table':<30} {'DuckDB Rows':>15} {'Snowflake Rows':>15} {'Match':>10}")
print("-" * 75)

for table_name in duckdb_tables:
    # DuckDB count
    duckdb_count = duck_conn.execute(f"SELECT COUNT(*) FROM {table_name}").fetchone()[0]
    
    # Snowflake count
    snowflake_count = session.table(f"stocks_db.raw.{table_name}").count()
    
    # Check match
    match = "‚úÖ" if duckdb_count == snowflake_count else "‚ùå"
    
    print(f"{table_name:<30} {duckdb_count:>15,} {snowflake_count:>15,} {match:>10}")

üìä Data Verification Summary:

Table                              DuckDB Rows  Snowflake Rows      Match
---------------------------------------------------------------------------
exchanges                                   14              14          ‚úÖ
index_list                                  36              36          ‚úÖ
index_quotes                           111,946         111,946          ‚úÖ
risk_premium                                44              44          ‚úÖ
stock_metrics                          129,384         129,384          ‚úÖ
stock_profiles                          11,895          11,895          ‚úÖ
stock_quotes                        12,556,103      12,556,103          ‚úÖ
stock_tickers                           11,895          11,895          ‚úÖ


In [None]:
# Preview data in Snowflake
for table_name in duckdb_tables:
    print(f"\nüìä Sample data from 'stocks_db.raw.{table_name}':")
    session.table(f"stocks_db.raw.{table_name}").limit(5).show()

---
# Part 3: Transform Raw Data üîÑ

Build transformation pipelines using **Snowflake Dynamic Tables**.

> üí° **Note:** The transformations below are templates. Customize them based on your actual DuckDB schema!

### üí° What are Dynamic Tables?

| Feature | Benefit |
|---------|---------|
| **Automatic Scheduling** | No manual orchestration needed |
| **Incremental Refresh** | Processes only changed rows |
| **Dependency Tracking** | Native DAG visualization in UI |

## Tier 1: Enrichment Layer

Let's create enriched views of the stock data. The exact transformations depend on your DuckDB schema.

In [None]:
# First, let's examine what tables we have to work with
print("Available tables in raw schema:")
for table_name in duckdb_tables:
    print(f"\nüìã Table: {table_name}")
    session.table(f"stocks_db.raw.{table_name}").printSchema()

### üìà Create Stock Price Enriched Table

This is a template - adjust based on your actual stock data columns!

In [19]:
# Get the first table name (assuming it contains stock price data)
# Modify this based on your actual table structure
primary_table = duckdb_tables[0] if duckdb_tables else None

if primary_table:
    print(f"Creating enriched view from: {primary_table}")
    
    # Get column names for the primary table
    table_df = session.table(f"stocks_db.raw.{primary_table}")
    columns = table_df.columns
    print(f"Available columns: {columns}")

Creating enriched view from: exchanges
Available columns: ['EXCHANGE', 'NAME', 'COUNTRYNAME', 'COUNTRYCODE', 'SYMBOLSUFFIX', 'DELAY']


In [20]:
# Example: Create a stocks_enriched dynamic table
# CUSTOMIZE THIS based on your actual schema!

if primary_table:
    # Build the enriched DataFrame
    # This is a generic example - modify based on your columns
    raw_df = session.table(f"stocks_db.raw.{primary_table}")
    
    # Create a simple enriched view (customize based on your data)
    # Adding row_number as an example transformation
    stocks_enriched_df = raw_df.select(
        "*"
    ).with_column(
        "ingestion_timestamp", 
        F.current_timestamp()
    )
    
    # Create the dynamic table
    stocks_enriched_df.create_or_replace_dynamic_table(
        name=f"stocks_db.analytics.{primary_table}_enriched",
        warehouse="STOCKS_LAB_WH",
        lag="12 hours"
    )
    
    print(f"‚úÖ Dynamic table '{primary_table}_enriched' created successfully")

‚úÖ Dynamic table 'exchanges_enriched' created successfully


### ‚úÖ Verify Tier 1 Dynamic Tables

In [21]:
# Check the enriched table
if primary_table:
    print(f"Sample data from {primary_table}_enriched:")
    session.table(f"stocks_db.analytics.{primary_table}_enriched").limit(10).show()

Sample data from exchanges_enriched:
----------------------------------------------------------------------------------------------------------------------------------------------------------
|"EXCHANGE"  |"NAME"                          |"COUNTRYNAME"             |"COUNTRYCODE"  |"SYMBOLSUFFIX"  |"DELAY"    |"INGESTION_TIMESTAMP"             |
----------------------------------------------------------------------------------------------------------------------------------------------------------
|AMEX        |New York Stock Exchange Arca    |United States of America  |US             |N/A             |Real-time  |2025-12-24 14:53:09.619000-08:00  |
|BUE         |Buenos Aires Stock Exchange     |Argentina                 |AR             |.BA             |20 min     |2025-12-24 14:53:09.619000-08:00  |
|BVC         |Colombia Stock Exchange         |Colombia                  |CO             |.CL             |15 min     |2025-12-24 14:53:09.619000-08:00  |
|CBOE        |Chicago Board Optio

## Tier 2: Aggregated Metrics

Create aggregated views for analytics. Customize based on your data!

In [22]:
# Example: Create a summary metrics dynamic table
# CUSTOMIZE THIS based on your actual schema and business requirements!

if primary_table:
    # Get the enriched table
    enriched_df = session.table(f"stocks_db.analytics.{primary_table}_enriched")
    
    # Create summary metrics (customize based on your columns)
    summary_df = enriched_df.agg(
        F.count("*").alias("total_records"),
        F.max("ingestion_timestamp").alias("last_updated")
    )
    
    # Create as a regular table (since aggregation might not support dynamic table)
    summary_df.write.mode("overwrite").save_as_table(
        f"stocks_db.analytics.data_summary"
    )
    
    print("‚úÖ Summary metrics table created successfully")

‚úÖ Summary metrics table created successfully


In [23]:
# View summary
print("Data Summary:")
session.table("stocks_db.analytics.data_summary").show()

Data Summary:
------------------------------------------------------
|"TOTAL_RECORDS"  |"LAST_UPDATED"                    |
------------------------------------------------------
|14               |2025-12-24 14:53:09.619000-08:00  |
------------------------------------------------------



### üìã List All Dynamic Tables

In [24]:
# List all dynamic tables in the analytics schema
analytics_schema_ref = root.databases["stocks_db"].schemas["analytics"]
dynamic_table_collection = DynamicTableCollection(analytics_schema_ref)
dynamic_tables = list(dynamic_table_collection.iter())

print("Dynamic tables in 'analytics' schema:")
if dynamic_tables:
    for dt in dynamic_tables:
        print(f"  - {dt.name}")
else:
    print("  No dynamic tables found")

Dynamic tables in 'analytics' schema:
  - EXCHANGES_ENRICHED


---
# Part 4: Incremental Refresh ‚ö°

Demonstrate adding new data and refreshing the pipeline.

### üìä Capture Initial State

In [25]:
# Get current row counts
print("Current row counts in Snowflake:")
for table_name in duckdb_tables:
    count = session.table(f"stocks_db.raw.{table_name}").count()
    print(f"  {table_name}: {count:,} rows")

Current row counts in Snowflake:
  exchanges: 14 rows
  index_list: 36 rows
  index_quotes: 111,946 rows
  risk_premium: 44 rows
  stock_metrics: 129,384 rows
  stock_profiles: 11,895 rows
  stock_quotes: 12,556,103 rows
  stock_tickers: 11,895 rows


### ‚ûï Re-sync Data (Optional)

Re-run the ingestion to refresh data from DuckDB.

In [26]:
# Re-sync uses the same load_table_to_snowflake function defined earlier
# Uncomment to re-sync all tables from DuckDB:

# for table_name in duckdb_tables:
#     name, row_count, elapsed = load_table_to_snowflake(table_name)
#     print(f"üîÑ Re-synced {name}: {row_count:,} rows in {elapsed:.2f}s")

### üîÑ Trigger Dynamic Table Refresh

In [27]:
# Refresh dynamic tables
if primary_table:
    print(f"Refreshing {primary_table}_enriched...")
    session.sql(f"ALTER DYNAMIC TABLE stocks_db.analytics.{primary_table}_enriched REFRESH").collect()
    print("‚úÖ Dynamic tables refreshed")

Refreshing exchanges_enriched...
‚úÖ Dynamic tables refreshed


### üìú Check Refresh History

In [28]:
# Check refresh history
if primary_table:
    print(f"Refresh history for {primary_table}_enriched:")
    session.sql(f"""
        SELECT name, refresh_action, state, refresh_start_time, refresh_trigger
        FROM TABLE(INFORMATION_SCHEMA.DYNAMIC_TABLE_REFRESH_HISTORY(
            NAME => 'stocks_db.ANALYTICS.{primary_table.upper()}_ENRICHED'
        ))
        ORDER BY refresh_start_time DESC 
        LIMIT 5
    """).show()

Refresh history for exchanges_enriched:
------------------------------------------------------------------------------------------------------------
|"NAME"              |"REFRESH_ACTION"  |"STATE"    |"REFRESH_START_TIME"              |"REFRESH_TRIGGER"  |
------------------------------------------------------------------------------------------------------------
|EXCHANGES_ENRICHED  |NO_DATA           |SUCCEEDED  |2025-12-24 14:53:15.019000-08:00  |MANUAL             |
|EXCHANGES_ENRICHED  |FULL              |SUCCEEDED  |2025-12-24 14:53:09.758000-08:00  |CREATION           |
------------------------------------------------------------------------------------------------------------



---
# üßπ Cleanup (Optional)

> ‚ö†Ô∏è **Warning:** This will permanently delete all data and objects created in this notebook.

In [29]:
# Close DuckDB connection
duck_conn.close()
print("ü¶Ü DuckDB connection closed")

ü¶Ü DuckDB connection closed


In [30]:
# Uncomment the following lines to perform Snowflake cleanup

# # Drop database
# root.databases["stocks_db"].delete()
# print("Database 'stocks_db' dropped")

# # Drop warehouse
# root.warehouses["STOCKS_LAB_WH"].delete()
# print("Warehouse 'STOCKS_LAB_WH' dropped")

print("Cleanup section ready (uncomment to execute)")

Cleanup section ready (uncomment to execute)


In [31]:
# Close the Snowflake session
session.close()
print("‚ùÑÔ∏è Snowflake session closed")

‚ùÑÔ∏è Snowflake session closed


---
# üéâ Summary & Resources

## What We Built

| Component | Description |
|-----------|-------------|
| **DuckDB Integration** | Read data from local DuckDB database |
| **Optimized Schema** | VARCHAR columns sized based on actual data |
| **Data Ingestion** | Bulk upload to Snowflake raw tables |
| **Dynamic Tables** | Automatic transformation pipeline |
| **Incremental Refresh** | Efficient data updates |

## Technologies Used

| Tool | Purpose |
|------|---------|
| **DuckDB** | Local analytical database |
| **Pandas** | Data interchange format |
| **Snowflake Python APIs** | Database/schema/table management |
| **Snowpark DataFrames** | Data querying and transformation |
| **Dynamic Tables** | Declarative pipeline orchestration |

---

## üìö Additional Resources

- [DuckDB Documentation](https://duckdb.org/docs/)
- [Snowflake Dynamic Tables](https://docs.snowflake.com/en/user-guide/dynamic-tables-intro)
- [Snowpark Python Developer Guide](https://docs.snowflake.com/en/developer-guide/snowpark/python/index)
- [Snowflake Python API Reference](https://docs.snowflake.com/developer-guide/snowflake-python-api/reference/latest/index)