# R in Snowflake Workspace Notebooks

This notebook demonstrates how to use R within Snowflake Workspace Notebooks using rpy2.

**Capabilities:**
- Execute R code in `%%R` magic cells alongside Python
- Transfer data bidirectionally between Python and R
- Connect to Snowflake from R using ADBC, Reticulate, or DuckDB

**Quick Start:**
1. Run Section 1 (Installation)
2. Run Section 3.1 (Session Setup)
3. Choose your preferred data access method (Sections 3-7)

**Sections:**
1. [Installation & Configuration](#section-1-installation--configuration)
2. [Python & R Interoperability](#section-2-python--r-interoperability)
3. [Snowflake via ADBC](#section-3-snowflake-database-connectivity) (direct R connection)
4. [Key Pair Authentication](#section-4-alternative-authentication---key-pair-jwt) (alternative to PAT)
5. [Snowflake via Reticulate](#section-5-reticulate---access-snowpark-from-r) (easiest - no auth setup)
6. [Data Visualization with ggplot2](#section-6-data-visualization-with-ggplot2)
7. [DuckDB Integration](#section-7-duckdb-integration-experimental) (dplyr + dbplyr)
8. [Iceberg Integration](#section-8-iceberg-integration) (experimental)

---

# Section 1: Installation & Configuration

This section sets up R and rpy2 in the Workspace Notebook environment.

## Overview

Snowflake Workspace Notebooks run in containers with a managed Python kernel. To use R:

1. **Install R** via micromamba (lightweight conda-compatible package manager)
2. **Install rpy2** into the notebook's Python kernel
3. **Register `%%R` magic** for R cell support

## Customizing R Packages

Edit `r_packages.yaml` to customize which R packages are installed:

```yaml
# Conda-forge packages (installed via micromamba)
conda_packages:
  - r-base           # Required: Base R
  - r-tidyverse      # Data manipulation
  - r-yourpackage    # Add packages here

# CRAN packages (installed via install.packages)
cran_packages:
  - somepackage      # Packages not available on conda-forge
```

## Installation Options

| Command | Description |
|---------|-------------|
| `bash setup_r_environment.sh` | Basic R installation |
| `bash setup_r_environment.sh --adbc` | R + ADBC driver for Snowflake connectivity |
| `bash setup_r_environment.sh --verbose` | Show detailed logging |
| `bash setup_r_environment.sh --help` | Show all options |

### 1.1 Install R Environment

Run the setup script. Choose `--basic` for R only, or `--adbc` to include ADBC Snowflake driver.

**Note:** This step takes 2-5 minutes on first run. The `--adbc` option takes longer as it compiles the Snowflake driver.

The script includes:
- Pre-flight checks (disk space, network connectivity)
- Automatic retry for network operations
- Logging to `setup_r.log` for debugging

In [None]:
# Choose ONE of the following:

# Option A: Basic R installation (faster)
# !bash setup_r_environment.sh --basic

# Option B: R + ADBC for Snowflake connectivity (required for Section 3)
!bash setup_r_environment.sh --adbc

### 1.2 Configure Python Environment & Install rpy2

This cell uses the helper module to:
1. Point Python to the R environment
2. Install rpy2 into the notebook kernel
3. Register the `%%R` magic
4. Load output helper functions for cleaner display

**Run this cell after the installation script completes.**

**Output Helpers:** Workspace Notebooks add extra line breaks to R output. After setup, use these R functions for cleaner formatting:

| Function | Usage | Description |
|----------|-------|-------------|
| `rprint(x)` | `rprint(df)` | Print any object cleanly |
| `rview(df, n)` | `rview(iris, n=10)` | View data frame with optional row limit |
| `rglimpse(df)` | `rglimpse(df)` | Glimpse data frame structure |

In [None]:
# Setup R helpers
import sys
sys.path.insert(0, '.')  # Ensure current directory is in path

from r_helpers import setup_r_environment

result = setup_r_environment()

if result['success']:
    print("‚úì R environment configured successfully")
    print(f"  R version: {result['r_version']}")
    print(f"  rpy2 installed: {result['rpy2_installed']}")
    print(f"  %%R magic registered: {result['magic_registered']}")
else:
    print("‚úó Setup failed:")
    for error in result['errors']:
        print(f"  - {error}")

In [None]:
# Manual R configuration
# Uncomment and run if Method 1 fails

# import os
# import sys
# import subprocess

# ENV_PREFIX = "/root/.local/share/mamba/envs/r_env"
# os.environ["PATH"] = f"{ENV_PREFIX}/bin:" + os.environ["PATH"]
# os.environ["R_HOME"] = f"{ENV_PREFIX}/lib/R"

# subprocess.run([sys.executable, "-m", "pip", "install", "rpy2", "-q"], check=True)

# from rpy2.ipython import rmagic
# get_ipython().register_magics(rmagic.RMagics)
# print("R environment configured")

### 1.3 Verify R Installation

Test that R is working correctly.

In [None]:
%%R
R__version_check
# Print R version (simple output works fine)
R.version.string

In [None]:
%%R
R__list_packages
# List installed packages
# Use rprint() for cleaner output in Workspace Notebooks
ip <- as.data.frame(installed.packages()[, c(1, 3:4)])
ip <- ip[is.na(ip$Priority), 1:2, drop = FALSE]
rprint(ip)

### 1.4 Run Diagnostics (Optional)

Run comprehensive environment diagnostics to verify all components are working.

In [None]:
from r_helpers import check_environment, print_diagnostics

# Run and display diagnostics
print_diagnostics()

### 1.5 Installing Additional R Packages

You can install R packages in two ways:

1. **Via `r_packages.yaml`** - Add packages before running the setup script (recommended for reproducibility)
2. **From within a `%%R` cell** - Install packages interactively during your session

The examples below show how to install packages from within the notebook.

In [None]:
%%R
R__install_packages
# Install R packages into the micromamba environment
# Set the library path to ensure packages go to the right location
lib_path <- "/root/.local/share/mamba/envs/r_env/lib/R/library"
.libPaths(lib_path)

# Example: Install 'forecast' package if not already installed
if (!require("forecast", quietly = TRUE)) {
    cat("Installing forecast package...\n")
    install.packages("forecast", repos = "https://cloud.r-project.org/", lib = lib_path)
}

# Verify installation
library(forecast)
cat("forecast version:", as.character(packageVersion("forecast")), "\n")

In [None]:
%%R
R__install_via_micromamba
# Alternative: Use micromamba for packages with complex dependencies
# This is better for compiled packages that need system libraries

# Install via micromamba (runs in background)
system("/root/.local/share/mamba/bin/micromamba install -n r_env -c conda-forge r-forecast -y", 
       ignore.stdout = TRUE)

# Reload library path and verify
.libPaths("/root/.local/share/mamba/envs/r_env/lib/R/library")
library(forecast)
cat("forecast installed via micromamba\n")
cat("Version:", as.character(packageVersion("forecast")), "\n")

---

# Section 2: Python & R Interoperability

This section demonstrates how to work with data in both Python and R, including:
- Using the `%%R` magic for R cells
- Passing data from Python to R
- Passing data from R to Python
- Running R functions from Python

## 2.1 Using %%R Magic Cells

The `%%R` magic lets you write R code directly in a cell. The magic supports flags:

| Flag | Description |
|------|-------------|
| `-i var` | Import Python variable `var` into R |
| `-o var` | Export R variable `var` back to Python |
| `-w WIDTH` | Set plot width |
| `-h HEIGHT` | Set plot height |

In [None]:
%%R
R__basic_operations
# Basic R operations
x <- c(1, 2, 3, 4, 5)
mean(x)

In [None]:
%%R
R__tidyverse_demo
# Using tidyverse
library(dplyr)

rprint(
data.frame(
  name = c("Alice", "Bob", "Charlie"),
  score = c(85, 92, 78)
) %>%
  mutate(grade = case_when(
    score >= 90 ~ "A",
    score >= 80 ~ "B",
    TRUE ~ "C"
  ))
)  

## 2.2 Passing Data: Python ‚Üí R

Use the `-i` flag to pass Python objects into R cells.

In [None]:
# Create a pandas DataFrame in Python
import pandas as pd

python_df = pd.DataFrame({
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston'],
    'population': [8336817, 3979576, 2693976, 2320268],
    'area_sq_mi': [302.6, 468.7, 227.3, 670.6]
})

print("Python DataFrame:")
python_df

In [None]:
%%R -i python_df
R__python_to_r
# The Python DataFrame is now available in R as 'python_df'
library(dplyr)

cat("Received DataFrame in R:\n")
rglimpse(python_df)  # Use rglimpse() for clean output

# Perform R operations
result <- python_df %>%
  mutate(density = population / area_sq_mi) %>%
  arrange(desc(density))

rprint(result)  # Use rprint() for clean output

## 2.3 Passing Data: R ‚Üí Python

Use the `-o` flag to export R objects back to Python.

In [None]:
%%R -o r_result
R__r_to_python
# Create a data frame in R
r_result <- data.frame(
  x = 1:10,
  y = (1:10)^2,
  label = paste0("Point_", 1:10)
)

cat("Created R data.frame:\n")
rprint(r_result)  # Use rprint() for clean output

In [None]:
# The R data.frame is now available in Python
print("R result in Python:")
print(type(r_result))
print(r_result)

## 2.4 Using R from Python (without magic)

For more control, you can use rpy2's Python API directly.

In [None]:
import rpy2.robjects as ro
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
from rpy2.robjects.conversion import localconverter

# Import R packages
base = importr('base')
stats = importr('stats')

# Run R code and get results
result = ro.r('sum(1:100)')
print(f"Sum of 1 to 100: {result[0]}")

In [None]:
# Convert pandas DataFrame to R and run R functions on it
import pandas as pd
import rpy2.robjects as ro
from rpy2.robjects import pandas2ri

# Create sample data
df = pd.DataFrame({
    'x': [1, 2, 3, 4, 5],
    'y': [2.1, 3.9, 6.2, 7.8, 10.1]
})

# Convert to R and run linear regression
with (ro.default_converter + pandas2ri.converter).context():
    r_df = ro.conversion.get_conversion().py2rpy(df)

# Run linear regression in R
lm_result = stats.lm('y ~ x', data=r_df)
print("Linear Regression Results:")
print(base.summary(lm_result))

## 2.5 Working with R's Built-in Datasets

Access R's built-in datasets and convert them to pandas.

In [None]:
import rpy2.robjects as ro
from rpy2.robjects import pandas2ri
from rpy2.robjects.conversion import localconverter

# Load the iris dataset in R
ro.r("data(iris)")

# Get the R data.frame
iris_r = ro.r["iris"]

# Convert to pandas DataFrame
with localconverter(ro.default_converter + pandas2ri.converter):
    iris_df = pandas2ri.rpy2py(iris_r)

print("Iris dataset (first 10 rows):")
iris_df.head(10)

---

# Section 3: Snowflake Database Connectivity

This section demonstrates connecting to Snowflake from R using ADBC.

**Prerequisites:**
- Run the setup script with `--adbc` flag (Section 1.1)
- Have appropriate Snowflake permissions

## Authentication Options

| Method | Status | Notes |
|--------|--------|-------|
| Python `get_active_session()` | ‚úÖ Works | Use for Snowpark queries, bridge to R via rpy2 |
| ADBC with PAT | ‚úÖ Works | Direct R-to-Snowflake, requires PAT token |
| SPCS OAuth Token | ‚ùå Blocked | Container token not authorized for ADBC |
| Username/Password | ‚ùå Blocked | SPCS requires OAuth |

## Connection Management

This notebook uses connection pooling - the ADBC connection is stored as `r_sf_con` in R's global environment and reused across cells. This avoids the overhead of creating new connections for each query.

## 3.1 Setup Python Session

This cell loads configuration and establishes the Snowflake session.

**Configuration:**
- Copy `notebook_config.yaml.template` to `notebook_config.yaml`
- Edit with your account details (for Local IDE)
- The config provides database, schema, warehouse, and query settings

**Environments:**
- **Workspace Notebook**: Uses `get_active_session()` (built-in OAuth)
- **Local IDE (VSCode/Cursor)**: Uses config file for key-pair auth

In [None]:
# Setup Snowflake session and load configuration
import os
import sys

# =============================================================================
# Load Configuration File
# =============================================================================
CONFIG_FILE = 'notebook_config.yaml'
CONFIG_TEMPLATE = 'notebook_config.yaml.template'

def load_config():
    """Load configuration from YAML file."""
    try:
        import yaml
    except ImportError:
        print("Installing PyYAML...")
        import subprocess
        subprocess.run([sys.executable, "-m", "pip", "install", "pyyaml", "-q"], check=True)
        import yaml
    
    if os.path.exists(CONFIG_FILE):
        with open(CONFIG_FILE) as f:
            config = yaml.safe_load(f)
        print(f"‚úì Loaded config from {CONFIG_FILE}")
        return config
    elif os.path.exists(CONFIG_TEMPLATE):
        print(f"‚úó Config not found!")
        print(f"  Copy {CONFIG_TEMPLATE} to {CONFIG_FILE} and customize.")
        return {}
    else:
        print("‚úó No config file found, using defaults")
        return {}

CONFIG = load_config()

# Extract config sections for easy access
CONN_CONFIG = CONFIG.get('connection', {})
DEFAULTS = CONFIG.get('defaults', {})
QUERY_CONFIG = CONFIG.get('sample_queries', {})
ICEBERG_CONFIG = CONFIG.get('iceberg', {})

# =============================================================================
# Environment Detection and Session Setup
# =============================================================================
def detect_environment():
    """
    Detect if running in Snowflake Workspace Notebook or local IDE.
    Returns: ('workspace', session) or ('local', config_dict)
    """
    workspace_indicators = [
        os.path.exists('/snowflake/session/token'),
        'SNOWFLAKE_HOST' in os.environ,
        '/home/udf' in os.getcwd(),
    ]
    
    if any(workspace_indicators):
        try:
            from snowflake.snowpark.context import get_active_session
            session = get_active_session()
            return ('workspace', session)
        except Exception as e:
            return ('workspace_error', str(e))
    else:
        return ('local', None)

# Detect environment
ENV_TYPE, ENV_RESULT = detect_environment()

if ENV_TYPE == 'workspace':
    session = ENV_RESULT
    
    # Get connection details from session, with config overrides
    ACCOUNT = session.sql('SELECT CURRENT_ACCOUNT()').collect()[0][0]
    USER = session.sql('SELECT CURRENT_USER()').collect()[0][0]
    DATABASE = DEFAULTS.get('database') or session.get_current_database()
    SCHEMA = DEFAULTS.get('schema') or session.get_current_schema()
    WAREHOUSE = DEFAULTS.get('warehouse') or session.get_current_warehouse()
    ROLE = session.get_current_role()
    
    # Build unified config
    ENV_CONFIG = {
        'account': ACCOUNT,
        'user': USER,
        'database': DATABASE,
        'schema': SCHEMA,
        'warehouse': WAREHOUSE,
        'role': ROLE,
    }
    
    print(f"\nEnvironment: Workspace Notebook")
    print(f"  Account:   {ACCOUNT}")
    print(f"  User:      {USER}")
    print(f"  Role:      {ROLE}")
    print(f"  Database:  {DATABASE}")
    print(f"  Schema:    {SCHEMA}")
    print(f"  Warehouse: {WAREHOUSE}")
    
elif ENV_TYPE == 'local':
    session = None
    
    # Use config file values for local IDE
    ENV_CONFIG = {
        'account': CONN_CONFIG.get('account', '<YOUR_ACCOUNT>'),
        'user': CONN_CONFIG.get('user', '<YOUR_USER>'),
        'database': DEFAULTS.get('database', 'SNOWFLAKE_SAMPLE_DATA'),
        'schema': DEFAULTS.get('schema', 'TPCH_SF1'),
        'warehouse': DEFAULTS.get('warehouse', '<YOUR_WAREHOUSE>'),
        'role': DEFAULTS.get('role', 'PUBLIC'),
        'private_key_path': CONN_CONFIG.get('private_key_path', '~/.ssh/snowflake_rsa_key.p8'),
    }
    
    print(f"\nEnvironment: Local IDE")
    print(f"  Account:   {ENV_CONFIG['account']}")
    print(f"  User:      {ENV_CONFIG['user']}")
    print(f"  Database:  {ENV_CONFIG['database']}")
    print(f"  Warehouse: {ENV_CONFIG['warehouse']}")
    print(f"  Key path:  {ENV_CONFIG['private_key_path']}")
    
    if '<YOUR_' in str(ENV_CONFIG.values()):
        print("\n‚ö†Ô∏è  Some config values need to be set!")
        print(f"   Edit {CONFIG_FILE} with your values")
else:
    print(f"Warning: Environment detection issue: {ENV_RESULT}")
    ENV_CONFIG = {}
    session = None

# Make query config easily accessible
ROW_LIMIT = QUERY_CONFIG.get('default_row_limit', 1000)
LARGE_ROW_LIMIT = QUERY_CONFIG.get('large_row_limit', 10000)
SAMPLE_START_DATE = QUERY_CONFIG.get('sample_start_date', '1995-01-01')
TABLES = QUERY_CONFIG.get('tables', {'nation': 'NATION', 'customer': 'CUSTOMER', 'orders': 'ORDERS'})

### Authentication Methods Overview

This notebook supports multiple authentication methods for different connectivity approaches:

| Section | Method | Auth Type | Status | Environment |
|---------|--------|-----------|--------|-------------|
| **3.1** | `get_active_session()` | Built-in OAuth | ‚úÖ Working | Workspace |
| **3.3-3.6** | ADBC + PAT | PAT Token | ‚úÖ Working | Workspace |
| **4.2** | ADBC + Key Pair | JWT | ‚úÖ Working | Both |
| **5** | Reticulate | Session OAuth | ‚úÖ Working | Workspace |
| **7** (DuckDB) | Key Pair | JWT | ‚úÖ Working | Local IDE |
| **7.3.1** | Python Bridge | Session OAuth | ‚úÖ Working | Both |
| **8** | Horizon Catalog | JWT | ‚úÖ Working | Local IDE |

**Recommended Path for Most Users:**
1. Run **Section 3.1** (required - sets up session)
2. Choose ONE of:
   - **Section 5** (Reticulate) - Easiest, uses built-in auth, works everywhere
   - **Section 7.3.1** (Python Bridge) - For dplyr workflows, works everywhere  
   - **Section 7** (DuckDB Direct) - For Local IDE only (key-pair required)
   - **Section 3** (ADBC) - For direct R-to-Snowflake in Workspace

## 3.2 Create Programmatic Access Token (PAT) (For ADBC - Optional)

**Used by:** Section 3 (R-ADBC) only. Skip if using Section 5 (Reticulate) or Section 7 (DuckDB).

PAT enables direct R-to-Snowflake ADBC connections. The session from Section 3.1 is used to create the token.

In [None]:
# Create PAT for authentication (requires session from 3.1)
from r_helpers import PATManager

# Uses the 'session' variable from Section 3.1
if session is None:
    print("Please run Section 3.1 first!")
else:
    pat_manager = PATManager(session)
    pat_result = pat_manager.create_pat()  # Creates PAT with 1 day expiry
    
    if pat_result['success']:
        print(f"‚úì PAT created successfully")
        print(f"  Token: {pat_result['token'][:20]}...")
        print(f"  Expires: {pat_result['expires_at']}")
    else:
        print(f"‚úó PAT creation failed: {pat_result.get('error', 'Unknown error')}")

In [None]:
# Check PAT status at any time
status = pat_mgr.get_status()
print("PAT Status:")
for key, value in status.items():
    print(f"  {key}: {value}")

## 3.3 Validate ADBC Prerequisites

Before connecting, validate that all ADBC prerequisites are met.

In [None]:
from r_helpers import validate_adbc_connection

valid, message = validate_adbc_connection()
print(message)

## 3.4 Initialize R Connection Management

Load the connection management functions into R. This provides:
- `get_snowflake_connection()` - Get or create connection (stored as `r_sf_con`)
- `close_snowflake_connection()` - Close and release connection
- `is_snowflake_connected()` - Check connection status
- `snowflake_connection_status()` - Get detailed status

In [None]:
from r_helpers import init_r_connection_management

success, msg = init_r_connection_management()
print(msg)

## 3.5 Connect to Snowflake from R (ADBC)

Use `get_snowflake_connection()` to establish or reuse the ADBC connection.

The connection is stored as `r_sf_con` in R's global environment and is automatically reused in subsequent cells.

In [None]:
%%R
R__adbc_connect
# Get or create the Snowflake connection
# Connection is stored globally as r_sf_con
r_sf_con <- get_snowflake_connection()

# Show connection status (uses print_connection_status() for clean output)
print_connection_status()

## 3.6 Query Snowflake from R

Run queries using the `r_sf_con` connection. The connection is automatically reused across cells.

In [None]:
%%R
R__adbc_test_query
# Simple test query using r_sf_con
r_sf_con |>
  read_adbc("SELECT CURRENT_USER() AS USER, CURRENT_ROLE() AS ROLE, CURRENT_WAREHOUSE() AS WAREHOUSE") |>
  tibble::as_tibble()

In [None]:
%%R
R__adbc_query_nations
# Query sample data from Snowflake
# Using the shared SNOWFLAKE_SAMPLE_DATA database
nations <- r_sf_con |>
  read_adbc("
    SELECT N_NATIONKEY, N_NAME, N_REGIONKEY 
    FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.NATION 
    ORDER BY N_NATIONKEY
    LIMIT 10
  ") |>
  tibble::as_tibble()

nations

In [None]:
%%R
R__adbc_query_orders
# More complex query with aggregation
library(dplyr)

orders_summary <- r_sf_con |>
  read_adbc("
    SELECT 
      O_ORDERSTATUS,
      COUNT(*) as ORDER_COUNT,
      SUM(O_TOTALPRICE) as TOTAL_VALUE,
      AVG(O_TOTALPRICE) as AVG_VALUE
    FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS
    GROUP BY O_ORDERSTATUS
    ORDER BY ORDER_COUNT DESC
  ") |>
  tibble::as_tibble()

orders_summary

In [None]:
%%R
R__adbc_verify_connection
# Verify connection is being reused (not recreated)
cat("Connection still valid:", is_snowflake_connected(), "\n")

## 3.7 Query from Python, Analyze in R

An alternative pattern: use Python's Snowpark session for querying, then pass data to R for analysis.

In [None]:
# Query Snowflake via Python
customers_df = session.sql(f"""
    SELECT 
        C_CUSTKEY,
        C_NAME,
        C_NATIONKEY,
        C_ACCTBAL
    FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER
    LIMIT 100
""").to_pandas()

print(f"Retrieved {len(customers_df)} rows")
customers_df.head()

In [None]:
%%R -i customers_df
R__analyze_customers
# Analyze the data in R
library(dplyr)

cat("Summary Statistics for Customer Account Balance:\n")
rprint(summary(customers_df$C_ACCTBAL))

cat("\nCustomers by Nation (top 5):\n")
result <- customers_df %>%
  group_by(C_NATIONKEY) %>%
  summarise(
    count = n(),
    avg_balance = mean(C_ACCTBAL),
    total_balance = sum(C_ACCTBAL)
  ) %>%
  arrange(desc(count)) %>%
  head(5)

rprint(result)  # Use rprint() for clean output

## 3.8 Check Connection Status

You can check the connection status from either Python or R.

In [None]:
# Check status from Python
from r_helpers import get_r_connection_status

status = get_r_connection_status()
print("R Connection Status (from Python):")
for key, value in status.items():
    print(f"  {key}: {value}")

In [None]:
%%R
R__check_connection
# Get or create the Snowflake connection
# Connection is stored globally as r_sf_con
r_sf_con <- get_snowflake_connection()

# Show connection status (uses print_connection_status() for clean output)
print_connection_status()

## 3.9 Clean Up

Close ADBC connection and optionally remove the PAT.

In [None]:
%%R
R__close_connection
# Close the Snowflake connection
close_snowflake_connection()

In [None]:
# Alternative: Close from Python
# from r_helpers import close_r_connection
# success, msg = close_r_connection()
# print(msg)

In [None]:
# Cleanup - remove PAT
# pat_mgr.remove_pat()
# print("PAT removed")

---

# Section 4: Alternative Authentication - Key Pair (JWT)

This section demonstrates Key Pair (JWT) authentication as an alternative to PAT.

## Authentication Methods for R ADBC

| Method | Status | Notes |
|--------|--------|-------|
| **PAT (Programmatic Access Token)** | ‚úÖ Working | **Recommended** - easiest to set up (see Section 3) |
| **Key Pair (JWT)** | ‚úÖ Working | Alternative - no token expiry, shown below |
| SPCS OAuth Token | ‚ùå Blocked | Container token restricted to specific connectors |
| Username/Password | ‚ùå Blocked | SPCS enforces OAuth for internal connections |

> **Note:** For tests of non-working methods, see `archive/auth_methods_not_working.ipynb`

## Prerequisites

- ADBC installed (`--adbc` flag during setup)
- RSA key pair generated
- Public key registered with your Snowflake user

## 4.1 Load Alternative Auth Test Functions

Load the R functions for testing different authentication methods.

In [None]:
from r_helpers import init_r_alt_auth

success, msg = init_r_alt_auth()
print(msg)

## 4.2 Key Pair (JWT) Authentication (Alternative for ADBC)

**Used by:** Section 3 (R-ADBC) as an alternative to PAT, and Section 8 (Iceberg) for Horizon Catalog API.

Key pair authentication uses RSA keys instead of passwords/PAT. This method is MFA-compatible and doesn't expire like PAT tokens.

In [None]:
# Key-pair authentication setup
from r_helpers import KeyPairAuth

# Initialize key pair auth helper
kp_auth = KeyPairAuth()

# Generate a new key pair (or use load_private_key() for existing key)
# Note: Requires 'cryptography' package: pip install cryptography
result = kp_auth.generate_key_pair(
    key_size=2048,
    output_dir="/tmp",
    passphrase=None  # Set a passphrase for encrypted key
)

if result['success']:
    print("‚úì Key pair generated successfully")
    print(f"  Private key: {result['private_key_path']}")
    print(f"  Public key:  {result['public_key_path']}")
    print(f"\n  Public key for Snowflake registration:")
    print(f"  {result['public_key_for_snowflake'][:50]}...")
else:
    print(f"‚úó Key generation failed: {result['error']}")

### Step 2: Register Public Key with Snowflake

Run this SQL to register the public key with your user (requires ACCOUNTADMIN or appropriate privileges).

In [None]:
# Generate key registration SQL
if result['success']:
    sql = kp_auth.register_public_key_sql(result['public_key_for_snowflake'])
    print("Run this SQL to register the public key:")
    print("-" * 60)
    print(sql)
    print("-" * 60)
    print("\nOr run via Snowpark session:")
    print("  session.sql(sql).collect()")

In [None]:
# Register public key in Snowflake
session.sql(sql).collect()

### Step 3: Configure and Test Key Pair Auth

In [None]:
# Configure environment for key pair auth
config = kp_auth.configure_for_adbc()
print("Key Pair Auth Configuration:")
for key, value in config.items():
    print(f"  {key}: {value}")

In [None]:
%%R
R__keypair_auth_test
# Test key pair authentication
# Note: Public key must be registered with user first!
result <- test_keypair_auth()
rprint(result)

## 4.3 Authentication Summary

### Working Methods

| Method | Auth Type | Best For |
|--------|-----------|----------|
| **PAT** | `auth_pat` | Most use cases - easy programmatic setup |
| **Key Pair** | `auth_jwt` | Long-lived credentials without expiry |

### Non-Working Methods (Blocked by SPCS)

| Method | Reason |
|--------|--------|
| SPCS OAuth Token | Restricted to specific Snowflake connectors |
| Username/Password | SPCS enforces OAuth internally |

> See `archive/auth_methods_not_working.ipynb` for test code if needed.

---

# Section 5: Reticulate - Access Snowpark from R

This section demonstrates using **reticulate** to access the Python Snowpark session directly from R. This is an alternative to ADBC that leverages the notebook's built-in authentication.

## Advantages of Reticulate Approach

| Feature | Reticulate + Snowpark | ADBC |
|---------|----------------------|------|
| Authentication | Uses notebook's built-in auth | Requires PAT or Key Pair |
| Setup | No additional auth setup | PAT creation or key registration |
| Connection | Shares Python session | Separate R connection |
| Best for | Quick queries, prototyping | Production R pipelines |

## How It Works

1. R accesses Python's Snowpark session via reticulate
2. Execute SQL queries using `session$sql()`
3. Convert results to pandas DataFrame with `.to_pandas()`
4. Reticulate automatically converts pandas ‚Üí R data.frame

## Output Pattern

For best display in Notebooks, use `%%R -o variable` to export R data frames to Python, then display them in a subsequent Python cell. This lets the Notebook render the DataFrame with proper formatting.

## 5.1 Setup Reticulate

Configure reticulate to use the notebook's Python environment.

> **Note:** You may see a warning about reticulate/rpy2 compatibility. This is safe to ignore if using reticulate >= 1.25 (installed by default). The issue was fixed in reticulate PR #1188.

In [None]:
%%R
R__setup_reticulate
library(reticulate)

# Use the same Python that's running the notebook kernel
# This ensures we access the same Snowpark session
use_python(Sys.which("python3"), required = TRUE)

# Verify Python is accessible
py_config()

## 5.2 Access Snowpark Session from R

Import the Snowpark module and get the active session. This uses the notebook's built-in authentication - no PAT required!

In [None]:
%%R
R__access_snowpark
# Import Snowpark module
snowpark <- import("snowflake.snowpark")

# Get the active session (uses notebook's built-in auth)
session <- snowpark$Session$builder$getOrCreate()

# Verify connection
rcat("Connected to Snowflake via Snowpark!")
rcat("Account: ", session$get_current_account())
rcat("User: ", session$get_current_user())
rcat("Database: ", session$get_current_database())
rcat("Schema: ", session$get_current_schema())

## 5.3 Query Snowflake and Get R DataFrame

Execute SQL queries and convert results to R data frames.

**Output Pattern:** Use `%%R -o variable` to export results to Python, then display in the next cell for nice Notebook formatting.

In [None]:
%%R -o nations_df
R__query_nations
# Execute a query and get Snowpark DataFrame
# Use -o to export result to Python for nice display
nations_df <- session$sql("
    SELECT N_NATIONKEY, N_NAME, N_REGIONKEY 
    FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.NATION 
    LIMIT 10
")$to_pandas()

# Print data type (R sees this as a data.frame)
cat("R data type:", class(nations_df), "\n")

In [None]:
# Display the exported DataFrame (nice Notebook rendering)
nations_df

## 5.4 R Analysis on Snowflake Data

Perform R analysis using dplyr on data retrieved via Snowpark. Use `-o` to export the result for display.

In [None]:
%%R -o customer_analysis
R__customer_analysis
# Query customer data with aggregation
customers_df <- session$sql("
    SELECT 
        C_MKTSEGMENT,
        COUNT(*) as CUSTOMER_COUNT,
        AVG(C_ACCTBAL) as AVG_BALANCE,
        MIN(C_ACCTBAL) as MIN_BALANCE,
        MAX(C_ACCTBAL) as MAX_BALANCE
    FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER
    GROUP BY C_MKTSEGMENT
    ORDER BY AVG_BALANCE DESC
")$to_pandas()

# Use dplyr for additional analysis
library(dplyr)

customer_analysis <- customers_df %>%
    mutate(
        BALANCE_RANGE = MAX_BALANCE - MIN_BALANCE,
        SEGMENT_SIZE = case_when(
            CUSTOMER_COUNT > 30000 ~ "Large",
            CUSTOMER_COUNT > 29000 ~ "Medium",
            TRUE ~ "Small"
        )
    )

cat("Analysis complete - result exported to Python\n")

In [None]:
# Display the R analysis result (exported via -o)
customer_analysis

## 5.5 Helper Function for Snowpark Queries

Create a convenience function to simplify querying. Use `-o` to export results for display.

In [None]:
%%R -o orders_summary
R__snowpark_helper
#' Query Snowflake via Snowpark and return R data.frame
#' 
#' @param sql SQL query string
#' @return R data.frame with query results
snowpark_query <- function(sql) {
    session$sql(sql)$to_pandas()
}

# Example usage - export result with -o
orders_summary <- snowpark_query("
    SELECT 
        O_ORDERSTATUS,
        COUNT(*) as ORDER_COUNT,
        SUM(O_TOTALPRICE) as TOTAL_VALUE
    FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS
    GROUP BY O_ORDERSTATUS
")

cat("Query complete - orders_summary exported to Python\n")

In [None]:
# Display the orders summary (exported via -o)
orders_summary

## 5.6 Reticulate vs ADBC Comparison

| Aspect | Reticulate + Snowpark | ADBC (Section 3 & 4) |
|--------|----------------------|----------------------|
| **Authentication** | Automatic (notebook's session) | PAT or Key Pair required |
| **Setup complexity** | Minimal | Moderate |
| **Data path** | Snowflake ‚Üí Snowpark ‚Üí pandas ‚Üí R | Snowflake ‚Üí Arrow ‚Üí R |
| **Performance** | Good for moderate data | Better for large data (Arrow) |
| **R-native** | No (via Python) | Yes (native R driver) |
| **Best for** | Quick analysis, prototyping | Production R workflows |

### When to Use Each

**Use Reticulate + Snowpark when:**
- You need quick access without auth setup
- Working interactively/prototyping
- Data sizes are moderate (< 1M rows)
- You're already using Python and R together

**Use ADBC when:**
- Building production R pipelines
- Working with large datasets
- Need pure R solution
- Require connection pooling/management

---

# Section 6: Data Visualization with ggplot2

This section demonstrates creating visualizations with **ggplot2** and displaying them in the Notebook.

## Key Points

- ggplot2 is included via `tidyverse` (installed by default)
- Use `%%R -w WIDTH -h HEIGHT` to control plot dimensions (in pixels)
- Call `print(p)` explicitly to render the plot
- Plots render inline in the Notebook output

## Plot Size Parameters

| Parameter | Description | Example |
|-----------|-------------|---------|
| `-w` | Width in pixels | `-w 800` |
| `-h` | Height in pixels | `-h 500` |
| `--type` | Graphics device | `--type=cairo` (optional, better quality) |

## 6.1 Basic ggplot2 Example

Create a simple scatter plot using the built-in `mtcars` dataset.

In [None]:
%%R -w 700 -h 450
R__ggplot_basic
library(ggplot2)

# Basic scatter plot with mtcars
p <- ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
    geom_point(size = 3) +
    labs(
        title = "Fuel Efficiency vs Weight",
        x = "Weight (1000 lbs)",
        y = "Miles per Gallon",
        color = "Cylinders"
    ) +
    theme_minimal()

print(p)

## 6.2 Visualize Snowflake Data

Query Snowflake data and create a bar chart. Bar charts work best when values have meaningful differences from zero.

> **Tip:** Avoid bar charts when values are clustered in a narrow range (e.g., all ~$4,500). Use dot plots or adjust the visualization instead.

In [None]:
%%R -w 800 -h 500
R__ggplot_snowflake
library(ggplot2)
library(dplyr)

# Query Snowflake for order data by status
# This data has more variance for a meaningful bar chart
orders <- session$sql("
    SELECT 
        O_ORDERSTATUS,
        COUNT(*) as ORDER_COUNT,
        ROUND(SUM(O_TOTALPRICE) / 1e9, 2) as TOTAL_VALUE_BILLIONS
    FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS
    GROUP BY O_ORDERSTATUS
    ORDER BY TOTAL_VALUE_BILLIONS DESC
")$to_pandas()

# Create bar chart - good when values have meaningful differences
p <- ggplot(orders, aes(x = reorder(O_ORDERSTATUS, -TOTAL_VALUE_BILLIONS), 
                         y = TOTAL_VALUE_BILLIONS)) +
    geom_col(aes(fill = ORDER_COUNT), width = 0.6) +
    geom_text(aes(label = paste0("$", TOTAL_VALUE_BILLIONS, "B")), 
              vjust = -0.5, size = 4) +
    scale_fill_viridis_c(option = "plasma", labels = scales::comma) +
    scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
    labs(
        title = "Total Order Value by Status",
        subtitle = "Data from Snowflake TPC-H Sample",
        x = "Order Status",
        y = "Total Value ($ Billions)",
        fill = "Order\nCount"
    ) +
    theme_minimal(base_size = 12) +
    theme(
        plot.title = element_text(face = "bold")
    )

print(p)

## 6.3 Multi-Panel Visualization (Facets)

Create faceted plots to compare distributions across categories.

In [None]:
%%R -w 900 -h 600
R__ggplot_facets
library(ggplot2)
library(dplyr)

# Query order data by status and priority
orders <- session$sql("
    SELECT 
        O_ORDERSTATUS,
        O_ORDERPRIORITY,
        O_TOTALPRICE
    FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS
    LIMIT 5000
")$to_pandas()

# Create faceted histogram
p <- ggplot(orders, aes(x = O_TOTALPRICE, fill = O_ORDERSTATUS)) +
    geom_histogram(bins = 30, alpha = 0.7) +
    facet_wrap(~O_ORDERPRIORITY, scales = "free_y", ncol = 3) +
    scale_x_continuous(labels = scales::dollar_format(scale = 0.001, suffix = "K")) +
    scale_fill_brewer(palette = "Set2") +
    labs(
        title = "Order Value Distribution by Priority",
        subtitle = "Colored by Order Status",
        x = "Total Price",
        y = "Count",
        fill = "Status"
    ) +
    theme_light(base_size = 11) +
    theme(
        plot.title = element_text(face = "bold"),
        strip.text = element_text(face = "bold")
    )

print(p)

## 6.4 Saving and Loading Plots

Use `ggsave()` to export plots to files, then display them from Python using `IPython.display.Image`.

In [None]:
%%R -w 700 -h 450
R__ggplot_save
library(ggplot2)

# Create a plot
p <- ggplot(mtcars, aes(x = hp, y = mpg)) +
    geom_point(aes(color = factor(gear)), size = 3) +
    geom_smooth(method = "lm", se = TRUE, color = "darkgray") +
    labs(
        title = "MPG vs Horsepower",
        x = "Horsepower",
        y = "Miles per Gallon",
        color = "Gears"
    ) +
    theme_bw()

# Display the plot inline
print(p)

# Save to file
ggsave("/tmp/mpg_vs_hp.png", p, width = 8, height = 5, dpi = 150)
cat("Plot saved to /tmp/mpg_vs_hp.png\n")

In [None]:
# Display the saved PNG file in the notebook
from IPython.display import Image, display

display(Image(filename="/tmp/mpg_vs_hp.png"))

---

# Section 7: DuckDB Integration (Experimental)

This section demonstrates using **DuckDB** as an intermediary between R and Snowflake, enabling:
- **dbplyr workflows** with Snowflake data via DuckDB's Snowflake extension
- **Local caching** of Snowflake query results for iterative analysis
- **Cross-environment compatibility** - works in both Workspace Notebooks and local IDEs (VSCode/Cursor)

## Architecture

```
R (dplyr/dbplyr)
    ‚Üï DBI
DuckDB (in-process analytics)
    ‚Üï Snowflake extension (ADBC)
Snowflake (key-pair auth)
```

## Prerequisites

1. Run setup with `--full` flag: `bash setup_r_environment.sh --full`
2. For local IDE: Configure key-pair authentication
3. DuckDB Snowflake extension will be installed automatically

## 7.1 Prerequisites

**Session Setup**: Environment detection and session setup is now handled in **Section 3.1**.
Make sure you've run that cell before proceeding with DuckDB integration.

The `ENV_TYPE` variable tells you which environment you're in:
- `'workspace'` - Workspace Notebook (uses OAuth)
- `'local'` - Local IDE (uses key-pair auth)

In [None]:
# Verify session from Section 3.1
if 'ENV_TYPE' not in dir():
    print("Please run Section 3.1 first to set up the session!")
else:
    print(f"Environment: {ENV_TYPE}")
    print(f"Session: {'Available' if session else 'Not available (local mode)'}")
    if ENV_CONFIG:
        print(f"Account: {ENV_CONFIG.get('account', 'N/A')}")

## 7.2 Configure Connection (Local IDE Only)

**Skip this section if running in Workspace Notebook.**

For local IDEs, set these environment variables before starting your notebook:

```bash
export SNOWFLAKE_ACCOUNT="your_account"     # e.g., "xy12345"  
export SNOWFLAKE_USER="your_user"
export SNOWFLAKE_DATABASE="SNOWFLAKE_SAMPLE_DATA"
export SNOWFLAKE_WAREHOUSE="COMPUTE_WH"
export SNOWFLAKE_PRIVATE_KEY_PATH="~/.ssh/snowflake_rsa_key.p8"
```

Then restart your notebook and run Section 3.1 to detect the environment.

In [None]:
# Local IDE verification (optional)
# Environment variables should be set before starting the notebook
# This cell just displays the current configuration

if ENV_TYPE == 'local':
    print("Local IDE Configuration:")
    for key in ['account', 'user', 'database', 'warehouse', 'private_key_path']:
        print(f"  {key}: {ENV_CONFIG.get(key, 'N/A')}")
    
    # Check if key file exists
    key_path = os.path.expanduser(ENV_CONFIG.get('private_key_path', ''))
    if os.path.exists(key_path):
        print(f"\n‚úì Private key file found")
    else:
        print(f"\n‚úó Private key file NOT found: {key_path}")
else:
    print("Running in Workspace - no local configuration needed")

## 7.3 DuckDB + Snowflake Setup in R

This cell configures DuckDB with the Snowflake extension and creates a connection.

**Authentication:**
- **Local IDE**: Uses key-pair authentication (‚úÖ Tested, recommended)
- **Workspace Notebooks**: OAuth has known issues with ADBC driver - **use Python Bridge (7.3.1) instead**

**Note**: The DuckDB Snowflake extension's OAuth support has known issues with token validation. 
For Workspace Notebooks, the **Python Bridge** approach in Section 7.3.1 is the recommended method.

In [None]:
%%R
R__duckdb_setup
library(DBI)
library(duckdb)
library(dplyr)
library(dbplyr)

cat("Loading DuckDB with Snowflake extension...\n")

# Connect to DuckDB (in-memory for speed, or file for persistence)
duckdb_con <- dbConnect(duckdb::duckdb(), dbdir = ":memory:")

# Load the Snowflake extension
tryCatch({
    dbExecute(duckdb_con, "INSTALL snowflake FROM community")
    dbExecute(duckdb_con, "LOAD snowflake")
    cat("‚úì Snowflake extension loaded\n")
}, error = function(e) {
    cat("‚úó Error loading extension:", conditionMessage(e), "\n")
})

cat("DuckDB ready. Configure Snowflake secret in next cell.\n")

In [None]:
# Create DuckDB Snowflake secret (for Local IDE)
# NOTE: OAuth has known issues with DuckDB's ADBC driver
# For Workspace Notebooks, use the Python Bridge approach in Section 7.3.1 instead

if ENV_TYPE == 'workspace':
    print("=" * 60)
    print("WORKSPACE NOTEBOOK DETECTED")
    print("=" * 60)
    print("\nDuckDB's direct Snowflake extension has OAuth issues in SPCS.")
    print("\nRECOMMENDED: Use the Python Bridge approach in Section 7.3.1")
    print("This queries Snowflake via Python and transfers data to R/DuckDB.")
    print("\nSkip the next R cell and proceed to Section 7.3.1.")
    # Use empty dict instead of None for R interop
    duckdb_auth = {'method': 'none'}
    
else:
    # Local IDE: Use key-pair auth (tested and working)
    key_path = os.path.expanduser(ENV_CONFIG.get('private_key_path', ''))
    if os.path.exists(key_path):
        with open(key_path, 'r') as f:
            private_key = f.read()
        
        duckdb_auth = {
            'method': 'keypair',
            'account': ENV_CONFIG['account'],
            'user': ENV_CONFIG['user'],
            'database': ENV_CONFIG['database'],
            'warehouse': ENV_CONFIG['warehouse'],
            'private_key': private_key,
        }
        print(f"‚úì Private key loaded from {key_path}")
        print(f"  Account: {ENV_CONFIG['account']}")
        print("\nRun the next R cell to create the Snowflake secret")
    else:
        print(f"‚úó Private key not found: {key_path}")
        print("  Set SNOWFLAKE_PRIVATE_KEY_PATH environment variable")
        duckdb_auth = {'method': 'none'}

In [None]:
%%R -i duckdb_auth
R__duckdb_secret
# Create DuckDB Snowflake secret (Local IDE only)

if (duckdb_auth$method == "none") {
    cat("DuckDB direct Snowflake connection not configured.\n")
    cat("\nFor Workspace Notebooks: Use Python Bridge (Section 7.3.1)\n")
    cat("For Local IDE: Configure key-pair auth in previous cell\n")
} else if (duckdb_auth$method == "keypair") {
    cat("Creating Snowflake secret with key-pair auth...\n")
    cat("  Account:", duckdb_auth$account, "\n")
    cat("  User:", duckdb_auth$user, "\n")
    
    secret_sql <- sprintf("
CREATE OR REPLACE SECRET snowflake_secret (
    TYPE snowflake,
    ACCOUNT '%s',
    USER '%s',
    DATABASE '%s',
    WAREHOUSE '%s',
    AUTH_TYPE 'key_pair',
    PRIVATE_KEY '%s'
)",
        duckdb_auth$account,
        duckdb_auth$user,
        duckdb_auth$database,
        duckdb_auth$warehouse,
        gsub("'", "''", duckdb_auth$private_key)
    )
    
    tryCatch({
        dbExecute(duckdb_con, secret_sql)
        cat("‚úì Key-pair secret created successfully\n")
        cat("\nRun next cell to attach Snowflake database\n")
    }, error = function(e) {
        cat("‚úó Error:", conditionMessage(e), "\n")
    })
}

In [None]:
%%R
R__duckdb_attach
# Attach Snowflake as a catalog in DuckDB (Local IDE only)
# For Workspace Notebooks, skip to Section 7.3.1 Python Bridge

if (!exists("duckdb_auth") || is.null(duckdb_auth)) {
    cat("Skipping - use Python Bridge (Section 7.3.1) for Workspace Notebooks\n")
} else {
    cat("Attaching Snowflake database...\n")
    
    tryCatch({
        dbExecute(duckdb_con, "ATTACH '' AS sf (TYPE snowflake, SECRET snowflake_secret, READ_ONLY)")
        cat("‚úì Snowflake attached as 'sf' catalog\n\n")
        
        # List schemas
        cat("Available schemas:\n")
        schemas <- dbGetQuery(duckdb_con, 
            "SELECT schema_name FROM sf.information_schema.schemata ORDER BY schema_name LIMIT 10")
        rprint(schemas)
        
    }, error = function(e) {
        cat("‚úó Error:", conditionMessage(e), "\n")
        cat("\nTroubleshooting:\n")
        cat("  - Verify account name format (e.g., 'xy12345' not full URL)\n")
        cat("  - Check private key is valid PKCS8 format\n")
        cat("  - Ensure public key is registered in Snowflake\n")
    })
}

### 7.3.1 Python Bridge (Recommended for Workspace Notebooks)

This approach queries Snowflake via Python Snowpark and transfers data to R/DuckDB for local analysis.

**Why use this?**
- Works reliably in both Workspace Notebooks and Local IDEs
- Uses the existing Snowpark session authentication
- No additional credential setup needed
- Ideal for dplyr/dbplyr workflows on fetched data

In [None]:
# Python Bridge: Query Snowflake, analyze with R/DuckDB
# This example uses SNOWFLAKE_SAMPLE_DATA (available to all Snowflake accounts)

if session is None:
    print("No session available. Run Section 3.1 first!")
else:
    import rpy2.robjects as ro
    from rpy2.robjects import pandas2ri
    from rpy2.robjects.conversion import localconverter
    
    # Query TPCH sample data
    # Note: Uses SNOWFLAKE_SAMPLE_DATA which is available to all accounts
    query = """
        SELECT O_ORDERKEY, O_CUSTKEY, O_ORDERSTATUS, 
               O_TOTALPRICE::FLOAT as O_TOTALPRICE, 
               O_ORDERDATE
        FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS
        WHERE O_ORDERDATE >= '1995-01-01'
        LIMIT 10000
    """
    
    print("Querying SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS...")
    orders_df = session.sql(query).to_pandas()
    
    # Transfer to R environment
    with localconverter(ro.default_converter + pandas2ri.converter):
        r_orders = ro.conversion.py2rpy(orders_df)
        ro.globalenv['sf_orders'] = r_orders
    
    print(f"‚úì Transferred {len(orders_df):,} rows to R as 'sf_orders'")
    print(f"  Columns: {', '.join(orders_df.columns)}")
    print("\nUse %%R cells to analyze with dplyr:")
    print("  library(dplyr)")
    print("  sf_orders %>% group_by(O_ORDERSTATUS) %>% summarize(n = n())")

In [None]:
%%R
R__duckdb_analyze
# Analyze data transferred via Python bridge
# This cell works in Workspace Notebooks without DuckDB

if (exists("sf_orders")) {
    library(dplyr)
    
    result <- sf_orders %>%
        mutate(order_year = format(O_ORDERDATE, "%Y")) %>%
        group_by(order_year, O_ORDERSTATUS) %>%
        summarise(
            orders = n(),
            total_value = sum(O_TOTALPRICE, na.rm = TRUE),
            .groups = "drop"
        ) %>%
        arrange(order_year, desc(orders))
    
    cat("Order analysis using dplyr (via Python bridge):\n")
    rprint(result)
} else {
    cat("Note: sf_orders not found. Run the Python bridge cell above first.\n")
    cat("Or use the DuckDB approach if in local IDE.\n")
}

## 7.4 Query Snowflake with dplyr

The recommended pattern for dplyr workflows:
1. **Direct SQL** for fetching data from Snowflake
2. **Cache locally** in DuckDB for iterative analysis
3. **Use dplyr** on local cached tables

**Important**: Use 2-part table names (`sf.schema.table`) when database is set in the secret.

In [None]:
%%R
R__dplyr_direct_sql
# Pattern 1: Direct SQL Query to Snowflake
# Best for: simple aggregations, data exploration

cat("Direct SQL query to Snowflake...\n\n")

tryCatch({
    # Note: Use sf.schema.table format (database set in secret)
    customers <- dbGetQuery(duckdb_con, "
        SELECT C_MKTSEGMENT, COUNT(*) as customers, ROUND(AVG(C_ACCTBAL), 2) as avg_balance
        FROM sf.tpch_sf1.customer
        GROUP BY C_MKTSEGMENT
        ORDER BY customers DESC
    ")
    
    cat("Customer analysis by market segment:\n")
    rprint(customers)
    
}, error = function(e) {
    cat("Error:", conditionMessage(e), "\n")
    cat("Note: Ensure database is set in secret (e.g., SNOWFLAKE_SAMPLE_DATA)\n")
})

In [None]:
%%R
R__dplyr_cache_local
# Pattern 2: Cache locally, then use dplyr
# Best for: complex analysis, joins, window functions

cat("Caching Snowflake data locally for dplyr analysis...\n\n")

tryCatch({
    # Cache data from Snowflake into local DuckDB table
    dbExecute(duckdb_con, "
        CREATE OR REPLACE TABLE orders_local AS 
        SELECT O_ORDERKEY, O_CUSTKEY, O_ORDERSTATUS, O_TOTALPRICE, O_ORDERDATE, O_ORDERPRIORITY
        FROM sf.tpch_sf1.orders
        LIMIT 50000
    ")
    cat("‚úì Cached 50,000 orders locally\n\n")
    
    # Now use dplyr on the local table - fast and featureful!
    analysis <- tbl(duckdb_con, "orders_local") %>%
        mutate(
            order_year = year(O_ORDERDATE),
            priority = case_when(
                O_ORDERPRIORITY %in% c("1-URGENT", "2-HIGH") ~ "High",
                TRUE ~ "Normal"
            )
        ) %>%
        group_by(order_year, O_ORDERSTATUS, priority) %>%
        summarise(
            orders = n(),
            total_value = sum(O_TOTALPRICE, na.rm = TRUE),
            .groups = 'drop'
        ) %>%
        arrange(order_year, O_ORDERSTATUS) %>%
        collect()
    
    cat("Order analysis with dplyr:\n")
    rprint(head(analysis, 10))
    
}, error = function(e) {
    cat("Error:", conditionMessage(e), "\n")
})

## 7.5 Advanced Patterns

Additional patterns for DuckDB + Snowflake workflows.

In [None]:
%%R
R__dplyr_join_tables
# Pattern 3: Join local cached tables
# Best for: combining reference data with transactional data

cat("Caching reference tables for joins...\n\n")

tryCatch({
    # Cache reference tables
    dbExecute(duckdb_con, "CREATE OR REPLACE TABLE nations AS SELECT * FROM sf.tpch_sf1.nation")
    dbExecute(duckdb_con, "CREATE OR REPLACE TABLE regions AS SELECT * FROM sf.tpch_sf1.region")
    cat("‚úì Reference tables cached\n\n")
    
    # Join using dplyr
    result <- tbl(duckdb_con, "nations") %>%
        inner_join(tbl(duckdb_con, "regions"), by = c("N_REGIONKEY" = "R_REGIONKEY")) %>%
        select(nation = N_NAME, region = R_NAME) %>%
        arrange(region, nation) %>%
        collect()
    
    cat("Nations by Region:\n")
    rprint(result)
    
}, error = function(e) {
    cat("Error:", conditionMessage(e), "\n")
})

In [None]:
%%R
R__dplyr_window_funcs
# Pattern 4: Window functions with dplyr
# Best for: rankings, running totals, lead/lag analysis

cat("Window functions on cached data...\n\n")

tryCatch({
    # Ensure orders_local exists from previous cell
    if (!dbExistsTable(duckdb_con, "orders_local")) {
        dbExecute(duckdb_con, "
            CREATE TABLE orders_local AS 
            SELECT O_ORDERKEY, O_CUSTKEY, O_ORDERSTATUS, O_TOTALPRICE, O_ORDERDATE
            FROM sf.tpch_sf1.orders
            LIMIT 50000
        ")
    }
    
    # Top customers by total order value
    top_customers <- tbl(duckdb_con, "orders_local") %>%
        group_by(O_CUSTKEY) %>%
        summarise(
            orders = n(),
            total_value = sum(O_TOTALPRICE, na.rm = TRUE),
            avg_order = mean(O_TOTALPRICE, na.rm = TRUE),
            .groups = 'drop'
        ) %>%
        arrange(desc(total_value)) %>%
        head(10) %>%
        collect()
    
    cat("Top 10 customers by total order value:\n")
    rprint(top_customers)
    
}, error = function(e) {
    cat("Error:", conditionMessage(e), "\n")
})

## 7.6 Cleanup

Close the DuckDB connection when done.

In [None]:
%%R
R__duckdb_cleanup
# Cleanup: Disconnect from DuckDB
# Uncomment to close connection

# dbDisconnect(duckdb_con)
# cat("DuckDB connection closed\n")

---

## 8. Iceberg Integration via Horizon Catalog (Experimental)

This section covers accessing Snowflake-managed Iceberg tables from external query engines using the Horizon Catalog REST API.

**Status**: üî¨ Experimental - Some features may require additional configuration.

### Key Concepts

- **Horizon Catalog**: Snowflake's implementation of the Apache Iceberg REST API
- **Vended Credentials**: Temporary S3/Azure/GCS credentials provided by the catalog
- **Authentication**: JWT/OAuth flow using the same key-pair as Snowflake

### What Works Now
- ‚úÖ JWT generation and token exchange
- ‚úÖ Catalog metadata queries (list namespaces, tables)
- ‚úÖ Table metadata retrieval (schema, partition specs, snapshots)
- ‚úÖ DuckDB iceberg extension ATTACH

### In Progress
- ‚ö†Ô∏è Full DuckDB query support (requires vended credentials)
- ‚ö†Ô∏è PyIceberg integration

## 8.1 Create an Iceberg Table (One-Time Setup)

First, let's create a Snowflake-managed Iceberg table from the TPCH Nation data for testing.

**Prerequisites:**
- An external volume configured with S3/Azure/GCS storage
- `CREATE ICEBERG TABLE` privilege

**Configuration:** Update `iceberg.external_volume` in `notebook_config.yaml`

In [None]:
# Create Iceberg table from TPCH Nation data
# Uses ICEBERG_CONFIG from notebook_config.yaml

if session is None:
    print("No session available. Run Section 3.1 first!")
else:
    iceberg_table = ICEBERG_CONFIG.get('test_table_name', 'NATION_ICEBERG')
    external_vol = ICEBERG_CONFIG.get('external_volume', '<YOUR_EXTERNAL_VOLUME>')
    target_db = ENV_CONFIG.get('database', 'SIMON')
    target_schema = ENV_CONFIG.get('schema', 'PUBLIC')
    
    print(f"Creating Iceberg table: {target_db}.{target_schema}.{iceberg_table}")
    print(f"External volume: {external_vol}")
    
    if '<YOUR_' in external_vol:
        print("\n‚ö†Ô∏è  Configure iceberg.external_volume in notebook_config.yaml first!")
    else:
        create_sql = f"""
        CREATE OR REPLACE ICEBERG TABLE {target_db}.{target_schema}.{iceberg_table}
            EXTERNAL_VOLUME = '{external_vol}'
            CATALOG = 'SNOWFLAKE'
            BASE_LOCATION = '{iceberg_table.lower()}/'
        AS
        SELECT 
            N_NATIONKEY::NUMBER(38,0) as N_NATIONKEY,
            N_NAME::STRING as N_NAME,
            N_REGIONKEY::NUMBER(38,0) as N_REGIONKEY,
            N_COMMENT::STRING as N_COMMENT
        FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.NATION
        """
        
        try:
            session.sql(create_sql).collect()
            print(f"‚úì Iceberg table created: {target_db}.{target_schema}.{iceberg_table}")
            
            # Verify
            count = session.sql(f"SELECT COUNT(*) FROM {target_db}.{target_schema}.{iceberg_table}").collect()[0][0]
            print(f"  Row count: {count}")
        except Exception as e:
            print(f"‚úó Error: {e}")

## 8.2 Horizon Catalog Authentication

Generate a JWT and exchange it for an access token to authenticate with the Horizon Catalog REST API.

**Note**: This uses key-pair authentication. You need a private key configured (see Section 4.2).

The `ENV_CONFIG` from Section 3.1 provides the account and user details.

In [None]:
# Horizon Catalog auth helper (uses ENV_CONFIG from Section 3.1)
import jwt
import time
import hashlib
import base64
import requests
from datetime import datetime
from cryptography.hazmat.primitives import serialization

def get_horizon_access_token(account, user, private_key_path, role=None):
    """Generate JWT and exchange for access token."""
    import os
    
    # Load private key
    key_path = os.path.expanduser(private_key_path)
    with open(key_path, 'rb') as f:
        private_key = serialization.load_pem_private_key(f.read(), password=None)
    
    # Get public key fingerprint
    public_key = private_key.public_key()
    public_key_bytes = public_key.public_bytes(
        encoding=serialization.Encoding.DER,
        format=serialization.PublicFormat.SubjectPublicKeyInfo
    )
    fingerprint = hashlib.sha256(public_key_bytes).digest()
    fingerprint_b64 = base64.b64encode(fingerprint).decode()
    
    # Build JWT
    qualified_user = f"{account.upper()}.{user.upper()}"
    now = int(time.time())
    payload = {
        "iss": f"{qualified_user}.SHA256:{fingerprint_b64}",
        "sub": qualified_user,
        "iat": now,
        "exp": now + 3600,
    }
    
    token = jwt.encode(payload, private_key, algorithm="RS256")
    
    # Exchange JWT for access token
    url = f"https://{account}.snowflakecomputing.com/polaris/api/catalog/v1/oauth/tokens"
    
    data = {
        "grant_type": "urn:ietf:params:oauth:grant-type:jwt-bearer",
        "assertion": token,
    }
    if role:
        data["scope"] = f"PRINCIPAL_ROLE:{role}"
    
    resp = requests.post(url, data=data, headers={"Content-Type": "application/x-www-form-urlencoded"})
    
    if resp.status_code == 200:
        return resp.json().get("access_token")
    else:
        raise Exception(f"Token exchange failed: {resp.status_code} - {resp.text}")

# Get access token using settings from ENV_CONFIG
if ENV_TYPE == 'local':
    key_path = ENV_CONFIG.get('private_key_path', '~/.ssh/snowflake_rsa_key.p8')
    try:
        ACCESS_TOKEN = get_horizon_access_token(
            account=ENV_CONFIG['account'],
            user=ENV_CONFIG['user'],
            private_key_path=key_path
        )
        print(f"‚úì Access token obtained (first 20 chars): {ACCESS_TOKEN[:20]}...")
    except Exception as e:
        print(f"‚úó Error getting access token: {e}")
        ACCESS_TOKEN = None
else:
    print("Note: Horizon Catalog auth requires key-pair (local IDE mode)")
    print("In Workspace, use the DuckDB Snowflake extension approach from Section 7")
    ACCESS_TOKEN = None

## 8.3 Query Horizon Catalog Metadata

Use the REST API to list namespaces and tables in the Iceberg catalog.

In [None]:
# Query Horizon Catalog API
import requests
import json
import os

# Configuration - update these for your environment
HORIZON_CONFIG = {
    'account': os.environ.get('SNOWFLAKE_ACCOUNT', '<YOUR_ORG>-<YOUR_ACCOUNT>'),
    'user': os.environ.get('SNOWFLAKE_USER', '<YOUR_USER>'),
    'role': 'SYSADMIN',
    'database': '<YOUR_DATABASE>',  # Database with Iceberg tables
    'private_key_path': os.environ.get(
        'SNOWFLAKE_PRIVATE_KEY_PATH',
        '~/.snowflake/keys/rsa_key.p8'
    )
}

def query_horizon_catalog(endpoint, access_token):
    """Query the Horizon Catalog REST API."""
    account = HORIZON_CONFIG['account']
    base_url = f"https://{account}.snowflakecomputing.com/polaris/api/catalog/v1"
    
    response = requests.get(
        f"{base_url}/{endpoint}",
        headers={
            'Authorization': f'Bearer {access_token}',
            'Content-Type': 'application/json'
        }
    )
    
    if response.status_code == 200:
        return response.json()
    else:
        return {'error': response.status_code, 'message': response.text}

# Example usage (uncomment when configured):
"""
# Get access token
access_token = get_horizon_access_token(
    HORIZON_CONFIG['account'],
    HORIZON_CONFIG['user'],
    HORIZON_CONFIG['role'],
    HORIZON_CONFIG['private_key_path']
)

# List namespaces (schemas)
database = HORIZON_CONFIG['database']
namespaces = query_horizon_catalog(f"{database}/namespaces", access_token)
print("Namespaces:", json.dumps(namespaces, indent=2))

# List tables in PUBLIC schema
tables = query_horizon_catalog(f"{database}/namespaces/PUBLIC/tables", access_token)
print("Tables:", json.dumps(tables, indent=2))

# Get table metadata
table_meta = query_horizon_catalog(
    f"{database}/namespaces/PUBLIC/tables/NATION_ICEBERG",
    access_token
)
print("Table metadata:", json.dumps(table_meta, indent=2)[:500])
"""

print("Horizon Catalog query functions defined.")
print("Configure HORIZON_CONFIG and uncomment example code to test.")

## 8.4 DuckDB Iceberg Extension (Experimental)

DuckDB's iceberg extension can connect to REST catalogs including Snowflake Horizon.

**Current Status**: Catalog attachment works, but data queries may fail due to vended credentials limitations. See design document for workarounds.

In [None]:
%%R
R__iceberg_duckdb
# DuckDB Iceberg Integration (Experimental)
# 
# NOTE: This demonstrates attaching to Horizon Catalog.
# Full query support may require vended credentials configuration.

library(DBI)
library(duckdb)

# Connect to DuckDB
iceberg_con <- dbConnect(duckdb::duckdb(), dbdir = ":memory:")

# Install and load iceberg extension
dbExecute(iceberg_con, "INSTALL iceberg")
dbExecute(iceberg_con, "LOAD iceberg")

cat("Iceberg extension loaded\n")

# Configuration - update for your environment
# Uncomment and configure when you have an access token
"""
ACCOUNT <- 'MYORG-MYACCOUNT'
DATABASE <- 'MY_DATABASE'
ACCESS_TOKEN <- '<your_access_token_from_section_8.2>'

# Attach to Horizon Catalog
attach_sql <- sprintf(
    \"ATTACH '%s' AS horizon (
        TYPE ICEBERG,
        ENDPOINT 'https://%s.snowflakecomputing.com/polaris/api/catalog',
        TOKEN '%s'
    )\",
    DATABASE,
    ACCOUNT,
    ACCESS_TOKEN
)

dbExecute(iceberg_con, attach_sql)
cat('Horizon Catalog attached\\n')

# List tables (this works!)
tables <- dbGetQuery(iceberg_con, 
    \"SELECT * FROM duckdb_tables() WHERE database_name = 'horizon'\")
print(tables)

# Query table (may fail with current vended credentials limitations)
# tryCatch({
#     result <- dbGetQuery(iceberg_con, 'SELECT * FROM horizon.PUBLIC.NATION_ICEBERG')
#     print(result)
# }, error = function(e) {
#     cat('Query failed - see design doc for workarounds:\\n')
#     cat(conditionMessage(e), '\\n')
# })
"""

cat("\nDuckDB Iceberg demo configured.\n")
cat("Configure variables and uncomment code to test.\n")

## 8.5 Recommended Alternative: Snowflake + DuckDB Hybrid

Until full Iceberg REST catalog support is available, use the working DuckDB Snowflake extension approach from Section 7:

1. **Query Snowflake via ADBC** (using DuckDB Snowflake extension)
2. **Cache results locally** in DuckDB
3. **Use dplyr/dbplyr** on the local cache

This provides the same benefits (local processing, R ecosystem) with full support today.

In [None]:
%%R
R__iceberg_hybrid
# Hybrid Approach: Best of Both Worlds
# Use the working DuckDB + Snowflake pattern for Iceberg-like benefits

# Assuming DuckDB connection from Section 7 is active (duckdb_con)
# If not, re-run Section 7.3

# Example: Query Snowflake Iceberg table, cache locally
# (Even though it's an Iceberg table in Snowflake, query via SQL works!)

"""
# Query the Iceberg table via standard Snowflake SQL
dbExecute(duckdb_con, \"
    CREATE OR REPLACE TABLE nation_iceberg_local AS 
    SELECT * FROM sf.PUBLIC.NATION_ICEBERG
\")

# Now use dplyr on the local cache
library(dplyr)
library(dbplyr)

tbl(duckdb_con, 'nation_iceberg_local') %>%
    group_by(N_REGIONKEY) %>%
    summarise(
        nations = n(),
        sample_name = first(N_NAME)
    ) %>%
    collect() %>%
    print()
"""

cat("Hybrid pattern example ready.\n")
cat("Uncomment code after running Section 7 DuckDB setup.\n")

---

## Troubleshooting

### Common Issues

| Issue | Solution |
|-------|----------|
| `ModuleNotFoundError: No module named 'rpy2'` | Run Section 1.2 to install rpy2 |
| `R.version.string` returns error | Verify PATH and R_HOME are set correctly |
| ADBC `auth_pat` error | Ensure PAT was created and stored in `SNOWFLAKE_PAT` |
| Network policy error | PAT may need `MINS_TO_BYPASS_NETWORK_POLICY_REQUIREMENT` |
| `adbcsnowflake` not found | Ensure setup script ran with `--adbc` flag |
| Setup script fails | Check `setup_r.log` for detailed error messages |
| `r_sf_con` not found | Run `get_snowflake_connection()` to create connection |

### Run Full Diagnostics

In [None]:
# Comprehensive diagnostic check
from r_helpers import print_diagnostics
print_diagnostics()

In [None]:
# Environment diagnostics
import os
import shutil

print("Quick Environment Check:")
print(f"  R_HOME: {os.environ.get('R_HOME', 'NOT SET')}")
print(f"  R binary: {shutil.which('R') or 'NOT FOUND'}")
print(f"  SNOWFLAKE_ACCOUNT: {os.environ.get('SNOWFLAKE_ACCOUNT', 'NOT SET')}")
print(f"  SNOWFLAKE_PAT: {'SET' if os.environ.get('SNOWFLAKE_PAT') else 'NOT SET'}")

In [None]:
# View setup log if something went wrong
# !tail -50 setup_r.log