# Logging R ARIMAX Model to Snowflake Model Registry (rpy2 Version)

This notebook demonstrates the **rpy2-based** approach for wrapping an R model in Python and logging it to Snowflake's Model Registry.

## Key Differences from Original

| Aspect | Original (subprocess) | This Version (rpy2) |
|--------|----------------------|---------------------|
| Data transfer | CSV files | In-memory |
| R execution | Subprocess | Embedded |
| Per-prediction | ~200-500ms | ~10-50ms |
| Code complexity | Higher | Lower |

## Benefits
- **5-20x faster** predictions (no file I/O)
- **Cleaner code** (~50% less)
- **Better error handling** (Python exceptions)
- **Type fidelity** (no CSV conversion)

## Step 1: Setup and Connect to Snowflake

```
# We don't have root permission in the Workspace.  One way of installing R without it is to use miniconda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r
bash Miniconda3-latest-Linux-x86_64.sh -b -p "$HOME/miniconda3"
# Now install base R
conda create -n r_env -c conda-forge r-base -y
$HOME/miniconda3/envs/r_env/bin/R --version 
/root/miniconda3/bin/conda install -n r_env -c conda-forge r-forecast -y

# Test running R and that forecast is available.  use q() to quit.
library(forecast); packageVersion('forecast')

# Set path to find it!
export PATH="$HOME/miniconda3/envs/r_env/bin:$PATH"
which R
# Belt n braces
export R_HOME="$(R RHOME)"
# Now install rpy2 - with pip
python -m pip install --user rpy2


ln -s libz.so.1 libz.so
ln -s liblzma.so.5 liblzma.so
```

In [None]:
!whoami

In [None]:
import os, sys, subprocess

# Make sure R from your conda env is on PATH for this process
os.environ["PATH"] = "/root/miniconda3/envs/r_env/bin:" + os.environ["PATH"]

# (Optional but robust) set R_HOME explicitly
try:
    r_home = subprocess.check_output(
        ["/root/miniconda3/envs/r_env/bin/R", "RHOME"],
        text=True,
    ).strip()
    os.environ["R_HOME"] = r_home
    print("R_HOME set to:", r_home)
except Exception as e:
    print("Warning: could not determine R_HOME:", e)

# IMPORTANT: install into the venv; DO NOT use --user
print("Installing rpy2 into kernel Python:", sys.executable)
subprocess.run(
    [sys.executable, "-m", "pip", "install", "rpy2"],
    check=True,
)

In [None]:
import sys, os
print("Kernel Python:", sys.executable)
print("Version:", sys.version)
print("PATH:", os.environ.get("PATH", "")[:200], "...")

In [None]:
import rpy2.robjects as ro

# Check that we can talk to R
print("R version via rpy2:", ro.r("R.version.string")[0])

# Simple test
print("2 + 3 in R =", ro.r("2 + 3")[0])


In [None]:
import os
import pandas as pd
import numpy as np
from snowflake.snowpark import Session
from snowflake.ml.registry import Registry

# Import the rpy2-based wrapper
from r_model_wrapper_rpy2 import ARIMAXModelWrapperRpy2

# Option 1: Use active session (Snowflake Notebooks)
from snowflake.snowpark.context import get_active_session
session = get_active_session()

# Option 2: Create session from connection params (local development)
# connection_params = {
#     "connection_name": os.getenv("SNOWFLAKE_CONNECTION_NAME") or "MY_DEMO"
# }
# session = Session.builder.configs(connection_params).create()

print(f"Connected to Snowflake: {session.get_current_database()}.{session.get_current_schema()}")

In [None]:
session.sql("""
USE SCHEMA SIMON.SNOWFLAKE_MODEL_REG_RPY2
""").collect()

## Step 2: Generate Synthetic Test Data

Create exogenous variables for forecasting:

In [None]:
np.random.seed(42)
n_forecast = 10

test_data = pd.DataFrame({
    'exog_var1': np.random.normal(5, 1, n_forecast),
    'exog_var2': np.random.normal(10, 2, n_forecast)
})

print("Test Data:")
print(test_data)

## Step 3: Fetch Model Artifact from Stage

Download the R model from Snowflake stage:

In [None]:
# Download R model artifact from Snowflake stage
#r_artifact_stage_path = "@E2E_SNOW_MLOPS_DB.MLOPS_SCHEMA.ML_ARTIFACTS_STAGE/r_models/arimax_model_artifact.rds"
r_artifact_stage_path = "@SIMON.SNOWFLAKE_MODEL_REG_RPY2.ML_ARTIFACTS_STAGE/r_models/arimax_model_artifact.rds"

# Download file from stage to /tmp/
session.file.get(r_artifact_stage_path, "/tmp/")

print("Downloaded R model artifact from stage to /tmp/")
print("\nNote: With rpy2, we don't need the separate predict_arimax.R script!")
print("The prediction logic is embedded in the Python wrapper.")

## Step 4: Test Model Locally (Optional)

If rpy2 is available in the notebook environment, test locally first:

In [None]:
# Optional: Test locally if rpy2 is available
try:
    from predict_arimax_rpy2 import load_arimax_model, predict_arimax_from_dataframe
    
    print("Testing rpy2 model locally...")
    model = load_arimax_model('/tmp/arimax_model_artifact.rds')
    local_predictions = predict_arimax_from_dataframe(model, test_data)
    
    print("\nLocal predictions (rpy2):")
    print(local_predictions)
    print("\n✓ Local test passed!")
except ImportError:
    print("rpy2 not available in notebook environment")
    print("Model will be tested after deployment to SPCS")
except Exception as e:
    print(f"Local test skipped: {e}")
    print("Model will be tested after deployment to SPCS")

## Step 5: Create Model Registry

In [None]:
session.sql("""
USE SCHEMA SIMON.SNOWFLAKE_MODEL_REG_RPY2
""").collect()

In [None]:
session.sql("""
USE ROLE SNOWFLAKE_MODEL_REG_RPY2;
""").collect()

In [None]:
'''
reg = Registry(
    session=session,
    database_name='E2E_SNOW_MLOPS_DB',
    schema_name='MLOPS_SCHEMA'
)
'''

reg = Registry(
    session=session,
    database_name='SIMON',
    schema_name='SNOWFLAKE_MODEL_REG_RPY2'
)

print(f"Registry initialized")

## Step 6: Log Model to Snowflake Registry

### Key Configuration:
- **target_platforms**: `["SNOWPARK_CONTAINER_SERVICES"]` - Required for R execution
- **conda_dependencies**: Now includes `rpy2>=3.5` in addition to R packages
- **Note**: No predict script artifact needed - logic is in Python wrapper!

In [None]:
# 3. Re-log with updated wrapper (make sure to re-import the wrapper)
from r_model_wrapper_rpy2 import ARIMAXModelWrapperRpy2
# or reload if already imported:
import importlib
import r_model_wrapper_rpy2
importlib.reload(r_model_wrapper_rpy2)
from r_model_wrapper_rpy2 import ARIMAXModelWrapperRpy2

In [None]:
from snowflake.ml.model import custom_model
from snowflake.ml.model.model_signature import (
    ModelSignature,
    FeatureSpec,
    DataType
)

# Create ModelContext - NOTE: Only model_rds needed (no predict script!)
model_context = custom_model.ModelContext(
    model_rds='/tmp/arimax_model_artifact.rds'
)

# Instantiate the rpy2-based custom model
my_model = ARIMAXModelWrapperRpy2(model_context)

# Define explicit signature
predict_signature = ModelSignature(
    inputs=[
        FeatureSpec(name="exog_var1", dtype=DataType.DOUBLE),
        FeatureSpec(name="exog_var2", dtype=DataType.DOUBLE)
    ],
    outputs=[
        FeatureSpec(name="forecast", dtype=DataType.DOUBLE),
        FeatureSpec(name="lower_80", dtype=DataType.DOUBLE),
        FeatureSpec(name="upper_80", dtype=DataType.DOUBLE),
        FeatureSpec(name="lower_95", dtype=DataType.DOUBLE),
        FeatureSpec(name="upper_95", dtype=DataType.DOUBLE)
    ]
)

# Log to registry with rpy2 dependency
model_version = reg.log_model(
    my_model,
    model_name="ARIMAX_R_MODEL_RPY2",
    version_name="V1",
    target_platforms=["SNOWPARK_CONTAINER_SERVICES"],
    conda_dependencies=[
        "r-base>=4.5.2",
        "r-forecast>=9.0.0",
        "rpy2>=3.6.4"  # New dependency for rpy2 approach
    ],
    signatures={"predict": predict_signature},
    sample_input_data=test_data,
    comment="R ARIMAX model with rpy2 Python wrapper (faster, no CSV I/O)"
)

print(f"\nModel logged successfully!")
print(f"Model: {model_version.model_name}")
print(f"Version: {model_version.version_name}")
print(f"\nKey improvement: Using rpy2 for direct R execution (no subprocess/CSV)")

In [None]:
reg.show_models()

In [None]:
model = reg.get_model("ARIMAX_R_MODEL_RPY2")
model_version = model.version('V1')
print(f"Model: {model_version.model_name}")
print(f"Version: {model_version.version_name}")

## Step 7: Create SPCS Resources

Create compute pool and image repository for model deployment:

In [None]:
# Create compute pool for R model inference
session.sql("""
CREATE COMPUTE POOL IF NOT EXISTS R_MODEL_POOL_RPY2
    MIN_NODES = 1
    MAX_NODES = 2
    INSTANCE_FAMILY = 'CPU_X64_M'
    AUTO_RESUME = TRUE
    COMMENT = 'Compute pool for R ARIMAX model inference (rpy2)'
""").collect()
print("✓ Compute pool created: R_MODEL_POOL_RPY2")

# Create image repository
session.sql("""
-- CREATE IMAGE REPOSITORY IF NOT EXISTS E2E_SNOW_MLOPS_DB.MLOPS_SCHEMA.R_MODEL_IMAGE_REPO_RPY2
CREATE IMAGE REPOSITORY IF NOT EXISTS SIMON.SNOWFLAKE_MODEL_REG_RPY2.R_MODEL_IMAGE_REPO_RPY2
    COMMENT = 'Repository for R model container images (rpy2 version)'
""").collect()
print("✓ Image repository created: R_MODEL_IMAGE_REPO_RPY2")

## Step 8: Deploy Model to SPCS

In [None]:
# Deploy model to SPCS
model_version.create_service(
    service_name="arimax_rpy2_deployment",
    service_compute_pool="R_MODEL_POOL_RPY2",
    image_repo="R_MODEL_IMAGE_REPO_RPY2",
    ingress_enabled=True,
    max_instances=1
)

print("Model deployed to SPCS: arimax_rpy2_deployment")
print("Building container image and starting service...")
print("This may take 5-10 minutes for first deployment.")

## Step 9: Run Inference

In [None]:
import time

# Check service status
service_status = session.sql("SHOW SERVICES LIKE 'arimax_rpy2_deployment'").collect()
print(f"Service status: {service_status[0]['status'] if service_status else 'Not found'}")

# Create test data for inference
test_snowpark_df = session.create_dataframe(test_data)

# Call the model via SPCS service
print("\n=== Running Inference via SPCS (rpy2) ===")
start_time = time.time()

predictions = model_version.run(
    test_snowpark_df,
    function_name="predict",
    service_name="arimax_rpy2_deployment"
)

elapsed_time = time.time() - start_time
print(f"\nInference completed in {elapsed_time:.2f} seconds")
print("\nPredictions:")
predictions.show()

In [None]:
# Call the model via SPCS service
print("\n=== Running Inference via SPCS (rpy2) ===")
start_time = time.time()

predictions = model_version.run(
    test_snowpark_df,
    function_name="predict",
    service_name="arimax_rpy2_deployment"
)

elapsed_time = time.time() - start_time
print(f"\nInference completed in {elapsed_time:.2f} seconds")
print("\nPredictions:")
predictions.show()

# Step 11: Run Performance Test

In [None]:
import benchmark_model
importlib.reload(benchmark_model)
from benchmark_model import run_benchmark, compare_benchmarks

# Run a single benchmark
stats = run_benchmark(
    session=session,
    model_version=model_version,
    service_name="arimax_rpy2_deployment",  # Your service name
    total_rows=100,      # Total rows to test
    rows_per_request=10, # Batch size per request
    verbose=True
)

In [None]:
import benchmark_model
importlib.reload(benchmark_model)
from benchmark_model import run_benchmark, compare_benchmarks, run_and_save_benchmark


# Compare different batch sizes
results = []
labels = []
for batch_size in [10, 25, 50]:
    stats = run_benchmark(
        session=session,
        model_version=model_version,
        service_name="arimax_rpy2_deployment",
        total_rows=100,
        rows_per_request=batch_size,
        verbose=False
    )
    results.append(stats)
    labels.append(f"{batch_size} rows")

compare_benchmarks(results, labels)

In [None]:
import benchmark_model
importlib.reload(benchmark_model)
from benchmark_model import run_benchmark, compare_benchmarks, run_and_save_benchmark

# Run benchmarks and save to Snowflake
for batch_size in [10, 25, 50]:
    run_and_save_benchmark(
        session=session,
        model_version=model_version,
        service_name="arimax_rpy2_deployment",
        model_type="rpy2",  # Label for this approach
        total_rows=100,
        rows_per_request=batch_size,
        table_name="BENCHMARK_RESULTS",
        run_id="comparison_001"  # Same run_id to group results
    )

In [None]:
from benchmark_model import compare_from_table, load_benchmark_results

# Quick comparison
compare_from_table(session, table_name="BENCHMARK_RESULTS")

# Or load raw data for custom analysis
df = load_benchmark_results(session, table_name="BENCHMARK_RESULTS")
print(df)

In [None]:
df

```
=== Benchmark Configuration ===
Total rows: 100
Rows per request: 10
Number of iterations: 10
  - Full batches: 10

=== Running Benchmark ===
  Iteration 1/10: 10 rows, 2211.59ms
UserWarning: Pandas Dataframe has non-standard index of type <class 'pandas.core.indexes.range.RangeIndex'> which will not be written. Consider changing the index to pd.RangeIndex(start=0,...,step=1) or call reset_index() to keep index as column(s)
  Iteration 2/10: 10 rows, 1981.82ms
  Iteration 3/10: 10 rows, 2124.08ms
  Iteration 4/10: 10 rows, 2225.68ms
  Iteration 5/10: 10 rows, 740.08ms
  Iteration 6/10: 10 rows, 911.45ms
  Iteration 7/10: 10 rows, 2182.01ms
  Iteration 8/10: 10 rows, 2162.82ms
  Iteration 9/10: 10 rows, 3156.90ms
  Iteration 10/10: 10 rows, 2686.21ms

=== Benchmark Results ===
Total rows processed: 100
Rows per request: 10
Successful iterations: 10/10

Timing Statistics (milliseconds):
  Total time:    20382.64ms
  Average:       2038.26ms
  Min:           740.08ms
  Max:           3156.90ms
  Std Dev:       688.34ms

Percentiles:
  P50 (median):  2172.42ms
  P90:           2733.28ms
  P95:           2945.09ms
  P99:           3114.54ms

Throughput: 4.91 rows/second
=== Benchmark Comparison ===

Metric                       1 rows         5 rows        10 rows        25 rows        50 rows
-----------------------------------------------------------------------------------------------
Rows/request                      1              5             10             25             50
Iterations                      100             20             10              4              2
Total time (ms)           182429.39       43037.81       28315.65        5186.02        1984.77
Avg (ms)                    1824.29        2151.89        2831.56        1296.50         992.38
Min (ms)                     826.04         903.07         927.95         977.74         899.85
Max (ms)                   23315.01       20102.75       19045.37        1617.60        1084.92
P50 (ms)                    1205.70        1110.48        1014.96        1295.33         992.38
P90 (ms)                    2068.99        1826.84        2945.05        1605.68        1066.41
P95 (ms)                    2767.64        3887.67       10995.21        1611.64        1075.66
P99 (ms)                   20814.39       16859.73       17435.34        1616.41        1083.06
Throughput (r/s)               0.55           2.32           3.53          19.28          50.38
```

## Step 10: View Models in Registry

In [None]:
models_df = reg.show_models()
print("\nRegistered Models:")
print(models_df[['name', 'versions', 'comment']])

In [None]:
reg.delete_model('ARIMAX_R_MODEL_RPY2')

In [None]:
%%sql -r dataframe_5
SHOW SERVICES IN SCHEMA SIMON.SNOWFLAKE_MODEL_REG_RPY2;

In [None]:
# Drop by the exact name from SHOW SERVICES output
session.sql("DROP SERVICE IF EXISTS SIMON.SNOWFLAKE_MODEL_REG_RPY2.MODEL_BUILD_D68F5176").collect()

# Also check for any MODEL_BUILD services that may be lingering
# session.sql("DROP SERVICE IF EXISTS SIMON.SNOWFLAKE_MODEL_REG_RPY2.MODEL_BUILD_D68F5176").collect()

## Cleanup (Optional)

In [None]:
# Uncomment to clean up resources

# Delete the service
# model_version.delete_service("arimax_rpy2_deployment")

# Delete the model
# reg.delete_model("ARIMAX_R_MODEL_RPY2")

# Drop compute pool and image repo
# session.sql("DROP COMPUTE POOL IF EXISTS R_MODEL_POOL_RPY2").collect()
# session.sql("DROP IMAGE REPOSITORY IF EXISTS R_MODEL_IMAGE_REPO_RPY2").collect()

# print("Resources cleaned up")

## Summary

### What Changed

| Component | Original | rpy2 Version |
|-----------|----------|--------------|
| Wrapper | `ARIMAXModelWrapper` | `ARIMAXModelWrapperRpy2` |
| Data transfer | CSV files | In-memory |
| R execution | subprocess | Embedded |
| Artifacts | model.rds + predict.R | model.rds only |
| Dependencies | r-base, r-forecast | + rpy2 |

### Benefits Achieved

1. **Faster predictions** - No file I/O overhead
2. **Cleaner code** - ~50% less code in wrapper
3. **Better errors** - Python-native exception handling
4. **Simpler artifacts** - No separate R script needed
5. **Type fidelity** - Direct pandas ↔ R conversion

In [None]:
# one-time registration of rpy2 magics in this notebook
from rpy2.ipython import rmagic
ip = get_ipython()
ip.register_magics(rmagic.RMagics)

In [None]:
%%R
x <- rnorm(100)
mean(x)

In [None]:
%%R
mean(x)