# R Forecasting Model: Train, Register, and InferenceThis notebook demonstrates an end-to-end workflow for R-based forecasting in Snowflake:1. **Query Data** from Snowflake (TPC-H sample)2. **Train** a time series model in R using the `forecast` package3. **Register** the model to Snowflake Model Registry4. **Run Inference** via Snowpark Container Services5. **Visualize** results with ggplot2**Target Audience:** Data scientists who prefer R but need to integrate with Snowflake's MLOps ecosystem.---## Table of Contents1. [Configuration](#section-1-configuration)2. [Environment Setup](#section-2-environment-setup)3. [Data Exploration](#section-3-data-exploration)4. [Time Series Preparation](#section-4-time-series-preparation)5. [Model Training (R)](#section-5-model-training-r)6. [Model Packaging](#section-6-model-packaging)7. [Model Registration](#section-7-model-registration)8. [Inference](#section-8-inference)9. [Visualization](#section-9-visualization)10. [Cleanup](#section-10-cleanup)

---# Section 1: ConfigurationConfigure your Snowflake environment settings below. These variables are used throughout the notebook.**Modify these values to match your Snowflake account:**

In [None]:
# =============================================================================# USER CONFIGURATION - Modify these values for your environment# =============================================================================# Database and schema for model artifacts and registryMODEL_DATABASE = "USER$SIMON"         # Your database (USER$<username> for personal DB)MODEL_SCHEMA = "R_FORECAST_DEMO"      # Schema for models and artifacts# Warehouse for queries (use a warehouse you have access to)WAREHOUSE = "WH_XS"                   # Your warehouse name# Model namingMODEL_NAME = "TPCH_ORDERS_FORECAST"   # Name in Model RegistryMODEL_VERSION = "V1"                  # Version identifier# SPCS resources (will be created if they don't exist)COMPUTE_POOL = "R_FORECAST_POOL"      # Compute pool for inferenceIMAGE_REPO = "R_FORECAST_IMAGES"      # Image repositorySERVICE_NAME = "orders_forecast_svc"  # Service name for deployment# Stage for model artifactsARTIFACTS_STAGE = "ML_ARTIFACTS_STAGE"# Data source (TPC-H sample data - available in all accounts)SOURCE_DATABASE = "SNOWFLAKE_SAMPLE_DATA"SOURCE_SCHEMA = "TPCH_SF1"            # SF1 = Scale Factor 1 (smallest)print("Configuration loaded:")print(f"  Model location: {MODEL_DATABASE}.{MODEL_SCHEMA}")print(f"  Warehouse: {WAREHOUSE}")print(f"  Model name: {MODEL_NAME}")print(f"  Data source: {SOURCE_DATABASE}.{SOURCE_SCHEMA}")

---# Section 2: Environment SetupSet up the R environment and connect to Snowflake.

## 2.1 Install R EnvironmentRun the setup script to install R and required packages. This only needs to run once per session.

In [None]:
# Install R environment with ADBC support# --adbc includes the forecast package needed for time series modeling!bash setup_r_environment.sh --adbc 2>&1 | tail -20

## 2.2 Configure Python-R Bridge

In [None]:
# Configure R environment and register %%R magicfrom r_helpers import setup_r_environment, print_diagnosticsresult = setup_r_environment()if result['success']:    print(f"✓ R environment configured successfully")    print(f"  R version: {result['r_version']}")    print(f"  rpy2 installed: {result['rpy2_installed']}")    print(f"  %%R magic registered: {result['magic_registered']}")else:    print(f"✗ Setup failed: {result['errors']}")

In [None]:
%%R# Verify R is working and forecast package is availablelibrary(forecast)library(ggplot2)library(dplyr)cat("R packages loaded successfully\n")cat("forecast version:", as.character(packageVersion("forecast")), "\n")

## 2.3 Connect to Snowflake

In [None]:
from snowflake.snowpark import Sessionfrom snowflake.snowpark.context import get_active_sessionfrom snowflake.ml.registry import Registryimport pandas as pdimport numpy as np# Get the active Snowpark session (built-in to Workspace Notebooks)session = get_active_session()# Set warehousesession.sql(f"USE WAREHOUSE {WAREHOUSE}").collect()print(f"Connected to Snowflake")print(f"  Account: {session.get_current_account()}")print(f"  User: {session.get_current_user()}")print(f"  Warehouse: {session.get_current_warehouse()}")

## 2.4 Create Schema and Artifacts Stage

In [None]:
# Create schema if it doesn't existsession.sql(f"CREATE SCHEMA IF NOT EXISTS {MODEL_DATABASE}.{MODEL_SCHEMA}").collect()session.sql(f"USE SCHEMA {MODEL_DATABASE}.{MODEL_SCHEMA}").collect()# Create stage for model artifactssession.sql(f"""    CREATE STAGE IF NOT EXISTS {ARTIFACTS_STAGE}    COMMENT = 'Stage for R model artifacts'""").collect()print(f"✓ Using schema: {MODEL_DATABASE}.{MODEL_SCHEMA}")print(f"✓ Artifacts stage: {ARTIFACTS_STAGE}")

---# Section 3: Data ExplorationExplore the TPC-H orders data that we'll use for forecasting.

In [None]:
# Query order volume by monthorders_query = f"""SELECT     DATE_TRUNC('MONTH', O_ORDERDATE) as ORDER_MONTH,    COUNT(*) as ORDER_COUNT,    SUM(O_TOTALPRICE) as TOTAL_REVENUE,    AVG(O_TOTALPRICE) as AVG_ORDER_VALUEFROM {SOURCE_DATABASE}.{SOURCE_SCHEMA}.ORDERSGROUP BY DATE_TRUNC('MONTH', O_ORDERDATE)ORDER BY ORDER_MONTH"""orders_df = session.sql(orders_query).to_pandas()print(f"Loaded {len(orders_df)} months of order data")print(f"Date range: {orders_df['ORDER_MONTH'].min()} to {orders_df['ORDER_MONTH'].max()}")orders_df.head(10)

In [None]:
%%R -i orders_df -w 900 -h 400library(ggplot2)library(dplyr)library(scales)# Convert to proper date typeorders_df$ORDER_MONTH <- as.Date(orders_df$ORDER_MONTH)# Plot order count time seriesp <- ggplot(orders_df, aes(x = ORDER_MONTH, y = ORDER_COUNT)) +    geom_line(color = "steelblue", linewidth = 1) +    geom_point(color = "steelblue", size = 2) +    scale_y_continuous(labels = comma) +    scale_x_date(date_breaks = "1 year", date_labels = "%Y") +    labs(        title = "Monthly Order Volume (TPC-H)",        subtitle = "Time series data for forecasting",        x = "Month",        y = "Number of Orders"    ) +    theme_minimal(base_size = 12) +    theme(plot.title = element_text(face = "bold"))print(p)

---# Section 4: Time Series PreparationPrepare the data as an R time series object for modeling.

In [None]:
%%R -i orders_dflibrary(forecast)# Ensure proper orderingorders_df <- orders_df[order(orders_df$ORDER_MONTH), ]# Extract the target variable (order count)order_counts <- orders_df$ORDER_COUNT# Get start date for ts objectstart_date <- as.Date(min(orders_df$ORDER_MONTH))start_year <- as.numeric(format(start_date, "%Y"))start_month <- as.numeric(format(start_date, "%m"))# Create time series object (monthly frequency = 12)orders_ts <- ts(order_counts, start = c(start_year, start_month), frequency = 12)cat("Time Series Summary:\n")cat("  Length:", length(orders_ts), "observations\n")cat("  Start:", start(orders_ts), "\n")cat("  End:", end(orders_ts), "\n")cat("  Frequency:", frequency(orders_ts), "(monthly)\n")

In [None]:
%%R -w 900 -h 500# Decompose the time series to understand components# Use STL decomposition (works well for monthly data)decomp <- stl(orders_ts, s.window = "periodic")plot(decomp, main = "Time Series Decomposition")

---# Section 5: Model Training (R)Train a forecasting model using R's `forecast` package. We'll use `auto.arima()` which automatically selects the best ARIMA parameters.

In [None]:
%%Rlibrary(forecast)# Split data: use last 12 months as test setn_test <- 12n_train <- length(orders_ts) - n_testtrain_ts <- window(orders_ts, end = c(start_year + floor((n_train-1)/12), ((start_month + n_train - 2) %% 12) + 1))test_ts <- window(orders_ts, start = c(start_year + floor(n_train/12), ((start_month + n_train - 1) %% 12) + 1))cat("Training set:", length(train_ts), "months\n")cat("Test set:", length(test_ts), "months\n")

In [None]:
%%R# Train ARIMA model with automatic parameter selectioncat("Training ARIMA model (auto parameter selection)...\n")arima_model <- auto.arima(train_ts,                           seasonal = TRUE,                          stepwise = FALSE,  # More thorough search                          trace = FALSE)cat("\nModel Summary:\n")print(summary(arima_model))

In [None]:
%%R -w 900 -h 400# Generate forecast for test periodforecast_result <- forecast(arima_model, h = n_test)# Plot forecast vs actualsautoplot(forecast_result) +    autolayer(test_ts, series = "Actual", color = "red") +    labs(        title = "ARIMA Forecast vs Actual",        subtitle = paste("Model:", arima_model$method),        x = "Time",        y = "Order Count"    ) +    theme_minimal() +    theme(legend.position = "bottom")

In [None]:
%%R# Calculate accuracy metricsaccuracy_metrics <- accuracy(forecast_result, test_ts)cat("\nModel Accuracy:\n")print(accuracy_metrics)

## 5.2 Train Final Model on Full DataNow train the final model on all available data.

In [None]:
%%R# Train final model on full datasetcat("Training final model on full dataset...\n")final_model <- auto.arima(orders_ts,                           seasonal = TRUE,                          stepwise = FALSE)cat("\nFinal Model:\n")print(final_model)

In [None]:
%%R# Save model to filemodel_path <- "/tmp/orders_forecast_model.rds"saveRDS(final_model, file = model_path)cat("Model saved to:", model_path, "\n")cat("File size:", file.size(model_path), "bytes\n")

---# Section 6: Model PackagingCreate a Python wrapper class that enables the R model to work with Snowflake Model Registry.

In [None]:
# Define the model wrapper classwrapper_code = '''"""Python wrapper for R forecast model using rpy2."""import pandas as pdimport numpy as npimport uuidfrom snowflake.ml.model import custom_modeldef _get_rpy2_components():    """Lazy import of rpy2 components."""    import rpy2.robjects as ro    from rpy2.robjects import pandas2ri, r    from rpy2.robjects.vectors import FloatVector, IntVector    from rpy2.robjects.conversion import localconverter    from rpy2.rinterface_lib.embedded import RRuntimeError    from rpy2.robjects import numpy2ri        combined_converter = ro.default_converter + pandas2ri.converter + numpy2ri.converter    return ro, r, FloatVector, IntVector, localconverter, RRuntimeError, combined_converterclass ForecastModelWrapper(custom_model.CustomModel):    """Python wrapper for R ARIMA/ETS forecast models."""        def __init__(self, context: custom_model.ModelContext):        super().__init__(context)        self._initialized = False        self._r_model_name = f"forecast_model_{uuid.uuid4().hex[:8]}"        def _ensure_initialized(self):        if self._initialized:            return                ro, _, _, _, localconverter, _, combined_converter = _get_rpy2_components()                with localconverter(combined_converter):            ro.r("library(forecast)")            model_path = self.context["model_rds"]            ro.r(f\'{self._r_model_name} <- readRDS("{model_path}")\')                self._initialized = True        @custom_model.inference_api    def predict(self, X: pd.DataFrame) -> pd.DataFrame:        self._ensure_initialized()                if "h" in X.columns:            h = int(X["h"].iloc[0])        else:            h = len(X)                ro, r, _, _, localconverter, RRuntimeError, combined_converter = _get_rpy2_components()                uid = uuid.uuid4().hex[:8]        var_pred = f"pred_{uid}"        var_mean = f"mean_{uid}"        var_lower = f"lower_{uid}"        var_upper = f"upper_{uid}"                try:            with localconverter(combined_converter):                ro.r(f\'\'\'                    {var_pred} <- forecast({self._r_model_name}, h={h})                    {var_mean} <- as.numeric({var_pred}$mean)                    {var_lower} <- as.matrix({var_pred}$lower)                    {var_upper} <- as.matrix({var_pred}$upper)                \'\'\')                                forecast_mean = np.array(ro.globalenv[var_mean]).flatten()                lower_intervals = np.array(ro.globalenv[var_lower])                upper_intervals = np.array(ro.globalenv[var_upper])                                if lower_intervals.ndim == 1:                    lower_intervals = lower_intervals.reshape(-1, 2)                if upper_intervals.ndim == 1:                    upper_intervals = upper_intervals.reshape(-1, 2)                                ro.r(f"rm({var_pred}, {var_mean}, {var_lower}, {var_upper})")                        return pd.DataFrame({                "period": range(1, h + 1),                "point_forecast": forecast_mean,                "lower_80": lower_intervals[:, 0],                "upper_80": upper_intervals[:, 0],                "lower_95": lower_intervals[:, 1],                "upper_95": upper_intervals[:, 1]            })                    except RRuntimeError as e:            raise RuntimeError(f"R execution error: {str(e)}")'''# Write wrapper to filewrapper_path = "/tmp/forecast_model_wrapper.py"with open(wrapper_path, "w") as f:    f.write(wrapper_code)print(f"Model wrapper saved to: {wrapper_path}")

In [None]:
# Test the wrapper locallyimport syssys.path.insert(0, '/tmp')from forecast_model_wrapper import ForecastModelWrapperfrom snowflake.ml.model import custom_model# Create model context pointing to the saved modeltest_context = custom_model.ModelContext(    model_rds='/tmp/orders_forecast_model.rds')# Instantiate wrapperwrapper = ForecastModelWrapper(test_context)# Test prediction (forecast 6 periods ahead)test_input = pd.DataFrame({'h': [6]})test_predictions = wrapper.predict(test_input)print("Local test predictions (6 months ahead):")test_predictions

---# Section 7: Model RegistrationRegister the model to Snowflake Model Registry for managed deployment.

In [None]:
# Upload model artifact to stagesession.file.put(    "/tmp/orders_forecast_model.rds",    f"@{MODEL_DATABASE}.{MODEL_SCHEMA}.{ARTIFACTS_STAGE}/r_models/",    auto_compress=False,    overwrite=True)print(f"Model uploaded to stage: @{ARTIFACTS_STAGE}/r_models/orders_forecast_model.rds")

In [None]:
# Initialize Model Registryfrom snowflake.ml.registry import Registryreg = Registry(    session=session,    database_name=MODEL_DATABASE,    schema_name=MODEL_SCHEMA)print(f"Registry initialized: {MODEL_DATABASE}.{MODEL_SCHEMA}")

In [None]:
from snowflake.ml.model import custom_modelfrom snowflake.ml.model.model_signature import ModelSignature, FeatureSpec, DataType# Create model contextmodel_context = custom_model.ModelContext(    model_rds='/tmp/orders_forecast_model.rds')# Instantiate wrappermodel_wrapper = ForecastModelWrapper(model_context)# Define model signaturepredict_signature = ModelSignature(    inputs=[        FeatureSpec(name="h", dtype=DataType.INT64)    ],    outputs=[        FeatureSpec(name="period", dtype=DataType.INT64),        FeatureSpec(name="point_forecast", dtype=DataType.DOUBLE),        FeatureSpec(name="lower_80", dtype=DataType.DOUBLE),        FeatureSpec(name="upper_80", dtype=DataType.DOUBLE),        FeatureSpec(name="lower_95", dtype=DataType.DOUBLE),        FeatureSpec(name="upper_95", dtype=DataType.DOUBLE)    ])sample_input = pd.DataFrame({'h': [12]})print("Model signature defined")print("  Input: h (forecast horizon)")print("  Output: period, point_forecast, lower_80, upper_80, lower_95, upper_95")

In [None]:
# Log model to registrymodel_version = reg.log_model(    model_wrapper,    model_name=MODEL_NAME,    version_name=MODEL_VERSION,    target_platforms=["SNOWPARK_CONTAINER_SERVICES"],    conda_dependencies=[        "r-base>=4.1",        "r-forecast>=8.0",        "rpy2>=3.5"    ],    signatures={"predict": predict_signature},    sample_input_data=sample_input,    comment="R ARIMA forecast model for TPC-H orders (trained with auto.arima)")print(f"\n✓ Model registered successfully!")print(f"  Name: {model_version.model_name}")print(f"  Version: {model_version.version_name}")

In [None]:
# View registered modelsreg.show_models()

---# Section 8: InferenceDeploy the model and run predictions via SPCS.

In [None]:
# Create SPCS resourcessession.sql(f"""    CREATE COMPUTE POOL IF NOT EXISTS {COMPUTE_POOL}    MIN_NODES = 1    MAX_NODES = 2    INSTANCE_FAMILY = 'CPU_X64_M'    AUTO_RESUME = TRUE    COMMENT = 'Compute pool for R forecast model inference'""").collect()print(f"✓ Compute pool: {COMPUTE_POOL}")session.sql(f"""    CREATE IMAGE REPOSITORY IF NOT EXISTS {MODEL_DATABASE}.{MODEL_SCHEMA}.{IMAGE_REPO}    COMMENT = 'Repository for R forecast model images'""").collect()print(f"✓ Image repository: {IMAGE_REPO}")

In [None]:
# Deploy model to SPCSmodel_version.create_service(    service_name=SERVICE_NAME,    service_compute_pool=COMPUTE_POOL,    image_repo=IMAGE_REPO,    ingress_enabled=True,    max_instances=1)print(f"Model deployment started: {SERVICE_NAME}")print("Building container image... (this may take 5-10 minutes for first deployment)")

In [None]:
# Check service statusimport timefor i in range(20):    status = session.sql(f"SHOW SERVICES LIKE '{SERVICE_NAME}'").collect()    if status:        current_status = status[0]['status']        print(f"Service status: {current_status}")        if current_status == 'READY':            print("✓ Service is ready!")            break    time.sleep(30)else:    print("Service not ready yet - check status manually")

In [None]:
# Run inference - forecast 12 months aheadinference_input = session.create_dataframe(pd.DataFrame({'h': [12]}))print("Running inference via SPCS...")start_time = time.time()predictions = model_version.run(    inference_input,    function_name="predict",    service_name=SERVICE_NAME)elapsed = time.time() - start_timeprint(f"Inference completed in {elapsed:.2f} seconds")# Convert to pandas for displaypredictions_df = predictions.to_pandas()predictions_df

---# Section 9: VisualizationVisualize the forecast results with ggplot2.

In [None]:
%%R -i predictions_df -i orders_df -w 900 -h 500library(ggplot2)library(dplyr)library(scales)# Get the last date from historical dataorders_df$ORDER_MONTH <- as.Date(orders_df$ORDER_MONTH)last_date <- max(orders_df$ORDER_MONTH)# Create future dates for predictionspredictions_df$forecast_date <- seq.Date(    from = last_date + 30,    by = "month",    length.out = nrow(predictions_df))# Prepare data for plottingforecast <- data.frame(    date = predictions_df$forecast_date,    value = predictions_df$POINT_FORECAST,    lower_80 = predictions_df$LOWER_80,    upper_80 = predictions_df$UPPER_80,    lower_95 = predictions_df$LOWER_95,    upper_95 = predictions_df$UPPER_95)# Create the plotp <- ggplot() +    geom_ribbon(data = forecast,                 aes(x = date, ymin = lower_95, ymax = upper_95),                fill = "steelblue", alpha = 0.2) +    geom_ribbon(data = forecast,                aes(x = date, ymin = lower_80, ymax = upper_80),                fill = "steelblue", alpha = 0.3) +    geom_line(data = orders_df,               aes(x = ORDER_MONTH, y = ORDER_COUNT),              color = "black", linewidth = 1) +    geom_line(data = forecast,              aes(x = date, y = value),              color = "steelblue", linewidth = 1, linetype = "dashed") +    geom_point(data = forecast,               aes(x = date, y = value),               color = "steelblue", size = 2) +    scale_y_continuous(labels = comma) +    scale_x_date(date_breaks = "1 year", date_labels = "%Y") +    labs(        title = "TPC-H Orders Forecast",        subtitle = "12-month forecast with 80% and 95% confidence intervals",        x = "Date",        y = "Order Count"    ) +    theme_minimal(base_size = 12) +    theme(plot.title = element_text(face = "bold"))print(p)

In [None]:
%%R -w 700 -h 400# Save the forecast plotggsave("/tmp/orders_forecast.png", p, width = 10, height = 6, dpi = 150)cat("Forecast plot saved to /tmp/orders_forecast.png\n")

In [None]:
# Display saved plotfrom IPython.display import Image, displaydisplay(Image(filename="/tmp/orders_forecast.png"))

---# Section 10: CleanupOptional cleanup of resources created in this notebook.

In [None]:
# Uncomment to clean up resources# Delete service# model_version.delete_service(SERVICE_NAME)# print(f"Deleted service: {SERVICE_NAME}")# Delete model from registry# reg.delete_model(MODEL_NAME)# print(f"Deleted model: {MODEL_NAME}")# Drop SPCS resources# session.sql(f"DROP COMPUTE POOL IF EXISTS {COMPUTE_POOL}").collect()# session.sql(f"DROP IMAGE REPOSITORY IF EXISTS {MODEL_DATABASE}.{MODEL_SCHEMA}.{IMAGE_REPO}").collect()# print("Dropped compute pool and image repository")print("Cleanup section - uncomment lines above to delete resources")

---## SummaryThis notebook demonstrated:1. **R Environment Setup** - Installing R and forecast package in Workspace Notebooks2. **Data Exploration** - Querying TPC-H data from Snowflake3. **Time Series Preparation** - Creating R time series objects4. **Model Training** - Using `auto.arima()` for automatic model selection5. **Model Packaging** - Creating a Python wrapper with rpy2 for registry compatibility6. **Model Registration** - Logging to Snowflake Model Registry7. **Inference** - Running predictions via SPCS8. **Visualization** - Creating publication-quality charts with ggplot2### Key Technologies| Component | Purpose ||-----------|---------|| rpy2 | Python-R bridge for %%R magic cells || forecast (R) | Time series modeling (ARIMA, ETS, etc.) || Snowflake Model Registry | Model versioning and management || SPCS | Container-based inference runtime || ggplot2 | Publication-quality visualizations |### Next Steps- Try different forecasting models (ETS, Prophet, etc.)- Add exogenous variables for ARIMAX models- Set up scheduled inference with Snowflake Tasks- Create dashboards with Streamlit in Snowflake