# R Setup & Python/R Interoperability

This notebook covers **installing R** in Snowflake Workspace Notebooks and **exchanging data between Python and R**.

**What You'll Learn:**
1. Install and configure R via rpy2
2. Use `%%R` magic cells alongside Python
3. Transfer data bidirectionally (DataFrames, lists, arrays)
4. Create visualizations with ggplot2

**For Snowflake connectivity**, see the companion notebook: `r_snowflake_connectivity.ipynb`

---

---

# Section 1: Installation & Configuration

This section sets up R and rpy2 in the Workspace Notebook environment.

## Overview

Snowflake Workspace Notebooks run in containers with a managed Python kernel. To use R:

1. **Install R** via micromamba (lightweight conda-compatible package manager)
2. **Install rpy2** into the notebook's Python kernel
3. **Register `%%R` magic** for R cell support

## Customizing R Packages

Edit `r_packages.yaml` to customize which R packages are installed:

```yaml
# Conda-forge packages (installed via micromamba)
conda_packages:
  - r-base           # Required: Base R
  - r-tidyverse      # Data manipulation
  - r-yourpackage    # Add packages here

# CRAN packages (installed via install.packages)
cran_packages:
  - somepackage      # Packages not available on conda-forge
```

## Installation Options

| Command | Description |
|---------|-------------|
| `bash setup_r_environment.sh` | Basic R installation |
| `bash setup_r_environment.sh --adbc` | R + ADBC driver for Snowflake connectivity |
| `bash setup_r_environment.sh --verbose` | Show detailed logging |
| `bash setup_r_environment.sh --help` | Show all options |

### 1.1 Install R Environment

Run the setup script. Choose `--basic` for R only, or `--adbc` to include ADBC Snowflake driver.

**Note:** This step takes 2-5 minutes on first run. The `--adbc` option takes longer as it compiles the Snowflake driver.

The script includes:
- Pre-flight checks (disk space, network connectivity)
- Automatic retry for network operations
- Logging to `setup_r.log` for debugging

In [None]:
# Choose ONE of the following:

# Option A: Basic R installation (faster)
# !bash setup_r_environment.sh --basic

# Option B: R + ADBC for Snowflake connectivity (required for Section 3)
!bash setup_r_environment.sh --adbc

### 1.2 Configure Python Environment & Install rpy2

This cell uses the helper module to:
1. Point Python to the R environment
2. Install rpy2 into the notebook kernel
3. Register the `%%R` magic
4. Load output helper functions for cleaner display

**Run this cell after the installation script completes.**

**Output Helpers:** Workspace Notebooks add extra line breaks to R output. After setup, use these R functions for cleaner formatting:

| Function | Usage | Description |
|----------|-------|-------------|
| `rprint(x)` | `rprint(df)` | Print any object cleanly |
| `rview(df, n)` | `rview(iris, n=10)` | View data frame with optional row limit |
| `rglimpse(df)` | `rglimpse(df)` | Glimpse data frame structure |

In [None]:
# Setup R helpers
import sys
sys.path.insert(0, '.')  # Ensure current directory is in path

from r_helpers import setup_r_environment

result = setup_r_environment()

if result['success']:
    print("✓ R environment configured successfully")
    print(f"  R version: {result['r_version']}")
    print(f"  rpy2 installed: {result['rpy2_installed']}")
    print(f"  %%R magic registered: {result['magic_registered']}")
else:
    print("✗ Setup failed:")
    for error in result['errors']:
        print(f"  - {error}")

In [None]:
# Manual R configuration
# Uncomment and run if Method 1 fails

# import os
# import sys
# import subprocess

# ENV_PREFIX = "/root/.local/share/mamba/envs/r_env"
# os.environ["PATH"] = f"{ENV_PREFIX}/bin:" + os.environ["PATH"]
# os.environ["R_HOME"] = f"{ENV_PREFIX}/lib/R"

# subprocess.run([sys.executable, "-m", "pip", "install", "rpy2", "-q"], check=True)

# from rpy2.ipython import rmagic
# get_ipython().register_magics(rmagic.RMagics)
# print("R environment configured")

### 1.3 Verify R Installation

Test that R is working correctly.

In [None]:
%%R
# Print R version (simple output works fine)
R.version.string

In [None]:
%%R
# List installed packages
# Use rprint() for cleaner output in Workspace Notebooks
ip <- as.data.frame(installed.packages()[, c(1, 3:4)])
ip <- ip[is.na(ip$Priority), 1:2, drop = FALSE]
rprint(ip)

### 1.4 Run Diagnostics (Optional)

Run comprehensive environment diagnostics to verify all components are working.

In [None]:
from r_helpers import check_environment, print_diagnostics

# Run and display diagnostics
print_diagnostics()

### 1.5 Installing Additional R Packages

You can install R packages in two ways:

1. **Via `r_packages.yaml`** - Add packages before running the setup script (recommended for reproducibility)
2. **From within a `%%R` cell** - Install packages interactively during your session

**Recommendation**: Use **micromamba** when the package exists in conda-forge. It's faster (pre-compiled binaries) and handles dependencies better than `install.packages()` from CRAN.

In [None]:
%%R
# FALLBACK: Install from CRAN (slower - compiles from source)
# Use this only if package is not available in conda-forge
#
# lib_path <- "/root/.local/share/mamba/envs/r_env/lib/R/library"
# .libPaths(lib_path)
#
# if (!require("somepackage", quietly = TRUE)) {
#     install.packages("somepackage", repos = "https://cloud.r-project.org/", lib = lib_path)
# }

cat("See the micromamba approach below (recommended)\n")

In [None]:
%%R
# RECOMMENDED: Install via micromamba (pre-compiled, faster)
# Run this AFTER Section 1.1 has installed R via the setup script

# Check common micromamba paths (setup script uses /root/micromamba)
micromamba_paths <- c(
    "/root/micromamba/bin/micromamba",
    "/root/.local/share/mamba/bin/micromamba"
)

micromamba_bin <- NULL
for (path in micromamba_paths) {
    if (file.exists(path)) {
        micromamba_bin <- path
        break
    }
}

r_lib <- "/root/.local/share/mamba/envs/r_env/lib/R/library"

if (!is.null(micromamba_bin)) {
    cat("Found micromamba at:", micromamba_bin, "\n")
    cat("Installing r-forecast via micromamba...\n\n")
    
    # Install the package
    result <- system(paste(micromamba_bin, "install -n r_env -c conda-forge r-forecast -y"),
                     ignore.stdout = TRUE)
    
    if (result == 0) {
        .libPaths(r_lib)
        library(forecast)
        cat("✓ forecast installed via micromamba\n")
        cat("Version:", as.character(packageVersion("forecast")), "\n")
    } else {
        cat("✗ Installation failed. Try install.packages() instead.\n")
    }
} else {
    cat("micromamba not found.\n")
    cat("Run Section 1.1 first to set up the R environment.\n\n")
    cat("Alternative: Use install.packages() from CRAN:\n")
    cat("  install.packages('forecast', repos='https://cloud.r-project.org/')\n")
}

# To find conda-forge packages: https://anaconda.org/conda-forge/r-<packagename>

---

# Section 2: Python & R Interoperability

This section demonstrates how to work with data in both Python and R, including:
- Using the `%%R` magic for R cells
- Passing data from Python to R
- Passing data from R to Python
- Running R functions from Python

## 2.1 Using %%R Magic Cells

The `%%R` magic lets you write R code directly in a cell. The magic supports flags:

| Flag | Description |
|------|-------------|
| `-i var` | Import Python variable `var` into R |
| `-o var` | Export R variable `var` back to Python |
| `-w WIDTH` | Set plot width |
| `-h HEIGHT` | Set plot height |

In [None]:
%%R
# Basic R operations
x <- c(1, 2, 3, 4, 5)
mean(x)

In [None]:
%%R
# Using tidyverse
library(dplyr)

rprint(
data.frame(
  name = c("Alice", "Bob", "Charlie"),
  score = c(85, 92, 78)
) %>%
  mutate(grade = case_when(
    score >= 90 ~ "A",
    score >= 80 ~ "B",
    TRUE ~ "C"
  ))
)  

## 2.2 Passing Data: Python → R

Use the `-i` flag to pass Python objects into R cells.

In [None]:
# Create a pandas DataFrame in Python
import pandas as pd

python_df = pd.DataFrame({
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston'],
    'population': [8336817, 3979576, 2693976, 2320268],
    'area_sq_mi': [302.6, 468.7, 227.3, 670.6]
})

print("Python DataFrame:")
python_df

In [None]:
%%R -i python_df
# The Python DataFrame is now available in R as 'python_df'
library(dplyr)

cat("Received DataFrame in R:\n")
rglimpse(python_df)  # Use rglimpse() for clean output

# Perform R operations
result <- python_df %>%
  mutate(density = population / area_sq_mi) %>%
  arrange(desc(density))

rprint(result)  # Use rprint() for clean output

## 2.3 Passing Data: R → Python

Use the `-o` flag to export R objects back to Python.

In [None]:
%%R -o r_result
# Create a data frame in R
r_result <- data.frame(
  x = 1:10,
  y = (1:10)^2,
  label = paste0("Point_", 1:10)
)

cat("Created R data.frame:\n")
rprint(r_result)  # Use rprint() for clean output

In [None]:
# The R data.frame is now available in Python
print("R result in Python:")
print(type(r_result))
print(r_result)

## 2.4 Using R from Python (without magic)

For more control, you can use rpy2's Python API directly.

In [None]:
import rpy2.robjects as ro
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
from rpy2.robjects.conversion import localconverter

# Import R packages
base = importr('base')
stats = importr('stats')

# Run R code and get results
result = ro.r('sum(1:100)')
print(f"Sum of 1 to 100: {result[0]}")

In [None]:
# Convert pandas DataFrame to R and run R functions on it
import pandas as pd
import rpy2.robjects as ro
from rpy2.robjects import pandas2ri

# Create sample data
df = pd.DataFrame({
    'x': [1, 2, 3, 4, 5],
    'y': [2.1, 3.9, 6.2, 7.8, 10.1]
})

# Convert to R and run linear regression
with (ro.default_converter + pandas2ri.converter).context():
    r_df = ro.conversion.get_conversion().py2rpy(df)

# Run linear regression in R
lm_result = stats.lm('y ~ x', data=r_df)
print("Linear Regression Results:")
print(base.summary(lm_result))

## 2.5 Working with R's Built-in Datasets

Access R's built-in datasets and convert them to pandas.

In [None]:
import rpy2.robjects as ro
from rpy2.robjects import pandas2ri
from rpy2.robjects.conversion import localconverter

# Load the iris dataset in R
ro.r("data(iris)")

# Get the R data.frame
iris_r = ro.r["iris"]

# Convert to pandas DataFrame
with localconverter(ro.default_converter + pandas2ri.converter):
    iris_df = pandas2ri.rpy2py(iris_r)

print("Iris dataset (first 10 rows):")
iris_df.head(10)

---

# Section 3: Data Visualization with ggplot2

This section demonstrates creating visualizations with **ggplot2** and displaying them in the Notebook.

## Key Points

- ggplot2 is included via `tidyverse` (installed by default)
- Use `%%R -w WIDTH -h HEIGHT` to control plot dimensions (in pixels)
- Call `print(p)` explicitly to render the plot
- Plots render inline in the Notebook output

## Plot Size Parameters

| Parameter | Description | Example |
|-----------|-------------|---------|
| `-w` | Width in pixels | `-w 800` |
| `-h` | Height in pixels | `-h 500` |
| `--type` | Graphics device | `--type=cairo` (optional, better quality) |

## 3.1 Basic ggplot2 Example

Create a simple scatter plot using the built-in `mtcars` dataset.

In [None]:
%%R -w 700 -h 450
library(ggplot2)

# Basic scatter plot with mtcars
p <- ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
    geom_point(size = 3) +
    labs(
        title = "Fuel Efficiency vs Weight",
        x = "Weight (1000 lbs)",
        y = "Miles per Gallon",
        color = "Cylinders"
    ) +
    theme_minimal()

print(p)

## 3.2 Bar Charts and Statistical Summaries

Create bar charts with the `diamonds` dataset to show aggregated data.

In [None]:
%%R -w 800 -h 500
library(ggplot2)
library(dplyr)

# Use built-in diamonds dataset - aggregate by cut quality
diamond_summary <- diamonds %>%
    group_by(cut) %>%
    summarise(
        count = n(),
        avg_price = mean(price),
        avg_carat = mean(carat)
    ) %>%
    arrange(desc(avg_price))

cat("Diamond summary by cut:\n")
print(diamond_summary)

# Create bar chart with dual encoding (height + color)
p <- ggplot(diamond_summary, aes(x = reorder(cut, avg_price), y = avg_price)) +
    geom_col(aes(fill = count), width = 0.7) +
    geom_text(aes(label = paste0("$", round(avg_price))), 
              vjust = -0.5, size = 4) +
    scale_fill_viridis_c(option = "plasma", labels = scales::comma) +
    scale_y_continuous(labels = scales::dollar, expand = expansion(mult = c(0, 0.1))) +
    labs(
        title = "Average Diamond Price by Cut Quality",
        subtitle = "Color intensity shows count of diamonds in each category",
        x = "Cut Quality",
        y = "Average Price ($)",
        fill = "Count"
    ) +
    theme_minimal(base_size = 12) +
    theme(
        plot.title = element_text(face = "bold"),
        axis.text.x = element_text(angle = 0)
    )

print(p)

## 3.3 Multi-Panel Visualization (Facets)

Create faceted plots to compare distributions across categories using the `iris` and `mpg` datasets.

In [None]:
%%R -w 900 -h 600
library(ggplot2)

# Use built-in mpg dataset - fuel economy data
cat("mpg dataset preview:\n")
print(head(mpg))

# Create faceted plot by vehicle class
p <- ggplot(mpg, aes(x = displ, y = hwy, color = drv)) +
    geom_point(alpha = 0.7, size = 2) +
    geom_smooth(method = "lm", se = FALSE, linewidth = 0.8) +
    facet_wrap(~class, scales = "free_y", ncol = 4) +
    scale_color_brewer(palette = "Set1", 
                       labels = c("4" = "4-wheel", "f" = "Front", "r" = "Rear")) +
    labs(
        title = "Highway MPG vs Engine Displacement by Vehicle Class",
        subtitle = "Colored by drive type, with trend lines",
        x = "Engine Displacement (liters)",
        y = "Highway MPG",
        color = "Drive Type"
    ) +
    theme_light(base_size = 11) +
    theme(
        plot.title = element_text(face = "bold"),
        strip.text = element_text(face = "bold"),
        legend.position = "bottom"
    )

print(p)

## 3.4 Saving and Loading Plots

Use `ggsave()` to export plots to files, then display them from Python using `IPython.display.Image`.

In [None]:
%%R -w 700 -h 450
library(ggplot2)

# Create a plot
p <- ggplot(mtcars, aes(x = hp, y = mpg)) +
    geom_point(aes(color = factor(gear)), size = 3) +
    geom_smooth(method = "lm", se = TRUE, color = "darkgray") +
    labs(
        title = "MPG vs Horsepower",
        x = "Horsepower",
        y = "Miles per Gallon",
        color = "Gears"
    ) +
    theme_bw()

# Display the plot inline
print(p)

# Save to file
ggsave("/tmp/mpg_vs_hp.png", p, width = 8, height = 5, dpi = 150)
cat("Plot saved to /tmp/mpg_vs_hp.png\n")

In [None]:
# Display the saved PNG file in the notebook
from IPython.display import Image, display

display(Image(filename="/tmp/mpg_vs_hp.png"))

---

# Section 4: Reticulate - Python/R Object Exchange

The `reticulate` package provides seamless interoperability between R and Python objects.

## Key Conversions

| Python Type | R Type |
|------------|--------|
| `pandas.DataFrame` | `data.frame` |
| `list` | `list` |
| `dict` | Named `list` |
| `numpy.ndarray` | `matrix` / `array` |
| Scalar types | Atomic vectors |

## When to Use

- **rpy2 `-i`/`-o` flags**: Quick transfers in magic cells
- **reticulate in R**: Access Python objects from R code
- **rpy2.robjects**: Programmatic Python-to-R in Python code

In [None]:
%%R
# Access Python objects from R using reticulate
library(reticulate)

cat("=== Accessing Python Pandas DataFrame from R ===\n\n")

# Check if Python has a DataFrame we can access
tryCatch({
    # Import pandas and access the py object
    pd <- import("pandas")
    
    # Create a sample DataFrame in Python (via reticulate)
    py_df <- pd$DataFrame(list(
        name = c("Alice", "Bob", "Charlie"),
        age = c(25L, 30L, 35L),
        score = c(85.5, 92.0, 78.5)
    ))
    
    cat("Python DataFrame (pandas):\n")
    print(py_df)
    
    # Convert to R data.frame
    r_df <- py_to_r(py_df)
    cat("\nConverted to R data.frame:\n")
    print(r_df)
    cat("\nClass:", class(r_df), "\n")
    
    # Manipulate with dplyr
    library(dplyr)
    result <- r_df %>%
        filter(age >= 30) %>%
        mutate(grade = ifelse(score >= 90, "A", "B"))
    
    cat("\nFiltered & transformed with dplyr:\n")
    print(result)
    
}, error = function(e) {
    cat("Error:", conditionMessage(e), "\n")
})

In [None]:
# Create a Pandas DataFrame that R can access
import pandas as pd
import numpy as np

# Sample sales data
sales_df = pd.DataFrame({
    'region': ['North', 'South', 'East', 'West', 'North', 'South'],
    'product': ['Widget', 'Widget', 'Gadget', 'Gadget', 'Gadget', 'Widget'],
    'revenue': [1200.50, 980.25, 1500.00, 1100.75, 890.00, 1350.25],
    'units': [100, 82, 120, 88, 74, 112]
})

print("Created sales_df in Python:")
print(sales_df)
print(f"\nShape: {sales_df.shape}")

In [None]:
%%R -i sales_df
# Access the Python DataFrame using rpy2's -i flag
# sales_df is automatically converted to R data.frame

cat("=== Python DataFrame accessed in R ===\n\n")

cat("Class:", class(sales_df), "\n")
cat("Dimensions:", nrow(sales_df), "x", ncol(sales_df), "\n\n")

print(sales_df)

# Analyze with R
library(dplyr)
summary_by_region <- sales_df %>%
    group_by(region) %>%
    summarise(
        total_revenue = sum(revenue),
        total_units = sum(units),
        avg_price = mean(revenue / units)
    ) %>%
    arrange(desc(total_revenue))

cat("\n=== Summary by Region (dplyr) ===\n")
print(summary_by_region)

In [None]:
%%R -o r_summary
# Create an R data.frame and export to Python with -o flag

r_summary <- data.frame(
    metric = c("mean_score", "sd_score", "n_samples"),
    value = c(85.3, 12.7, 150),
    category = c("performance", "performance", "count")
)

cat("Created r_summary in R:\n")
print(r_summary)
cat("\nThis will be available as 'r_summary' in Python after this cell runs.\n")

In [None]:
# Access the R data.frame that was exported with -o flag
print("=== R DataFrame accessed in Python ===\n")
print(f"Type: {type(r_summary)}")
print(f"\n{r_summary}")

# Convert to proper pandas if needed
if hasattr(r_summary, 'to_pandas'):
    r_summary = r_summary.to_pandas()
    print(f"\nConverted to pandas DataFrame")
    print(f"Columns: {list(r_summary.columns)}")

In [None]:
%%R
# Working with Python lists and dicts from R
library(reticulate)

cat("=== Python Lists & Dicts in R ===\n\n")

# Access Python builtins
py <- import_builtins()

# Create Python list
py_list <- py$list(c(1, 2, 3, 4, 5))
cat("Python list:", py_to_r(py_list), "\n")

# Create Python dict
py_dict <- py$dict(list(a = 1, b = 2, c = 3))
cat("Python dict as R list:\n")
r_list <- py_to_r(py_dict)
print(r_list)

# R list to Python dict
cat("\n=== R to Python ===\n")
r_config <- list(
    model = "random_forest",
    n_estimators = 100L,
    max_depth = 10L,
    features = c("age", "income", "score")
)
cat("R list that can be passed to Python:\n")
print(r_config)
cat("\nUse r_to_py() or rpy2 -o flag to send to Python\n")

In [None]:
%%R
# NumPy arrays and R matrices
library(reticulate)

cat("=== NumPy / R Matrix Exchange ===\n\n")

np <- import("numpy")

# Create NumPy array
py_array <- np$array(matrix(1:12, nrow=3, ncol=4))
cat("NumPy array created from R matrix:\n")
print(py_array)

# Convert back to R
r_matrix <- py_to_r(py_array)
cat("\nBack to R matrix:\n")
print(r_matrix)
cat("Class:", class(r_matrix), "\n")

# Do R matrix operations
cat("\nR matrix operations (column means):\n")
print(colMeans(r_matrix))

---

## Summary: Data Exchange Cheat Sheet

### rpy2 Magic Flags (Quick)

```python
%%R -i py_var        # Python → R (input)
%%R -o r_var         # R → Python (output)
%%R -i df -o result  # Both directions
```

### reticulate in R (Full Control)

```r
library(reticulate)
pd <- import("pandas")           # Import Python module
py_obj <- r_to_py(r_obj)         # R → Python
r_obj <- py_to_r(py_obj)         # Python → R
```

### Type Mappings

| Python | R | Notes |
|--------|---|-------|
| `pandas.DataFrame` | `data.frame` | Automatic |
| `list` | `list` | Unnamed list |
| `dict` | `list` | Named list |
| `numpy.ndarray` | `matrix`/`array` | Shape preserved |
| `int`, `float`, `str` | Atomic vectors | Length-1 vectors |

---

**Next Steps:** See `r_snowflake_connectivity.ipynb` for Snowflake database access from R.