# Using DataFrames with RLM

This tutorial shows how to pass pandas DataFrames to DSPy's RLM (Recursive Language Model) module while preserving data types.

## Setup

First, install the required packages:

In [None]:
# %pip install -q dspy 
# %pip install -q pandas

In [None]:
import dspy
import pandas as pd

# Configure your LM
lm = dspy.LM("openai/gpt-4.1-mini")
dspy.configure(lm=lm)

## Create Sample Data

Let's create a DataFrame with various data types to demonstrate type preservation:

In [14]:
sales_data = pd.DataFrame({
    'date': pd.to_datetime(['2024-01-15', '2024-01-16', '2024-01-17', '2024-01-18', '2024-01-19']),
    'product': pd.Categorical(['Widget', 'Gadget', 'Widget', 'Gizmo', 'Gadget']),
    'quantity': [10, 25, 15, 8, 30],
    'price': [29.99, 49.99, 29.99, 99.99, 49.99],
    'is_promotion': [True, False, True, False, True]
})

print("DataFrame:")
print(sales_data)
print("\nData types:")
print(sales_data.dtypes)

DataFrame:
        date product  quantity  price  is_promotion
0 2024-01-15  Widget        10  29.99          True
1 2024-01-16  Gadget        25  49.99         False
2 2024-01-17  Widget        15  29.99          True
3 2024-01-18   Gizmo         8  99.99         False
4 2024-01-19  Gadget        30  49.99          True

Data types:
date            datetime64[us]
product               category
quantity                 int64
price                  float64
is_promotion              bool
dtype: object


## Define a Signature with DataFrame

Use `dspy.DataFrame` as an input field type. RLM will have full access to the data:

In [15]:
class AnalyzeSales(dspy.Signature):
    """Analyze sales data and provide insights."""
    
    data: dspy.DataFrame = dspy.InputField(desc="Sales transaction data")
    total_revenue: float = dspy.OutputField(desc="Total revenue across all sales")
    top_product: str = dspy.OutputField(desc="Product with highest quantity sold")
    summary: str = dspy.OutputField(desc="Brief analysis summary")

## Run with RLM

RLM executes Python code in a sandbox where the DataFrame is available with all its types preserved:

In [16]:
analyzer = dspy.RLM(AnalyzeSales, verbose=True)
result = analyzer(data=sales_data)

2026/02/09 21:31:56 INFO dspy.predict.rlm: RLM iteration 1/20
Reasoning: I have a sales transaction DataFrame named `data` with columns: date, product, quantity, price, and is_promotion. The goal is to analyze this data to provide three outputs: total revenue (as a float), top product by revenue, and a summary.

First, I will explore the data by printing its structure and a few rows to confirm the contents and understand the data types. Then, I will calculate the total revenue as quantity * price summed over all rows. Next, I will determine the top product by total revenue and also generate a brief summary describing insights from the data.

My initial step is to print a preview of the data and some basic information about the DataFrame.
Code:
print(data.info())
print(data.head())
2026/02/09 21:32:00 INFO dspy.predict.rlm: RLM iteration 2/20
Reasoning: I have confirmed the data structure and types. The 'date' column is datetime, 'product' and 'is_promotion' are categorical (though curr

In [17]:
print(f"Total Revenue: ${result.total_revenue:.2f}")
print(f"Top Product: {result.top_product}")
print(f"Summary: {result.summary}")

Total Revenue: $4299.12
Top Product: Gadget
Summary: The sales data covers the period from 2024-01-15 to 2024-01-19. Total revenue generated is $4299.12. The top product by revenue is 'Gadget', contributing $2749.45. Revenue from promotional sales is $2249.45, while non-promotional sales account for $2049.67.


## How It Works

Behind the scenes:

1. **Wrapping**: RLM auto-wraps raw pandas DataFrames into `dspy.DataFrame` when the signature field declares that type
2. **Serialization**: The DataFrame is serialized to JSON records via `pandas.to_json()`
3. **Injection**: The JSON is written into the Pyodide sandbox's virtual filesystem
4. **Reconstruction**: RLM generates and runs Python code that reads the JSON back into a pandas DataFrame

Note: Since data passes through JSON, some dtypes may need re-inference on the sandbox side (e.g., dates arrive as strings). The LLM typically handles this as part of its analysis code.

The LLM sees a structured preview of the data in the prompt:

In [18]:
# See what the LLM sees
df_wrapped = dspy.DataFrame(sales_data)
print(df_wrapped.rlm_preview())

DataFrame: 5 rows x 5 columns

Columns:
  date: datetime64[us]
  product: category
  quantity: int64
  price: float64
  is_promotion: bool

Sample (first 3 rows):
        date product  quantity  price  is_promotion
0 2024-01-15  Widget        10  29.99          True
1 2024-01-16  Gadget        25  49.99         False
2 2024-01-17  Widget        15  29.99          True
