# Tutorial: DataFrames with RLM

DSPy provides first-class support for pandas DataFrames in the RLM (Recursive Language Model) module. DataFrames are automatically serialized via Parquet format, preserving dtypes, and rich metadata is provided to the LLM.

Install the latest DSPy via `pip install -U dspy` and follow along.

## 1) Setup

First, let's configure DSPy with an LM and create a sample DataFrame.

In [None]:
import warnings
warnings.filterwarnings("ignore", message="Pydantic serializer warnings")

import dspy
import pandas as pd

# Configure your LM
lm = dspy.LM("anthropic/claude-sonnet-4-5-20250929", max_tokens=16000)
dspy.configure(lm=lm)

In [None]:
# Create a sample DataFrame
dataframe = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie", "David", "Eve"],
    "age": [25, 30, 35, 40, 45],
    "city": ["New York", "Los Angeles", "Chicago", "Houston", "Miami"]
})

dataframe

## 2) Using `dspy.DataFrame` in Signatures

To use a DataFrame as an input field in a DSPy Signature, use the `dspy.DataFrame` type annotation. This tells Pydantic how to handle the DataFrame type.

In [None]:
class DocWriter(dspy.Signature):
    """Write documentation for the provided data."""
    
    dataframe: dspy.DataFrame = dspy.InputField()
    documentation: str = dspy.OutputField(desc="Generated markdown documentation.")

## 3) Running RLM with DataFrames

Now we can use `dspy.RLM` to process the DataFrame. The RLM module will:

1. Serialize the DataFrame to Parquet format (preserving dtypes)
2. Provide rich metadata to the LLM (shape, columns, dtypes, sample rows)
3. Make the DataFrame available in the Python sandbox for code execution

In [None]:
doc_writer = dspy.RLM(
    DocWriter,
    max_iterations=10,
    verbose=True
)

result = doc_writer(dataframe=dataframe)

In [None]:
print(result.documentation)

## 4) How It Works

When you pass a DataFrame to RLM:

1. **Serialization**: The DataFrame is serialized to Parquet format using PyArrow, which preserves data types (int, float, datetime, categorical, etc.)

2. **Metadata**: The LLM receives rich metadata about the DataFrame:
   - Shape (rows x columns)
   - Column names and dtypes
   - Null value counts
   - Sample rows (first and last 3 rows)

3. **Sandbox Access**: The DataFrame is made available in the Python sandbox, where the LLM-generated code can access it directly using pandas operations.

## 5) Advanced Example: Data Analysis

Let's try a more complex example where the LLM analyzes the data.

In [None]:
# Create a more complex DataFrame
import numpy as np

sales_data = pd.DataFrame({
    "product": ["Widget A", "Widget B", "Widget C", "Widget D", "Widget E"] * 20,
    "category": pd.Categorical(["Electronics", "Home", "Electronics", "Home", "Garden"] * 20),
    "price": np.random.uniform(10, 100, 100).round(2),
    "quantity": np.random.randint(1, 50, 100),
    "date": pd.date_range("2024-01-01", periods=100),
})

sales_data.head()

In [None]:
class DataAnalyst(dspy.Signature):
    """Analyze the sales data and provide insights."""
    
    sales_data: dspy.DataFrame = dspy.InputField()
    analysis: str = dspy.OutputField(desc="Detailed analysis with statistics and insights.")

analyst = dspy.RLM(DataAnalyst, max_iterations=10, verbose=True)
result = analyst(sales_data=sales_data)

In [None]:
print(result.analysis)

## Summary

- Use `dspy.DataFrame` as the type annotation for DataFrame input fields
- DataFrames are serialized via Parquet, preserving all dtypes
- Rich metadata (shape, columns, dtypes, samples) is provided to the LLM
- The DataFrame is available in the RLM sandbox for pandas operations
- Works seamlessly with other input types (strings, numbers, etc.)