# Using an LLM to interact with Pandas

It is possible to use an LLM to interact with a Pandas dataframe, something that can be useful when encountering a large, new dataset with unfamiliar content. Basically, the scenario can be thought of as a combination of a simple [RAG][1] system combined with [tool-calling][2]. As the tool-calls are fairly complex, we'll make use of the [llamaindex framework][3] to do the [heavy lifting (docs)][4].

**Caveat**: This is a bleeding edge technique and thus error prone, _and_ it uses python's `eval` function to execute code written by the LLM on your behalf which is a potential security risk. **Consider yourself warned.**

[1]: ../RAG-tutorial/intro.ipynb
[2]: ../LLM-tool-calling/LLM-tool-calling.ipynb
[3]: https://www.llamaindex.ai
[4]: https://docs.llamaindex.ai/en/stable/api_reference/query_engine/pandas/

## The big picture

The process will look something like the following figure (compare to the [RAG setup][1])

![Query pipeline](img/pipeline.png)

1. The user inputs a query which is transformed into a first prompt, adding information about the Pandas dataframe to target before feeding the prompt to the LLM.
2. The LLM processes the first prompt and make [_tool calls_][2] to Pandas to produce intermediary output (containing code, data and further instructions).
3. The output is combined with the original query to form a second prompt.
4. The second prompt is fed to the LLM to generate an answer to the user query.

[1]: ../RAG-tutorial/intro.ipynb
[2]: ../LLM-tool-calling/LLM-tool-calling.ipynb

## Install prerequisites

In [None]:
!pip -q install ollama llama-index llama-index-experimental  llama-index-llms-ollama

## A simple Pandas agent

As an example, we will use the [titanic dataset](https://jkarakas.github.io/Exploratory-Analysis-of-the-Titanic-Dataset/Titanic_Dataset_Exploratory_Analysis_No_Code.html) in `data/titanic.csv`, so let's import Pandas and load the dataset:

In [None]:
import pandas as pd

df = pd.read_csv("data/titanic.csv")

In [None]:
df.info()

In [None]:
df.head()

Let's first try a "vanilla" call with a simple question about the dataset:

In [None]:
OLLAMA_HOST = 'http://10.129.20.4:9090'
OLLAMA_MODEL = 'qwen2.5-coder:latest' # 'deepseek-coder-v2:latest' # 'deepseek-r1:70b' # 'llama3.3:latest'

In [None]:
from llama_index.llms.ollama import Ollama
from llama_index.core import Settings
from llama_index.experimental.query_engine import PandasQueryEngine

Settings.llm = Ollama(model=OLLAMA_MODEL, base_url=OLLAMA_HOST)
query_engine = PandasQueryEngine(df=df, verbose=False)
response = query_engine.query(
    "What is the key for the column outlining survival?",
)

In [None]:
print(response)

OK, hopefully you got "Survived" as response in accordance with the output from `df.info()` above. If not, just re-run the above cell until you do :)

## Query pipelines

Let's dig into what just happened in detail by building our own query pipeline from scratch.

### Prompt templates

We can get a peek at the prompt templates used by retrieving them from the query_engine:

In [None]:
prompts = query_engine.get_prompts()
for key in prompts.keys():
    print(f"--- {key} ---\n")
    print(prompts[key].template)
    print()

We *could* make up our own templates, but for now we'll us them as-is.

In [None]:
pandas_prompt_str = (
    "You are working with a pandas dataframe in Python.\n"
    "The name of the dataframe is `df`.\n"
    "This is the result of `print(df.head())`:\n"
    "{df_str}\n\n"
    "Follow these instructions:\n"
    "{instruction_str}\n"
    "Query: {query_str}\n\n"
    "Expression:"
)

response_synthesis_prompt_str = (
    "Given an input question, synthesize a response from the query results.\n"
    "Query: {query_str}\n\n"
    "Pandas Instructions (optional):\n{pandas_instructions}\n\n"
    "Pandas Output: {pandas_output}\n\n"
    "Response: "
)

The default `instruction_str` is:

In [None]:
instruction_str = (
    "1. Convert the query to executable Python code using Pandas.\n"
    "2. The final line of code should be a Python expression that can be called with the `eval()` function.\n"
    "3. The code should represent a solution to the query.\n"
    "4. PRINT ONLY THE EXPRESSION.\n"
    "5. Do not quote the expression.\n"
)

## Build a Query Pipeline

As we'll dive into `llamaindex` consider having the [docs for llamaindex *pipelines*][1] readily available.

But first let's set up a small helper function to aid in debugging;

[1]: https://docs.llamaindex.ai/en/stable/examples/pipeline/query_pipeline/

In [None]:
!pip -q install pyvis

In [None]:
## Helper function to visualize a pipeline
from pyvis.network import Network
from IPython.display import display, HTML

def showPipeline(pipeline):
    dag = Network(notebook=True, cdn_resources="in_line", directed=True)
    dag.from_nx(pipeline.dag)
    display(HTML(dag.generate_html()))

Start with importing the necessary modules, and set up an LLM client:

In [None]:
from llama_index.llms.ollama import Ollama
from llama_index.core import PromptTemplate
from llama_index.core.query_pipeline import QueryPipeline
from llama_index.core.query_pipeline import InputComponent
from llama_index.experimental.query_engine.pandas import PandasInstructionParser

OLLAMA_HOST = 'http://10.129.20.4:9090'
OLLAMA_MODEL = 'qwen2.5-coder:latest' # 'deepseek-coder-v2:latest' # 'deepseek-r1:70b' # 'llama3.3:latest'

llm = Ollama(model=OLLAMA_MODEL, base_url=OLLAMA_HOST)

Then prepare the prompts using the above templates:

In [None]:
pandas_prompt = PromptTemplate(pandas_prompt_str).partial_format(instruction_str=instruction_str, df_str=df.head(5))

response_synthesis_prompt = PromptTemplate(response_synthesis_prompt_str)

To interact with Pandas, we'll use [PandasInstructionParser][1]:

[1]: https://docs.llamaindex.ai/en/stable/api_reference/query_engine/pandas/#llama_index.experimental.query_engine.PandasQueryEngine

In [None]:
pandas_output_parser = PandasInstructionParser(df)

Next we'll instantiate an empty pipeline and define the pieces we'll use. Setting `verbose=True` provides outpu from each stage, helping in debugging.

In [None]:

query_pipeline = QueryPipeline(verbose=True)
query_pipeline.add_modules(
    {
        "input": InputComponent(),
        "pandas_prompt": pandas_prompt,
        "llm1": llm,
        "pandas_output_parser": pandas_output_parser,
        "response_synthesis_prompt": response_synthesis_prompt,
        "llm2": llm,
    }
)

To make the pipeline useful, we need to specify the processing steps. Take a look at the figure at the start of this tutorial and you'll see that the processing steps form a *graph*, more precisely a [Directed Acyclic Graph][1] (DAG). Starting from the input we can add the first *edges* between the *node* defined in `add_modules` above in a from->to style:

[1]: https://en.wikipedia.org/wiki/Directed_acyclic_graph

In [None]:
query_pipeline.add_link("input", "pandas_prompt")
query_pipeline.add_link("pandas_prompt", "llm1")
query_pipeline.add_link("llm1", "pandas_output_parser")

As you recall, the `response_synthesis_prompt` takes three different inputs (`query_str`, `pandas_output`, and `pandas_instructions`) that are provided from three different processing nodes. In order to make each input end up in the right place, use the optional argument `dest_key` when adding the edges:

In [None]:
query_pipeline.add_link("pandas_output_parser", "response_synthesis_prompt", dest_key="pandas_output")
query_pipeline.add_link("llm1", "response_synthesis_prompt", dest_key="pandas_instructions")
query_pipeline.add_link("input", "response_synthesis_prompt", dest_key="query_str")

Finally, pass the combined result as a prompt to the final LLM processing step:

In [None]:
query_pipeline.add_link("response_synthesis_prompt", "llm2")

We can use the previously defined helper function to visualize the pipeline DAG and check that it looks OK:

In [None]:
showPipeline(query_pipeline)

With that we are ready to run our homegrown pipeline:

In [None]:
response = query_pipeline.run(
    query_str="What is the key for the column outlining survival?",
)
print(response)

## Examing the fragile pipeline

Sometimes the first LLM invocation jumps to conclusions and directly responds with the name of the column rather then the _python code required_, see below, by the second LLM invocation, which causes an exception (try to explain why to yourself).

**Bad output example**

```
> Running module input with input:
query_str: What is the key for the column outlining survival?

> Running module pandas_prompt with input:
query_str: What is the key for the column outlining survival?

> Running module llm1 with input:
messages: You are working with a pandas dataframe in Python.
The name of the dataframe is `df`.
This is the result of `print(df.head())`:
   Survived  Pclass                                               Name  ...

INFO:httpx:HTTP Request: POST http://10.129.20.4:9090/api/chat "HTTP/1.1 200 OK"
HTTP Request: POST http://10.129.20.4:9090/api/chat "HTTP/1.1 200 OK"
> Running module pandas_output_parser with input:
input: assistant: Survived

> Running module response_synthesis_prompt with input:
query_str: What is the key for the column outlining survival?
pandas_instructions: assistant: Survived
pandas_output: There was an error running the output as Python code. Error message: name 'Survived' is not defined

...
```

Looking at line 2 of `instruction_str`:
```
"2. The final line of code should be a Python expression that can be called with the `eval()` function.\n"
```
the problem is most likely that the first LLM derives and outputs the key "Survived" directly instead of _the code_ to retrieve the key, e.g. `df.columns[0]`. You cannot run `eval()` on a string.

**Excercise**: See if you can change the instructions to fix this problem.

## More complicated queries

Don't expect queries to work every time, re-run until they do. The sucess rate depends on e.g. which LLM model you have chosen.

**Excercise**: Try with different models, just make sure that they are capable of tool-calling.

In [None]:
response = query_pipeline.run(
    query_str="What is the correlation between survival and age?",
)
print(response.message.content)

In [None]:
response = query_pipeline.run(
    query_str="Generate python code to plot survival rate versus fare using matplotlib. Choose an appropriate binsize. Show plot as well as code.",
)
print(response.message.content)