# Using LangChain's Pandas DataFrame Agent

## Introduction

The `create_pandas_dataframe_agent` function from the `langchain_experimental.agents.agent_toolkits.pandas.base` module is a powerful tool designed to enable language models (LLMs) to interact with and analyze data stored in Pandas DataFrames. By combining the capabilities of LLMs with the flexibility of Pandas, this function allows users to perform complex data analysis tasks through natural language queries. The agent can handle a variety of operations, such as filtering, aggregation, and visualization, making it an invaluable tool for data scientists, analysts, and developers who want to streamline their workflows.

However, it is important to note that this functionality comes with significant security considerations. The agent relies on a Python REPL (Read-Eval-Print Loop) tool, which can execute arbitrary code. This capability, while powerful, introduces potential risks, such as arbitrary code execution vulnerabilities, if not used in a properly sandboxed environment. Users must explicitly opt-in to this functionality by setting the `allow_dangerous_code` parameter to `True`, acknowledging the risks and ensuring that appropriate safeguards are in place.

---

## Preparation

### Installing Required Libraries
This section installs the necessary Python libraries for working with LangChain, OpenAI embeddings, Anthropic models, and other utilities. These libraries include:
- `langchain-openai`: Provides integration with OpenAI's embedding models and APIs.
- `langchain-anthropic`: Enables integration with Anthropic's models and APIs.
- `langchain_community`: Contains community-contributed modules and tools for LangChain.
- `langchain_experimental`: Includes experimental features and utilities for LangChain.

In [None]:
!pip install -qU langchain-openai
!pip install -qU langchain-anthropic
!pip install -qU langchain_community
!pip install -qU langchain_experimental

### Initializing OpenAI and Anthropic Chat Models
This section demonstrates how to securely fetch API keys for OpenAI and Anthropic using Kaggle's `UserSecretsClient` and initialize their respective chat models. The `ChatOpenAI` and `ChatAnthropic` classes are used to create instances of these models, which can be used for natural language processing tasks such as text generation and conversational AI.

**Key steps:**
1. **Fetch API Keys**: The OpenAI and Anthropic API keys are securely retrieved using Kaggle's `UserSecretsClient`.
2. **Initialize Chat Models**:
   - The `ChatOpenAI` class is initialized with the `gpt-4o-mini` model and the fetched OpenAI API key.
   - The `ChatAnthropic` class is initialized with the `claude-3-5-sonnet-latest` model and the fetched Anthropic API key.

In [None]:
import pandas as pd
from langchain_openai import ChatOpenAI
from langchain_anthropic import ChatAnthropic
from kaggle_secrets import UserSecretsClient

# Fetch API key securely
user_secrets = UserSecretsClient()

# Initialize LLM
model = ChatOpenAI(model="gpt-4o-mini", temperature=0, api_key=user_secrets.get_secret("my-openai-api-key"))
#model = ChatAnthropic(model="claude-3-5-sonnet-latest", temperature=0, api_key=user_secrets.get_secret("my-anthropic-api-key"))

# Load the Titanic dataset
df = pd.read_csv("/kaggle/input/titanic-dataset/titanic.csv")

The provided code defines a utility function, `pretty_print_response`, designed to format and display the output of an agent's response in a clean and readable manner.

In [None]:
def pretty_print_response(question, response):
    """
    Formats and prints the agent's response in a structured and readable manner.

    This function takes a question and the agent's response, extracts the 'output' key
    from the response, and prints it with clear separators for improved readability.
    It is useful for debugging, logging, or presenting results in a user-friendly format.

    Args:
        question (str): The question or query posed to the agent.
        response (dict): The agent's response, expected to contain an 'output' key
                         with the result of the query.

    Returns:
        None: This function prints the formatted output directly to the console.
    """
    print(f"Question: {question}")
    print("\n" + "=" * 80 + "\n")
    print(response["output"])  # Extract the 'output' key for pretty printing
    print("\n" + "=" * 80 + "\n")

---

## Examples

The `create_pandas_dataframe_agent` function is a powerful tool for analyzing and interacting with Pandas DataFrames using a language model (LLM). It allows you to perform complex data analysis tasks, such as filtering, aggregation, and visualization, using natural language queries. Below are examples demonstrating the use of key parameters in `create_pandas_dataframe_agent`.

### **Example 1: Basic Usage with Default Parameters**
This example demonstrates how to create a Pandas agent with default parameters. The agent is initialized with a DataFrame and a language model (`llm`). It can answer questions about the dataset, such as column names, data types, and basic statistics.

In [None]:
import pandas as pd
from langchain_experimental.agents import create_pandas_dataframe_agent

# Load the Titanic dataset
df = pd.read_csv("/kaggle/input/titanic-dataset/titanic.csv")

# Create the agent with default parameters
agent = create_pandas_dataframe_agent(
    model,
    df,
    agent_type="tool-calling",  # Use the modern "tool-calling" agent type
    allow_dangerous_code=True,  # Allow execution of Python code (use with caution)
    verbose=True,               # Enable verbose logging for debugging
)

# Ask the agent a question
response = agent.invoke("What are the columns in the dataset and their data types?")
print(response["output"])

### **Example 2: Custom Prompt Prefix and Suffix**
This example shows how to use the `prefix` and `suffix` parameters to customize the agent's behavior. The `prefix` provides context for the agent, while the `suffix` specifies the format of the output.

In [None]:
# Create the agent with a custom prefix and suffix
agent = create_pandas_dataframe_agent(
    model,
    df,
    agent_type="tool-calling",
    prefix="You are a data analyst. Analyze the Titanic dataset and provide concise answers.",
    suffix="Provide the final answer in a clear and structured format.",
    allow_dangerous_code=True,
    verbose=True,
)

# Ask the agent a question
response = agent.invoke("What is the average age of passengers?")
print(response["output"])

### **Example 3: Including DataFrame Head in the Prompt**
This example demonstrates how to include the first few rows of the DataFrame in the prompt using the `include_df_in_prompt` and `number_of_head_rows` parameters. This helps the agent understand the structure of the data.

In [None]:
# Create the agent with the first 5 rows included in the prompt
agent = create_pandas_dataframe_agent(
    model,
    df,
    agent_type="tool-calling",
    include_df_in_prompt=True,  # Include the DataFrame head in the prompt
    number_of_head_rows=5,      # Number of rows to include
    allow_dangerous_code=True,
    verbose=True,
)

# Ask the agent a question
response = agent.invoke("What are the unique values in the 'Pclass' column?")
print(response["output"])

### **Example 4: Adding Extra Tools**
This example shows how to add extra tools to the agent using the `extra_tools` parameter. These tools can extend the agent's capabilities, such as performing custom calculations or interacting with external APIs.

In [None]:
from langchain.tools import Tool

# Define a custom tool
def custom_calculation_tool(input: str) -> str:
    return f"Custom calculation result for: {input}"

# Add the custom tool to the agent
extra_tools = [
    Tool(
        name="custom_calculation",
        func=custom_calculation_tool,
        description="A tool for performing custom calculations."
    )
]

# Create the agent with extra tools
agent = create_pandas_dataframe_agent(
    model,
    df,
    agent_type="tool-calling",
    extra_tools=extra_tools,  # Add custom tools
    allow_dangerous_code=True,
    verbose=True,
)

# Ask the agent to use the custom tool
response = agent.invoke("Use the custom_calculation tool to process 'example input'.")
print(response["output"])

### **Example 5: Limiting Execution Time and Iterations**
This example demonstrates how to limit the agent's execution time and the number of iterations using the `max_execution_time` and `max_iterations` parameters. This is useful for controlling resource usage.

In [None]:
# Create the agent with execution limits
agent = create_pandas_dataframe_agent(
    model,
    df,
    agent_type="tool-calling",
    max_execution_time=10,  # Limit execution time to 10 seconds
    max_iterations=5,       # Limit the number of iterations to 5
    allow_dangerous_code=True,
    verbose=True,
)

# Ask the agent a question
response = agent.invoke("What is the survival rate of passengers?")
print(response["output"])

### **Example 6: Handling Missing Data**
This example demonstrates how to use the agent to identify and handle missing data in the DataFrame.

In [None]:
# Create the agent
agent = create_pandas_dataframe_agent(
    model,
    df,
    agent_type="tool-calling",
    allow_dangerous_code=True,
    verbose=True,
)

# Ask the agent to identify missing data
response = agent.invoke("Which columns have missing data, and how should we handle them?")
print(response["output"])

### **Example 7: Customizing Early Stopping**
This example shows how to customize the early stopping behavior using the `early_stopping_method` parameter. The agent will stop processing if it encounters an error or reaches a stopping condition.

In [None]:
# Create the agent with custom early stopping
agent = create_pandas_dataframe_agent(
    model,
    df,
    agent_type="tool-calling",
    early_stopping_method="force",  # Force stop on errors
    allow_dangerous_code=True,
    verbose=True,
)

# Ask the agent a question
response = agent.invoke("What is the average fare for each passenger class?")
print(response["output"])

---

## Best Practices

### Example 1: Using `create_pandas_dataframe_agent` for Titanic Dataset Analysis
This example demonstrates how to use the `create_pandas_dataframe_agent` to analyze the Titanic dataset. The agent is capable of performing various data analysis tasks, such as data exploration, filtering, aggregation, and visualization, using natural language queries.

#### **Step 1: Load the Dataset and Initialize the Agent**
This code block loads the Titanic dataset and initializes the `create_pandas_dataframe_agent`. The agent is configured to use a language model (`model`) and is set to allow the execution of potentially dangerous code (e.g., Python code execution). The `verbose` flag enables detailed logging of the agent's operations.

In [None]:
# Load the Titanic dataset
df = pd.read_csv("/kaggle/input/titanic-dataset/titanic.csv")

# Create the agent
agent_executor = create_pandas_dataframe_agent(
    model,
    df,
    agent_type="tool-calling",
    allow_dangerous_code=True,
    verbose=True
)

#### **Step 2: Perform Data Analysis with the Agent**
This code block demonstrates how to use the agent to answer various questions about the Titanic dataset. The questions range from basic exploration to complex analysis, and the agent dynamically processes the data to provide insights. The `pretty_print_response` function is used to format and print the responses in a clean and readable way.

In [None]:
# Question 1: Basic Data Exploration
response_1 = agent_executor.invoke("What are the columns in the dataset and their data types?")
pretty_print_response("1. What are the columns in the dataset and their data types?", response_1)

In [None]:
# Question 2: Filtering Data
response_2 = agent_executor.invoke("Find all passengers who survived.")
pretty_print_response("2. Find all passengers who survived.", response_2)

In [None]:
# Question 3: Aggregation
response_3 = agent_executor.invoke("What is the average age of passengers?")
pretty_print_response("3. What is the average age of passengers?", response_3)

In [None]:
# Question 4: Grouping and Aggregation
response_4 = agent_executor.invoke("What is the average fare for each passenger class?")
pretty_print_response("4. What is the average fare for each passenger class?", response_4)

In [None]:
# Question 5: Visualization
response_5 = agent_executor.invoke("Create a bar chart showing the number of passengers in each class.")
pretty_print_response("5. Create a bar chart showing the number of passengers in each class.", response_5)

In [None]:
# Question 6: Conditional Analysis
response_6 = agent_executor.invoke("What is the survival rate of female passengers?")
pretty_print_response("6. What is the survival rate of female passengers?", response_6)

In [None]:
# Question 7: Handling Missing Data
response_7 = agent_executor.invoke("Which columns have missing data, and how should we handle them?")
pretty_print_response("7. Which columns have missing data, and how should we handle them?", response_7)

In [None]:
# Question 8: Complex Query
response_8 = agent_executor.invoke("Find the names of all passengers who survived, were in first class, and are older than 30.")
pretty_print_response("8. Find the names of all passengers who survived, were in first class, and are older than 30.", response_8)

#### **Step 3: Custom Prompt for Advanced Analysis**
This code block demonstrates how to use a custom prompt to guide the agent's behavior. The agent is initialized with a prefix that instructs it to act as a data analyst and provide concise answers. This is useful for tailoring the agent's responses to specific tasks or audiences.

In [None]:
# Question 9: Custom Prompt
agent_executor_custom = create_pandas_dataframe_agent(
    model,
    df,
    agent_type="tool-calling",
    prefix="You are a data analyst. Analyze the Titanic dataset and provide concise answers.",
    allow_dangerous_code=True,
    verbose=True
)
response_9 = agent_executor_custom.invoke("What is the distribution of passengers by gender?")
pretty_print_response("9. What is the distribution of passengers by gender?", response_9)

### Example 2: Using `create_pandas_dataframe_agent` for Data Analysis

This example demonstrates how to use the `create_pandas_dataframe_agent` to analyze multiple DataFrames without merging them in advance. The agent is capable of performing operations like filtering, aggregation, and conditional analysis across multiple DataFrames dynamically.

#### **Step 1: Create DataFrames**
This code block creates three DataFrames:
1. **`df_titanic`**: Contains passenger details like `PassengerId`, `Name`, `Pclass`, and `Fare`.
2. **`df_fare_category`**: Maps `Pclass` to `FareCategory` (e.g., First Class, Second Class, Economy).
3. **`df_discount`**: Contains `PassengerId` and `Discount` percentages.

In [None]:
import pandas as pd

# Create the first DataFrame: Titanic passenger data
data_titanic = {
    "PassengerId": [1, 2, 3, 4, 5],
    "Name": ["Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley", "Heikkinen, Miss. Laina", "Futrelle, Mrs. Jacques Heath", "Allen, Mr. William Henry"],
    "Pclass": [3, 1, 3, 1, 3],
    "Fare": [7.25, 71.2833, 7.925, 53.1, 8.05],
}

df_titanic = pd.DataFrame(data_titanic)

# Create the second DataFrame: Fare category data
data_fare_category = {
    "Pclass": [1, 2, 3],
    "FareCategory": ["First Class", "Second Class", "Economy"],
}

df_fare_category = pd.DataFrame(data_fare_category)

# Create the third DataFrame: Discount data
data_discount = {
    "PassengerId": [1, 2, 3, 4, 5],
    "Discount": [0.0, 5.0, 0.0, 10.0, 0.0],  # Discounts in percentage
}

df_discount = pd.DataFrame(data_discount)

#### **Step 2: Initialize the Agent**
This code block initializes the `create_pandas_dataframe_agent` with multiple DataFrames. The agent can dynamically join, filter, and analyze data across these DataFrames.

In [None]:
from langchain_experimental.agents import create_pandas_dataframe_agent

# Create the agent with multiple DataFrames
agent = create_pandas_dataframe_agent(
    model,
    [df_titanic, df_fare_category, df_discount],  # Pass multiple DataFrames as a list
    agent_type="tool-calling",
    allow_dangerous_code=True,
    verbose=True
)

#### **Step 3: Perform Data Analysis**
This code block demonstrates how to use the agent to answer various questions about the data. The questions range from basic exploration to complex analysis, and the agent dynamically processes the DataFrames to provide insights.

In [None]:
# Question 1: Basic Data Exploration
response_1 = agent.invoke("What are the columns in each DataFrame and their data types?")
pretty_print_response("1. What are the columns in each DataFrame and their data types?", response_1)

In [None]:
# Question 2: Filtering Data
response_2 = agent.invoke("Find all passengers in First Class by joining df_titanic and df_fare_category.")
pretty_print_response("2. Find all passengers in First Class by joining df_titanic and df_fare_category.", response_2)

In [None]:
# Question 3: Aggregation
response_3 = agent.invoke("What is the average fare for each fare category? Use df_titanic and df_fare_category.")
pretty_print_response("3. What is the average fare for each fare category? Use df_titanic and df_fare_category.", response_3)

In [None]:
# Question 4: Conditional Analysis
response_4 = agent.invoke("What is the total fare for passengers who received a discount? Use df_titanic and df_discount.")
pretty_print_response("4. What is the total fare for passengers who received a discount? Use df_titanic and df_discount.", response_4)

In [None]:
# Question 5: Complex Query
response_5 = agent.invoke("Find the names of passengers who paid more than $50 in total fare. Use all DataFrames.")
pretty_print_response("5. Find the names of passengers who paid more than $50 in total fare. Use all DataFrames.", response_5)

In [None]:
# Question 6: Handling Missing Data
response_6 = agent.invoke("Are there any missing values in df_titanic?")
pretty_print_response("6. Are there any missing values in df_titanic?", response_6)

In [None]:
# Question 7: Custom Prompt
response_7 = agent.invoke("You are a data analyst. Analyze the DataFrames and provide insights about fare distribution.")
pretty_print_response("7. Analyze the DataFrames and provide insights about fare distribution.", response_7)

---

## Conclusion
The `create_pandas_dataframe_agent` function bridges the gap between natural language processing and data analysis, enabling users to interact with Pandas DataFrames in an intuitive and efficient manner. By leveraging the strengths of LLMs, this tool simplifies complex data tasks and makes them accessible to a broader audience. However, the power of this functionality comes with a responsibility to use it securely. Users must ensure that they operate in a safe, sandboxed environment and understand the risks associated with executing arbitrary code. When used responsibly, this tool can significantly enhance productivity and unlock new possibilities for data-driven decision-making.