# Building an AI Data Analysis Assistant with LlamaIndex and OpenAI

Data analysis is a crucial skill in today's data-driven world, but not everyone has the technical expertise to write Python code for analyzing datasets. What if we could create an AI assistant that can write and execute code based on natural language requests? In this article, we'll explore how to build such an assistant using [LlamaIndex](https://www.llamaindex.ai/)'s code interpreter tool and OpenAI's powerful language models.

## Setting Up the Environment

Let's start by installing the necessary packages. We'll need LlamaIndex for its agent framework, OpenAI for the language model, and the code interpreter tool from LlamaIndex.

After installing the required packages, we'll import them and set up our OpenAI API key.

In [1]:
import openai
import os
from dotenv import load_dotenv
load_dotenv(override=True)
openai.api_key = os.getenv("OPENAI_API_KEY")
from llama_index.agent.openai import OpenAIAgent
from llama_index.tools.code_interpreter.base import CodeInterpreterToolSpec

## Creating the Code Interpreter Agent

Now, let's create our code interpreter agent. The `CodeInterpreterToolSpec` is a tool specification in LlamaIndex that allows an agent to write and execute Python code. We'll initialize this spec and create an OpenAI agent with it.

In [2]:
# Import and initialize our tool spec
code_spec = CodeInterpreterToolSpec()

tools = code_spec.to_tool_list()
# Create the Agent with our tools
agent = OpenAIAgent.from_tools(tools, verbose=True)

The `verbose=True` parameter ensures that we can see the agent's thought process as it works on our problems.

## Testing Our Agent

Let's test our agent by asking it to help us write some Python code to pass to the code interpreter tool.

In [3]:
print(
    agent.chat(
        "Can you help me write some python code to pass to the code_interpreter tool"
    )
)

Added user message to memory: Can you help me write some python code to pass to the code_interpreter tool
Of course! What specific task or problem would you like the Python code to solve or address?


This initial interaction helps "prime" the agent, making it aware that we're interested in using it for code interpretation.

## Data Analysis with Natural Language

Now, let's put our agent to work on a real data analysis task. We have a dataset called "world_happiness_2016.csv" in a data directory. First, let's ask our agent to tell us what columns the dataset contains.

In [4]:
print(
    agent.chat(
        """There is a world_happiness_2016.csv file in the `data` directory (relative path).
                 Can you write and execute code to tell me columns does it have?"""
    )
)

Added user message to memory: There is a world_happiness_2016.csv file in the `data` directory (relative path).
                 Can you write and execute code to tell me columns does it have?
=== Calling Function ===
Calling function: code_interpreter with args: {"code":"import pandas as pd\n\n# Load the dataset\ndata = pd.read_csv('data/world_happiness_2016.csv')\n\n# Get the columns of the dataset\ncolumns = data.columns\n\n# Print the columns\nprint(columns)"}
Got output: StdOut:
b"Index(['Country', 'Region', 'Happiness Rank', 'Happiness Score',\n       'Lower Confidence Interval', 'Upper Confidence Interval',\n       'Economy (GDP per Capita)', 'Family', 'Health (Life Expectancy)',\n       'Freedom', 'Trust (Government Corruption)', 'Generosity',\n       'Dystopia Residual'],\n      dtype='object')\n"
StdErr:
b''

The `world_happiness_2016.csv` file has the following columns:
- Country
- Region
- Happiness Rank
- Happiness Score
- Lower Confidence Interval
- Upper Confidence Inter

The agent will write and execute Python code to read the CSV file using pandas, then display the column names. This is a simple example of how the agent can help with exploratory data analysis.

## Analyzing the Top 10 Happiest Countries

Next, let's ask our agent to analyze the data and tell us the top 10 happiest countries.


In [5]:
print(agent.chat("What are the top 10 happiest countries"))

Added user message to memory: What are the top 10 happiest countries
=== Calling Function ===
Calling function: code_interpreter with args: {"code":"# Sort the data by 'Happiness Score' in descending order and get the top 10 happiest countries\ntop_10_happiest_countries = data.sort_values(by='Happiness Score', ascending=False).head(10)\n\n# Print the top 10 happiest countries\nprint(top_10_happiest_countries[['Country', 'Happiness Score']])"}
Got output: StdOut:
b''
StdErr:
b'Traceback (most recent call last):\n  File "<string>", line 2, in <module>\nNameError: name \'data\' is not defined\n'

=== Calling Function ===
Calling function: code_interpreter with args: {"code":"import pandas as pd\n\n# Load the dataset\ndata = pd.read_csv('data/world_happiness_2016.csv')\n\n# Get the top 10 happiest countries\ntop_10_happiest_countries = data.sort_values(by='Happiness Score', ascending=False).head(10)\n\n# Print the top 10 happiest countries\nprint(top_10_happiest_countries[['Country', 'Happ

The agent will write code to sort the dataset by happiness score and show us the top 10 countries. No need for us to remember how to filter and sort pandas DataFrames!

## Visualizing the Data

Data is often easier to understand through visualizations. Let's ask our agent to create a graph of the top 10 happiest countries.

In [6]:
print(agent.chat("Can you make a graph of the top 10 happiest countries"))

Added user message to memory: Can you make a graph of the top 10 happiest countries
=== Calling Function ===
Calling function: code_interpreter with args: {"code":"import matplotlib.pyplot as plt\n\n# Define the data for the top 10 happiest countries\ncountries = top_10_happiest_countries['Country']\nhappiness_scores = top_10_happiest_countries['Happiness Score']\n\n# Create a bar graph\nplt.figure(figsize=(12, 6))\nplt.bar(countries, happiness_scores, color='skyblue')\nplt.xlabel('Country')\nplt.ylabel('Happiness Score')\nplt.title('Top 10 Happiest Countries')\nplt.xticks(rotation=45)\nplt.tight_layout()\nplt.show()"}
Got output: StdOut:
b''
StdErr:
b'Traceback (most recent call last):\n  File "<string>", line 4, in <module>\nNameError: name \'top_10_happiest_countries\' is not defined\n'

=== Calling Function ===
Calling function: code_interpreter with args: {"code":"import pandas as pd\nimport matplotlib.pyplot as plt\n\n# Load the dataset\ndata = pd.read_csv('data/world_happiness_2


The agent will generate code to create a bar chart or similar visualization. However, since this is in a Jupyter notebook, we might not be able to see the plot immediately. Let's ask the agent to save it locally.

In [7]:
print(
    agent.chat(
        "I cant see the plot - can you save it locally with file name `output.png`?"
    )
)

Added user message to memory: I cant see the plot - can you save it locally with file name `output.png`?
=== Calling Function ===
Calling function: code_interpreter with args: {"code":"import pandas as pd\nimport matplotlib.pyplot as plt\n\n# Load the dataset\ndata = pd.read_csv('data/world_happiness_2016.csv')\n\n# Get the top 10 happiest countries\ntop_10_happiest_countries = data.sort_values(by='Happiness Score', ascending=False).head(10)\n\n# Define the data for the top 10 happiest countries\ncountries = top_10_happiest_countries['Country']\nhappiness_scores = top_10_happiest_countries['Happiness Score']\n\n# Create a bar graph\nplt.figure(figsize=(12, 6))\nplt.bar(countries, happiness_scores, color='skyblue')\nplt.xlabel('Country')\nplt.ylabel('Happiness Score')\nplt.title('Top 10 Happiest Countries')\nplt.xticks(rotation=45)\nplt.tight_layout()\nplt.savefig('output.png')"}
Got output: StdOut:
b''
StdErr:
b''

The plot has been saved locally as `output.png`. You can download it us


The agent will modify its code to save the plot as an image file that we can view later.
## Comparing the Happiest and Least Happy Countries

For a more comprehensive analysis, let's also look at the 10 least happy countries.

In [8]:
print(agent.chat("can you also plot the 10 lowest"))

Added user message to memory: can you also plot the 10 lowest
=== Calling Function ===
Calling function: code_interpreter with args: {"code":"import pandas as pd\nimport matplotlib.pyplot as plt\n\n# Load the dataset\ndata = pd.read_csv('data/world_happiness_2016.csv')\n\n# Get the 10 lowest happiest countries\ntop_10_lowest_countries = data.sort_values(by='Happiness Score', ascending=True).head(10)\n\n# Define the data for the 10 lowest happiest countries\ncountries_lowest = top_10_lowest_countries['Country']\nhappiness_scores_lowest = top_10_lowest_countries['Happiness Score']\n\n# Create a bar graph for the 10 lowest happiest countries\nplt.figure(figsize=(12, 6))\nplt.bar(countries_lowest, happiness_scores_lowest, color='salmon')\nplt.xlabel('Country')\nplt.ylabel('Happiness Score')\nplt.title('Top 10 Lowest Happiest Countries')\nplt.xticks(rotation=45)\nplt.tight_layout()\nplt.savefig('output_lowest.png')"}
Got output: StdOut:
b''
StdErr:
b''

The plot for the 10 lowest happiest c

Finally, let's ask the agent to create a single visualization that shows both the happiest and least happy countries for comparison.

In [None]:
agent.chat("can you do it in one plot")

Added user message to memory: can you do it in one plot


## Benefits of Using a Code Interpreter Agent

Using a code interpreter agent like this offers several advantages:

1. **Accessibility**: People without programming skills can perform data analysis.
2. **Efficiency**: Even for experienced programmers, describing what you want in natural language can be faster than writing code.
3. **Learning**: The generated code can serve as a learning resource for those interested in improving their coding skills.
4. **Flexibility**: The agent can adapt to different datasets and analysis requirements without needing to rewrite code.

## Limitations and Considerations

While this approach is powerful, there are some limitations to keep in mind:

1. **Security**: The code interpreter executes Python code, which could potentially be risky if not properly sandboxed.
2. **Complexity**: For very complex analyses, the agent might not generate the most efficient or accurate code.
3. **Interpretability**: The generated code might be harder to interpret or debug than code you would write yourself.

## Conclusion

In this article, we've seen how to combine LlamaIndex's code interpreter tool with OpenAI's language models to create an AI assistant that can perform data analysis tasks. This approach democratizes data analysis by allowing anyone to ask questions about data in natural language and get insightful answers.

This is just the beginning of what's possible with code interpreter agents. You could extend this approach to handle more complex analyses, integrate with other data sources, or create specialized agents for specific domains like financial analysis or scientific research.

The future of data analysis might not be about writing code yourself, but rather telling an AI assistant what insights you're looking for and letting it handle the technical details.

---

By implementing this AI data analysis assistant, you're not just building a cool tool - you're part of a movement that's making data analysis more accessible to everyone. Whether you're a data scientist looking to streamline your workflow or a non-technical user who needs insights from data, code interpreter agents like this one can be valuable allies in your data journey.