# Pandas example

This example showcases hooking up an LLM to answer questions and generate Plotly Express plots over a Pandas DataFrame.

This uses the titanic survival dataset from HuggingFace, see https://huggingface.co/datasets/julien-c/titanic-survival. To download it, use the the HuggingFace `datasets` library. Alternatively, substitute the dataframe with any local dataset you may have.

In [None]:
from datasets import load_dataset

df = load_dataset("julien-c/titanic-survival")["train"].to_pandas()
df

In [None]:
from langchain import OpenAI, PandasDataFrameChain

In [None]:
llm = OpenAI(temperature=0)

In [None]:
df_chain = PandasDataFrameChain.from_llm(llm=llm, dataframe=df, verbose=True)

## Ask direct questions on the dataset
These are questions where the output is expected to be a single value (e.g. float, string, etc.).

In [None]:
output = df_chain("How many people survived?")
output

The chain returns both the generated code in the `code` field and the Python output from the code execution in the `result` field.

In [None]:
output = df_chain("How many people under 30 died?")
output

In [None]:
output = df_chain("What was the average fare in 1st class?")
output

In [None]:
output = df_chain("What's the most common male last name?")
output

The prompt discourages the access of non-existent columns or variables, but in the event it fails a `NameError` or `KeyError` will be raised.

In [None]:
output = df_chain("Get the sum of the passenger height column and divide by z")
output["result"]

## Filter or transform the dataset
These are operations which return a Pandas DataFrame or Series object after applying some filtering or transformation function.

In [None]:
output = df_chain("Remove duplicates")
output

In [None]:
output = df_chain("Average fare by class and gender")
output

In [None]:
output = df_chain("Remove men under the age of 30 and sort by fare")
output

## Directly generate Plotly figures
If you ask for a plot, the generated `df.plot` code will be automatically translated into the equivalent Plotly Express code.

In [None]:
output = df_chain("Plot the fare of people under 30 versus their age, colored by sex")
output["result"]

In [None]:
output = df_chain("Plot the average fare per class")
output["result"]

You can even specify the plot type you would like!

In [None]:
output = df_chain("Plot the average fare per class (bar)")
output["result"]