# DataFrame Output Parsing

This demo shows how you can extract tabular DataFrames from raw text.

We show this with different levels of complexity, all backed by the OpenAI Function API:
- (more code) How to build an extractor yourself using our OpenAIPydanticProgram
- (less code) Using our out-of-the-box `DFFullOutputParser` and `DFRowsOutputParser` objects


## Build a DF Extractor Yourself (Using OpenAIPydanticProgram)

Our OpenAIPydanticProgram is a wrapper around an OpenAI LLM that supports function calling - it will return structured
outputs in the form of a Pydantic object.

We import our `DataFrame` and `DataFrameRowsOnly` objects.

To create an output extractor, you just need to 1) specify the relevant Pydantic object, and 2) Add the right prompt

In [2]:
from llama_index.program import OpenAIPydanticProgram
# from llama_index.program.df_program import DataFrame, DataFrameRow, DataFrameColumn, DataFrameWithColumns
from llama_index.output_parsers.df import DataFrame, DataFrameRowsOnly
from langchain.chat_models import ChatOpenAI

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
program = OpenAIPydanticProgram.from_defaults(
    output_cls=DataFrame,
    llm=ChatOpenAI(temperature=0, model_name="gpt-4-0613"),
    prompt_template_str=(
        "Please extract the following query into a structured data according to: {input_str}."
        "Please extract both the set of column names and a set of rows."
    ),
    verbose=True,
)

In [4]:
response_obj = program(
    input_str="""My name is John and I am 25 years old. I live in 
        New York and I like to play basketball. His name is 
        Mike and he is 30 years old. He lives in San Francisco 
        and he likes to play baseball. Sarah is 20 years old 
        and she lives in Los Angeles. She likes to play tennis.
        Her name is Mary and she is 35 years old. 
        She lives in Chicago."""
)
response_obj

Function call: DataFrame with args: {
  "columns": [
    {
      "column_name": "Name",
      "column_desc": "Name of the person"
    },
    {
      "column_name": "Age",
      "column_desc": "Age of the person"
    },
    {
      "column_name": "City",
      "column_desc": "City where the person lives"
    },
    {
      "column_name": "Hobby",
      "column_desc": "What the person likes to do"
    }
  ],
  "rows": [
    {
      "row_values": ["John", 25, "New York", "play basketball"]
    },
    {
      "row_values": ["Mike", 30, "San Francisco", "play baseball"]
    },
    {
      "row_values": ["Sarah", 20, "Los Angeles", "play tennis"]
    },
    {
      "row_values": ["Mary", 35, "Chicago", "unknown"]
    }
  ]
}


DataFrame(description=None, columns=[DataFrameColumn(column_name='Name', column_desc='Name of the person'), DataFrameColumn(column_name='Age', column_desc='Age of the person'), DataFrameColumn(column_name='City', column_desc='City where the person lives'), DataFrameColumn(column_name='Hobby', column_desc='What the person likes to do')], rows=[DataFrameRow(row_values=['John', 25, 'New York', 'play basketball']), DataFrameRow(row_values=['Mike', 30, 'San Francisco', 'play baseball']), DataFrameRow(row_values=['Sarah', 20, 'Los Angeles', 'play tennis']), DataFrameRow(row_values=['Mary', 35, 'Chicago', 'unknown'])])

In [7]:
program = OpenAIPydanticProgram.from_defaults(
    output_cls=DataFrameRowsOnly,
    llm=ChatOpenAI(temperature=0, model_name="gpt-4-0613"),
    prompt_template_str=(
        "Please extract the following text into a structured data: {input_str}. "
        "The column names are the following: ['Name', 'Age', 'City', 'Favorite Sport']. "
        "Do not specify additional parameters that are not in the function schema. "
    ),
    verbose=True,
)

In [8]:
program(
    input_str="""My name is John and I am 25 years old. I live in 
        New York and I like to play basketball. His name is 
        Mike and he is 30 years old. He lives in San Francisco 
        and he likes to play baseball. Sarah is 20 years old 
        and she lives in Los Angeles. She likes to play tennis.
        Her name is Mary and she is 35 years old. 
        She lives in Chicago."""
)

Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 1.0 seconds as it raised APIConnectionError: Error communicating with OpenAI: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')).


Function call: DataFrameRowsOnly with args: {
  "rows": [
    {
      "row_values": ["John", 25, "New York", "basketball"]
    },
    {
      "row_values": ["Mike", 30, "San Francisco", "baseball"]
    },
    {
      "row_values": ["Sarah", 20, "Los Angeles", "tennis"]
    },
    {
      "row_values": ["Mary", 35, "Chicago", "tennis"]
    }
  ]
}


DataFrameRowsOnly(rows=[DataFrameRow(row_values=['John', 25, 'New York', 'basketball']), DataFrameRow(row_values=['Mike', 30, 'San Francisco', 'baseball']), DataFrameRow(row_values=['Sarah', 20, 'Los Angeles', 'tennis']), DataFrameRow(row_values=['Mary', 35, 'Chicago', 'tennis'])])

### Using our `PydanticProgramOutputParser`

This is a simple convenience wrapper around a `BasePydanticProgram` (like the OpenAI one) that allows you to feed in the string without specifying exact kwargs.

In [9]:
from llama_index.output_parsers import PydanticProgramOutputParser

In [11]:
program_parser = PydanticProgramOutputParser(program)
program_parser.parse(
    """My name is John and I am 25 years old. I live in 
    New York and I like to play basketball. His name is 
    Mike and he is 30 years old. He lives in San Francisco 
    and he likes to play baseball. Sarah is 20 years old 
    and she lives in Los Angeles. She likes to play tennis.
    Her name is Mary and she is 35 years old. 
    She lives in Chicago."""
)

Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 1.0 seconds as it raised APIConnectionError: Error communicating with OpenAI: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')).


Function call: DataFrameRowsOnly with args: {
  "rows": [
    {
      "row_values": ["John", 25, "New York", "basketball"]
    },
    {
      "row_values": ["Mike", 30, "San Francisco", "baseball"]
    },
    {
      "row_values": ["Sarah", 20, "Los Angeles", "tennis"]
    },
    {
      "row_values": ["Mary", 35, "Chicago", ""]
    }
  ]
}


DataFrameRowsOnly(rows=[DataFrameRow(row_values=['John', 25, 'New York', 'basketball']), DataFrameRow(row_values=['Mike', 30, 'San Francisco', 'baseball']), DataFrameRow(row_values=['Sarah', 20, 'Los Angeles', 'tennis']), DataFrameRow(row_values=['Mary', 35, 'Chicago', ''])])

## Use our DataFrame Output Parsers

We provide convenience wrappers for `DFFullOutputParser` and `DFRowsOutputParser`. This allows a simpler object creation interface than specifying all details through the `OpenAIPydanticProgram`.

In [12]:
from llama_index.output_parsers.df import DFRowsOutputParser, DFFullOutputParser
from llama_index.program import OpenAIPydanticProgram
import pandas as pd

# initialize empty df
df = pd.DataFrame({'Name': pd.Series(dtype='str'),
                   'Age': pd.Series(dtype='int'),
                   'City': pd.Series(dtype='str'),
                   'Favorite Sport': pd.Series(dtype='str')})

# initialize parser, using existing df as schema 
df_rows_output_parser = DFRowsOutputParser.from_defaults(
    pydantic_program_cls=OpenAIPydanticProgram,
    df=df
)

In [13]:
# parse text, using existing df as schema 
result_obj = df_rows_output_parser.parse(
    """My name is John and I am 25 years old. I live in 
        New York and I like to play basketball. His name is 
        Mike and he is 30 years old. He lives in San Francisco 
        and he likes to play baseball. Sarah is 20 years old 
        and she lives in Los Angeles. She likes to play tennis.
        Her name is Mary and she is 35 years old. 
        She lives in Chicago."""
)

In [14]:
result_obj.to_df(existing_df=df)

  return existing_df.append(new_df, ignore_index=True)


Unnamed: 0,Name,Age,City,Favorite Sport
0,John,25,New York,Basketball
1,Mike,30,San Francisco,Baseball
2,Sarah,20,Los Angeles,Tennis
3,Mary,35,Chicago,


In [15]:
# initialize parser that can do joint schema extraction and structured data extraction 
df_full_output_parser = DFFullOutputParser.from_defaults(
    pydantic_program_cls=OpenAIPydanticProgram,
)

In [16]:
result_obj = df_full_output_parser.parse(
    """My name is John and I am 25 years old. I live in 
        New York and I like to play basketball. His name is 
        Mike and he is 30 years old. He lives in San Francisco 
        and he likes to play baseball. Sarah is 20 years old 
        and she lives in Los Angeles. She likes to play tennis.
        Her name is Mary and she is 35 years old. 
        She lives in Chicago."""
)

In [4]:
result_obj.to_df()

Unnamed: 0,Name,Age,Location,Hobby
0,John,25,New York,Basketball
1,Mike,30,San Francisco,Baseball
2,Sarah,20,Los Angeles,Tennis
3,Mary,35,Chicago,
