## Structured Data with OpenAI - Expense Parsing

During this lab, our goal is to:

- Understand OpenAI's Structured Outputs feature
- Learn to define Pydantic models for data extraction
- Practice parsing natural language into structured data
- Handle edge cases and validation

## Part 0: Setup

In [None]:
import os

from dotenv import load_dotenv
from openai import OpenAI
from pydantic import BaseModel

load_dotenv()  # Load environment variables from .env file
API_KEY = os.getenv('OPENAI_API_KEY')

print(API_KEY)

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

## Part 1: A Simple Example

Let's start by trying to extract out some information about a person using a simple sentence that describes them.

To start, we'll define a Pydantic model that represents our `Person` object.

In [None]:
# Define the structured output schema using Pydantic
class Person(BaseModel):
    name: str
    age: int

    def say_hi(self):
        print(f"Hi! I'm {self.name} and I'm {self.age} years old!")

Next, we'll make a call to OpenAI and specify the `text_format` as our `Person`.

<details>
<summary>🤔 Why `input` instead of `messages`? Why `text_format`? And what about `response_format`?</summary>

There are several ways to get structured output, which is why the docs can look inconsistent. Here’s the breakdown:

- `input` vs `messages` → The new Responses API uses `input` (which can still take role-based messages). The older Chat Completions API uses messages.
- `response_format` → That’s from Chat Completions. Still works, but it’s the legacy way.
- `text.format` → The low-level Responses API option for controlling output (plain, markdown, JSON schema, etc.).
- `text_format` → A convenience wrapper around `text.format`. With Pydantic models, it automatically generates the schema and parses the response back into your model.

Since we’re not calling tools _yet_, the recommended path is to stick with `text_format`. It’s simple, ergonomic, and does the parsing for us. If you want to dig deeper, check out [this guide on choosing the right structured output approach
](https://platform.openai.com/docs/guides/structured-outputs#function-calling-vs-response-format).

In [None]:
def extract_user_data(user_description):
    response = client.responses.parse(
        model="gpt-4o-2024-08-06",
        input=[
            {"role": "developer", "content": "Extract the user information."},
            {
                "role": "user",
                "content": user_description,
            },
        ],
        text_format=Person,
    )

    return response.output_parsed


user_description = "Alice is a twenty-five year old."
user = extract_user_data(user_description)

print(user)       # name='Alice' age=25
print(type(user)) # <class '__main__.Person'>
user.say_hi()     # Hi! I'm Alice and I'm 25 years old!

Amazing! We took unstructured input and were given back an object that we can use in our code.

Let's test the limits of this data extraction. Write several sentences that are less clear than "Alice is a twenty-five year old." and see if the LLM can still extract the correct data. For example, this sentence tends to choose `65` as the age, and `Jenny`, `Jennifer`, or `Jennyfer` as the name:

> I've had a lovely time at my retirement party! The only problem is that they spelt my name wrong on the cake. I can't believe anyone would think to spell it 'Jennyfer'!

In addition to testing out more elusive inputs, try testing inputs that don't provide the information required, and see what happens.

In [None]:
# TODO: Test different inputs:
#    - Inputs where the data to extract is unclear.
#    - Inputs lacking data that's supposed to be extracted.
#    - Inputs with more information than is required.

print("\nUnclear or ambiguous input")
unclear_description = "I've had a lovely time at my retirement party! The only problem is that they spelt my name wrong on the cake. I can't believe anyone would think to spell it 'Jennyfer'!"
user1 = extract_user_data(unclear_description)
print(user1)

print("\nIncomplete data")
incomplete_description = "Alice is here."
user2 = extract_user_data(incomplete_description)
print(user2)

print("\nExcess information")
verbose_description = "Sean and James are here. It's both of their birthdays. James is turning 33 and Sean if five years older. They're going to the pub to celebrate."
user3 = extract_user_data(verbose_description)
print(user3)

### Handling Incomplete Data

In real-world scenarios, we may run into scenarios where the data we would like isn't present or isn't always present. If you did the above tests, you probably noticed that the LLM will make up data to fill in if it hasn't been provided.

To fix this, we need to give the LLM an option to say "I don't know." In Pydantic terms, that means making a field optional.

In [None]:
# TODO: Update your Pydantic model to allow for incomplete or unknown
#       data to prevent hallucinated data in our objects.

from typing import Optional

class Person(BaseModel):
    name: str
    age: Optional[int] = None

person = extract_user_data('Alice is a woman')
print(person)

## Part 2: Extracting Expense Information

On your own, you'll now define a model and test it using the following requirements:

Create a Pydantic model named `ExpenseItem` that represents a single item in a list of expenses. Your model should include the following fields:

`amount` (required)
  - Type: `float`
  - Description: A numeric value representing the monetary amount spent.

`memo` (required)
  - Type: `str`
  - Description: A short description or note about what the expense was for.

`date` (optional)
  - Type: `date` (from the `datetime` module)
  - Description: The date the expense occurred in YYYY-MM-DD format. This field may be omitted if the date is unclear or should be inferred from context.

`category` (optional)
  - Type: `str`
  - Description: A category label describing the type of expense (e.g., food, transportation, entertainment, etc.).

In [None]:
from typing import Optional
from datetime import date as date_type
from pydantic import BaseModel, Field

# TODO: Define ExpenseItem model to meet the above requirements

class ExpenseItem(BaseModel):
    amount: float = Field(description="The monetary amount spent")
    memo: str = Field(description="Brief description of what the expense was for")
    date: Optional[date_type] = Field(default=None, description="Date of expense in YYYY-MM-DD format, infer from context if possible")
    category: Optional[str] = Field(default=None, description="Category of expense (food, transportation, entertainment, etc.)")

### Test Your Pydantic Model

Next, write a function, `extract_expense_data`, that takes some input and returns an `ExpenseItem` object populated with the data indicated in the input.

Hint: To get the date to work with inputs like "today" and "yesterday", provide today's date in the system message.

In [None]:
# TODO: Define and test `extract_expense_data`

def extract_expense_data(expense_description):
    response = client.responses.parse(
    model="gpt-4o-2024-08-06",
    input=[
            {"role": "system", "content": f"Extract the expense information from the memo. Today is {date_type.today()}"},
            {
                "role": "user",
                "content": expense_description,
            },
        ],
        text_format=ExpenseItem,
    )

    return response.output_parsed

print(extract_expense_data("I spent 23 dollars on lunch yesterday"))

## Part 3: Handling Multiple Expenses

Sometimes, we'd like to handle user inputs that describe more than one expense at a time. For example:

> I went to the store yesterday and spent 42 dollars on groceries. Then I went out to lunch with mom and paid 33 dollars. I also got a coffee on my way home for 4.33.

We can give the LLM freedom to create multiple objects so long as we define our Pydantic models to contain a varying number of objects.

To do this, create an `ExpenseList` Pydantic model that contains a list of `ExpenseItem` objects. Then, update the `text_format` to use an `ExpenseList`. This way, the LLM can create as few or as many expenses as it deems necessary.

In [None]:
from typing import List

# TODO: Define `ExpenseList` model
# TODO: Define and test `extract_expenses_data` that handles one or more expenses

class ExpenseList(BaseModel):
    expenses: List[ExpenseItem]

def extract_expenses_data(expense_descriptions):
    response = client.responses.parse(
      model="gpt-4o-2024-08-06",
      input=[
              {"role": "system", "content": f"Extract the expense information from the memo. Today is {date_type.today()}"},
              {
                  "role": "user",
                  "content": expense_descriptions,
              },
          ],
      text_format=ExpenseList,
    )

    return response.output_parsed


expense_list = extract_expenses_data(
    """
    I went to the store yesterday and spent 42 dollars on groceries.
    Then I went out to lunch with mom and paid 33 dollars.
    I also got a coffee on my way home for 4.33.
    """)

# Your object may require a different way to iterate over expenses
for expense in expense_list.expenses:
    print(expense)