# Wrestling with Structured Output

## The Structured Output Challenges

Large language models (LLMs) excel at generating human-like text, but they often struggle to produce output in a structured format consistently. This poses a significant challenge when we need LLMs to generate data that can be easily processed by other systems, such as databases, APIs, or other software applications.  

Sometimes, even with a well-crafted prompt, an LLM might produce an unstructured response when a structured one is expected. This can be particularly challenging when integrating LLMs into systems that require specific data formats.

In [1]:
from dotenv import load_dotenv
import os

# Load environment variables from .env file
load_dotenv()

from openai import OpenAI

client = OpenAI()

In [35]:
# Define the prompt expecting a structured JSON response
MAX_LENGTH = 10000
with open('../data/apple.txt', 'r') as file:
    sec_filing = file.read() 
prompt = f"""
Generate a two-person discussion about the key financial data from the following text in JSON format.
TEXT: {sec_filing[:MAX_LENGTH]}
"""

response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": prompt}]
)

In [26]:
response_content = response.choices[0].message.content
print(response_content)

Person 1: Wow, Apple Inc. seems to have a lot of different products and services they offer. It's interesting to see the breakdown of their revenue streams in their Form 10-K.

Person 2: Absolutely, they have a diverse portfolio with iPhones, Macs, iPads, wearables, and even services. It's impressive to see how they have capitalized on different technology trends.

Person 1: I noticed that they have a large market value of over $2.6 trillion as of March 29, 2024. That's a huge amount, and it shows the confidence investors have in the company.

Person 2: Definitely, that's a significant figure. It's also good to see that they are complying with all the required SEC regulations and filing their reports in a timely manner.

Person 1: Yes, it's crucial for investors to have access to accurate and up-to-date financial information. It helps in making informed decisions about their investments in the company.

Person 2: Absolutely, transparency and compliance with regulations are key in the f

In [27]:
import json

def is_json(myjson):
  try:
    json.loads(myjson)
  except ValueError as e:
    return False
  return True

is_json(response_content)


False

In this example, despite the prompt clearly asking for a JSON object, the LLM generates a natural language sentence instead. This highlights the inconsistency and unpredictability of LLMs when it comes to producing structured output.

## Problem Statement

Obtaining structured output from LLMs presents several significant challenges:

* **Inconsistency**: LLMs often produce unpredictable results, sometimes generating well-structured output and other times deviating from the expected format.
* **Lack of Type Safety**: LLMs do not inherently understand data types, which can lead to errors when their output is integrated with systems requiring specific data formats.
* **Prompt Engineering Complexity**: Crafting prompts that effectively guide LLMs to produce the correct structured output is complex and requires extensive experimentation.

## Solutions

Several strategies and tools can be employed to address the challenges of structured output from LLMs.

### Strategies

* **Schema Guidance**: Providing the LLM with a clear schema or blueprint of the desired output structure helps to constrain its generation and improve consistency. This can be achieved by using tools like Pydantic to define the expected data structure and then using that definition to guide the LLM's output. 
* **Output Parsing**: When LLMs don't natively support structured output, parsing their text output using techniques like regular expressions or dedicated parsing libraries can extract the desired information. For example, you can use regular expressions to extract specific patterns from the LLM's output, or you can use libraries like Pydantic to parse the output into structured data objects.
* **Type Enforcement**: Using tools that enforce data types, such as Pydantic in Python, can ensure that the LLM output adheres to the required data formats. This can help to prevent errors when integrating the LLM's output with other systems.

### Techniques and Tools

#### One-Shot Prompts

In one-shot prompting, you provide a single example of the desired output format within the prompt.

In [31]:
prompt = f"""
Generate a two-person discussion about the key financial data from the following text in JSON format.

<JSON_FORMAT>
{{
   "Person1": {{
     "name": "Alice",
     "statement": "The revenue for Q1 has increased by 20% compared to last year."
   }},
   "Person2": {{
     "name": "Bob",
     "statement": "That's great news! What about the net profit margin?"
   }}
}}
</JSON_FORMAT>

TEXT: {sec_filing[:MAX_LENGTH]}
"""

response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": prompt}]
)

In [32]:
response_content = response.choices[0].message.content
print(response_content)

{
   "Person1": {
     "name": "Alice",
     "statement": "The revenue for Q1 has increased by 20% compared to last year."
   },
   "Person2": {
     "name": "Bob",
     "statement": "That's great news! What about the net profit margin?"
   }
}


In [33]:
is_json(response_content)


True

#### Structured Output with Provider-Specific APIs

Simple techniques such as one-shot prompting can lead to material improvements though may not be sufficient for complex (e.g. nested) structures and / or when the model's output needs to be restricted to a specific set of options or types.

Provider-specific APIs can offer ways to handle those challenges. For example, OpenAI's API offers features specifically designed for generating JSON output or structured outputs, in general.

**JSON Mode**

JSON mode is a feature provided by some LLM APIs, such as OpenAI's, that allows the model to generate output in JSON format. This is particularly useful when you need structured data as a result, such as when parsing the output programmatically or integrating it with other systems that require JSON input.

When using JSON mode, you must always instruct the model to produce JSON via some message in the conversation, for example via your system message. If you don't include an explicit instruction to generate JSON, the model may generate an unending stream of whitespace and the request may run continually until it reaches the token limit. To help ensure you don't forget, the API will throw an error if the string "JSON" does not appear somewhere in the context.




In [48]:
prompt = f"""
Generate a two-person discussion about the key financial data from the following text in JSON format.
TEXT: {sec_filing[:MAX_LENGTH]}
"""
response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": prompt}],
response_format = { "type": "json_object" }
)

In [49]:
response_content = response.choices[0].message.content
print(response_content)

{
  "person1": "I see that Apple Inc. reported a total market value of approximately $2,628,553,000,000 held by non-affiliates as of March 29, 2024. That's a significant amount!",
  "person2": "Yes, it definitely shows the scale and value of the company in the market. It's impressive to see the sheer size of the market value.",
  "person1": "Also, they mentioned having 15,115,823,000 shares of common stock issued and outstanding as of October 18, 2024. That's a large number of shares circulating in the market.",
  "person2": "Absolutely, the number of shares outstanding plays a crucial role in determining the company's market capitalization and investor interest."
}


In [50]:
is_json(response_content)

True

JSON mode will not guarantee the output matches any specific schema, only that it is valid and parses without errors. For that purpose, we can leverage a new feature some modern LLM API offer called "Structured Outputs" to ensure the output data matches a target schema/typing.


**Structured Output Mode**

Structured Outputs is a feature that ensures the model will always generate responses that adhere to your supplied JSON Schema, so you don't need to worry about the model omitting a required key, or hallucinating an invalid enum value.

Some benefits of Structured Outputs include:
- **Reliable type-safety**: No need to validate or retry incorrectly formatted responses.
- **Explicit refusals**: Safety-based model refusals are now programmatically detectable.
- **Simpler prompting**: No need for strongly worded prompts to achieve consistent formatting.


Here's a Python example demonstrating how to use the OpenAI API to generate a structured output in the form of a two-person conversation discussing financial statements from an input SEC filing. This example uses the `response_format` parameter within the OpenAI API call. This functionality is supported by GPT-4o models, specifically `gpt-4o-mini-2024-07-18`, `gpt-4o-2024-08-06`, and later versions.

In [68]:
from pydantic import BaseModel
from openai import OpenAI

client = OpenAI()

class SECExtraction(BaseModel):
    mentioned_entities: list[str]
    mentioned_places: list[str]

def extract_from_sec_filing(sec_filing_text: str) -> SECExtraction:
    """
    Extracts structured data from an input SEC filing text.
    """
    completion = client.beta.chat.completions.parse(
        model="gpt-4o-mini-2024-07-18",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are an expert at structured data extraction. You will be given unstructured text from a SEC filing and extracted names of mentioned entities and places and should convert the response into the given structure."
                )
            },
            {"role": "user", "content": sec_filing_text}
        ],
        response_format=SECExtraction
    )
    return completion.choices[0].message.parsed

**Explanation:**

*   **Data Structures:** The code defines two Pydantic models, `DialogueTurn` and `FinancialStatementDiscussion`, to represent the structured output of the conversation. These models provide type hints and structure for the conversation.
*   **API Interaction:** The `generate_conversation` function uses the OpenAI client to send a chat completion request to the `gpt-4o-2024-08-06` model. The prompt instructs the model to simulate a dialogue about financial statements. The `response_format` is set to `FinancialStatementDiscussion`, ensuring the response conforms to the Pydantic model.
*   **Output Processing:** The returned response is parsed into the `FinancialStatementDiscussion` model. The code then iterates through the dialogue turns and prints the speaker and content of each turn.

In [69]:
conversation = extract_from_sec_filing(sec_filing[:MAX_LENGTH])

In [70]:
print("Extracted entities:", conversation.mentioned_entities)
print("Extracted places:", conversation.mentioned_places)


Extracted entities: ['Apple Inc.', 'The Nasdaq Stock Market LLC']
Extracted places: ['Washington, D.C.', 'California', 'Cupertino, California']


**Benefits**

*   **Structured Output:** The use of Pydantic models and the `response_format` parameter enforces the structure of the model's output, making it more reliable and easier to process.
*   **Schema Adherence:**  Structured Outputs in OpenAI API guarantee that the response adheres to the provided schema.

This structured approach improves the reliability and usability of your application by ensuring consistent, predictable output from the OpenAI API.

### LangChain

LangChain is a framework designed to simplify the development of LLM applications. It offers several tools for parsing structured output, including:

* **`with_structured_output`**: This method is used with LLMs that support structured output APIs, allowing you to enforce a schema directly within the prompt.

### Outlines

Outlines is a library specifically focused on structured text generation from LLMs.  It provides several powerful features:

* **Multiple Choice Generation**: Restrict the LLM output to a predefined set of options.

### Comparing Solutions

* **Simplicity vs. Control**: One-shot prompts are simple but offer limited control.  Dedicated tools like Gemini's structured output features, LangChain, and Outlines provide greater control but might have a steeper learning curve.
* **Native LLM Support**:  `with_structured_output` in LangChain relies on the LLM having built-in support for structured output APIs. Other methods, like parsing or using Outlines, are more broadly applicable.
* **Flexibility**:  Outlines and LangChain's  `StructuredOutputParser`  offer the most flexibility for defining custom output structures.

## Best Practices


* **Clear Schema Definition**: Define the desired output structure clearly, using schemas, types, or grammars as appropriate. This ensures the LLM knows exactly what format is expected.
* **Descriptive Naming**: Use meaningful names for fields and elements in your schema. This makes the output more understandable and easier to work with.
* **Detailed Prompting**: Guide the LLM with well-crafted prompts that include examples and clear instructions.  A well-structured prompt improves the chances of getting the desired output.
* **Error Handling**: Implement mechanisms to handle cases where the LLM deviates from the expected structure. LLMs are not perfect, so having error handling ensures your application remains robust.
* **Testing and Iteration**: Thoroughly test your structured output generation process and refine your prompts and schemas based on the results. Continuous testing and refinement are key to achieving reliable structured output.
* **Structured Response**: If you want to structure the model's output when it responds to the user, then you should consider using a structured response_format (e.g. JSON format).
* **Function Calling**: If you are connecting the model to tools, functions, data, etc. in your system, then you should use a typed structured output.  



## Conclusion

Extracting structured output from LLMs is crucial for integrating them into real-world applications. By understanding the challenges and employing appropriate strategies and tools, developers can improve the reliability and usability of LLM-powered systems, unlocking their potential to automate complex tasks and generate valuable insights. 
