<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tapi-logo-small.png" />

This notebook free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

Created by [William Mattingly](https://www.wjbmattingly.com) for the 2025 Text Analysis Pedagogy Institute, with support from [Constellate](https://constellate.org).
<br />
____

# Improving the Output of LLMs for Analaysis

In this notebook we will use what we have learned to generate better outputs from LLMs. In this notebook, you will have the following:


1) Understanding of how to include other classes for the model and why these can improve outputs.
2) How to perform OCR Correction with an LLM and, more importantly, some of the problems with this approach
3) How to leverage NER for downstream tasks, such as the creation of knowledge graphs
4) How to use the skills you have learned this week to tackle a real-world problem

In [79]:
!pip install openai pydantic datasets spacy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


## Importing the Dataset

First let's go ahead and import the dataset we will be using throughout this notebook, the American Stories dataset. Here, we are using a very small sample of this dataset.

In [1]:
from datasets import load_dataset

dataset = load_dataset("wjbmattingly/american-stories-sample-tap", split="train")

dataset

  from .autonotebook import tqdm as notebook_tqdm


Dataset({
    features: ['article_id', 'newspaper_name', 'edition', 'date', 'page', 'headline', 'byline', 'article'],
    num_rows: 1000
})

Just like before, we will be working with this article.

In [2]:
print(dataset[2]['article'])

British. Surrender to Nozs Dotes

 Back to Three Post-ANor Decisions; Hitler Saw They Would Not Fight After They Let Itoly. Toke Ethiopio;

 After France Let Them Toke Rhnelond;; M w And When Franco Was Not Checked

 WASHINGTON, September g. - Here are two. tl1vb-naN sketches of history which should be kept in Av1


## Setting up OpenAI API

In order to follow along with this notebook, remember to setup your OpenAI credentials and use your API Key. We will first import the required libraries.

In [5]:
from openai import OpenAI
import os
from pydantic import BaseModel

Now, let's connect to the client.

In [4]:
client = OpenAI(
    api_key=os.environ.get("OPENAI_API_KEY"),
    # api_key=""
)

## Adding other Options for the Model to Consider

In the previous notebook, we saw the benefits of few-shot learning with LLMs and well-crafted prompts. Here, I'd like to demonstrate another approach. We will be using additional classes, namely Person. The idea behind providing the model additional classes is that it can have other options to classify certain entities that may be mislabeled. This can sometimes improve the NER output of a model. To begin, we will create two classes: Location and Person. Note that we are using a slightly different NER prompt than we did in the previous notebook.

Finally, we will have a third class: Entities. This will be a list of the entities in the text which will map to both locations and places.

In [61]:

ner_messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant that identifies entities in a text."
    },
    {
        "role": "user",
        "content": f"Find entities in this text:: {dataset[2]['article']}"
    }
]

class Location(BaseModel):
    text: str

class Person(BaseModel):
    text: str

class Entities(BaseModel):
    locations: list[Location]
    people: list[Person]

response = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=ner_messages,
    response_format=Entities
)

for response in response.choices[0].message.parsed:
    print(response)

('locations', [Location(text='British'), Location(text='Nozs Dotes'), Location(text='ANor'), Location(text='Itoly'), Location(text='Ethiopio'), Location(text='France'), Location(text='Rhnelond'), Location(text='WASHINGTON')])
('people', [Person(text='Hitler'), Person(text='Franco')])


If you run this cell multiple times, you will notice that we (as expected) have rather inconsistent output. One of the main things I have noticed is that this model continues to assign things like `British` to Person and Location unpredictably.

## **EXERCISE 1** (10 minutes)
Your exercise is to solve the above noted problem. Based on what you've learned so far, come up with a solution that will prevent this from happening.

In [58]:

imrpoved = [
    {
        "role": "system",
        "content": "You are a helpful assistant that identifies entities in a text."
    },
    {
        "role": "user",
        "content": f"Find entities in this text: {dataset[2]['article']}"
    }
]

class Location(BaseModel):
    text: str

class Person(BaseModel):
    text: str

class Entities(BaseModel):
    locations: list[Location]
    people: list[Person]

response = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=imrpoved,
    response_format=Entities
)

print(response.choices[0].message.content)

{"locations":[{"text":"Britain"},{"text":"Nozs Dotes"},{"text":"Hitler"},{"text":"Itoly"},{"text":"Ethiopio"},{"text":"France"},{"text":"Rhnelond"},{"text":"Franco"},{"text":"WASHINGTON"}],"people":[{"text":"Hitler"},{"text":"Franco"}]}


## Importance of OCR Correction?

One of the main reasons the model struggles with the text in question is that it being given a poorly-OCRed text. OCR correction is an area of active research, especially with SMLs, or small language models. LLMs and SMLs can perform OCR correction, but one of the main issues is that these are generative models, meaning they can hallucinate and do so in entirely unpredictable ways. They can sometimes convert bad OCR of early-modern Italian into proper modern Italy. It's important to understand that LLMs, if used for OCR correction, should be manually validated.

For the purposes of today, though, let's go ahead and use an LLM to do this. To do this, we will craft a slightly different prompt and Pydantic model.

In [32]:

ocr_messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant that cleans OCR errors in a text."
    },
    {
        "role": "user",
        "content": f"Clean OCR errors in this text: {dataset[2]['article']}"
    }
]

class CorrectedText(BaseModel):
    text: str

response = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=ocr_messages,
    response_format=CorrectedText
)

print(response.choices[0].message.content)

{"text":"British Surrender to Nazis Dotes\n\nBack to Three Post-War Decisions: Hitler Saw They Would Not Fight After They Let Italy Take Ethiopia;\n\nAfter France Let Them Take Rhineland; And When Franco Was Not Checked\n\nWASHINGTON, September 9. - Here are two thumbnail sketches of history which should be kept in view."}


Excellent! Let's now grab this as structured data.

In [33]:
print(response.choices[0].message.parsed)

text='British Surrender to Nazis Dotes\n\nBack to Three Post-War Decisions: Hitler Saw They Would Not Fight After They Let Italy Take Ethiopia;\n\nAfter France Let Them Take Rhineland; And When Franco Was Not Checked\n\nWASHINGTON, September 9. - Here are two thumbnail sketches of history which should be kept in view.'


And now let's make it into a an object we can call later in our script. We will call this `corrected_text`.

In [34]:
corrected_text = response.choices[0].message.parsed.text
print(corrected_text)

British Surrender to Nazis Dotes

Back to Three Post-War Decisions: Hitler Saw They Would Not Fight After They Let Italy Take Ethiopia;

After France Let Them Take Rhineland; And When Franco Was Not Checked

WASHINGTON, September 9. - Here are two thumbnail sketches of history which should be kept in view.


Let's compare it to our original.

In [35]:
print(dataset[2]['article'])

British. Surrender to Nozs Dotes

 Back to Three Post-ANor Decisions; Hitler Saw They Would Not Fight After They Let Itoly. Toke Ethiopio;

 After France Let Them Toke Rhnelond;; M w And When Franco Was Not Checked

 WASHINGTON, September g. - Here are two. tl1vb-naN sketches of history which should be kept in Av1


In [64]:
from IPython.display import display, HTML

# Create HTML table with better formatting for comparison
html_comparison = f"""
<h2>Original vs Corrected Text</h2>
<table style="width:100%; border-collapse: collapse;">
    <tr>
        <th style="border: 1px solid black; padding: 8px; text-align: left; background-color: #f2f2f2;">Original Text</th>
        <th style="border: 1px solid black; padding: 8px; text-align: left; background-color: #f2f2f2;">Corrected Text</th>
    </tr>
    <tr>
        <td style="border: 1px solid black; padding: 8px; vertical-align: top; white-space: pre-wrap;">{dataset[2]['article']}</td>
        <td style="border: 1px solid black; padding: 8px; vertical-align: top; white-space: pre-wrap;">{corrected_text}</td>
    </tr>
</table>
"""

display(HTML(html_comparison))

Original Text,Corrected Text
"British. Surrender to Nozs Dotes  Back to Three Post-ANor Decisions; Hitler Saw They Would Not Fight After They Let Itoly. Toke Ethiopio;  After France Let Them Toke Rhnelond;; M w And When Franco Was Not Checked  WASHINGTON, September g. - Here are two. tl1vb-naN sketches of history which should be kept in Av1","British Surrender to Nazis Dotes Back to Three Post-War Decisions: Hitler Saw They Would Not Fight After They Let Italy Take Ethiopia; After France Let Them Take Rhineland; And When Franco Was Not Checked WASHINGTON, September 9. - Here are two thumbnail sketches of history which should be kept in view."


It's always a good idea to do manual validation against the original source image. To find that we can check out the metadata.

In [36]:
dataset[2]

{'article_id': '42_1938-09-23_p8_sn82014085_00393347417_1938092301_0313',
 'newspaper_name': 'The Waterbury Democrat.',
 'edition': '01',
 'date': '1938-09-23',
 'page': 'p8',
 'headline': 'Daily fA\n\n Washington',
 'byline': 'Ana ROnERT q ALLElg\n\nRy DREW PEARSON',
 'article': 'British. Surrender to Nozs Dotes\n\n Back to Three Post-ANor Decisions; Hitler Saw They Would Not Fight After They Let Itoly. Toke Ethiopio;\n\n After France Let Them Toke Rhnelond;; M w And When Franco Was Not Checked\n\n WASHINGTON, September g. - Here are two. tl1vb-naN sketches of history which should be kept in Av1'}

As we can see, it's in from `The Waterbury Democrat`, edition 01, page 8. If we go to the link below, we can see the image and here is the relevant section.

You can find the full article [here](https://www.loc.gov/resource/sn82014085/1938-09-23/ed-1/?sp=8&r=0.624,0.033,0.482,0.354,0)![article image](../assets/images/article.png)

In [39]:
print(corrected_text)

British Surrender to Nazis Dotes

Back to Three Post-War Decisions: Hitler Saw They Would Not Fight After They Let Italy Take Ethiopia;

After France Let Them Take Rhineland; And When Franco Was Not Checked

WASHINGTON, September 9. - Here are two thumbnail sketches of history which should be kept in view.


In our case, we were lucky, our corrected OCR fairly closely mirrors the original image, despite having never seen it.

## Using the Corrected OCR Output

Now that we have used one LLM to correct the OCR, let's use another to solve our same NER problem.

In [66]:

corrected_ocr_messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant that identifies entities in a text."
    },
    {
        "role": "user",
        "content": f"Find entities in this text: {corrected_text}"
    }
]

class Location(BaseModel):
    text: str

class Person(BaseModel):
    text: str

class Nationality(BaseModel):
    text: str


class Entities(BaseModel):
    locations: list[Location]
    people: list[Person]
    nationalities: list[Nationality]

    
response = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=corrected_ocr_messages,
    response_format=Entities
)

print(response.choices[0].message.content)

{"locations":[{"text":"Ethiopia"},{"text":"Rhineland"},{"text":"Washington"}],"people":[{"text":"Hitler"},{"text":"Franco"}],"nationalities":[{"text":"British"},{"text":"Nazis"},{"text":"French"}]}


While your results will be a bit inconsistent, they should be slightly better. One error I continue to see is "Nazi" being applied to nationality. In some contexts, I can understand this, but we may want to have them be classified as "Organization" or maybe even "Political Party". You can always improve upon these classes as you see fit.

## Drawing Relationships between Entities

Extracting entities is useful, but a far more challenging task is understand *how* those entities relate to one another. Traditionally humans could manually go through a text and construct a knowledge graph through triples. Triples are a set of three things: subject, relationship, and object. These can be sole-directional: Mary (Subject) is the mother (Relationship) of Susan (Object) or bi-directional. Mary (Subject) is related to (Relationship) Susan (Object) (and the opposite can be true.)

When we understand the relationship between entities in our text, we can have a much better understanding of how those entities are functioning within the text. Doing this without LLMs can be very challenging and require a lot of domain expertise. LLMs make this prolbem a little easier. To see this, let's first grab our entities.

In [44]:
entities = response.choices[0].message.content

Now, let's use these entities to extract a knowledge graph of triples from our text.

In [76]:

tripples_messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant that draws relationships between entities in a text."
    },
    {
        "role": "user",
        "content": f"Draw relationships between entities in this text: {corrected_text}. Here are the entities: {entities}"
    }
]

class Entity(BaseModel):
    text: str

class Relationship(BaseModel):
    relationship: str


class Triple(BaseModel):
    subject: Entity
    relationship: Relationship
    object: Entity

class Triples(BaseModel):
    triples: list[Triple]
    
response = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=tripples_messages,
    response_format=Triples
)

print(response.choices[0].message.content)

{"triples":[{"subject":{"text":"Italy"},"relationship":{"relationship":"took"},"object":{"text":"Ethiopia"}},{"subject":{"text":"France"},"relationship":{"relationship":"let take"},"object":{"text":"Rhineland"}},{"subject":{"text":"Franco"},"relationship":{"relationship":"was not checked"},"object":{"text":""}}]}


Let's make it a little easier to read this.

In [75]:
for triple in response.choices[0].message.parsed.triples:
    print(triple)
    print()

subject=Entity(text='Italy') relationship=Relationship(relationship='took') object=Entity(text='Ethiopia')

subject=Entity(text='France') relationship=Relationship(relationship='let them take') object=Entity(text='Rhineland')

subject=Entity(text='Franco') relationship=Relationship(relationship='was not checked') object=Entity(text='')

subject=Entity(text='Hitler') relationship=Relationship(relationship='saw they would not fight after') object=Entity(text='Italy took Ethiopia')

subject=Entity(text='Hitler') relationship=Relationship(relationship='saw they would not fight after') object=Entity(text='France let them take Rhineland')

subject=Entity(text='Hitler') relationship=Relationship(relationship='saw they would not fight after') object=Entity(text='Franco was not checked')



## Extracting Complex Entities and their Metadata

LLMs also allow us to solve other complex types of tasks that leverage NER. Imagine we needed to identify the women mentioned in the following text. What are some of the challenges you can foresee in this problem?

In [52]:
print(dataset[500]["article"])

MANY prominent persons in Wash
@ ington have consented to act as
patronesses for the benefit perform
ances of "Once Is Enough" Tuesday
evening. sponsored by the Washington
Bryn MaNr Club. The proceeds will
60 to scholarship to the college for
Washington student and the Bry1
Mavr summer school for women work
ers in industry.


Nine. Troyanovsky, wife of the
Soviet Ambassador, has taken a box
and others who will entertain box
parties include Mrs. Carroll Miller, DR..
Ethel Dunham. Mrs. Jacob Simpson
Payton, Mrs. Edward B. Meigs, Mrs
Howell Moorhead and Mrs. Thomas
McAllister.


The patroness list. headed by Mrs.
Franklin Delano Roosevelt, includes
Mrs. Cordell Hull. Lady Lindsay,
Madame Troyanovsky, Sonora de los
Rios, Frau Dieckhoff, Madame Peter
and Madame Pelonyi Also wives of
associate justices of the Supreme
Court. Mrs. Louis D. Brandeis, Mrs.
Pierce Butler, Mrs. Harlan Fiske Stone
and Mrs. Hugo Black.


Mrs. Henry Morgenthau, jr., wife
OF the Secretary of the Treasury; the
Secretar

## **Exercise 2** (10 minutes)

Read the text and think through a solution to the following problem. Find and identify the women referenced in this text, while separating them from their spouses. You don't need to write code here. Come up with a conceptional way to approach the problem given what you have already learned.

In [None]:
from typing import Optional

ner_women = [
    {
        "role": "system",
        "content": "You are a helpful assistant that identifies women in a text. Only return the names of women who are clearly identified by either pronoun or honorifics. If her name is attached to her spouse, separate her spouse's name from hers. This is true for examples like 'Mrs. John Smith' or 'John Smith and his wife, Mrs. John Smith'."
    },
    {
        "role": "user",
        "content": f"Identify women in this text: {dataset[500]['article']}"
    }
]

class Woman(BaseModel):
    text: str

class Spouse(BaseModel):
    text: str

class Woman(BaseModel):
    woman_name: Woman
    spouse: Optional[Spouse]

class Women(BaseModel):
    women: list[Woman]

    
response = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=ner_women,
    response_format=Women
)

for woman in response.choices[0].message.parsed.women:
    print(woman)

woman_name=Woman(text='Nine Troyanovsky') spouse=Spouse(text='Troyanovsky')
woman_name=Woman(text='Mrs. Carroll Miller') spouse=None
woman_name=Woman(text='Ethel Dunham') spouse=None
woman_name=Woman(text='Mrs. Jacob Simpson Payton') spouse=Spouse(text='Jacob Simpson Payton')
woman_name=Woman(text='Mrs. Edward B. Meigs') spouse=Spouse(text='Edward B. Meigs')
woman_name=Woman(text='Mrs. Howell Moorhead') spouse=Spouse(text='Howell Moorhead')
woman_name=Woman(text='Mrs. Thomas McAllister') spouse=Spouse(text='Thomas McAllister')
woman_name=Woman(text='Mrs. Franklin Delano Roosevelt') spouse=Spouse(text='Franklin Delano Roosevelt')
woman_name=Woman(text='Mrs. Cordell Hull') spouse=Spouse(text='Cordell Hull')
woman_name=Woman(text='Lady Lindsay') spouse=None
woman_name=Woman(text='Madame Troyanovsky') spouse=Spouse(text='Troyanovsky')
woman_name=Woman(text='Sonora de los Rios') spouse=None
woman_name=Woman(text='Frau Dieckhoff') spouse=None
woman_name=Woman(text='Madame Peter') spouse=None

## Exploring the New Models from OpenAI

While this notebook was created and used (April 2025) several new OpenAI models were released. I encourage you to use one of the newer models and test out the workflows listed above.

In [78]:
for model in client.models.list():
    print(model.id)

gpt-4o-audio-preview-2024-12-17
dall-e-3
text-embedding-3-large
dall-e-2
o4-mini-2025-04-16
gpt-4o-audio-preview-2024-10-01
o4-mini
gpt-4.1-nano
gpt-4.1-nano-2025-04-14
gpt-4o-realtime-preview-2024-10-01
gpt-4o-realtime-preview
babbage-002
tts-1-hd-1106
gpt-4
text-embedding-ada-002
o1-2024-12-17
o1-pro-2025-03-19
o1
tts-1-hd
gpt-4o-mini-audio-preview
o1-pro
gpt-4o-audio-preview
o1-preview-2024-09-12
gpt-4o-mini-realtime-preview
gpt-4.1-mini
gpt-4o-mini-realtime-preview-2024-12-17
gpt-3.5-turbo-instruct-0914
gpt-4o-mini-search-preview
gpt-4.1-mini-2025-04-14
tts-1-1106
chatgpt-4o-latest
davinci-002
gpt-3.5-turbo-1106
gpt-4o-search-preview
gpt-4-turbo
gpt-4o-realtime-preview-2024-12-17
gpt-3.5-turbo-instruct
gpt-3.5-turbo
gpt-4-turbo-preview
gpt-4o-mini-search-preview-2025-03-11
gpt-4-0125-preview
gpt-4o-2024-11-20
whisper-1
gpt-4o-2024-05-13
gpt-4-turbo-2024-04-09
gpt-3.5-turbo-16k
o1-preview
gpt-4-0613
gpt-4.5-preview
gpt-4.5-preview-2025-02-27
gpt-4o-search-preview-2025-03-11
o3-mini
