<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tapi-logo-small.png" />

This notebook free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

Created by [William Mattingly](https://www.wjbmattingly.com) for the 2025 Text Analysis Pedagogy Institute, with support from [Constellate](https://constellate.org).
<br />
____

# Part 2: Leveraging LLMs for Named Entity Recognition

In this notebook, we will learn how to leverage LLMs to perform a specific task: Named Entity Recognition (NER). We will also learn some tricks for guiding LLMs to do what we want consistently.

Learning Objects

- Understanding NER as a task
- Understanding the different approaches to NER
- Understanding when to use an LLM for NER
- Understanding the Limitations of an LLM
- Understanding Pydantic and Data Models
- Understanding Structured Outputs from LLMs and their Importance


## Install Required Packages

In [102]:
!pip install openai pydantic datasets spacy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m48.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


## Introduction to American Stories Dataset

In [2]:
from datasets import load_dataset

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
dataset = load_dataset("wjbmattingly/american-stories-sample-tap")

In [4]:
dataset

DatasetDict({
    train: Dataset({
        features: ['article_id', 'newspaper_name', 'edition', 'date', 'page', 'headline', 'byline', 'article'],
        num_rows: 1000
    })
})

In [5]:
dataset = dataset['train']

In [6]:
print(dataset[2]['article'])

British. Surrender to Nozs Dotes

 Back to Three Post-ANor Decisions; Hitler Saw They Would Not Fight After They Let Itoly. Toke Ethiopio;

 After France Let Them Toke Rhnelond;; M w And When Franco Was Not Checked

 WASHINGTON, September g. - Here are two. tl1vb-naN sketches of history which should be kept in Av1


## What is NER

## Different ways to Solve NER

Named Entity Recognition (NER) can be approached through several different methods, each with its own strengths and use cases. One of the most basic approaches is using rule-based systems, where human experts define explicit patterns and rules to identify entities. Rule-based systems are highly precise for specific domains, easy to understand and debug, and don't require training data. However, they can be labor-intensive to create and maintain, and may struggle with novel or ambiguous cases that don't match predefined rules.

Regular expressions (regex) offer another approach to NER by defining text patterns to match entities. Regex is powerful for identifying entities with consistent formats like phone numbers, email addresses, or dates. It's fast, lightweight, and gives precise control over pattern matching. However, like rule-based systems, regex patterns can be complex to maintain and may not handle variations or context well.

Machine learning approaches, particularly deep learning models, have become increasingly popular for NER. These systems learn to identify entities from large amounts of annotated training data, capturing complex patterns and contextual relationships that would be difficult to define manually. ML models can generalize well to new examples and handle ambiguous cases, but they require significant training data and computational resources.

Spacy stands out as a versatile NLP library because it combines all these approaches. You can use Spacy's statistical models for machine learning-based NER, define rule-based patterns with its matcher components, and incorporate regex patterns where needed. This flexibility allows developers to choose the best approach for their specific use case, or even combine multiple approaches for optimal results. Additionally, Spacy's efficient implementation and easy-to-use API make it practical for both small and large-scale applications.

Let's take a look at what NER looks like in Python. First, we will import spacy


In [9]:
import spacy

Now, we will load the small English pipeline that has a machine learning NER component. We downloaded this earlier in the notebook.

In [103]:
nlp = spacy.load("en_core_web_sm")

Finally, we will run it over one of our articles.

In [104]:
doc = nlp(dataset[2]['article'])

I chose this article specifically because it is messy and represents real-world data, espcially humanities data. What are some of the problems you can identify?

In [11]:
print(doc)

British. Surrender to Nozs Dotes

 Back to Three Post-ANor Decisions; Hitler Saw They Would Not Fight After They Let Itoly. Toke Ethiopio;

 After France Let Them Toke Rhnelond;; M w And When Franco Was Not Checked

 WASHINGTON, September g. - Here are two. tl1vb-naN sketches of history which should be kept in Av1


To look at the NER, we can print off the entities and their labels.

In [105]:
for ent in doc.ents:
    print(ent.text, ent.label_)

British NORP
Surrender PERSON
Three CARDINAL
Toke Ethiopio PERSON
France GPE
Franco PERSON
WASHINGTON GPE
two CARDINAL


These results are... not great. This is expected. The reason they are bad is because we are using the small English model. If we used a larger model, things would improve. The main reason we are seeing based results, though, is because of data quality. Our data was OCRed where mistakes in the transcription remain. This is a very common problem. One of the biggest advantages of LLMs is that their robust knowledge of language (even in it's messy forms!) allows them to predict well on these types of problems.

## Using an LLM to Solve NER

Large Language Models (LLMs) offer several compelling advantages for Named Entity Recognition (NER) tasks. Their broad knowledge base and contextual understanding allow them to identify entities even in noisy or informal text where traditional NER models might fail. LLMs can also recognize novel or rare entities that weren't in their training data, and they can adapt to domain-specific terminology without requiring explicit retraining. This flexibility is particularly valuable when working with historical texts, social media content, or specialized technical documents.

However, LLMs come with significant drawbacks that need to be considered. They are computationally expensive and typically require API calls to cloud services, making them costly for large-scale processing. Their responses can also be inconsistent between runs, making them less suitable for applications requiring deterministic results. Additionally, the "black box" nature of LLMs makes it difficult to understand or debug their decision-making process, unlike traditional NER models where the features and weights are more transparent.

When choosing between LLMs and traditional NER approaches, consider your specific use case. LLMs are ideal for projects dealing with messy, inconsistent data or requiring flexible entity recognition across diverse domains. They're also excellent for prototyping or small-scale applications where accuracy is prioritized over speed or cost. However, for large-scale production systems processing millions of documents, or applications requiring consistent, reproducible results, traditional NER models might be more appropriate. The best solution might be a hybrid approach, using LLMs to handle complex cases while relying on traditional NER for straightforward, high-volume processing.

To use an LLM, let's first connect to the OpenAI API. We'll first import our libraries.


In [112]:
from openai import OpenAI
import os

And then connect to the API like we did in the previous notebook.

In [113]:
client = OpenAI(
    api_key=os.environ.get("OPENAI_API_KEY"),
    # api_key=""
)

Now, let's craft our first NER Messages list to the model. Here, we define a very clear system prompt and provide a text to exract entities from.

In [106]:
basic_ner_messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant that identifies named entities in a text."
    },
    {
        "role": "user",
        "content": f"Here is a text: {dataset[2]['article']}"
    }
]

Now, let's use that variable to create a response.

In [15]:
response = client.chat.completions.create(
    model="gpt-4o",
    messages=basic_ner_messages
)

print(response.choices[0].message.content)

Here are the named entities identified in the text:

1. British
2. Nozs Dotes
3. Hitler
4. Itoly (likely a misspelling of Italy)
5. Ethiopio (likely a misspelling of Ethiopia)
6. France
7. Rhnelond (possibly a misspelling of Rhineland)
8. Franco
9. WASHINGTON
10. September

Note: There are several misspellings in the text that seem to be intended to refer to prominent entities. If you can provide more context or corrected spellings, it may help in more accurate identification.


While this output looks good (your experience will vary), it has some issues. First, I don't really know what these entities are, so what's a place, person, etc. Second, I don't have a consistent output that I can work with. I don't have structure that allows me to manipulate the data in a specific way. We'll learn how to tackle some of these challenges.

## **EXERCISE 1** (10 minutes)

Use this time to test out this approach. Use the prompting and few-shot approaches from last class to find only locations. Try and make those locations output in a consistent format.

## Introduction to Pydantic

Pydantic is a Python library for data validation and settings management using Python type annotations. It enforces type hints at runtime and provides user-friendly errors when data is invalid.

Key features of Pydantic include:
- Data validation using Python type annotations
- Automatic JSON schema generation
- Customizable validation rules
- Serialization/deserialization to and from JSON
- Integration with FastAPI and other frameworks

We'll use Pydantic to create structured outputs from our LLM responses, ensuring the data follows a consistent schema and format.

## Prompting with Structured Outputs

To begin working with Pydantic, we first need to import a data model from it. For our purposes, we will work with the BaseModel.

In [114]:
from pydantic import BaseModel

The BaseModel class is the foundation of Pydantic's data validation system. It allows you to define data models using Python type hints, which Pydantic then uses to validate data at runtime. When you create a class that inherits from BaseModel, you can define attributes with type annotations, and Pydantic will automatically validate any data assigned to those attributes to ensure it matches the specified types. This makes it easy to create structured, type-safe data models that can be used to validate and serialize/deserialize data.

Let's create a simple example. Imagine we wanted to just find the locations in the same text as above. We can create a BasicLocation data model like this:

In [115]:
class BasicLocation(BaseModel):
    text: str

A good way to think about this model is as a dictionary:

```python
{"text": ""}
```

This is essentially the exact same thing as the data model above. Now, how do we use it? Well let's first create a new prompt that will be a bit more specific to finding locations.

In [117]:

locations_ner_messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant that identifies locations in a text. Locations are physical places, like cities, states, countries, etc. They are not people, organizations, or other entities. They are not groups of people, like British or American."
    },
    {
        "role": "user",
        "content": f"Find only the locations in this text: {dataset[2]['article']}"
    }
]

Now let's go ahead and use the model. Notice here that we are using a different endpoint, specificially `client.beta.chat.completions.parse`. This allows us to pass an extra argument: `response_format` which can take a Pydantic data model.

In [None]:
response = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=locations_ner_messages,
    response_format=BasicLocation
)

print(response.choices[0].message.content)

{"text":"Three, ANor, Italy, Ethiopia, France, Rhineland, Washington, Av1"}


The example above has some right answers and some wrong answers. This isn't good. We can rerun this several times and we'll get different (and some better) results. But this is impractical for one main reason: the data is just a list of names provided as a single string. This isn't well-structured data for entities. We want our data to be a list of entities, not a continuous string.

To fix this, we can make another data model that will be a list of our BasicLocations

In [119]:
class LocationList(BaseModel):
    locations: list[BasicLocation]

In [120]:
response = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=locations_ner_messages,
    response_format=LocationList
)

print(response.choices[0].message.content)

{"locations":[{"text":"Italy"},{"text":"Ethiopia"},{"text":"France"},{"text":"Rhineland"},{"text":"Washington"}]}


Now we have something that is starting to look a bit better. Notice that we have a key of "locations" which points to a list of dictionaries. These dictionaries all have one key, "text" that points to the text of the location. But we have a problem. The data doesn't tell us if this is the corrected OCR or the raw text found inside the article. This can make it challenging to validate any output.

To fix this, we can add extra variables to our Location class, such as "original_text" and "corrected_ocr".

In [121]:
class LocationCorrected(BaseModel):
    original_text: str
    corrected_ocr: str

class LocationListCorrected(BaseModel):
    locations: list[LocationCorrected]

response = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=locations_ner_messages,
    response_format=Locations
)

print(response.choices[0].message.content)

{"locations":[{"original_text":"Italy","corrected_ocr":"Italy","corrected":true,"type":"country"},{"original_text":"Ethiopio","corrected_ocr":"Ethiopia","corrected":true,"type":"country"},{"original_text":"France","corrected_ocr":"France","corrected":false,"type":"country"},{"original_text":"Rhnelond","corrected_ocr":"Rhineland","corrected":true,"type":"other"},{"original_text":"Franco","corrected_ocr":"Franco","corrected":false,"type":"other"},{"original_text":"WASHINGTON","corrected_ocr":"Washington","corrected":true,"type":"other"}]}


We can even modify this further to have a boolean to let us know if the data was in fact corrected.

In [122]:
class LocationCorrected(BaseModel):
    original_text: str
    corrected_ocr: str
    was_corrected: bool

class LocationListCorrected(BaseModel):
    locations: list[LocationCorrected]

response = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=locations_ner_messages,
    response_format=LocationListCorrected
)

print(response.choices[0].message.content)

{"locations":[{"original_text":"British","corrected_ocr":"British","was_corrected":false},{"original_text":"Nozs Dotes","corrected_ocr":"Nazi Germany","was_corrected":true},{"original_text":"Itoly","corrected_ocr":"Italy","was_corrected":true},{"original_text":"Ethiopio","corrected_ocr":"Ethiopia","was_corrected":true},{"original_text":"France","corrected_ocr":"France","was_corrected":false},{"original_text":"Rhnelond","corrected_ocr":"Rhineland","was_corrected":true},{"original_text":"Franco","corrected_ocr":"Spain","was_corrected":true},{"original_text":"WASHINGTON","corrected_ocr":"WASHINGTON","was_corrected":false}]}


Will this always work? Absolutely not. In fact, if you examine the results above, I'm certain you will see mistakes. And, if you don't then simply run the cell above a few more times you will begin to see a few.

Before we move forward, I thought it best to also have you examine the type of data that we see above. This looks like a dictionary, right? But if we examine it, we will find that it is actually a string.

In [123]:
type(response.choices[0].message.content)

str

At this stage, we could use json.loads to conver the data into JSON or we can use the parsed structured data that the OpenAI API provides us:

In [60]:
print(response.choices[0].message.parsed)

locations=[Location(original_text='Italy', corrected_ocr='ltoly', corrected=True), Location(original_text='Ethiopia', corrected_ocr='Ethiopio', corrected=True), Location(original_text='France', corrected_ocr='France', corrected=False), Location(original_text='Rhineland', corrected_ocr='Rhnelond', corrected=True), Location(original_text='Washington', corrected_ocr='WASHINGTON', corrected=True)]


What if we wanted to get more granularity about the data we are extracting. We could add `type`.

In [62]:
class Location(BaseModel):
    original_text: str
    corrected_ocr: str
    corrected: bool
    type: str

class Locations(BaseModel):
    locations: list[Location]

response = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=locations_ner_messages,
    response_format=Locations
)
print(response.choices[0].message.parsed)

locations=[Location(original_text='Italy', corrected_ocr='Itoly', corrected=True, type='country'), Location(original_text='Ethiopia', corrected_ocr='Ethiopio', corrected=True, type='country'), Location(original_text='France', corrected_ocr='France', corrected=False, type='country'), Location(original_text='Rhineland', corrected_ocr='Rhnelond', corrected=True, type='region'), Location(original_text='WASHINGTON', corrected_ocr='WASHINGTON', corrected=False, type='city')]


Unfortunately, `type` isn't very clear. What do we mean by `type`? This is where we can further add clarity in our schema by providing the model with descriptions. To get descriptions, we can use the `Field` class from Pydantic and pass a description to it.

In [127]:
from pydantic import Field

class Location(BaseModel):
    original_text: str = Field(description="The original text of the location")
    corrected_ocr: str = Field(description="The corrected OCR of the location")
    corrected: bool = Field(description="Whether the location was corrected")
    type: str = Field(description="The type of location, either 'city', 'state', 'country', or 'other'")

class Locations(BaseModel):
    locations: list[Location] = Field(description="A list of locations")

response = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=locations_ner_messages,
    response_format=Locations
)
print(response.choices[0].message.parsed)

locations=[Location(original_text='Ethiopio', corrected_ocr='Ethiopia', corrected=True, type='country'), Location(original_text='France', corrected_ocr='France', corrected=False, type='country'), Location(original_text='Rhnelond', corrected_ocr='Rhineland', corrected=True, type='other'), Location(original_text='WASHINGTON', corrected_ocr='Washington', corrected=False, type='other')]


## **EXERCISE 2** (10 Minutes)

Use the article below. Try and create a Pydantic data model that can find and extract the people in the text. If you succeed at this, try and also capture their professional title and roles.

In [100]:
print(dataset[39]['article'])

BY the Associated Press.


CHICAGO, Nov. 10.-Assistant
Attorney General Thurman Arnold
declared yesterday that dominant
American business" was to blame
for defense production lag.


In an N. B. c. radio address on
the University of Chicago Round
Table Mr.. Arnold said that "for
the first 10 months our defense
effort was hampered by the fear of
expansion of the production Of basic
materials"


Businessmen, he said, 'indulg-
ing in wishful thinking. concealed
shortages by overoptimistic predic-
tons of supply.


Il would still insist that the gen.
eral attitude of dominant American
business. fearing overproduction
after the war, was responsible for
this lag in production"


Leo M. Cherne, director Of the
Research Institute of America. told
the same audience that labor also
was partly responsible and urged
that some sort of legislation is
needed to restrict labor's demands
to the purely legitimate needs re-
lating to hours. wages and condi-
tions of work.


Mr. Cherne estimated that Amer


## Structured Outputs and Few-Shot

We can improve our outputs even more by providing the model with 1 or more examples.

In [128]:
few_shot_messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant that identifies locations in a text. Locations are physical places, like cities, states, countries, etc. They are not people, organizations, or other entities. They are not groups of people, like British or American."
    },
    {
        "role": "user",
        "content": "Here is a text: The Nazis invaded Paland in 1938. The British were in Britain."},
    {
        "role": "assistant",
        "content": "[{\"original_text\": \"Paland\", \"corrected_ocr\": \"Poland\", \"corrected\": true, \"type\": \"country\"}, {\"original_text\": \"Britain\", \"corrected_ocr\": \"Britain\", \"corrected\": false, \"type\": \"country\"}]"},
    {
        "role": "user",
        "content": f"Find only the locations in this text: {dataset[2]['article']}"
    }
]
class Location(BaseModel):
    original_text: str = Field(description="The original text of the location")
    corrected_ocr: str = Field(description="The corrected OCR of the location")
    corrected: bool = Field(description="Whether the location was corrected")
    type: str = Field(description="The type of location, either 'city', 'state', 'country', or 'other'")

class Locations(BaseModel):
    locations: list[Location] = Field(description="A list of locations")

response = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=few_shot_messages,
    response_format=Locations
)
print(response.choices[0].message.parsed)

locations=[Location(original_text='Italy', corrected_ocr='Italy', corrected=True, type='country'), Location(original_text='Ethiopio', corrected_ocr='Ethiopia', corrected=True, type='country'), Location(original_text='France', corrected_ocr='France', corrected=False, type='country'), Location(original_text='Rhnelond', corrected_ocr='Rhineland', corrected=True, type='other'), Location(original_text='Washington', corrected_ocr='Washington', corrected=False, type='other')]
