## Named Entity Recognition

Named Entity Recognition (NER) is a fundamental Natural Language Processing (NLP) task that involves identifying and classifying named entities (like people, places, organizations) within text. For example, in the sentence "Shakespeare wrote Romeo and Juliet in London", a NER system would identify "Shakespeare" as a person, "Romeo and Juliet" as a work of art, and "London" as a location. NER is crucial for extracting structured information from unstructured text, making it valuable for tasks like information retrieval, question answering, and metadata enrichment. In this notebook, we'll explore how to perform NER using both traditional NLP approaches and modern Large Language Models.

## Rationale

This notebook demonstrates how to use OpenAI's GPT models to perform Named Entity Recognition (NER) by converting input text into annotated markdown format. Rather than using traditional NLP libraries, we leverage a Large Language Model's natural language understanding capabilities to identify and classify named entities. The notebook takes plain text as input and outputs markdown where entities are annotated in the format [Entity](TYPE), such as [London](LOCATION). This approach showcases how LLMs can be used for structured information extraction tasks in cultural heritage metadata enrichment.


## Process Overview

The process consists of the following steps:
1. **Text Input**: We start with plain text that needs entity recognition
2.  **LLM Processing**: The text is sent to GPT with a prompt that instructs it to identify entities
3. **Entity Annotation**: The LLM marks entities in markdown format: [Entity](TYPE)
4. **Visualization**: The annotated text is displayed with color-coded entity highlighting

This approach leverages the LLM's natural language understanding while producing structured, machine-readable output.

## Necessary Functions for Visualizations

In [9]:
import re
import spacy
from spacy.tokens import Doc, Span


def annotated_text_to_spacy_doc(text, nlp=None):
    """
    Converts annotated text in format [Entity](LABEL) to a spaCy Doc with entity spans.
    
    Args:
        text (str): Text with annotations like "[Tom](PERSON) worked for [Microsoft](ORGANIZATION)"
        nlp (spacy.Language, optional): spaCy language model. If None, uses blank English model.
    
    Returns:
        spacy.tokens.Doc: spaCy document with entity spans set
        
    Example:
        >>> text = "[Tom](PERSON) worked for [Microsoft](ORGANIZATION) in 2020 before he lived in [Rome](LOCATION)."
        >>> doc = annotated_text_to_spacy_doc(text)
        >>> spacy.displacy.render(doc, style="ent")
    """
    if nlp is None:
        nlp = spacy.blank("en")
    
    # Pattern to match [text](LABEL) format
    pattern = r'\[([^\]]+)\]\(([^)]+)\)'
    
    # Parse the text to extract tokens and entity information
    tokens = []
    entity_spans = []  # List of (start_token_idx, end_token_idx, label)
    custom_labels = set()
    
    # Split text by the pattern and process each part
    last_end = 0
    token_idx = 0
    
    for match in re.finditer(pattern, text):
        # Add tokens before the entity
        before_entity = text[last_end:match.start()]
        if before_entity.strip():
            # Tokenize the text before the entity
            before_tokens = before_entity.split()
            tokens.extend(before_tokens)
            token_idx += len(before_tokens)
        
        # Add the entity tokens
        entity_text = match.group(1)
        entity_label = match.group(2)
        custom_labels.add(entity_label)
        
        # Tokenize the entity text
        entity_tokens = entity_text.split()
        start_token_idx = token_idx
        tokens.extend(entity_tokens)
        token_idx += len(entity_tokens)
        end_token_idx = token_idx
        
        # Store entity span information
        entity_spans.append((start_token_idx, end_token_idx, entity_label))
        
        last_end = match.end()
    
    # Add any remaining tokens after the last entity
    remaining = text[last_end:]
    if remaining.strip():
        remaining_tokens = remaining.split()
        tokens.extend(remaining_tokens)
    
    # Add custom labels to the NLP model if they don't exist
    if "ner" not in nlp.pipe_names:
        ner = nlp.add_pipe("ner")
    else:
        ner = nlp.get_pipe("ner")
    
    for label in custom_labels:
        ner.add_label(label)
    
    # Create spaces array (True for tokens that should have a space after them)
    # Simple heuristic: all tokens except the last one get a space
    spaces = [True] * len(tokens)
    if tokens:
        spaces[-1] = False
    
    # Create the Doc from tokens
    doc = Doc(nlp.vocab, words=tokens, spaces=spaces)
    
    # Create entity spans
    entities = []
    for start_idx, end_idx, label in entity_spans:
        if start_idx < len(doc) and end_idx <= len(doc):
            span = Span(doc, start_idx, end_idx, label=label)
            entities.append(span)
    
    # Set entities on the document
    doc.ents = entities
    
    return doc


def visualize_annotated_text(text, nlp=None, style="ent", jupyter=True):
    """
    Convenience function to convert annotated text and visualize it with displaCy.
    
    Args:
        text (str): Text with annotations like "[Tom](PERSON) worked for [Microsoft](ORGANIZATION)"
        nlp (spacy.Language, optional): spaCy language model. If None, uses blank English model.
        style (str): displaCy style ("ent" or "dep")
        jupyter (bool): Whether to render for Jupyter notebook
    
    Returns:
        Rendered visualization (HTML string if not in Jupyter)
    """
    doc = annotated_text_to_spacy_doc(text, nlp)
    
    try:
        import spacy
        return spacy.displacy.render(doc, style=style, jupyter=jupyter)
    except ImportError:
        print("spaCy not installed. Please install with: pip install spacy")
        return None


## Importing the Required Libraries

In [10]:
from dotenv import load_dotenv
import os
from openai import OpenAI

## Loading our Environment Variables

In [11]:
load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

## Connecting to OpenAI Client

In [12]:
client = OpenAI(api_key=OPENAI_API_KEY)

## Main Variables for the Notebook

In [13]:
INPUT_DATA = [{'text_original': "This painting depicts Monet's first wife, Camille, outside on a snowy day passing by the French doors of their home at Argenteuil. Her face is rendered in a radically bold Impressionist technique of mere daubs of paint quickly applied, just as the snow and trees are defined by broad, broken strokes of pure white and green.",
  'text_clean': "This painting depicts Monet's first wife, Camille, outside on a snowy day passing by the French doors of their home at Argenteuil. Her face is rendered in a radically bold Impressionist technique of mere daubs of paint quickly applied, just as the snow and trees are defined by broad, broken strokes of pure white and green.",
  'language': {'language': 'en', 'score': -868.9007034301758},
  'sentences': [{'id': 0,
    'start': 0,
    'end': 130,
    'text': "This painting depicts Monet's first wife, Camille, outside on a snowy day passing by the French doors of their home at Argenteuil."},
   {'id': 1,
    'start': 131,
    'end': 324,
    'text': 'Her face is rendered in a radically bold Impressionist technique of mere daubs of paint quickly applied, just as the snow and trees are defined by broad, broken strokes of pure white and green.'}],
  'tokens': [{'id': 0,
    'text': 'This',
    'start': 0,
    'end': 4,
    'ws': True,
    'is_punct': False,
    'sent_id': 0},
   {'id': 1,
    'text': 'painting',
    'start': 5,
    'end': 13,
    'ws': True,
    'is_punct': False,
    'sent_id': 0},
   {'id': 2,
    'text': 'depicts',
    'start': 14,
    'end': 21,
    'ws': True,
    'is_punct': False,
    'sent_id': 0},
   {'id': 3,
    'text': 'Monet',
    'start': 22,
    'end': 27,
    'ws': False,
    'is_punct': False,
    'sent_id': 0},
   {'id': 4,
    'text': "'s",
    'start': 27,
    'end': 29,
    'ws': True,
    'is_punct': False,
    'sent_id': 0},
   {'id': 5,
    'text': 'first',
    'start': 30,
    'end': 35,
    'ws': True,
    'is_punct': False,
    'sent_id': 0},
   {'id': 6,
    'text': 'wife',
    'start': 36,
    'end': 40,
    'ws': False,
    'is_punct': False,
    'sent_id': 0},
   {'id': 7,
    'text': ',',
    'start': 40,
    'end': 41,
    'ws': True,
    'is_punct': True,
    'sent_id': 0},
   {'id': 8,
    'text': 'Camille',
    'start': 42,
    'end': 49,
    'ws': False,
    'is_punct': False,
    'sent_id': 0},
   {'id': 9,
    'text': ',',
    'start': 49,
    'end': 50,
    'ws': True,
    'is_punct': True,
    'sent_id': 0},
   {'id': 10,
    'text': 'outside',
    'start': 51,
    'end': 58,
    'ws': True,
    'is_punct': False,
    'sent_id': 0},
   {'id': 11,
    'text': 'on',
    'start': 59,
    'end': 61,
    'ws': True,
    'is_punct': False,
    'sent_id': 0},
   {'id': 12,
    'text': 'a',
    'start': 62,
    'end': 63,
    'ws': True,
    'is_punct': False,
    'sent_id': 0},
   {'id': 13,
    'text': 'snowy',
    'start': 64,
    'end': 69,
    'ws': True,
    'is_punct': False,
    'sent_id': 0},
   {'id': 14,
    'text': 'day',
    'start': 70,
    'end': 73,
    'ws': True,
    'is_punct': False,
    'sent_id': 0},
   {'id': 15,
    'text': 'passing',
    'start': 74,
    'end': 81,
    'ws': True,
    'is_punct': False,
    'sent_id': 0},
   {'id': 16,
    'text': 'by',
    'start': 82,
    'end': 84,
    'ws': True,
    'is_punct': False,
    'sent_id': 0},
   {'id': 17,
    'text': 'the',
    'start': 85,
    'end': 88,
    'ws': True,
    'is_punct': False,
    'sent_id': 0},
   {'id': 18,
    'text': 'French',
    'start': 89,
    'end': 95,
    'ws': True,
    'is_punct': False,
    'sent_id': 0},
   {'id': 19,
    'text': 'doors',
    'start': 96,
    'end': 101,
    'ws': True,
    'is_punct': False,
    'sent_id': 0},
   {'id': 20,
    'text': 'of',
    'start': 102,
    'end': 104,
    'ws': True,
    'is_punct': False,
    'sent_id': 0},
   {'id': 21,
    'text': 'their',
    'start': 105,
    'end': 110,
    'ws': True,
    'is_punct': False,
    'sent_id': 0},
   {'id': 22,
    'text': 'home',
    'start': 111,
    'end': 115,
    'ws': True,
    'is_punct': False,
    'sent_id': 0},
   {'id': 23,
    'text': 'at',
    'start': 116,
    'end': 118,
    'ws': True,
    'is_punct': False,
    'sent_id': 0},
   {'id': 24,
    'text': 'Argenteuil',
    'start': 119,
    'end': 129,
    'ws': False,
    'is_punct': False,
    'sent_id': 0},
   {'id': 25,
    'text': '.',
    'start': 129,
    'end': 130,
    'ws': True,
    'is_punct': True,
    'sent_id': 0},
   {'id': 26,
    'text': 'Her',
    'start': 131,
    'end': 134,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 27,
    'text': 'face',
    'start': 135,
    'end': 139,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 28,
    'text': 'is',
    'start': 140,
    'end': 142,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 29,
    'text': 'rendered',
    'start': 143,
    'end': 151,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 30,
    'text': 'in',
    'start': 152,
    'end': 154,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 31,
    'text': 'a',
    'start': 155,
    'end': 156,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 32,
    'text': 'radically',
    'start': 157,
    'end': 166,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 33,
    'text': 'bold',
    'start': 167,
    'end': 171,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 34,
    'text': 'Impressionist',
    'start': 172,
    'end': 185,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 35,
    'text': 'technique',
    'start': 186,
    'end': 195,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 36,
    'text': 'of',
    'start': 196,
    'end': 198,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 37,
    'text': 'mere',
    'start': 199,
    'end': 203,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 38,
    'text': 'daubs',
    'start': 204,
    'end': 209,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 39,
    'text': 'of',
    'start': 210,
    'end': 212,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 40,
    'text': 'paint',
    'start': 213,
    'end': 218,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 41,
    'text': 'quickly',
    'start': 219,
    'end': 226,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 42,
    'text': 'applied',
    'start': 227,
    'end': 234,
    'ws': False,
    'is_punct': False,
    'sent_id': 1},
   {'id': 43,
    'text': ',',
    'start': 234,
    'end': 235,
    'ws': True,
    'is_punct': True,
    'sent_id': 1},
   {'id': 44,
    'text': 'just',
    'start': 236,
    'end': 240,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 45,
    'text': 'as',
    'start': 241,
    'end': 243,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 46,
    'text': 'the',
    'start': 244,
    'end': 247,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 47,
    'text': 'snow',
    'start': 248,
    'end': 252,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 48,
    'text': 'and',
    'start': 253,
    'end': 256,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 49,
    'text': 'trees',
    'start': 257,
    'end': 262,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 50,
    'text': 'are',
    'start': 263,
    'end': 266,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 51,
    'text': 'defined',
    'start': 267,
    'end': 274,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 52,
    'text': 'by',
    'start': 275,
    'end': 277,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 53,
    'text': 'broad',
    'start': 278,
    'end': 283,
    'ws': False,
    'is_punct': False,
    'sent_id': 1},
   {'id': 54,
    'text': ',',
    'start': 283,
    'end': 284,
    'ws': True,
    'is_punct': True,
    'sent_id': 1},
   {'id': 55,
    'text': 'broken',
    'start': 285,
    'end': 291,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 56,
    'text': 'strokes',
    'start': 292,
    'end': 299,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 57,
    'text': 'of',
    'start': 300,
    'end': 302,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 58,
    'text': 'pure',
    'start': 303,
    'end': 307,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 59,
    'text': 'white',
    'start': 308,
    'end': 313,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 60,
    'text': 'and',
    'start': 314,
    'end': 317,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 61,
    'text': 'green',
    'start': 318,
    'end': 323,
    'ws': False,
    'is_punct': False,
    'sent_id': 1},
   {'id': 62,
    'text': '.',
    'start': 323,
    'end': 324,
    'ws': False,
    'is_punct': True,
    'sent_id': 1}],
  'meta': {'source': 'CMA',
   'id': 135382,
   'char_count': 324,
   'token_count': 63,
   'sentence_count': 2}}]

In [14]:
MODEL = "gpt-4o-mini"
LABELS = ["PERSON", "LOCATION", "ORGANIZATION"]
TEXT = INPUT_DATA[0]["text_clean"]
print(TEXT)

This painting depicts Monet's first wife, Camille, outside on a snowy day passing by the French doors of their home at Argenteuil. Her face is rendered in a radically bold Impressionist technique of mere daubs of paint quickly applied, just as the snow and trees are defined by broad, broken strokes of pure white and green.


## Creating the Prompt

In [15]:
prompt = f"""
Convert the following text into a structured markdown format, where you annotate the entities in the text in the following format: [Tom](PERSON) went to [New York](PLACE).

Look for the following entities types:
{LABELS}

Do this for the following text:
{TEXT}

Only return the markdown output, nothing else.
"""

In [16]:
print(prompt)


Convert the following text into a structured markdown format, where you annotate the entities in the text in the following format: [Tom](PERSON) went to [New York](PLACE).

Look for the following entities types:
['PERSON', 'LOCATION', 'ORGANIZATION']

Do this for the following text:
This painting depicts Monet's first wife, Camille, outside on a snowy day passing by the French doors of their home at Argenteuil. Her face is rendered in a radically bold Impressionist technique of mere daubs of paint quickly applied, just as the snow and trees are defined by broad, broken strokes of pure white and green.

Only return the markdown output, nothing else.



## Calling OpenAI

In [17]:
response = client.chat.completions.create(
    model=MODEL,
    messages=[{"role": "user", "content": prompt}]
)

In [18]:
markdown_output = response.choices[0].message.content

In [19]:
print(markdown_output)

This painting depicts [Monet](PERSON)'s first wife, [Camille](PERSON), outside on a snowy day passing by the French doors of their home at [Argenteuil](LOCATION). Her face is rendered in a radically bold Impressionist technique of mere daubs of paint quickly applied, just as the snow and trees are defined by broad, broken strokes of pure white and green.


## Visualizing the Results

In [20]:
visualize_annotated_text(markdown_output)

In [22]:
doc = annotated_text_to_spacy_doc(markdown_output)
print(doc.ents)

(Monet, Camille, Argenteuil)


In [24]:
entities = []
for ent in doc.ents:
    print(ent.text, ent.label_, ent.start_char, ent.end_char)
    entities.append({
        "text": ent.text,
        "label": ent.label_,
        "start_char": ent.start_char,
        "end_char": ent.end_char
    })
    

Monet PERSON 22 27
Camille PERSON 43 50
Argenteuil LOCATION 121 131


In [26]:
INPUT_DATA[0]["entities"] = entities

In [None]:
print(INPUT_DATA[0]["entities"])

[{'text': 'Monet', 'label': 'PERSON', 'start_char': 22, 'end_char': 27}, {'text': 'Camille', 'label': 'PERSON', 'start_char': 43, 'end_char': 50}, {'text': 'Argenteuil', 'label': 'LOCATION', 'start_char': 121, 'end_char': 131}]
