# 👩‍💻👨‍💻 <font color='red'>**Please do not edit this file.** </font> Go to <font color='blue'>*File > Save a copy in Drive*</font>

# DHd 2025 | Workshop: **eScriptorium meets LLMs**
## Hands-On 3: **LLMs for Named Entity Recognition (NER) of OCR results**

This introductory (and experimental) notebook is based on:
* Zhang, R., Li, Y., Ma, Y., Zhou, M., & Zou, L. (2023). LLMaAA: Making Large Language Models as Active Annotators. In H. Bouamor, J. Pino, & K. Bali (Hrsg.), Findings of the Association for Computational Linguistics: EMNLP 2023 (S. 13088–13103). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.findings-emnlp.872
  * https://github.com/ridiculouz/LLMaAA/
* Dalfsen, A. van, Karsdorp, F., Bagheri, A., Engelen, T. van, Mentink, D., & Stronks, E. (2024). Direct and Indirect Annotation with Generative AI: A Case Study into Finding Animals and Plants in Historical Text. Proceedings of the Computational Humanities Research Conference, 2024. https://ceur-ws.org/Vol-3834/paper74.pdf
  * https://github.com/trister95/direct-and-indirect-annotation

In [1]:
# Install dependencies
%pip install -U litellm==1.61.7 spacy==3.7.5 json-repair==0.39.1

Collecting litellm==1.61.7
  Downloading litellm-1.61.7-py3-none-any.whl.metadata (37 kB)
Collecting json-repair==0.39.1
  Downloading json_repair-0.39.1-py3-none-any.whl.metadata (11 kB)
Collecting python-dotenv>=0.2.0 (from litellm==1.61.7)
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Collecting tiktoken>=0.7.0 (from litellm==1.61.7)
  Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading litellm-1.61.7-py3-none-any.whl (6.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m19.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading json_repair-0.39.1-py3-none-any.whl (20 kB)
Downloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packa

In [2]:
import json
import random
import requests

## Load **labeled data**

We are loading manually annotated NER data as a `.json` file. The labeled data is based on OCR ground truth for the German newspaper "Deutscher Reichsanzeiger und Preußischer Staatsanzeiger" (*German Imperial Gazette and Prussian Official Gazette*), which was published under changing names from 1819 to 1945 (https://digi.bib.uni-mannheim.de/periodika/reichsanzeiger/ausgaben).

→ The labeled data will be our source for creating example and test data for LLM usage in later steps.

----

### Data sources

* `OCR` data (license: CC0): https://github.com/UB-Mannheim/reichsanzeiger-gt
* `NER` data (license: CC0): https://github.com/UB-Mannheim/reichsanzeiger-nlp


In [3]:
# URL with labeled data
json_url = 'https://raw.githubusercontent.com/tsmdt/DHd-2025_eScriptorium-meets-LLMs/fca6ede5d1f7bc46bae82519f8afeeb5725d4601/03_LLMs4NER/data/labeled_data.json'

In [4]:
# Load labeled data
labeled_data = requests.get(json_url).json()

# Print the first two items of the json
labeled_data[:2]

[{'text': 'In der großen Kapelle Quirinals, die Pauliniſche genannt, werden alle Vorrungen getroffen, um einen großen und freien Raum zu winnen.',
  'labels': [{'text': 'Kapelle Quirinals, die Pauliniſche',
    'start': 14,
    'end': 48,
    'label': 'LOC'}]},
 {'text': '(Ich rufe den Herrn Chriſtus, der mich richten zum Zeugen an, daß ich denjenigen waͤhle, welchen ich Gottes Willen waͤhlen zu muͤſſen glaube, was ich auch dem Accessus thun werde.)',
  'labels': [{'text': 'Herrn Chriſtus',
    'start': 14,
    'end': 28,
    'label': 'PER'},
   {'text': 'Gottes', 'start': 101, 'end': 107, 'label': 'PER'}]}]

In [5]:
print(f"Labeled sentences in data set: {len(labeled_data)}")

Labeled sentences in data set: 1161


## Create **data pools**

We now randomly select a number of sentences (e.g. 100 sentences) from our `labeled_data` and split this sample into two data pools:

1. `example_data`: a data pool from we which we will choose examples for prompting our LLM later
2. `test_data`: a data pool for testing the accuracy of both `spacy` NER models and LLMs

In [6]:
def get_random_samples(
  data: list[dict],
  sent_length: int,
  n: int,
  seed: int | None = 42
) -> list[dict]:
  """
  Samples a subset based on sentence length, and returns a random sample.

  Args:
    data: A list of dictionaries, where each dictionary is expected
          to have a "text" key containing a string.
    sent_length: The minimum sentence length (number of words) for a
                 dictionary to be included in the sample.
    n: The number of random samples to return.
    seed: An optional integer to seed the random number generator for
          reproducible results.

  Returns:
    A list of dictionaries that meet the sentence length criteria,
    randomly sampled from the input data. The size of the returned
    list will be the smaller of `n` and the number of dictionaries
    meeting the sentence length criteria.
  """
  # Filter sentences by number of words
  filtered = [item for item in data if len(item.get("text", "").split()) >= sent_length]
  sample_size = min(n, len(filtered))

  # Set the random seed
  rng = random.Random(seed)

  return rng.sample(filtered, sample_size)

In [7]:
# Create a list of 100 randomly sampled sentences with labels
samples = get_random_samples(data=labeled_data, sent_length=5, n=100, seed=84)
samples[26]

{'text': 'Wohnſitzes in Uedem, und der Notariats⸗Kandidat Ludwig Pfahl zu Bonn zum Notar für den Friedensgerichtsbezirk Waldbröl, im Landgerichtsbezirk Köln, mit Anweiſung ſeines Wohnſitzes in Waldbröl, ernannt worden.',
 'labels': [{'text': 'Uedem', 'start': 14, 'end': 19, 'label': 'LOC'},
  {'text': 'Notariats⸗Kandidat Ludwig Pfahl',
   'start': 29,
   'end': 60,
   'label': 'PER'},
  {'text': 'Bonn', 'start': 64, 'end': 68, 'label': 'LOC'},
  {'text': 'Waldbröl', 'start': 110, 'end': 118, 'label': 'LOC'},
  {'text': 'Köln', 'start': 142, 'end': 146, 'label': 'LOC'},
  {'text': 'Waldbröl', 'start': 183, 'end': 191, 'label': 'LOC'}]}

In [8]:
def split_sample_data(samples: list[dict], perc: float) -> tuple[list[dict], list[dict]]:
  """
  Split sample data into example and test subsets based on specified proportion.

  Args:
      samples: Input data containing dictionaries to split
      perc: Fraction (0.0-1.0) of data to allocate to examples subset

  Returns:
      Tuple containing:
      - example_data: First `perc` portion for LLM prompting
      - test_data: Remaining samples for model testing
  """
  ratio = int(round(len(samples) * perc))
  example_data = samples[:ratio]
  test_data = samples[ratio:]
  return example_data, test_data

In [9]:
# Create 50 samples each for example_data and test_data
example_data, test_data = split_sample_data(samples=samples, perc=0.5)

In [10]:
# First sentence of example_data
example_data[0]

{'text': 'Am 28. April: Dr. Hoffa, Ob. St. Arzt a. D., zuletzt in der ehem. Kurheſſ. Artill.',
 'labels': [{'text': '28. April', 'start': 3, 'end': 12, 'label': 'TIME'},
  {'text': 'Dr. Hoffa, Ob. St. Arzt a. D.',
   'start': 14,
   'end': 43,
   'label': 'PER'},
  {'text': 'ehem. Kurheſſ. Artill.', 'start': 60, 'end': 82, 'label': 'ORG'}]}

In [11]:
# First sentence of test_data
test_data[0]

{'text': 'Lord Palmerſton iſt von hier nach Tiverton in Devonſhire abgereiſt.',
 'labels': [{'text': 'Lord Palmerſton', 'start': 0, 'end': 15, 'label': 'PER'},
  {'text': 'Tiverton', 'start': 34, 'end': 42, 'label': 'LOC'},
  {'text': 'Devonſhire', 'start': 46, 'end': 56, 'label': 'LOC'}]}

## NER with `spacy`

We now test `spacy` NER models on our `test_data` to check how well they perform on our historical domain.

<font color='purple'>**NOTE:** This comparison is not entirely fair as we are testing basic `spacy` models on our domain. There might be much more powerful models available, but our rationale for the workshop is: *what if we are working with data for which NO models are available?*</font>

In [12]:
# Download spacy models for NER testing
!python -m spacy download de_core_news_md
!python -m spacy download de_core_news_lg

Collecting de-core-news-md==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_md-3.7.0/de_core_news_md-3.7.0-py3-none-any.whl (44.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 MB[0m [31m21.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: de-core-news-md
Successfully installed de-core-news-md-3.7.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('de_core_news_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Collecting de-core-news-lg==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_lg-3.7.0/de_core_news_lg-3.7.0-py3-none-any.whl (567.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5

In [13]:
import spacy
import de_core_news_lg, de_core_news_md

We define a sentence from our `test_data` pool for testing first.

In [14]:
# Test sentence
test_sentence = test_data[17]
test_sentence

{'text': 'Am 24. April: Kayſer, Ob. Lt. a. D., zuletzt in der 8. Art. Brig.',
 'labels': [{'text': '24. April', 'start': 3, 'end': 12, 'label': 'TIME'},
  {'text': 'Kayſer, Ob. Lt. a. D.', 'start': 14, 'end': 35, 'label': 'PER'},
  {'text': '8. Art. Brig', 'start': 52, 'end': 64, 'label': 'ORG'}]}

To get a **direct comparison** between the `spacy` labels and the already existing labels, we print the manually annotated labels first:

In [15]:
def print_entities(sentence: dict) -> None:
  """
  Helper function to print existing labels.
  """
  for label in sentence['labels']:
      if label.get('text') != None:
        print(f"Entity: {label['text']} (Label: {label['label']})")
      elif label.get('span') != None:
        print(f"Entity: {label['span']} (Label: {label['label']})")
      else:
        print("No entity found")

print_entities(test_sentence)

Entity: 24. April (Label: TIME)
Entity: Kayſer, Ob. Lt. a. D. (Label: PER)
Entity: 8. Art. Brig (Label: ORG)


Now we test `de_core_news_md`, a *medium* sized NER model.

* Documentation: https://spacy.io/models/de#de_core_news_md

In [16]:
# Load medium sized spacy model
spacy.require_cpu()
nlp = spacy.load("de_core_news_md")

# Run NER on our test sentence
doc = nlp(test_sentence['text'])

for ent in doc.ents:
  print(f"Entity: {ent.text} (Label: {ent.label_})")

Entity: Kayſer (Label: MISC)
Entity: Lt (Label: PER)
Entity: Brig (Label: LOC)


As we can see, there is quite a difference between the labels annotated by humans and by `spacy`.

Let's check if the *larger* `spacy` model `de_core_news_lg` performs better on our test sentence.

* Documentation: https://spacy.io/models/de#de_core_news_lg

In [17]:
# Load larger spacy model
spacy.require_cpu()
nlp = spacy.load("de_core_news_lg")

# Run NER on our test sentence
doc = nlp(test_sentence['text'])

for ent in doc.ents:
  print(f"Entity: {ent.text} (Label: {ent.label_})")

Entity: Kayſer (Label: PER)
Entity: Lt (Label: MISC)
Entity: Brig (Label: LOC)


## NER with LLMs

After using `spacy` for NER we now test the capabilities of LLMs as annotaters via the LLM service `Groq`.

* `Groq`: https://console.groq.com/

In [18]:
import os
import re
import requests
import json_repair
import litellm
from google.colab import userdata

We need to provide an API key to use Groq.

<font color='red'>**IMPORTANT**: Make sure to never share your API keys!</font>

In [28]:
# Set API key for Groq
groq_api_key = "" # gsk_12345...

In [29]:
# Check for available Groq models
url = "https://api.groq.com/openai/v1/models"

headers = {
  "Authorization": f"Bearer {groq_api_key}",
  "Content-Type": "application/json"
}

response = requests.get(url, headers=headers)

response.json()

{'object': 'list',
 'data': [{'id': 'llama3-70b-8192',
   'object': 'model',
   'created': 1693721698,
   'owned_by': 'Meta',
   'active': True,
   'context_window': 8192,
   'public_apps': None},
  {'id': 'gemma2-9b-it',
   'object': 'model',
   'created': 1693721698,
   'owned_by': 'Google',
   'active': True,
   'context_window': 8192,
   'public_apps': None},
  {'id': 'llama-3.2-1b-preview',
   'object': 'model',
   'created': 1727224268,
   'owned_by': 'Meta',
   'active': True,
   'context_window': 8192,
   'public_apps': None},
  {'id': 'deepseek-r1-distill-llama-70b',
   'object': 'model',
   'created': 1737924940,
   'owned_by': 'DeepSeek / Meta',
   'active': True,
   'context_window': 131072,
   'public_apps': None},
  {'id': 'whisper-large-v3',
   'object': 'model',
   'created': 1693721698,
   'owned_by': 'OpenAI',
   'active': True,
   'context_window': 448,
   'public_apps': None},
  {'id': 'llama-3.3-70b-versatile',
   'object': 'model',
   'created': 1733447754,
   'ow

### **0-shot prompting** (no example data)

In [21]:
ZERO_SHOT_TEMPLATE = """You are a highly intelligent and accurate information extraction system for Named Entity Recognition (NER).
I'll provide an <input sentence>, written in German Fraktur. Perform the following task on the <input sentence>:

1. Carefully **recognize** and **annotate** entities of the following types:
- PER: Names of individuals (e.g. "Maria", "Herzogs Boleslaus des Frommen", "Alexander Duval")
- LOC: Names of locations (e.g. "Polen", "Swinemuender Rhede", "Aegypten")
- ORG: Names of organizations (e.g. "Kommiſſion des Pairshofes", "Nicolaiſche Buchhandlung")
- TIME: Dates (e.g. "Auguſt 1834", "1002", "4. Nov. 1038")
- PROD: Names of products (e.g. "Voſſiſche Zeitung", "Journal de Paris")

2. **Output Format**:
- Always preserve all non‐ASCII characters (like 'ſ') exactly as they appear: **do not** convert or escape them.
- Return ONLY a **JSON** in the following structure and **nothing** else:

```JSON
{'text': <input_sentence>,
 'labels': [{'text': 'Hamburger Correſpondent',
   'label': 'PROD'},
   ...
   ]
}
```

<input sentence>
{input_sentence}
"""

We construct the final prompt for querying the LLM out of **1)** a prompt template and **2)** an input sentence using the helper function `build_prompt()`

In [22]:
def build_prompt(template, **kwargs):
  """
  Helper function to inject kwargs into prompt templates.
  """
  prompt = template
  for key, value in kwargs.items():
    placeholder = f'{{{key}}}'
    prompt = prompt.replace(placeholder, str(value))
  return prompt

# Construct the final prompt with our test sentence as <input_sentence>
prompt_zero_shot = build_prompt(
  template=ZERO_SHOT_TEMPLATE,
  input_sentence=test_sentence['text'].strip()
)

# Print the prompt
print(prompt_zero_shot)

You are a highly intelligent and accurate information extraction system for Named Entity Recognition (NER).
I'll provide an <input sentence>, written in German Fraktur. Perform the following task on the <input sentence>:

1. Carefully **recognize** and **annotate** entities of the following types:
- PER: Names of individuals (e.g. "Maria", "Herzogs Boleslaus des Frommen", "Alexander Duval")
- LOC: Names of locations (e.g. "Polen", "Swinemuender Rhede", "Aegypten")
- ORG: Names of organizations (e.g. "Kommiſſion des Pairshofes", "Nicolaiſche Buchhandlung")
- TIME: Dates (e.g. "Auguſt 1834", "1002", "4. Nov. 1038")
- PROD: Names of products (e.g. "Voſſiſche Zeitung", "Journal de Paris")

2. **Output Format**:
- Always preserve all non‐ASCII characters (like 'ſ') exactly as they appear: **do not** convert or escape them.
- Return ONLY a **JSON** in the following structure and **nothing** else:

```JSON
{'text': <input_sentence>,
 'labels': [{'text': 'Hamburger Correſpondent',
   'label': 

#### Calling the **LLM Service**

We now call the LLM with our newly constructed prompt using the `Groq` API. Notice that we call a relatively small model with 8 billion parameters: `llama-3.1-8b-instant` instead of using a larger model like `llama-3.3-70b-versatile` with 70 billion parameters or OpenAI's `gpt-4o` with 1.8 trillion estimated parameters (OpenAI does not disclose the exact parameter size of their models).

The idea is to run a model that we could also run **locally** (by using, e.g., Ollama: https://ollama.com/). Inference would be slower, but the advantages of running a small LLM locally could outweigh the disadvantages for us: think about data privacy, API costs and rate limits etc.

---

#### Why do we use **`litellm`**?

We call the `Groq` API using the Python package `litellm`: https://docs.litellm.ai/

`Litellm` is a library to call multiple LLM providers in a **uniform** way: with the following `litellm.completion()` function (wrapped inside the `call_llm()` function) we could call `OpenAI`, `Anthropic`, `Google`, `HuggingFace`, `Ollama` and many others. We would just have to switch the `model` name and the `api_key` according to the service we want to use.

<font color='purple'>**NOTE**: To get `Ollama` running with local LLMs you have to install it first. To get `litellm` to use your `Ollama` instance you need to pass an additional argument to the `litellm.completion()` function: `api_base` which refers to the address of your local Ollama instance (default: `http://localhost:11434/`).

* `Ollama` (with installation instructions): https://github.com/ollama/ollama
* `litellm` Docs: https://docs.litellm.ai/docs/</font>

In [23]:
def call_llm(
    model: str,
    prompt: str,
    system_prompt: str = "You are a helpful assistant.",
    temperature: float = 0,
    api_key: str = groq_api_key
    ):
  """
  Function for calling an LLM using litellm.
  """
  response = litellm.completion(
    messages=[
        {
            "role": "system",
            "content": system_prompt
        },
        {
            "role": "user",
            "content": prompt
        }
    ],
    model=model,
    temperature=temperature,
    api_key=api_key,
    # api_base='http://localhost:11434/' # Only needed if you want to use Ollama with local LLMs
  )
  return response.choices[0].message.content

In [24]:
response_zero_shot = call_llm(
    model='groq/llama-3.1-8b-instant',
    prompt=prompt_zero_shot,
    )

In [25]:
print(response_zero_shot)

{'text': 'Am 24. April: Kayſer, Ob. Lt. a. D., zuletzt in der 8. Art. Brig.',
 'labels': [{'text': 'Kayſer', 'label': 'PER'},
            {'text': 'April', 'label': 'TIME'},
            {'text': '8. Art. Brig.', 'label': 'ORG'}]}


Our `result` may look like a valid `dict` but is just a plain string:

In [None]:
print(type(response_zero_shot))

<class 'str'>


For further processing we transform the `json` string into a valid `dict` using the `json_repair` library.

* `json_repair`: https://github.com/mangiucugna/json_repair

In [None]:
def clean_llm_result(result: str) -> dict:
  """
  Helper function to repair LLM results and return a valid dict.
  """
  temp_string = json_repair.repair_json(result)
  clean_dict = json.loads(temp_string)
  return clean_dict

# Call the function
result_zero_shot = clean_llm_result(response_zero_shot)

# Print the result
result_zero_shot

{'text': 'Am 24. April: Kayſer, Ob. Lt. a. D., zuletzt in der 8. Art. Brig.',
 'labels': [{'text': 'Kayſer', 'label': 'PER'},
  {'text': 'April', 'label': 'TIME'},
  {'text': '8. Art. Brig.', 'label': 'ORG'}]}

In [None]:
# Now we have a dict
print(type(result_zero_shot))

<class 'dict'>


At the moment we are missing `start` and `end` positions for our NER labels. A small helper functions can resolve this issue:

In [None]:
# Get the position of the labeled spans and append them to the dict
def annotate_span_positions(data):
  """
  Calculate correct span positions for labeled entities.
  """
  if isinstance(data, list):
   data = data[0]

  text = data["text"]
  new_labels = []

  for label in data["labels"]:
    span = label.get("span") or label.get("text")

    # Find all occurrences of the span in text
    matches = list(re.finditer(re.escape(span), text))
    for m in matches:
        start, end = m.start(), m.end()

        # Check that the occurrence is not part of a larger alphanumeric token
        if start > 0 and text[start-1].isalnum():
          continue
        if end < len(text) and text[end].isalnum():
          continue
        new_labels.append({
            "text": span,
            "start": start,
            "end": end,
            "label": label["label"]
        })
  data["labels"] = new_labels
  return data

In [None]:
result_zero_shot_spans = annotate_span_positions(result_zero_shot)
result_zero_shot_spans

{'text': 'Am 24. April: Kayſer, Ob. Lt. a. D., zuletzt in der 8. Art. Brig.',
 'labels': [{'text': 'Kayſer', 'start': 14, 'end': 20, 'label': 'PER'},
  {'text': 'April', 'start': 7, 'end': 12, 'label': 'TIME'},
  {'text': '8. Art. Brig.', 'start': 52, 'end': 65, 'label': 'ORG'}]}

#### **Comparison**

In [None]:
def print_comparisons(test_sentence: str, llm_result: dict) -> None:
  """
  Helper function to print a comparison of manual, spacy and LLM
  entity labels.
  """
  if test_sentence:
    print(f"Test Sentence:\n{test_sentence['text']}\n")

    print('=== Entities (Human) ===')
    print_entities(test_sentence)

    print("\n=== Entities (Spacy: de_core_news_md) ===")
    spacy.require_cpu()
    nlp = spacy.load("de_core_news_md")
    doc = nlp(test_sentence['text'])

    for ent in doc.ents:
      print(f"Entity: {ent.text} (Label: {ent.label_})")

  if llm_result:
    print('\n=== Entities (LLM) ===')
    print_entities(llm_result)

In [None]:
print_comparisons(test_sentence=test_sentence, llm_result=result_zero_shot_spans)

Test Sentence:
Am 24. April: Kayſer, Ob. Lt. a. D., zuletzt in der 8. Art. Brig.

=== Entities (Human) ===
Entity: 24. April (Label: TIME)
Entity: Kayſer, Ob. Lt. a. D. (Label: PER)
Entity: 8. Art. Brig (Label: ORG)

=== Entities (Spacy: de_core_news_md) ===
Entity: Kayſer (Label: MISC)
Entity: Lt (Label: PER)
Entity: Brig (Label: LOC)

=== Entities (LLM) ===
Entity: Kayſer (Label: PER)
Entity: April (Label: TIME)
Entity: 8. Art. Brig. (Label: ORG)


### **Exercise**

Test other models on the test sentence.

In [None]:
# Check for available Groq models
url = "https://api.groq.com/openai/v1/models"

headers = {
  "Authorization": f"Bearer {groq_api_key}",
  "Content-Type": "application/json"
}
# response = requests.get(url, headers=headers)
# response.json()

In [None]:
# Calling the Groq API using litellm
response = litellm.completion(
  messages=[{
    "role": "user",
    "content": prompt_zero_shot
  }],
  model='groq/llama-3.2-3b-preview',
  temperature=0,
  api_key=groq_api_key
)
# result = clean_llm_result(response.choices[0].message.content)
# result = annotate_span_positions(result)
# print_comparisons(test_sentence=test_sentence, llm_result=result)

### **Few-shot prompting** (with example data)

In [None]:
FEW_SHOT_TEMPLATE = """You are a highly intelligent and accurate information extraction system for Named Entity Recognition (NER).
I'll provide an <input sentence>, written in German Fraktur. Perform the following task on the <input sentence>:

1. Carefully **recognize** and **annotate** entities of the following types:
- PER: Names of individuals (e.g. "Maria", "Herzogs Boleslaus des Frommen", "Alexander Duval")
- LOC: Names of locations (e.g. "Polen", "Swinemuender Rhede", "Aegypten")
- ORG: Names of organizations (e.g. "Kommiſſion des Pairshofes", "Nicolaiſche Buchhandlung")
- TIME: Dates (e.g. "Auguſt 1834", "1002", "4. Nov. 1038")
- PROD: Names of products (e.g. "Voſſiſche Zeitung", "Journal de Paris")

2. **Output Format**:
- Use the provided <example> for reference.
- Always preserve all non‐ASCII characters (like 'ſ') exactly as they appear: **do not** convert or escape them.
- Return ONLY a **JSON** in the following structure and **nothing** else:

```JSON
{'text': <input sentence>,
 'labels': [{'text': 'Hamburger Correſpondent',
   'label': 'PROD'},
   ...
   ]
}
```

<example>
{example}
</example>

<input sentence>
{input_sentence}
"""

In [None]:
# Example sentence for our few shot prompt
example_data[0]

{'text': 'Am 28. April: Dr. Hoffa, Ob. St. Arzt a. D., zuletzt in der ehem. Kurheſſ. Artill.',
 'labels': [{'text': '28. April', 'start': 3, 'end': 12, 'label': 'TIME'},
  {'text': 'Dr. Hoffa, Ob. St. Arzt a. D.',
   'start': 14,
   'end': 43,
   'label': 'PER'},
  {'text': 'ehem. Kurheſſ. Artill.', 'start': 60, 'end': 82, 'label': 'ORG'}]}

In [None]:
# Construct the few shot prompt with example and test sentence
prompt_few_shot = build_prompt(
    template=FEW_SHOT_TEMPLATE,
    example=example_data[0],
    input_sentence=test_sentence['text']
    )

print(prompt_few_shot)

You are a highly intelligent and accurate information extraction system for Named Entity Recognition (NER).
I'll provide an <input sentence>, written in German Fraktur. Perform the following task on the <input sentence>:

1. Carefully **recognize** and **annotate** entities of the following types:
- PER: Names of individuals (e.g. "Maria", "Herzogs Boleslaus des Frommen", "Alexander Duval")
- LOC: Names of locations (e.g. "Polen", "Swinemuender Rhede", "Aegypten")
- ORG: Names of organizations (e.g. "Kommiſſion des Pairshofes", "Nicolaiſche Buchhandlung")
- TIME: Dates (e.g. "Auguſt 1834", "1002", "4. Nov. 1038")
- PROD: Names of products (e.g. "Voſſiſche Zeitung", "Journal de Paris")

2. **Output Format**:
- Use the provided <example> for reference.
- Always preserve all non‐ASCII characters (like 'ſ') exactly as they appear: **do not** convert or escape them.
- Return ONLY a **JSON** in the following structure and **nothing** else:

```JSON
{'text': <input sentence>,
 'labels': [{'te

In [None]:
# Call the LLM with the few shot prompt
response_few_shot = call_llm(
    model='groq/qwen-2.5-32b',
    prompt=prompt_few_shot,
    )
print(response_few_shot)

```json
{
  "text": "Am 24. April: Kayſer, Ob. Lt. a. D., zuletzt in der 8. Art. Brig.",
  "labels": [
    {
      "text": "24. April",
      "start": 3,
      "end": 12,
      "label": "TIME"
    },
    {
      "text": "Kayſer, Ob. Lt. a. D.",
      "start": 14,
      "end": 33,
      "label": "PER"
    },
    {
      "text": "8. Art. Brig.",
      "start": 53,
      "end": 64,
      "label": "ORG"
    }
  ]
}
```


In [None]:
# Repair the LLM result (type = str) to get a valid dict
result_few_shot = clean_llm_result(response_few_shot)

# Add correct span lengths to the annotated data
result_few_shot_spans = annotate_span_positions(result_few_shot)
result_few_shot_spans

{'text': 'Am 24. April: Kayſer, Ob. Lt. a. D., zuletzt in der 8. Art. Brig.',
 'labels': [{'text': '24. April', 'start': 3, 'end': 12, 'label': 'TIME'},
  {'text': 'Kayſer, Ob. Lt. a. D.', 'start': 14, 'end': 35, 'label': 'PER'},
  {'text': '8. Art. Brig.', 'start': 52, 'end': 65, 'label': 'ORG'}]}

#### **Comparison**

In [None]:
print_comparisons(test_sentence, llm_result=result_few_shot_spans)

Test Sentence:
Am 24. April: Kayſer, Ob. Lt. a. D., zuletzt in der 8. Art. Brig.

=== Entities (Human) ===
Entity: 24. April (Label: TIME)
Entity: Kayſer, Ob. Lt. a. D. (Label: PER)
Entity: 8. Art. Brig (Label: ORG)

=== Entities (Spacy: de_core_news_md) ===
Entity: Kayſer (Label: MISC)
Entity: Lt (Label: PER)
Entity: Brig (Label: LOC)

=== Entities (LLM) ===
Entity: 24. April (Label: TIME)
Entity: Kayſer, Ob. Lt. a. D. (Label: PER)
Entity: 8. Art. Brig. (Label: ORG)


#### **System Prompts**

To align the LLM better to our NER task we test a system prompt next.

In [None]:
SYSTEM_PROMPT_TEMPLATE = """You are a **precision-driven Named Entity Recognition (NER) engine**. Follow these steps rigorously:
1. Comprehensively analyze the <input_sentence>, **prioritizing contextual understanding** over isolated terms.

2. **Identify** and **classify entities** into these categories (with strict validation):
- PER: Names of individuals, including titles (e.g. "Maria", "Herzogs Boleslaus des Frommen", "Alexander Duval")
- LOC: Physical locations; cities, countries, landmarks (e.g. "Polen", "Swinemuender Rhede", "Aegypten")
- ORG: Companies, organizations, institutions, brands (e.g. "Kommiſſion des Pairshofes", "Nicolaiſche Buchhandlung")
- TIME: Absolute Dates (e.g. "Auguſt 1834", "1002", "4. Nov. 1038")
- PROD: Products, outlets, commodities (e.g. "Voſſiſche Zeitung", "Journal de Paris", "Aspirin")

3. **Resolve ambiguity** by:
- Cross-referencing entities within the <input_sentence> (e.g., pronouns with antecedents).

4. **Output Format**:
- Use the provided <example> for reference.
- Always preserve all non‐ASCII characters (like 'ſ') exactly as they appear: **do not** convert or escape them.
- Return ONLY a **JSON** in the following structure and **nothing** else:

```JSON
{'text': <input sentence>,
 'labels': [{'text': 'Hamburger Correſpondent',
   'label': 'PROD'},
   ...
   ]
}
```

**Prioritize accuracy over speed**. If entity boundaries or types are unclear, return nothing rather than guessing."""


In [None]:
EXAMPLES_PROMPT_TEMPLATE = """<example>
{example}
</example>

<input_sentence>
{input_sentence}
"""

In [None]:
prompt_example_input_only = build_prompt(
    template=EXAMPLES_PROMPT_TEMPLATE,
    example=example_data[0],
    input_sentence=test_sentence['text']
    )

print(prompt_example_input_only)

<example>
{'text': 'Am 28. April: Dr. Hoffa, Ob. St. Arzt a. D., zuletzt in der ehem. Kurheſſ. Artill.', 'labels': [{'text': '28. April', 'start': 3, 'end': 12, 'label': 'TIME'}, {'text': 'Dr. Hoffa, Ob. St. Arzt a. D.', 'start': 14, 'end': 43, 'label': 'PER'}, {'text': 'ehem. Kurheſſ. Artill.', 'start': 60, 'end': 82, 'label': 'ORG'}]}
</example>

<input_sentence>
Am 24. April: Kayſer, Ob. Lt. a. D., zuletzt in der 8. Art. Brig.



In [None]:
# Call the LLM with our system_prompt and example_only_prompt
response = call_llm(
    model='groq/qwen-2.5-32b',
    system_prompt=SYSTEM_PROMPT_TEMPLATE,
    prompt=prompt_example_input_only
    )

print(response_few_shot)

```json
{
  "text": "Am 24. April: Kayſer, Ob. Lt. a. D., zuletzt in der 8. Art. Brig.",
  "labels": [
    {
      "text": "24. April",
      "start": 3,
      "end": 12,
      "label": "TIME"
    },
    {
      "text": "Kayſer, Ob. Lt. a. D.",
      "start": 14,
      "end": 33,
      "label": "PER"
    },
    {
      "text": "8. Art. Brig.",
      "start": 53,
      "end": 64,
      "label": "ORG"
    }
  ]
}
```


In [None]:
def process_LLM_response(response) -> dict:
  """
  Helper function to clean the LLM response.

  This function combines the individual steps for processing
  the LLM result (cleaning the str; transforming it into a
  dict etc.)
  """
  result = clean_llm_result(response)
  result = annotate_span_positions(result)
  return result

llm_result = process_LLM_response(response)
print_comparisons(test_sentence, llm_result)

Test Sentence:
Am 24. April: Kayſer, Ob. Lt. a. D., zuletzt in der 8. Art. Brig.

=== Entities (Human) ===
Entity: 24. April (Label: TIME)
Entity: Kayſer, Ob. Lt. a. D. (Label: PER)
Entity: 8. Art. Brig (Label: ORG)

=== Entities (Spacy: de_core_news_md) ===
Entity: Kayſer (Label: MISC)
Entity: Lt (Label: PER)
Entity: Brig (Label: LOC)

=== Entities (LLM) ===
Entity: 24. April (Label: TIME)
Entity: Kayſer, Ob. Lt. a. D. (Label: PER)
Entity: 8. Art. Brig. (Label: ORG)


#### Running LLM annotation on a **list** of sentences

In a production setting we would likely want to annotate a **list of sentences** instead of one sentence at a time. We test this workflow next.

In [None]:
# We define a sample list with 5 sentences
sample_set = test_data[:5]
sample_set

[{'text': 'Lord Palmerſton iſt von hier nach Tiverton in Devonſhire abgereiſt.',
  'labels': [{'text': 'Lord Palmerſton',
    'start': 0,
    'end': 15,
    'label': 'PER'},
   {'text': 'Tiverton', 'start': 34, 'end': 42, 'label': 'LOC'},
   {'text': 'Devonſhire', 'start': 46, 'end': 56, 'label': 'LOC'}]},
 {'text': 'Dümmler, Linden Nr. 19, zu haben: Ueber die Bildung der Steinkohle, [1949] nach Lindley und Hutten mit Rückſicht auf andere darüber aufgeſtellte Anſichten.',
  'labels': [{'text': 'Dümmler', 'start': 0, 'end': 7, 'label': 'PER'},
   {'text': 'Linden Nr. 19', 'start': 9, 'end': 22, 'label': 'LOC'},
   {'text': 'Lindley', 'start': 80, 'end': 87, 'label': 'PER'},
   {'text': 'Hutten', 'start': 92, 'end': 98, 'label': 'PER'}]},
 {'text': 'Am 14. Februar: v. Reiche, Major a. D, zuletzt in ehem. Hannov. Dienſten.',
  'labels': [{'text': '14. Februar', 'start': 3, 'end': 14, 'label': 'TIME'},
   {'text': 'v. Reiche, Major a. D', 'start': 16, 'end': 37, 'label': 'PER'}]},
 {'text'

In [None]:
import numpy as np
import random
import time

def cosine_sim(a, b):
  """
  Compute the cosine similarity between two vectors.
  """
  return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def get_kNN_examples(
    embedding_model,
    embedded_sentence: str = None,
    embedded_data: list[dict] = None,
    k: int = 3
    ) -> list:
  """
  Retrieve kNN examples from a sentence and a data_pool of this type:

    {
      'text': 'Berliner Anzeiger am 28. April.',
      'labels': [
        {'text': '28. April', 'start': 3, 'end': 12, 'label': 'TIME'},
        {'text': 'Berliner Anzeiger', 'start': 21, 'end': 30,'label': 'PROD'}
        ]
    }
  """
  if len(embedded_sentence) > 0 and embedded_data:
    # Run cosine similarity between embedded_sentence and embedded_data vectors
    similarities = [cosine_sim(embedded_sentence, data['embeddings']) for data in embedded_data]

    # Retrieve kNN example sentences
    top_k_indices = np.argsort(similarities)[-k:][::-1]
    knn_examples = [embedded_data[i] for i in top_k_indices]

    return knn_examples

  else:
    print('Please provide a valid sentence and data_pool.')
    return None

def run_LLM_annotation(
    sentences: list[dict],
    example_data: list[dict],
    model: str,
    embedding_model: str = None,
    temperature: float = 0,
    api_key: str = groq_api_key
  ) -> list[dict]:
  """
  Annotate sentences using an LLM with optional embedding-based
  example selection.

  For each sentence (extracted from the 'text' key), the function
  either retrieves a similar example via kNN (if an embedding_model
  is provided) or selects one randomly from example_data. It then
  builds a prompt using a template, calls the LLM to annotate the
  sentence, processes the response, and returns a list of annotation
  results.
  """
  if not sentences:
    print('Please provide at least one sentence for annotation.')
    return
  elif len(sentences) == 1:
    # If only one sentence is passed, make it a list
    sentences = [sentences]

  sentences = [sent['text'] for sent in sentences]

  # If embedding_model provided embed example_data
  if embedding_model:
    for i, sent in enumerate(example_data):
      example_data[i]['embeddings'] = embedding_model.encode(f"{sent['text']}")

  # Run the annotation
  labeled_sentences = []
  for i, sent in enumerate(sentences):
    # Choose an example sentence using kNN retrieval
    if embedding_model:
      sent_embedding = embedding_model.encode(f"{sent}")
      example_sents = get_kNN_examples(
          embedding_model=embedding_model,
          embedded_sentence=sent_embedding,
          embedded_data=example_data,
          k=3
          )
      example_sent = example_sents[0]
    # Choose an example sentence randomly
    else:
      example_sent = random.choice(example_data)

    # Combine prompt using a template, example and input sentence
    prompt = build_prompt(
        template=EXAMPLES_PROMPT_TEMPLATE,
        input_sentence=sent,
        example=example_sent
        )

    # Call the LLM
    response = call_llm(
        model=model,
        system_prompt=SYSTEM_PROMPT_TEMPLATE,
        prompt=prompt,
        temperature=temperature,
        api_key=api_key
    )

    # Process and clean the response
    result = process_LLM_response(response)

    # Append the resulting dict to our list
    labeled_sentences.append(result)

    # Sleep for 5 seconds to avoid rate limit error
    print(f"→ Sentence [{i}] finished: Waiting to avoid hitting rate limit ...")
    time.sleep(5)

  return labeled_sentences

Run annotation on the sample set while **randomly** choosing an example sentence.

In [None]:
# Run annotation on sample set
labeled_sentences = run_LLM_annotation(
    sentences=sample_set,
    example_data=example_data, # retrieve examples from example_data
  	model='groq/qwen-2.5-32b'
  )

→ Sentence [0] finished: Waiting to avoid hitting rate limit ...
→ Sentence [1] finished: Waiting to avoid hitting rate limit ...
→ Sentence [2] finished: Waiting to avoid hitting rate limit ...
→ Sentence [3] finished: Waiting to avoid hitting rate limit ...
→ Sentence [4] finished: Waiting to avoid hitting rate limit ...


In [None]:
for sample_sent, labeled_sent in zip(sample_set, labeled_sentences):
  print_comparisons(test_sentence=sample_sent, llm_result=labeled_sent)
  print('\n---\n')

Test Sentence:
Lord Palmerſton iſt von hier nach Tiverton in Devonſhire abgereiſt.

=== Entities (Human) ===
Entity: Lord Palmerſton (Label: PER)
Entity: Tiverton (Label: LOC)
Entity: Devonſhire (Label: LOC)

=== Entities (Spacy: de_core_news_md) ===
Entity: Lord Palmerſton iſt (Label: PER)
Entity: Tiverton (Label: LOC)
Entity: Devonſhire abgereiſt (Label: LOC)

=== Entities (LLM) ===
Entity: Lord Palmerſton (Label: PER)
Entity: Tiverton (Label: LOC)
Entity: Devonſhire (Label: LOC)

---

Test Sentence:
Dümmler, Linden Nr. 19, zu haben: Ueber die Bildung der Steinkohle, [1949] nach Lindley und Hutten mit Rückſicht auf andere darüber aufgeſtellte Anſichten.

=== Entities (Human) ===
Entity: Dümmler (Label: PER)
Entity: Linden Nr. 19 (Label: LOC)
Entity: Lindley (Label: PER)
Entity: Hutten (Label: PER)

=== Entities (Spacy: de_core_news_md) ===
Entity: Dümmler (Label: LOC)
Entity: Linden Nr. 19 (Label: MISC)
Entity: Lindley (Label: LOC)
Entity: Hutten (Label: LOC)
Entity: Rückſicht (Label

#### **Few-shot prompting** with **kNN retrieval** on example data

In the previous section we **randomly** chose an example sentence from `example_data` to call our LLM. As we chose the example sentence blindly, it might not have been helpful at all in guiding the LLM response.

To improve this we go a step further und implement a **[k-Nearest-Neighbor](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) retrieval** to select our sample sentence.

Outline of our workflow:
1. We **embed the input sentence** that we want to label using an embedding model from HuggingFace
  * https://huggingface.co/nomic-ai/nomic-embed-text-v1.5
2. We **embed the pool of example sentences** we previously created (our `example_data` dictionary)
3. We return the **top 3 most similar sentences** to our input sentence using `cosine similarity`
4. We choose the **most similar example sentence** to our input sentences and use it to query the LLM

In [None]:
from sentence_transformers import SentenceTransformer

In [None]:
# Load embedding model
embedding_model = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/255 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/140 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/71.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/120 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/2.06k [00:00<?, ?B/s]

configuration_hf_nomic_bert.py:   0%|          | 0.00/1.96k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/nomic-ai/nomic-bert-2048:
- configuration_hf_nomic_bert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_hf_nomic_bert.py:   0%|          | 0.00/103k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/nomic-ai/nomic-bert-2048:
- modeling_hf_nomic_bert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors:   0%|          | 0.00/547M [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/286 [00:00<?, ?B/s]

In [None]:
# Run annotation on sample set with kNN retrieval
labeled_sentences = run_LLM_annotation(
    sentences=sample_set,
    example_data=example_data, # retrieve examples from example_data
  	model='groq/qwen-2.5-32b',
    embedding_model=embedding_model, # pass embedding model
    temperature=0.1
  )

→ Sentence [0] finished: Waiting to avoid hitting rate limit ...
→ Sentence [1] finished: Waiting to avoid hitting rate limit ...
→ Sentence [2] finished: Waiting to avoid hitting rate limit ...
→ Sentence [3] finished: Waiting to avoid hitting rate limit ...
→ Sentence [4] finished: Waiting to avoid hitting rate limit ...


In [None]:
for sample_sent, labeled_sent in zip(sample_set, labeled_sentences):
  print_comparisons(test_sentence=sample_sent, llm_result=labeled_sent)
  print('\n---\n')

Test Sentence:
Lord Palmerſton iſt von hier nach Tiverton in Devonſhire abgereiſt.

=== Entities (Human) ===
Entity: Lord Palmerſton (Label: PER)
Entity: Tiverton (Label: LOC)
Entity: Devonſhire (Label: LOC)

=== Entities (Spacy: de_core_news_md) ===
Entity: Lord Palmerſton iſt (Label: PER)
Entity: Tiverton (Label: LOC)
Entity: Devonſhire abgereiſt (Label: LOC)

=== Entities (LLM) ===
Entity: Lord Palmerſton (Label: PER)
Entity: Tiverton (Label: LOC)
Entity: Devonſhire (Label: LOC)

---

Test Sentence:
Dümmler, Linden Nr. 19, zu haben: Ueber die Bildung der Steinkohle, [1949] nach Lindley und Hutten mit Rückſicht auf andere darüber aufgeſtellte Anſichten.

=== Entities (Human) ===
Entity: Dümmler (Label: PER)
Entity: Linden Nr. 19 (Label: LOC)
Entity: Lindley (Label: PER)
Entity: Hutten (Label: PER)

=== Entities (Spacy: de_core_news_md) ===
Entity: Dümmler (Label: LOC)
Entity: Linden Nr. 19 (Label: MISC)
Entity: Lindley (Label: LOC)
Entity: Hutten (Label: LOC)
Entity: Rückſicht (Label