<a href="https://colab.research.google.com/github/mialondon/llm-lod-recipes/blob/main/notebooks/named_entity_recognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Named Entity Recognition with spaCy and HIPE Format


## Summary

This recipe demonstrates how to perform Named Entity Recognition (NER) using spaCy and output results in the HIPE (Identifying Historical People, Places and other Entities) standard format. We'll walk through installing spaCy and the hipe-commons package, downloading the English language model, and extracting named entities with comprehensive HIPE-compliant annotations including IOB tagging, entity types, and additional metadata.


## Rationale

Named Entity Recognition is a fundamental NLP task that identifies and classifies named entities (such as persons, organizations, locations, dates, etc.) in text. The HIPE format provides a standardized, comprehensive annotation scheme that includes:

- **Multi-layered annotations**: Literal vs metonymic senses
- **Fine-grained entity types**: Detailed subtype classifications
- **Entity linking**: Connection to knowledge bases (e.g., Wikidata)
- **Nested entity support**: Handling complex entity structures
- **Standardized output**: Consistent format for research and applications

This recipe is essential for:
- Historical text analysis and digital humanities research
- Building knowledge graphs from historical documents
- Standardized NER evaluation and comparison
- Integration with HIPE-compliant datasets and tools


## Key Concepts

**Named Entity Recognition (NER)**: The process of identifying and classifying named entities in text into predefined categories such as PERSON, ORGANIZATION, LOCATION, DATE, etc.

**HIPE Format**: A comprehensive annotation standard with 10 columns per token:
1. **TOKEN**: The actual token text
2. **NE-COARSE-LIT**: Coarse entity type (IOB format) for literal sense
3. **NE-COARSE-METO**: Coarse entity type for metonymic sense
4. **NE-FINE-LIT**: Fine-grained entity type for literal sense
5. **NE-FINE-METO**: Fine-grained entity type for metonymic sense
6. **NE-FINE-COMP**: Component type of the entity
7. **NE-NESTED**: Nested entity type (if any)
8. **NEL-LIT**: Entity linking (Wikidata Q-ID) for literal sense
9. **NEL-METO**: Entity linking for metonymic sense
10. **MISC**: Miscellaneous flags (NoSpaceAfter, EndOfSentence, etc.)

**IOB Tagging**: A labeling scheme where:
- **B-** indicates the beginning of an entity
- **I-** indicates inside/continuation of an entity  
- **O** indicates outside any entity (not part of a named entity)

**spaCy**: An industrial-strength natural language processing library that provides pre-trained models for various NLP tasks including NER.


## Process Overview

1. **Install Dependencies**: Set up spaCy and hipe-commons libraries
2. **Download Language Model**: Get the English language model (en_core_web_sm)
3. **Load Model**: Initialize the spaCy pipeline
4. **Process Text**: Apply NER and sentence segmentation
5. **Generate HIPE Format**: Convert spaCy output to HIPE-compliant annotations
6. **Display Results**: Show the HIPE format output with all 10 columns


## Recipe

### Step 1: Install Dependencies

Install both spaCy and the hipe-commons package for HIPE format support:


## Process Overview

1. **Install spaCy**: Set up the spaCy library in your environment
2. **Download Language Model**: Get the English language model (en_core_web_sm) that includes NER capabilities
3. **Load Model**: Initialize the spaCy pipeline with the downloaded model
4. **Process Text**: Apply the pipeline to identify sentence boundaries and extract entities
5. **Generate IOB Tags**: Convert the entity annotations to IOB format for each sentence
6. **Display Results**: Show the sentences with their corresponding IOB tags


## Recipe


### Step 1: Install spaCy

First, we need to install the spaCy library. Run this command in your terminal or use the following cell:


In [None]:
# Install spaCy (uncomment if not already installed)
# !pip install spacy


### Step 2: Download the English Language Model

Download the small English model which includes NER capabilities:


In [None]:
# Download the English language model (uncomment if not already downloaded)
# !python -m spacy download en_core_web_sm


### Step 3: Load spaCy and Initialize the Model


In [None]:
import spacy

# Load the English language model
nlp = spacy.load("en_core_web_sm")

print("spaCy model loaded successfully!")
print(f"Model name: {nlp.meta['name']}")
print(f"Model version: {nlp.meta['version']}")


spaCy model loaded successfully!
Model name: core_web_sm
Model version: 3.8.0


### Step 4: Prepare Sample Text

Let's use a sample text that contains various types of named entities:


In [None]:
import os
from pathlib import Path

input_dir = "../input/"

# Get all files in the input directory
text_files = []
for root, dirs, files in os.walk(input_dir):
    for file in files:
        text_files.append(os.path.join(root, file))

print(f"Found {len(text_files)} files:")
for file in text_files:
    print(f"- {file}")


Found 2 files:
- ../input/text1.txt
- ../input/nested/text2.txt


In [None]:
# Sample text with various named entities

with open(text_files[0], 'r') as file:
    sample_text = file.read()

print("Sample text:")
print(sample_text.strip())


Sample text:
Madonna and child; the Virgin seated turned to left and seen three-quarter length, holding the infant Jesus seated on her knee and suckling him, a round composition. c.1641 Etching


### Step 5: Process Text and Extract Sentences with NER


In [None]:
# Process the text with spaCy
doc = nlp(sample_text.strip())

# Extract sentences
sentences = list(doc.sents)

print(f"Found {len(sentences)} sentences:")
for i, sent in enumerate(sentences, 1):
    print(f"{i}. {sent.text.strip()}")


Found 3 sentences:
1. Madonna and child; the Virgin seated turned to left and seen three-quarter length, holding the infant Jesus seated on her knee and suckling him, a round composition.
2. c.1641
3. Etching


### Step 6: Generate IOB Tags for Each Sentence

Now we'll create IOB tags for each token in every sentence. To work with nested entities, we will be adopting the HIPE 2022 TSV format.

```
Each line consists of 10 columns:
1. TOKEN: the annotated token.
2. NE-COARSE-LIT: the coarse type (IOB-type) of the entity mention token, according to
the literal sense.
3. NE-COARSE-METO: the coarse type (IOB-type) of the entity mention token, according
to the metonymic sense.
4. NE-FINE-LIT: the fine-grained type (IOB-type.subtype.subtype) of the entity mention
token, according to the literal sense.
5. NE-FINE-METO: the fine-grained type (IOB-type.subtype.subtype) of the entity mention
token, according to the metonymic sense.
6. NE-FINE-COMP: the component type of the entity mention token.
7. NE-NESTED: the coarse type of the nested entity (if any).
8. NEL-LIT: the Wikidata Q id of the literal sense, or `NIL’ if an entity cannot be linked.
Rows without link annotations have value `_’.
9. NEL-METO: the Wikidata Q id of the metonymic sense, or `NIL’.
10. MISC: a flag which can take the following values:
- NoSpaceAfter: to indicate the absence of white space after the token.
- EndOfLine: to indicate the end of a layout line.
- EndOfSentence: to indicate the end of a sentence.
- Partial-START:STOP: to indicate the zero-based character on-/offsets of mentions
that do not cover the full token (esp. for German compounds). START and STOP
follow Python's slicing semantics: "abcd"[1:3] means "bc".
Non-specified values are marked by the underscore character “_”.
```

### Installing HIPE PyCommons

```bash
pip install git+https://github.com/hipe-eval/HIPE-pycommons
```

In [None]:
def get_iob_tags(sentence):
    """
    Generate IOB tags for tokens in a sentence following HIPE 2022 TSV format.

    Args:
        sentence: A spaCy Span object representing a sentence

    Returns:
        List of tuples containing the 10 HIPE columns for each token
    """
    iob_tags = []

    for token in sentence:
        # Initialize all columns with default "_" value
        token_data = [
            token.text,  # TOKEN
            "_",        # NE-COARSE-LIT
            "_",        # NE-COARSE-METO
            "_",        # NE-FINE-LIT
            "_",        # NE-FINE-METO
            "_",        # NE-FINE-COMP
            "_",        # NE-NESTED
            "_",        # NEL-LIT
            "_",        # NEL-METO
            "_"         # MISC
        ]

        # Set coarse literal IOB tag if token is part of an entity
        if token.ent_iob_ != "O":
            token_data[1] = f"{token.ent_iob_}-{token.ent_type_}"

        # Set EndOfSentence flag for last token
        if token.is_sent_end:
            token_data[9] = "EndOfSentence"

        # Set NoSpaceAfter flag
        if not token.whitespace_:
            token_data[9] = "NoSpaceAfter"

        iob_tags.append(token_data)

    return iob_tags
# Process each sentence and generate IOB tags
all_iob_tags = []
for i, sentence in enumerate(sentences, 1):
    print(f"\nSentence {i}: {sentence.text.strip()}")
    print("-" * 50)

    iob_tags = get_iob_tags(sentence)
    all_iob_tags.extend(iob_tags)

    # Display column headers
    headers = ["TOKEN", "NE-COARSE-LIT", "NE-COARSE-METO", "NE-FINE-LIT",
              "NE-FINE-METO", "NE-FINE-COMP", "NE-NESTED", "NEL-LIT",
              "NEL-METO", "MISC"]
    print("\t".join(headers))

    # Display token data
    for token_data in iob_tags:
        print("\t".join(token_data))

# Write IOB tags to TSV file
import os

# Create output directory if it doesn't exist
os.makedirs('../output', exist_ok=True)

# Open file and write headers
with open('../output/sample.tsv', 'w', encoding='utf-8') as f:
    # Write headers
    f.write('\t'.join(headers) + '\n')

    # Write token data for all sentences
    for token_data in all_iob_tags:
        # Add EndOfLine for last token in sentence
        if token_data[9] == "EndOfSentence":
            token_data[9] = "EndOfLine|EndOfSentence"

        f.write('\t'.join(token_data) + '\n')



Sentence 1: Madonna and child; the Virgin seated turned to left and seen three-quarter length, holding the infant Jesus seated on her knee and suckling him, a round composition.
--------------------------------------------------
TOKEN	NE-COARSE-LIT	NE-COARSE-METO	NE-FINE-LIT	NE-FINE-METO	NE-FINE-COMP	NE-NESTED	NEL-LIT	NEL-METO	MISC
Madonna	B-PERSON	_	_	_	_	_	_	_	_
and	_	_	_	_	_	_	_	_	_
child	_	_	_	_	_	_	_	_	NoSpaceAfter
;	_	_	_	_	_	_	_	_	_
the	_	_	_	_	_	_	_	_	_
Virgin	B-PERSON	_	_	_	_	_	_	_	_
seated	_	_	_	_	_	_	_	_	_
turned	_	_	_	_	_	_	_	_	_
to	_	_	_	_	_	_	_	_	_
left	_	_	_	_	_	_	_	_	_
and	_	_	_	_	_	_	_	_	_
seen	_	_	_	_	_	_	_	_	_
three	B-DATE	_	_	_	_	_	_	_	NoSpaceAfter
-	I-DATE	_	_	_	_	_	_	_	NoSpaceAfter
quarter	I-DATE	_	_	_	_	_	_	_	_
length	_	_	_	_	_	_	_	_	NoSpaceAfter
,	_	_	_	_	_	_	_	_	_
holding	_	_	_	_	_	_	_	_	_
the	_	_	_	_	_	_	_	_	_
infant	_	_	_	_	_	_	_	_	_
Jesus	B-PERSON	_	_	_	_	_	_	_	_
seated	_	_	_	_	_	_	_	_	_
on	_	_	_	_	_	_	_	_	_
her	_	_	_	_	_	_	_	_	_
knee	_	_	_	_	_	_	_	_	_
and	

### Step 7: Summary of Identified Entities

Let's also create a summary of all the entities found in the text:


In [None]:
# Extract and display all entities
print("\nSummary of Named Entities:")
print("=" * 40)

entities_by_type = {}

for ent in doc.ents:
    entity_type = ent.label_
    entity_text = ent.text

    if entity_type not in entities_by_type:
        entities_by_type[entity_type] = []

    entities_by_type[entity_type].append(entity_text)

# Display entities grouped by type
for entity_type, entities in entities_by_type.items():
    print(f"\n{entity_type}:")
    for entity in entities:
        print(f"  - {entity}")

# Display entity type explanations
print("\nEntity Type Explanations:")
print("-" * 30)
entity_explanations = {
    'ORG': 'Organization (companies, agencies, institutions)',
    'PERSON': 'Person names',
    'GPE': 'Geopolitical entity (countries, cities, states)',
    'DATE': 'Dates or periods',
    'MONEY': 'Monetary values',
    'CARDINAL': 'Numerals that do not fall under another type'
}

for ent_type in entities_by_type.keys():
    explanation = entity_explanations.get(ent_type, 'See spaCy documentation for details')
    print(f"{ent_type}: {explanation}")



Summary of Named Entities:

PERSON:
  - Madonna
  - Virgin
  - Jesus

DATE:
  - three-quarter

Entity Type Explanations:
------------------------------
PERSON: Person names
DATE: Dates or periods


In [None]:
from spacy import displacy

In [None]:
displacy.render(doc, style='ent')

## Variations and Alternatives

### Different spaCy Models
- **en_core_web_md**: Medium model with better accuracy
- **en_core_web_lg**: Large model with highest accuracy
- **en_core_web_trf**: Transformer-based model with state-of-the-art performance

### Alternative Libraries
- **NLTK**: Offers basic NER capabilities with different models
- **Hugging Face Transformers**: Provides pre-trained transformer models for NER
- **Stanza**: Stanford's NLP library with robust NER capabilities

### Custom Entity Recognition
- Train custom spaCy models for domain-specific entities
- Use rule-based matching for specific patterns
- Combine multiple models for improved coverage

### Output Formats
- **BILOU tagging**: More detailed than IOB (Begin, Inside, Last, Outside, Unit)
- **JSON format**: Structured output for API integration
- **CoNLL format**: Standard format for NLP competitions and research

### Performance Considerations
- For large texts, process in batches using `nlp.pipe()`
- Disable unused pipeline components to improve speed
- Use GPU acceleration for transformer models
