# Named-entity Recognition (NER) 

Named entity recognition is a fundamental task in information extraction from textual documents. While named entities originally corresponded to real-world entities with names (named entities), this concept has been extended to any type of information: it is possible to extract chemical molecules, product numbers, amounts, addresses, etc. In this practical assignment, we will use several named entity extraction libraries in French on a small corpus. The objective is not to train the best possible model, but to test the use of each of these libraries.



## The AdminSet dataset
The AdminSet dataset is a corpus of administrative documents in French produced by automatic character recognition and manually annotated with named entities. This corpus is quite difficult because the document recognition process produces noisy text (errors due to layout, recognition, fonts, etc.).

The paper describing the dataset is available [here](https://hal.science/hal-04855066v1/file/AdminSet_et_AdminBERT__version___preprint.pdf).

The corpus is available on HuggingFace: [Adminset-NER](https://huggingface.co/datasets/taln-ls2n/Adminset-NER).

In [63]:
from datasets import load_dataset
ds = load_dataset('taln-ls2n/Adminset-NER')
print(ds)



DatasetDict({
    train: Dataset({
        features: ['tokens', 'ner_tags'],
        num_rows: 729
    })
    validation: Dataset({
        features: ['tokens', 'ner_tags'],
        num_rows: 85
    })
})


#### Question
> * Compute descriptive statistics on the texts  for each split (train, dev)
> * Compute descriptive statistics on the entities for each split (train, dev)
> * Compare with the statistics reported in the paper (Table 2)
> * Display a couple of random texts with their entities

In [65]:
import numpy as np
import pandas as pd

train_df = pd.DataFrame(ds['train'])
train_df.head()

Unnamed: 0,tokens,ner_tags
0,"[fin, Procès-Verbal, Conseil, communautaire, d...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
1,"[Monsieur, MORLET, excuse, Monsieur, Christoph...","[B-PER, I-PER, O, B-PER, I-PER, I-PER, O, O, O..."
2,"[Monsieur, MORLET, annonce, le, décès, de, Mon...","[B-PER, I-PER, O, O, O, O, B-PER, I-PER, I-PER..."
3,"[Commentaires, ,, débat, Constatant, qu'il, n'...","[O, O, O, O, O, O, O, O, O, O, O, B-PER, I-PER..."
4,"[Page, 4, sur, 15, <, page, >, 4, <, /, page, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."


In [66]:
val_df = pd.DataFrame(ds['validation'])
val_df.head()

Unnamed: 0,tokens,ner_tags
0,"[et, L’Office, Communautaire, d’Animations, et...","[O, B-ORG, I-ORG, I-ORG, I-ORG, I-ORG, I-ORG, ..."
1,"[Signé, le, 22, février, 2024, Reçu, au, Contr...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
2,"[Reçu, au, Contrôle, de, légalité, le, 12, déc...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
3,"[Etaient, absents, et, représentés, Mesdames, ...","[O, O, O, O, O, O, O, O, B-PER, I-PER, O, O, B..."
4,"[Commune, d'Ollioules, -, Departement, du, Var...","[B-LOC, I-LOC, O, B-LOC, I-LOC, I-LOC, O, O, O..."


In [67]:
train_df.shape, val_df.shape

((729, 2), (85, 2))

In [68]:
# Compute statistics on the number of token in train and validation : min, max, mean std, median

import numpy as np
from collections import Counter
import random

# for train set
train_df["n_tokens"] = train_df["tokens"].apply(len)
train_df.head()


Unnamed: 0,tokens,ner_tags,n_tokens
0,"[fin, Procès-Verbal, Conseil, communautaire, d...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...",63
1,"[Monsieur, MORLET, excuse, Monsieur, Christoph...","[B-PER, I-PER, O, B-PER, I-PER, I-PER, O, O, O...",24
2,"[Monsieur, MORLET, annonce, le, décès, de, Mon...","[B-PER, I-PER, O, O, O, O, B-PER, I-PER, I-PER...",31
3,"[Commentaires, ,, débat, Constatant, qu'il, n'...","[O, O, O, O, O, O, O, O, O, O, O, B-PER, I-PER...",18
4,"[Page, 4, sur, 15, <, page, >, 4, <, /, page, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...",41


In [69]:
print("TRAIN TOKEN STATS")
print(train_df["n_tokens"].describe())
print("Median:", train_df["n_tokens"].median())

TRAIN TOKEN STATS
count    729.000000
mean      63.367627
std       52.705843
min       15.000000
25%       30.000000
50%       45.000000
75%       75.000000
max      379.000000
Name: n_tokens, dtype: float64
Median: 45.0


In [70]:
# for validation set
val_df["n_tokens"] = val_df["tokens"].apply(len)

print("\nVALIDATION TOKEN STATS")
print(val_df["n_tokens"].describe())
print("Median:", val_df["n_tokens"].median())



VALIDATION TOKEN STATS
count     85.000000
mean      79.835294
std       68.679006
min       19.000000
25%       35.000000
50%       50.000000
75%       86.000000
max      352.000000
Name: n_tokens, dtype: float64
Median: 50.0


**table2 of the paper**

<img src="images/paper_table2.png" width="500">


In [71]:
train_tags = train_df["ner_tags"].explode()

# remove "O"
train_entities = train_tags[train_tags != "O"]

print("\nTRAIN ENTITY STATS (Token-level)")
print("Total entity tokens:", len(train_entities))
print("Number of entity labels:", train_entities.nunique())
print(train_entities.value_counts())



TRAIN ENTITY STATS (Token-level)
Total entity tokens: 4983
Number of entity labels: 6
ner_tags
I-ORG    1476
I-PER    1092
B-ORG     770
B-PER     764
B-LOC     454
I-LOC     427
Name: count, dtype: int64


In [72]:
val_tags = val_df["ner_tags"].explode()

val_entities = val_tags[val_tags != "O"]

print("\nVALIDATION ENTITY STATS (Token-level)")
print("Total entity tokens:", len(val_entities))
print("Number of entity labels:", val_entities.nunique())
print(val_entities.value_counts())



VALIDATION ENTITY STATS (Token-level)
Total entity tokens: 694
Number of entity labels: 6
ner_tags
I-ORG    203
I-PER    138
B-PER    124
B-ORG    123
I-LOC     54
B-LOC     52
Name: count, dtype: int64


In [73]:
train_b_entities = train_entities[train_entities.str.startswith("B-")]

print("\nTRAIN REAL ENTITIES (Span-level)")
print("Total entities:", len(train_b_entities))
print("Entity types:", train_b_entities.str[2:].value_counts())



TRAIN REAL ENTITIES (Span-level)
Total entities: 1988
Entity types: ner_tags
ORG    770
PER    764
LOC    454
Name: count, dtype: int64


In [74]:
val_b_entities = val_entities[val_entities.str.startswith("B-")]

print("\nVALIDATION REAL ENTITIES (Span-level)")
print("Total entities:", len(val_b_entities))
print("Entity types:", val_b_entities.str[2:].value_counts())



VALIDATION REAL ENTITIES (Span-level)
Total entities: 299
Entity types: ner_tags
PER    124
ORG    123
LOC     52
Name: count, dtype: int64


In [75]:
sample = train_df.sample(2)

for i, row in sample.iterrows():
    print("\n" + "="*60)
    print("Text:")
    print(" ".join(row["tokens"]))
    
    print("\nEntities:")
    
    current_entity = []
    current_label = None
    
    for token, tag in zip(row["tokens"], row["ner_tags"]):
        
        if tag.startswith("B-"):
            if current_entity:
                print(" ".join(current_entity), "->", current_label)
            current_entity = [token]
            current_label = tag[2:]
        
        elif tag.startswith("I-"):
            current_entity.append(token)
        
        else:
            if current_entity:
                print(" ".join(current_entity), "->", current_label)
                current_entity = []
                current_label = None
    
    if current_entity:
        print(" ".join(current_entity), "->", current_label)



Text:
Fait à SAINT ETIENNE le 08 juin 2023 ( En deux exemplaires ) Pour le S.A.T.P E 21 olas hors les murs » PLE A S OR s v p» Rue Désiré Claude 4000 anti l. 04 7730 07 secretariatoeadultesprismer Pour LE DEPARTEMENT DE LA LOIRE Le Président M. Georges ZIEGLER S.A.T « Hors les murs » Prisme 21 Loire - 40 rue Désiré Claude 42100 SAINT ETIENNE 2023 / 2024 Page 9 / 10 < page > 9 < / page > LISTE ET COORDONNÉES DES PROFESSIONNELS DU SERVICE HABILITÉS À ACCOMPAGNER M. Y.C.

Entities:

Text:
Ont été équipées les Maisons de la Communauté Côte- Basque-Adour et Sud Pays Basque , la Villa Tarride et le Centre Technique de l’Environnement .

Entities:
Communauté Côte- Basque-Adour -> LOC
Sud Pays Basque -> LOC
Villa Tarride -> LOC
Centre Technique de l’Environnement -> LOC


### Creation of the splits

The train_test_split() function from huggingface allow to split a dataset randomly in 2 parts : https://huggingface.co/docs/datasets/v4.5.0/process#split

The ```spacy_utils.py``` file contains functions to save a dataset in text format (```save_text```, usefull for inspection), BIO format (```save_bio```) and spacy format (```save_docbin```).

#### Questions
>* Using the split function, create a train/dev/test split corresponding to the proportions reported in the paper
>* Save the sets in a corpus directory, in text, bio and docbin formats.

In [76]:
from spacy_utils import save_bio, save_text, save_docbin
from datasets import concatenate_datasets

full_ds = concatenate_datasets([ds["train"], ds["validation"]])
full_ds.shape

(814, 2)

In [77]:
train_ds = full_ds.select(range(0, 583))
dev_ds   = full_ds.select(range(583, 583 + 146))
test_ds  = full_ds.select(range(583 + 146, 814))

train_ds.shape, dev_ds.shape, test_ds.shape

((583, 2), (146, 2), (85, 2))

In [78]:
train_ds

Dataset({
    features: ['tokens', 'ner_tags'],
    num_rows: 583
})

In [79]:
dev_ds

Dataset({
    features: ['tokens', 'ner_tags'],
    num_rows: 146
})

In [80]:
test_ds

Dataset({
    features: ['tokens', 'ner_tags'],
    num_rows: 85
})

In [81]:
from spacy_utils import save_bio, save_text, save_docbin

# save the datasets in different formats
save_text(train_ds, "corpus/train.txt")
save_text(dev_ds, "corpus/dev.txt")
save_text(test_ds, "corpus/test.txt")

save_bio(train_ds, "corpus/train.bio")
save_bio(dev_ds, "corpus/dev.bio")
save_bio(test_ds, "corpus/test.bio")

save_docbin(train_ds, "corpus/train.spacy")
save_docbin(dev_ds, "corpus/dev.spacy")
save_docbin(test_ds, "corpus/test.spacy")


Saving text to corpus/train.txt...


100%|██████████| 583/583 [00:00<00:00, 9283.59it/s]


Saved to corpus/train.txt
Saving text to corpus/dev.txt...


100%|██████████| 146/146 [00:00<00:00, 12646.23it/s]


Saved to corpus/dev.txt
Saving text to corpus/test.txt...


100%|██████████| 85/85 [00:00<00:00, 9055.06it/s]


Saved to corpus/test.txt
Saving BIO text to corpus/train.bio...


100%|██████████| 583/583 [00:00<00:00, 16618.61it/s]


Saved to corpus/train.bio
Saving BIO text to corpus/dev.bio...


100%|██████████| 146/146 [00:00<00:00, 18605.67it/s]


Saved to corpus/dev.bio
Saving BIO text to corpus/test.bio...


100%|██████████| 85/85 [00:00<00:00, 11457.27it/s]


Saved to corpus/test.bio
Creating corpus/train.spacy with 583 examples...


100%|██████████| 583/583 [00:00<00:00, 5377.82it/s]


Saved to corpus/train.spacy
Creating corpus/dev.spacy with 146 examples...


100%|██████████| 146/146 [00:00<00:00, 4530.66it/s]


Saved to corpus/dev.spacy
Creating corpus/test.spacy with 85 examples...


100%|██████████| 85/85 [00:00<00:00, 3595.11it/s]

Saved to corpus/test.spacy





### Testing spaCy pre-trained NER models

spaCy comes with a several pretrained models for many languages. For French, 4 models are provided : https://spacy.io/models/fr

To apply a pretrained model to dataset, use : 
- ```nlp = spacy.load(MODEL_NAME)``` to load the model. You need to download it first with "spacy download MODEL_NAME"
- ```DocBin().from_disk()``` to load a dataset in spaCy format from the disk
- ```doc_bin.get_docs(nlp.vocab)``` to convert the dataset from binary to text format
- ```nlp(doc.text)```to apply the NER model to a text

To evaluate the prediction, you can use the spaCy [Scorer](https://spacy.io/api/scorer)
- ```scorer.score(examples)``` where examples is a list of spaCy ```Example(prediction, reference)````

#### Question

>* Using a spaCy pretrained model for French, evaluate its performace for NER prediction on the train, dev and test sets
>* Compare this model to results reported in the paper

In [84]:
import sys
print(sys.executable)
print(sys.version)


/Users/nysarakpy/github/NLP-Course-Ensae/.venv/bin/python
3.12.6 (v3.12.6:a4a2d2b0d85, Sep  6 2024, 16:08:03) [Clang 13.0.0 (clang-1300.0.29.30)]


In [87]:
import spacy
nlp = spacy.load("fr_core_news_md")
print("Model loaded successfully")


OSError: [E050] Can't find model 'fr_core_news_md'. It doesn't seem to be a Python package or a valid path to a data directory.

In [None]:
import spacy
from spacy.tokens import DocBin
from spacy.scorer import Scorer
from spacy.training import Example
from tqdm import tqdm
from prettytable import PrettyTable

nlp = spacy.load("fr_core_news_md")


OSError: [E050] Can't find model 'fr_core_news_md'. It doesn't seem to be a Python package or a valid path to a data directory.

### Training a custom spaCy model

The training of a cupstom spaCy NER model can be done both with the command line interface (cli) or in a python script. Using the cli is ususally more optimzed. All the configuration of the training is defined in a coniguration file, which is a good practice for documentation, tracing and reproducibility.

The configuration file can be generated on line using the [Quickstart](https://spacy.io/usage/training#quickstart)

<img src="images/spacy_quickstart.jpg" width="600" >

You can run the training process as a script using the train function (https://spacy.io/usage/training#api-train), specifying the configuration file and the directory in which to save the model as parameters. Once the training is complete, the best and last models are saved in the directory.

#### Question
> * Generate a training configuration file for a NER in French
> * Add the correct path to the training and dev sets generated previously
> * train a NER model
> * Evaluate the model on the train, dev et test sets. Compare to the results reported in the paper.

In [None]:
# train the model
from spacy.cli.train import train


In [None]:
# evaluate the model

### Zero-shot NER prediction with GLiNER


[GLiNER](https://github.com/fastino-ai/GLiNER2/tree/main)  is a library that provides models for zero-shot named entity recognition. This means that[structured information extraction](https://github.com/fastino-ai/GLiNER2/blob/main/tutorial/3-json_extraction.md)structured information extraction, which means that the extracted information can be organised in a structured JSON format. GLiNER does not provide the location of entities in the text by default, but you can configure the model to output this information (```include_spans=True```). Finally, GLiNER enables entities to be overlapped and nested, which is not supported by the spaCy scorer. The spaCy [filter_spans](https://spacy.io/api/top-level#util.filter_spans) function can be used to remove overlapping entities for evaluation.

#### Question
> * Define the entities to extract from the text.
> * Apply GLiNER on the dev and test sets
> * Evaluate the models on the dev and test sets and compare to the results reported in the paper.

In [None]:
from gliner2 import GLiNER2
extractor = GLiNER2.from_pretrained("fastino/gliner2-base-v1")
from spacy.util import filter_spans
nlp = spacy.blank("fr")  # tokenizer only

doc_bin = DocBin().from_disk("REPLACE")
gold_docs = list(doc_bin.get_docs(nlp.vocab))

label_map = {
    # Define the entities here
}

gliner_labels = list(label_map.values())
reverse_map = {v: k for k, v in label_map.items()}

examples = []

for gold_doc in gold_docs:
    text = gold_doc.text
    
    predictions = extractor.extract_entities(text, gliner_labels, include_spans=True)
    pred_doc = nlp.make_doc(text)

    spans = []
    for gliner_label, entities in predictions["entities"].items():
        spacy_label = reverse_map.get(gliner_label)
        if not spacy_label:
            continue
        for ent in entities:
            start = ent["start"]
            end = ent["end"]

            span = pred_doc.char_span(start, end, label=spacy_label)
            if span:
                spans.append(span)

    spans = filter_spans(spans)
    pred_doc.ents = spans
    examples.append(Example(pred_doc, gold_doc))

