# Named-entity Recognition (NER) 

Named entity recognition is a fundamental task in information extraction from textual documents. While named entities originally corresponded to real-world entities with names (named entities), this concept has been extended to any type of information: it is possible to extract chemical molecules, product numbers, amounts, addresses, etc. In this practical assignment, we will use several named entity extraction libraries in French on a small corpus. The objective is not to train the best possible model, but to test the use of each of these libraries.



## The AdminSet dataset
The AdminSet dataset is a corpus of administrative documents in French produced by automatic character recognition and manually annotated with named entities. This corpus is quite difficult because the document recognition process produces noisy text (errors due to layout, recognition, fonts, etc.).

The paper describing the dataset is available [here](https://hal.science/hal-04855066v1/file/AdminSet_et_AdminBERT__version___preprint.pdf).

The corpus is available on HuggingFace: [Adminset-NER](https://huggingface.co/datasets/taln-ls2n/Adminset-NER).

In [1]:
from datasets import load_dataset
ds = load_dataset('taln-ls2n/Adminset-NER')
print(ds)

  from .autonotebook import tqdm as notebook_tqdm
Generating train split: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 729/729 [00:00<00:00, 39786.70 examples/s]
Generating validation split: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 85/85 [00:00<00:00, 10944.80 examples/s]

DatasetDict({
    train: Dataset({
        features: ['tokens', 'ner_tags'],
        num_rows: 729
    })
    validation: Dataset({
        features: ['tokens', 'ner_tags'],
        num_rows: 85
    })
})





#### Question
> * Compute descriptive statistics on the texts  for each split (train, dev)
> * Compute descriptive statistics on the entities for each split (train, dev)
> * Compare with the statistics reported in the paper (Table 2)
> * Display a couple of random texts with their entities

In [2]:
# Compute statistics on the number of token in train and validation : min, max, mean std, median
import numpy as np
from collections import Counter
import random
import pandas as pd

In [3]:
temp = ds["train"].map(lambda x: {"length": len(x["tokens"])})
temp["length"]

Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 729/729 [00:00<00:00, 8906.82 examples/s]


Column([63, 24, 31, 18, 41, ...])

In [4]:
t = temp["length"]
print(min(t), max(t), np.mean(t), np.std(t), np.median(t))

15 379 63.3676268861454 52.66968106175962 45.0


In [5]:
np.sum(temp["length"])

np.int64(46195)

In [6]:
temp = ds["train"].map(lambda x: {
    "length": len(x["ner_tags"]), 
    "nb_ORG": x["ner_tags"].count("B-ORG")+x["ner_tags"].count("I-ORG"),
    "nb_PER": x["ner_tags"].count("B-PER")+x["ner_tags"].count("I-PER"),
    "nb_LOC": x["ner_tags"].count("B-LOC")+x["ner_tags"].count("I-LOC")
})
temp["nb_PER"]

Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 729/729 [00:00<00:00, 8614.28 examples/s]


Column([3, 5, 5, 2, 2, ...])

In [7]:
t = temp["nb_ORG"]
print(min(t), max(t), np.mean(t), np.std(t), np.median(t), np.sum(t))

0 35 3.0809327846364885 4.816555948106844 1.0 2246


In [8]:
list(zip(ds["train"][0]["tokens"],ds["train"][0]["ner_tags"]))

[('fin', 'O'),
 ('Proc√®s-Verbal', 'O'),
 ('Conseil', 'O'),
 ('communautaire', 'O'),
 ('du', 'O'),
 ('lundi', 'O'),
 ('19', 'O'),
 ('juin', 'O'),
 ('2023', 'O'),
 ("L'an", 'O'),
 ('deux', 'O'),
 ('mille', 'O'),
 ('vingt-trois', 'O'),
 (',', 'O'),
 ('le', 'O'),
 ('lundi', 'O'),
 ('19', 'O'),
 ('juin', 'O'),
 (',', 'O'),
 ('√†', 'O'),
 ('18', 'O'),
 ('heures', 'O'),
 ('30', 'O'),
 (',', 'O'),
 ('le', 'O'),
 ('conseil', 'O'),
 ('communautaire', 'O'),
 ("s'est", 'O'),
 ('r√©uni', 'O'),
 ('√†', 'O'),
 ('Coucy', 'B-LOC'),
 ('le', 'I-LOC'),
 ('Ch√¢teau', 'I-LOC'),
 ('conform√©ment', 'O'),
 ('√†', 'O'),
 ("l'article", 'O'),
 ('2122-17', 'O'),
 ('du', 'O'),
 ('Code', 'O'),
 ('g√©n√©ral', 'O'),
 ('des', 'O'),
 ('Collectivit√©s', 'O'),
 ('Territoriales', 'O'),
 ('sur', 'O'),
 ('la', 'O'),
 ('convocation', 'O'),
 ('de', 'O'),
 ('Monsieur', 'B-PER'),
 ('Vincent', 'I-PER'),
 ('MORLET', 'I-PER'),
 (',', 'O'),
 ('Pr√©sident', 'O'),
 (',', 'O'),
 ('adress√©e', 'O'),
 ('aux', 'O'),
 ('d√©l√©gu√©s', 'O')

In [32]:
ds_train.num_rows

583

In [None]:
# for item in ds_train

### Creation of the splits

The train_test_split() function from huggingface allow to split a dataset randomly in 2 parts : https://huggingface.co/docs/datasets/v4.5.0/process#split

The ```spacy_utils.py``` file contains functions to save a dataset in text format (```save_text```, usefull for inspection), BIO format (```save_bio```) and spacy format (```save_docbin```).

#### Questions
>* Using the split function, create a train/dev/test split corresponding to the proportions reported in the paper
>* Save the sets in a corpus directory, in text, bio and docbin formats.

In [23]:
from spacy_utils import save_bio, save_text, save_docbin
from spacy.tokens import Doc, DocBin
import spacy

In [28]:
temp = ds["train"].train_test_split(test_size=0.2)
ds_train = temp["train"]
ds_dev = temp["test"]
ds_test = ds["validation"]

In [35]:
import spacy_utils
def aux(d,output_path):
    save_text(d,"corpus/"+output_path+".txt")
    save_bio(d,"corpus/"+output_path+".bio")
    save_docbin(d,"corpus/"+output_path+".docbin")
aux(ds_train,"train")
aux(ds_dev,"dev")
aux(ds_test,"test")

Saving text to corpus/train.txt...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 583/583 [00:00<00:00, 3163.34it/s]


Saved to corpus/train.txt
Saving BIO text to corpus/train.bio...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 583/583 [00:00<00:00, 4503.67it/s]


Saved to corpus/train.bio
Creating corpus/train.docbin with 583 examples...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 583/583 [00:00<00:00, 1252.70it/s]


Saved to corpus/train.docbin
Saving text to corpus/dev.txt...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 146/146 [00:00<00:00, 3213.45it/s]


Saved to corpus/dev.txt
Saving BIO text to corpus/dev.bio...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 146/146 [00:00<00:00, 4021.28it/s]


Saved to corpus/dev.bio
Creating corpus/dev.docbin with 146 examples...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 146/146 [00:00<00:00, 829.42it/s]


Saved to corpus/dev.docbin
Saving text to corpus/test.txt...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 85/85 [00:00<00:00, 3843.80it/s]


Saved to corpus/test.txt
Saving BIO text to corpus/test.bio...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 85/85 [00:00<00:00, 5178.30it/s]


Saved to corpus/test.bio
Creating corpus/test.docbin with 85 examples...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 85/85 [00:00<00:00, 797.42it/s]

Saved to corpus/test.docbin





### Testing spaCy pre-trained NER models

spaCy comes with a several pretrained models for many languages. For French, 4 models are provided : https://spacy.io/models/fr

To apply a pretrained model to dataset, use : 
- ```nlp = spacy.load(MODEL_NAME)``` to load the model. You need to download it first with "spacy download MODEL_NAME"
- ```DocBin().from_disk()``` to load a dataset in spaCy format from the disk
- ```doc_bin.get_docs(nlp.vocab)``` to convert the dataset from binary to text format
- ```nlp(doc.text)```to apply the NER model to a text

To evaluate the prediction, you can use the spaCy [Scorer](https://spacy.io/api/scorer)
- ```scorer.score(examples)``` where examples is a list of spaCy ```Example(prediction, reference)````

#### Question

>* Using a spaCy pretrained model for French, evaluate its performace for NER prediction on the train, dev and test sets
>* Compare this model to results reported in the paper

In [37]:
import spacy
from spacy.tokens import DocBin
from spacy.scorer import Scorer
from spacy.training import Example
from tqdm import tqdm
from prettytable import PrettyTable # optional but nice

In [51]:
# python -m spacy download fr_core_news_sm
nlp = spacy.load('fr_core_news_sm')
doc_bin = DocBin().from_disk("corpus/train.docbin")
doc_list = list(doc_bin.get_docs(nlp.vocab))

In [84]:
type(doc_list[0].text)

str

In [85]:
toto = nlp(doc_list[0].text)

In [86]:
for token in toto:
    print(token.text, token.lemma_, token.ent_type_)

Il il 
d√©veloppe d√©velopper 
des un 
actions action 
et et 
des un 
projets projet 
autour autour 
de de 
l' le 
image image 
, , 
du de 
d√©veloppement d√©veloppement 
durable durable 
, , 
de de 
l' le 
am√©nagement am√©nagement 
urbain urbain 
, , 
du de 
patrimoine patrimoine 
, , 
etc. etc. 
Le le ORG
Comit√© comit√© ORG
engage engager 
notamment notamment 
des un 
actions action 
de de 
coop√©ration coop√©ration 
d√©centralis√©e d√©centraliser 
entre entrer 
Angoul√™me angoul√™me LOC
et et 
S√©gou S√©gou LOC
Mali Mali LOC
, , 
contribuant contribuer 
ainsi ainsi 
au au 
rayonnement rayonnement 
international international 
de de 
la le LOC
Ville ville LOC
. . 


In [87]:
for (pred,ref) in zip(toto,doc_list[0]):
    print(pred.text, pred.ent_type_, ref.ent_type_, sep="\t")

Il		
d√©veloppe		
des		
actions		
et		
des		
projets		
autour		
de		
l'		
image		
,		
du		
d√©veloppement		
durable		
,		
de		
l'		
am√©nagement		
urbain		
,		
du		
patrimoine		
,		ORG
etc.		ORG
Le	ORG	
Comit√©	ORG	
engage		
notamment		
des		
actions		
de		
coop√©ration		
d√©centralis√©e		ORG
entre		
Angoul√™me	LOC	ORG
et		ORG
S√©gou	LOC	
Mali	LOC	
,		
contribuant		
ainsi		
au		
rayonnement		
international		LOC
de		LOC
la	LOC	


In [90]:
predictions = list(map(lambda d: nlp(d.text), doc_list))

In [68]:
for token in predictions[0]:
    print(token.lemma_, token.ent_type_)

il 
d√©velopper 
un 
action 
et 
un 
projet 
autour 
de 
l'image 
, 
de 
d√©veloppement 
durable 
, 
de 
l'am√©nagement 
urbain 
, 
de 
patrimoine 
, 
etc. 
le ORG
comit√© ORG
engager 
notamment 
un 
action 
de 
coop√©ration 
d√©centraliser 
entrer 
angoul√™me ORG
et 
S√©gou ORG
Mali ORG
, 
contribuer 
ainsi 
au 
rayonnement 
international 
de 
le LOC
ville LOC
. 


In [91]:
type(predictions[0])

spacy.tokens.doc.Doc

In [92]:
scorer = Scorer(nlp)

In [93]:
examples = list(map(Example, predictions,doc_list))

In [94]:
scorer.score(examples)

{'token_acc': None,
 'token_p': None,
 'token_r': None,
 'token_f': None,
 'pos_acc': 0.9411811367151866,
 'morph_acc': 0.9413118831529648,
 'morph_micro_p': 0.9524083734164352,
 'morph_micro_r': 0.9696854738584633,
 'morph_micro_f': 0.9609692744112923,
 'morph_per_feat': {'Gender': {'p': 0.9605792437650845,
   'r': 0.968931138418734,
   'f': 0.964737115484504},
  'Number': {'p': 0.9851409389346892,
   'r': 0.9802053146028783,
   'f': 0.98266692928895},
  'Person': {'p': 0.904603068712475,
   'r': 0.900398406374502,
   'f': 0.9024958402662229},
  'Mood': {'p': 0.9162348877374784,
   'r': 0.9107296137339056,
   'f': 0.9134739560912613},
  'Tense': {'p': 0.9436201780415431,
   'r': 0.9228855721393034,
   'f': 0.9331377069796688},
  'VerbForm': {'p': 0.9548599670510708,
   'r': 0.9330328396651641,
   'f': 0.943820224719101},
  'Definite': {'p': 0.8876337693222355,
   'r': 0.992686170212766,
   'f': 0.9372253609541745},
  'PronType': {'p': 0.893644617380026,
   'r': 0.9891304347826086,
   

### Training a custom spaCy model

The training of a cupstom spaCy NER model can be done both with the command line interface (cli) or in a python script. Using the cli is ususally more optimzed. All the configuration of the training is defined in a coniguration file, which is a good practice for documentation, tracing and reproducibility.

The configuration file can be generated on line using the [Quickstart](https://spacy.io/usage/training#quickstart)

<img src="images/spacy_quickstart.jpg" width="600" >

You can run the training process as a script using the train function (https://spacy.io/usage/training#api-train), specifying the configuration file and the directory in which to save the model as parameters. Once the training is complete, the best and last models are saved in the directory.

#### Question
> * Generate a training configuration file for a NER in French
> * Add the correct path to the training and dev sets generated previously
> * train a NER model
> * Evaluate the model on the train, dev et test sets. Compare to the results reported in the paper.

In [103]:
# train the model
from spacy.cli.train import train
# This is an auto-generated partial config. To use it with 'spacy train'
# you can run spacy init fill-config to auto-fill all default settings:
# python -m spacy init fill-config ./base_config.cfg ./config.cfg
# NB: files need to have .spacy extension
# NB: do not forget to set output_path, otherwise output is not saved
train("./config.cfg", output_path = "./spacy_model",
      overrides={"paths.train": "./corpus/train.spacy", 
                 "paths.dev": "./corpus/dev.spacy"})

[38;5;2m‚úî Created output directory: spacy_model[0m
[38;5;4m‚Ñπ Saving to output directory: spacy_model[0m
[38;5;4m‚Ñπ Using CPU[0m
[1m
[38;5;2m‚úî Initialized pipeline[0m
[1m
[38;5;4m‚Ñπ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4m‚Ñπ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     74.36    0.45    0.30    0.84    0.00
  0     200        898.63   2311.27   22.45   36.49   16.21    0.22
  1     400        258.34   1485.19   39.12   56.57   29.89    0.39
  1     600        655.84   1298.27   48.66   61.61   40.21    0.49
  2     800        223.71   1189.86   50.39   53.02   48.00    0.50
  3    1000        211.89   1026.38   49.94   62.74   41.47    0.50
  5    1200        270.11    876.30   54.01   58.29   50.32    0.54
  6    1400        366.54    877.53   53.38   62.61   46.53    0.53
  8    1600        330.96    747.05   51.51 

In [106]:
# evaluate the model
nlp_trained = spacy.load('spacy_model/model-best')

In [107]:
scorer = Scorer(nlp_trained)
predictions = list(map(lambda d: nlp_trained(d.text), doc_list))
examples = list(map(Example, predictions,doc_list))
scorer.score(examples)

{'token_acc': None,
 'token_p': None,
 'token_r': None,
 'token_f': None,
 'ents_p': 0.7972889710412816,
 'ents_r': 0.3760534728276664,
 'ents_f': 0.5110584518167457,
 'ents_per_type': {'ORG': {'p': 0.676056338028169,
   'r': 0.4678362573099415,
   'f': 0.5529953917050691},
  'LOC': {'p': 0.8058823529411765,
   'r': 0.3499361430395913,
   'f': 0.48797862867319675},
  'PER': {'p': 0.9424083769633508,
   'r': 0.6545454545454545,
   'f': 0.7725321888412017},
  'MISC': {'p': 0.0, 'r': 0.0, 'f': 0.0}}}

### Zero-shot NER prediction with GLiNER


[GLiNER](https://github.com/fastino-ai/GLiNER2/tree/main)  is a library that provides models for zero-shot named entity recognition. This means that[structured information extraction](https://github.com/fastino-ai/GLiNER2/blob/main/tutorial/3-json_extraction.md)structured information extraction, which means that the extracted information can be organised in a structured JSON format. GLiNER does not provide the location of entities in the text by default, but you can configure the model to output this information (```include_spans=True```). Finally, GLiNER enables entities to be overlapped and nested, which is not supported by the spaCy scorer. The spaCy [filter_spans](https://spacy.io/api/top-level#util.filter_spans) function can be used to remove overlapping entities for evaluation.

#### Question
> * Define the entities to extract from the text.
> * Apply GLiNER on the dev and test sets
> * Evaluate the models on the dev and test sets and compare to the results reported in the paper.

In [109]:
from gliner2 import GLiNER2
extractor = GLiNER2.from_pretrained("fastino/gliner2-base-v1")
from spacy.util import filter_spans
nlp = spacy.blank("fr")  # tokenizer only



You are using a model of type extractor to instantiate a model of type . This is not supported for all configurations of models and can yield errors.


üß† Model Configuration
Encoder model      : microsoft/deberta-v3-base
Counting layer     : count_lstm_v2
Token pooling      : first


FileNotFoundError: [Errno 2] No such file or directory: 'REPLACE'

In [156]:
doc_bin = DocBin().from_disk("./corpus/test.spacy")
gold_docs = list(doc_bin.get_docs(nlp.vocab))

label_map = {
    # Define the entities here
    "ORG": "Business organizations and corporations",
    "LOC": "Geographical places including cities",
    "PER": "Names of individuals including executives"
}

gliner_labels = list(label_map.values())
reverse_map = {v: k for k, v in label_map.items()}


In [254]:
examples=[]
for gold_doc in gold_docs:
    text = gold_doc.text
    
    predictions = extractor.extract_entities(text, gliner_labels, include_spans=True)
    pred_doc = nlp.make_doc(text)

    spans = []
    for gliner_label, entities in predictions["entities"].items():
        spacy_label = reverse_map.get(gliner_label)
        if not spacy_label:
            continue
        for ent in entities:
            start = ent["start"]
            end = ent["end"]

            span = pred_doc.char_span(start, end, label=spacy_label)
            if span:
                spans.append(span)

    spans = filter_spans(spans)
    pred_doc.ents = spans
    examples.append(Example(pred_doc, gold_doc))


In [255]:
text = gold_docs[0].text
extractor.extract_entities(text, gliner_labels, include_spans=True)

{'entities': {'Business organizations and corporations': [],
  'Geographical places including cities': [],
  'Names of individuals including executives': [{'text': 'Madame M√©lanie SAVARY',
    'start': 83,
    'end': 104}]}}

In [259]:
len(examples)

85

In [260]:
x = examples[0].x
y = examples[0].y
for a,b in zip(x,y):
    print(a,b, a.ent_type_, b.ent_type_, sep="|\t\t")

et|		et|		|		
L‚Äô|		L‚ÄôOffice|		|		ORG
Office|		Communautaire|		|		ORG
Communautaire|		d‚ÄôAnimations|		|		ORG
d‚Äô|		et|		|		ORG
Animations|		de|		|		ORG
et|		Loisirs|		|		ORG
de|		,|		|		
Loisirs|		¬´|		|		
,|		L‚ÄôOCAL|		|		ORG
¬´|		¬ª|		|		
L‚Äô|		,|		|		
OCAL|		repr√©sent√©|		|		
¬ª|		par|		|		
,|		Madame|		|		PER
repr√©sent√©|		M√©lanie|		|		PER
par|		SAVARY|		|		PER
Madame|		,|		PER|		
M√©lanie|		pr√©sidente|		PER|		
SAVARY|		,|		PER|		
,|		il|		|		
pr√©sidente|		est|		|		
,|		convenu|		|		
il|		ce|		|		
est|		qui|		|		
convenu|		suit|		|		
ce|		.|		|		


In [261]:
scorer = Scorer(nlp_trained)
scorer.score(examples)

{'token_acc': None,
 'token_p': None,
 'token_r': None,
 'token_f': None,
 'ents_p': 0.27181208053691275,
 'ents_r': 0.2709030100334448,
 'ents_f': 0.271356783919598,
 'ents_per_type': {'PER': {'p': 0.4326923076923077,
   'r': 0.3629032258064516,
   'f': 0.39473684210526316},
  'ORG': {'p': 0.3050847457627119,
   'r': 0.14634146341463414,
   'f': 0.1978021978021978},
  'LOC': {'p': 0.13333333333333333,
   'r': 0.34615384615384615,
   'f': 0.1925133689839572}}}

In [263]:
list(gold_docs[0])

[et,
 L‚ÄôOffice,
 Communautaire,
 d‚ÄôAnimations,
 et,
 de,
 Loisirs,
 ,,
 ¬´,
 L‚ÄôOCAL,
 ¬ª,
 ,,
 repr√©sent√©,
 par,
 Madame,
 M√©lanie,
 SAVARY,
 ,,
 pr√©sidente,
 ,,
 il,
 est,
 convenu,
 ce,
 qui,
 suit,
 .]

In [265]:
entities = extractor.extract_entities(list(gold_docs[0]), ["ORG","PER"])

AttributeError: 'list' object has no attribute 'endswith'

In [None]:
from gliner2 import GLiNER

# Load GLiNER2 model
model_name = "urchade/gliner2-base"
model = GLiNER.from_pretrained(model_name)

# Your pre-tokenized input
tokens = ["John", "lives", "in", "New", "York", "."]
# Build the text and track character offsets
text = ""
offsets = []
pos = 0
for tok in tokens:
    if text:  # add space before token except first
        text += " "
        pos += 1
    start = pos
    text += tok
    end = pos + len(tok)
    offsets.append((start, end))
    pos = end

# Run GLiNER2 inference
predictions = model.predict_entities(text)

# Align predictions to your tokens
def align_predictions(predictions, tokens, offsets):
    aligned = [[] for _ in tokens]
    for pred in predictions:
        start, end, label = pred["start"], pred["end"], pred["label"]
        for i, (tok_start, tok_end) in enumerate(offsets):
            # Token is inside the predicted span
            if tok_start >= start and tok_end <= end:
                aligned[i].append(label)
    return aligned

aligned_labels = align_predictions(predictions, tokens, offsets)

# Show results
for token, labels in zip(tokens, aligned_labels):
    print(f"{token}: {labels}")
