# Named-entity Recognition (NER) 

Named entity recognition is a fundamental task in information extraction from textual documents. While named entities originally corresponded to real-world entities with names (named entities), this concept has been extended to any type of information: it is possible to extract chemical molecules, product numbers, amounts, addresses, etc. In this practical assignment, we will use several named entity extraction libraries in French on a small corpus. The objective is not to train the best possible model, but to test the use of each of these libraries.



## The AdminSet dataset
The AdminSet dataset is a corpus of administrative documents in French produced by automatic character recognition and manually annotated with named entities. This corpus is quite difficult because the document recognition process produces noisy text (errors due to layout, recognition, fonts, etc.).

The paper describing the dataset is available [here](https://hal.science/hal-04855066v1/file/AdminSet_et_AdminBERT__version___preprint.pdf).

The corpus is available on HuggingFace: [Adminset-NER](https://huggingface.co/datasets/taln-ls2n/Adminset-NER).

In [None]:
from datasets import load_dataset
ds = load_dataset('taln-ls2n/Adminset-NER')
print(ds)

#### Question
> * Compute descriptive statistics on the texts  for each split (train, dev)
> * Compute descriptive statistics on the entities for each split (train, dev)
> * Compare with the statistics reported in the paper (Table 2)
> * Display a couple of random texts with their entities

In [None]:
# Compute statistics on the number of token in train and validation : min, max, mean std, median

import numpy as np
from collections import Counter
import random




### Creation of the splits

The train_test_split() function from huggingface allow to split a dataset randomly in 2 parts : https://huggingface.co/docs/datasets/v4.5.0/process#split

The ```spacy_utils.py``` file contains functions to save a dataset in text format (```save_text```, usefull for inspection), BIO format (```save_bio```) and spacy format (```save_docbin```).

#### Questions
>* Using the split function, create a train/dev/test split corresponding to the proportions reported in the paper
>* Save the sets in a corpus directory, in text, bio and docbin formats.

In [None]:
from spacy_utils import save_bio, save_text, save_docbin



### Testing spaCy pre-trained NER models

spaCy comes with a several pretrained models for many languages. For French, 4 models are provided : https://spacy.io/models/fr

To apply a pretrained model to dataset, use : 
- ```nlp = spacy.load(MODEL_NAME)``` to load the model. You need to download it first with "spacy download MODEL_NAME"
- ```DocBin().from_disk()``` to load a dataset in spaCy format from the disk
- ```doc_bin.get_docs(nlp.vocab)``` to convert the dataset from binary to text format
- ```nlp(doc.text)```to apply the NER model to a text

To evaluate the prediction, you can use the spaCy [Scorer](https://spacy.io/api/scorer)
- ```scorer.score(examples)``` where examples is a list of spaCy ```Example(prediction, reference)````

#### Question

>* Using a spaCy pretrained model for French, evaluate its performace for NER prediction on the train, dev and test sets
>* Compare this model to results reported in the paper

In [None]:
import spacy
from spacy.tokens import DocBin
from spacy.scorer import Scorer
from spacy.training import Example
from tqdm import tqdm
from prettytable import PrettyTable # optional but nice


    

### Training a custom spaCy model

The training of a cupstom spaCy NER model can be done both with the command line interface (cli) or in a python script. Using the cli is ususally more optimzed. All the configuration of the training is defined in a coniguration file, which is a good practice for documentation, tracing and reproducibility.

The configuration file can be generated on line using the [Quickstart](https://spacy.io/usage/training#quickstart)

<img src="images/spacy_quickstart.jpg" width="600" >

You can run the training process as a script using the train function (https://spacy.io/usage/training#api-train), specifying the configuration file and the directory in which to save the model as parameters. Once the training is complete, the best and last models are saved in the directory.

#### Question
> * Generate a training configuration file for a NER in French
> * Add the correct path to the training and dev sets generated previously
> * train a NER model
> * Evaluate the model on the train, dev et test sets. Compare to the results reported in the paper.

In [None]:
# train the model
from spacy.cli.train import train


In [None]:
# evaluate the model

### Zero-shot NER prediction with GLiNER


[GLiNER](https://github.com/fastino-ai/GLiNER2/tree/main)  is a library that provides models for zero-shot named entity recognition. This means that[structured information extraction](https://github.com/fastino-ai/GLiNER2/blob/main/tutorial/3-json_extraction.md)structured information extraction, which means that the extracted information can be organised in a structured JSON format. GLiNER does not provide the location of entities in the text by default, but you can configure the model to output this information (```include_spans=True```). Finally, GLiNER enables entities to be overlapped and nested, which is not supported by the spaCy scorer. The spaCy [filter_spans](https://spacy.io/api/top-level#util.filter_spans) function can be used to remove overlapping entities for evaluation.

#### Question
> * Define the entities to extract from the text.
> * Apply GLiNER on the dev and test sets
> * Evaluate the models on the dev and test sets and compare to the results reported in the paper.

In [None]:
from gliner2 import GLiNER2
extractor = GLiNER2.from_pretrained("fastino/gliner2-base-v1")
from spacy.util import filter_spans
nlp = spacy.blank("fr")  # tokenizer only

doc_bin = DocBin().from_disk("REPLACE")
gold_docs = list(doc_bin.get_docs(nlp.vocab))

label_map = {
    # Define the entities here
}

gliner_labels = list(label_map.values())
reverse_map = {v: k for k, v in label_map.items()}

examples = []

for gold_doc in gold_docs:
    text = gold_doc.text
    
    predictions = extractor.extract_entities(text, gliner_labels, include_spans=True)
    pred_doc = nlp.make_doc(text)

    spans = []
    for gliner_label, entities in predictions["entities"].items():
        spacy_label = reverse_map.get(gliner_label)
        if not spacy_label:
            continue
        for ent in entities:
            start = ent["start"]
            end = ent["end"]

            span = pred_doc.char_span(start, end, label=spacy_label)
            if span:
                spans.append(span)

    spans = filter_spans(spans)
    pred_doc.ents = spans
    examples.append(Example(pred_doc, gold_doc))

