## Evaluate Presidio Analyzer using the Presidio Evaluator framework

This notebook demonstrates how to evaluate a Presidio instance using the presidio-evaluator framework
Steps:
1. Load dataset from file
2. Simple dataset statistics
3. Define the AnalyzerEngine object (and its parameters)
4. Align the dataset's entities to Presidio's entities
5. Set up the Evaluator object
6. Run experiment
7. Evaluate results
8. Error analysis

For an example with a custom Presidio instance, see [notebook 5](5_Evaluate_Custom_Presidio_Analyzer.ipynb).

In [1]:
# install presidio evaluator via pip if not yet installed

#!pip install presidio-evaluator

In [2]:
from pathlib import Path
from pprint import pprint
from collections import Counter
from typing import Dict, List
import json

from presidio_evaluator import InputSample
from presidio_evaluator.evaluation import Evaluator, ModelError, Plotter
from presidio_evaluator.experiment_tracking import get_experiment_tracker

import pandas as pd

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

%reload_ext autoreload
%autoreload 2
%matplotlib inline

stanza and spacy_stanza are not installed
Flair is not installed by default


## 1. Load dataset from file

In [3]:
dataset_name = "generated_size_500_date_August_04_2025.json"
dataset = InputSample.read_dataset_json(Path(Path.cwd().parent, "data", dataset_name))
print(len(dataset))

tokenizing input:   0%|          | 0/153 [00:00<?, ?it/s]

loading model en_core_web_sm


tokenizing input: 100%|██████████| 153/153 [00:01<00:00, 108.85it/s]

153





This dataset was auto generated. See more info here [Synthetic data generation](1_Generate_data.ipynb).

In [4]:
def get_entity_counts(dataset: List[InputSample]) -> Dict:
    """Return a dictionary with counter per entity type."""
    entity_counter = Counter()
    for sample in dataset:
        for tag in sample.tags:
            entity_counter[tag] += 1
    return entity_counter


## 2. Simple dataset statistics

In [5]:
entity_counts = get_entity_counts(dataset)
print("Count per entity:")
pprint(entity_counts.most_common(), compact=True)

print("\nMin and max number of tokens in dataset: "\
f"Min: {min([len(sample.tokens) for sample in dataset])}, "\
f"Max: {max([len(sample.tokens) for sample in dataset])}")

print(f"Min and max sentence length in dataset: " \
f"Min: {min([len(sample.full_text) for sample in dataset])}, "\
f"Max: {max([len(sample.full_text) for sample in dataset])}")

print("\nExample InputSample:")
print(dataset[0])

Count per entity:
[('O', 1356), ('PERSON', 433), ('HOSPITAL_NAME', 303), ('DATE_TIME', 295),
 ('DRUG', 49), ('FREQUENCY', 37), ('TIME', 36), ('DOSE', 23), ('DURATION', 13),
 ('SYMPTOM', 5), ('PHONE_NUMBER', 5), ('LAB_RESULT', 4),
 ('MEDICAL_CONDITION', 4), ('PROCEDURE', 3), ('LOCATION', 2), ('PATIENT_ID', 1),
 ('INSURANCE_NUMBER', 1), ('BLOOD_PRESSURE', 1)]

Min and max number of tokens in dataset: Min: 7, Max: 27
Min and max sentence length in dataset: Min: 30, Max: 138

Example InputSample:
Full text: Dr. John Doe recommended MRI for further evaluation of Headache.
Spans: [Span(type: SYMPTOM, value: Headache, char_span: [55: 63]), Span(type: PROCEDURE, value: MRI, char_span: [25: 28]), Span(type: PERSON, value: John Doe, char_span: [4: 12])]



In [6]:
print("A few examples sentences containing each entity:\n")
for entity in entity_counts.keys():
    samples = [sample for sample in dataset if entity in set(sample.tags)]
    if len(samples) > 1 and entity != "O":
        print(f"Entity: <{entity}> two example sentences:\n"
              f"\n1) {samples[0].full_text}"
              f"\n2) {samples[1].full_text}"
              f"\n------------------------------------\n")

A few examples sentences containing each entity:

Entity: <PERSON> two example sentences:

1) Dr. John Doe recommended MRI for further evaluation of Headache.
2) Appointment for John Doe confirmed at St. Luke's Cornwall Hospital on 2025-08-01 at 10:00.
------------------------------------

Entity: <PROCEDURE> two example sentences:

1) Dr. John Doe recommended MRI for further evaluation of Headache.
2) John Doe was admitted for Hypertension and underwent MRI on 2025-08-01.
------------------------------------

Entity: <SYMPTOM> two example sentences:

1) Dr. John Doe recommended MRI for further evaluation of Headache.
2) John Doe was advised to monitor Headache and return if symptoms worsen.
------------------------------------

Entity: <HOSPITAL_NAME> two example sentences:

1) Appointment for John Doe confirmed at St. Luke's Cornwall Hospital on 2025-08-01 at 10:00.
2) Reminder: John Doe has an appointment with Dr. Ryan Thomas at Pine Rest Christian Mental Health Services on 2025-08-

## 3. Define the AnalyzerEngine object 
Using Presidio with default parameters (not recommended, it's used here for simplicity). For an example on customization, see [notebook 5](5_Evaluate_Custom_Presidio_Analyzer.ipynb)

In [7]:
#import medspacy
#nlp = medspacy.load("en_core_web_sm")
#print(nlp.pipe_names)  # Should include 'medspacy_ner'

In [8]:
from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.medspacy_recognizer import MedspacyRecognizer
# Loading the vanilla Analyzer Engine, with the default NER model.
analyzer_engine = AnalyzerEngine(default_score_threshold=0.4)
medspacy_recognizer = MedspacyRecognizer()
analyzer_engine.registry.add_recognizer(medspacy_recognizer)

pprint(f"Supported entities for English:")
pprint(analyzer_engine.get_supported_entities("en"), compact=True)

print(f"\nLoaded recognizers for English:")
pprint([rec.name for rec in analyzer_engine.registry.get_recognizers("en", all_fields=True)], compact=True)

print(f"\nLoaded NER models:")
pprint(analyzer_engine.nlp_engine.models)

'Supported entities for English:'
['MEDICAL_CONDITION', 'DURATION', 'DOSAGE', 'LAB_TEST', 'VACCINE', 'CRYPTO',
 'MEDICAL_LICENSE', 'LOCATION', 'EMAIL_ADDRESS', 'DRUG', 'US_BANK_NUMBER',
 'US_PASSPORT', 'NRP', 'IBAN_CODE', 'US_DRIVER_LICENSE', 'ALLERGY', 'PROCEDURE',
 'FREQUENCY', 'PERSON', 'DATE_TIME', 'ANATOMY', 'HOSPITAL', 'US_SSN', 'SYMPTOM',
 'UK_NHS', 'URL', 'PHONE_NUMBER', 'US_ITIN', 'CREDIT_CARD', 'MEDICAL_DEVICE',
 'IP_ADDRESS']

Loaded recognizers for English:
['CreditCardRecognizer', 'UsBankRecognizer', 'UsLicenseRecognizer',
 'UsItinRecognizer', 'UsPassportRecognizer', 'UsSsnRecognizer', 'NhsRecognizer',
 'CryptoRecognizer', 'DateRecognizer', 'EmailRecognizer', 'IbanRecognizer',
 'IpRecognizer', 'MedicalLicenseRecognizer', 'PhoneRecognizer', 'UrlRecognizer',
 'SpacyRecognizer', 'MedspacyRecognizer']

Loaded NER models:
[{'lang_code': 'en', 'model_name': 'en_core_web_lg'}]


## 4. Align the dataset's entities to Presidio's entities

There is possibly a difference between the names of entities in the dataset, and the names of entities Presidio can detect.
For example, it could be that a dataset labels a name as PER while Presidio returns PERSON. To be able to compare the predicted value to the actual and gather metrics, an alignment between the entity names is necessary. Consider changing the mapping if your dataset and/or Presidio instance supports difference entity types.

In [9]:
from presidio_evaluator.models import  PresidioAnalyzerWrapper

entities_mapping=PresidioAnalyzerWrapper.presidio_entities_map # default mapping

print("Using this mapping between the dataset and Presidio's entities:")
pprint(entities_mapping, compact=True)


dataset = Evaluator.align_entity_types(
    dataset, 
    entities_mapping=entities_mapping, 
    allow_missing_mappings=True
)
new_entity_counts = get_entity_counts(dataset)
print("\nCount per entity after alignment:")
pprint(new_entity_counts.most_common(), compact=True)

dataset_entities = list(new_entity_counts.values())


Using this mapping between the dataset and Presidio's entities:
{'ADDRESS': 'LOCATION',
 'AGE': 'AGE',
 'BIRTHDAY': 'DATE_TIME',
 'CITY': 'LOCATION',
 'CREDIT_CARD': 'CREDIT_CARD',
 'CREDIT_CARD_NUMBER': 'CREDIT_CARD',
 'DATE': 'DATE_TIME',
 'DATE_OF_BIRTH': 'DATE_TIME',
 'DATE_TIME': 'DATE_TIME',
 'DOB': 'DATE_TIME',
 'DOMAIN': 'URL',
 'DOMAIN_NAME': 'URL',
 'EMAIL': 'EMAIL_ADDRESS',
 'EMAIL_ADDRESS': 'EMAIL_ADDRESS',
 'FACILITY': 'LOCATION',
 'FIRST_NAME': 'PERSON',
 'GPE': 'LOCATION',
 'HCW': 'PERSON',
 'HOSP': 'ORGANIZATION',
 'HOSPITAL': 'ORGANIZATION',
 'IBAN': 'IBAN_CODE',
 'IBAN_CODE': 'IBAN_CODE',
 'ID': 'ID',
 'IP_ADDRESS': 'IP_ADDRESS',
 'LAST_NAME': 'PERSON',
 'LOC': 'LOCATION',
 'LOCATION': 'LOCATION',
 'NAME': 'PERSON',
 'NATIONALITY': 'NRP',
 'NORP': 'NRP',
 'NRP': 'NRP',
 'O': 'O',
 'ORG': 'ORGANIZATION',
 'ORGANIZATION': 'ORGANIZATION',
 'PATIENT': 'PERSON',
 'PATORG': 'ORGANIZATION',
 'PER': 'PERSON',
 'PERSON': 'PERSON',
 'PHONE': 'PHONE_NUMBER',
 'PHONE_NUMBER': 'PH

## 5. Set up the Evaluator object

In [10]:
# Set up the experiment tracker to log the experiment for reproducibility
experiment = get_experiment_tracker()

# Create the evaluator object
evaluator = Evaluator(model=analyzer_engine)


# Track model and dataset params
params = {"dataset_name": dataset_name, 
          "model_name": evaluator.model.name}
params.update(evaluator.model.to_log())
experiment.log_parameters(params)
experiment.log_dataset_hash(dataset)
experiment.log_parameter("entity_mappings", json.dumps(entities_mapping))

--------
Entities supported by this Presidio Analyzer instance:
MEDICAL_CONDITION, DURATION, DOSAGE, LAB_TEST, VACCINE, CRYPTO, MEDICAL_LICENSE, LOCATION, EMAIL_ADDRESS, DRUG, US_BANK_NUMBER, US_PASSPORT, NRP, IBAN_CODE, US_DRIVER_LICENSE, ALLERGY, PROCEDURE, FREQUENCY, PERSON, DATE_TIME, ANATOMY, HOSPITAL, US_SSN, SYMPTOM, UK_NHS, URL, PHONE_NUMBER, US_ITIN, CREDIT_CARD, MEDICAL_DEVICE, IP_ADDRESS


## 6. Run experiment

In [11]:
%%time

## Run experiment

evaluation_results = evaluator.evaluate_all(dataset)
results = evaluator.calculate_score(evaluation_results)

# Track experiment results
experiment.log_metrics(results.to_log())
entities, confmatrix = results.to_confusion_matrix()
experiment.log_confusion_matrix(matrix=confmatrix, 
                                labels=entities)

# end experiment
experiment.end()

Running model PresidioAnalyzerWrapper on dataset...
Finished running model on dataset
saving experiment data to c:\projects\presidio\presidio-research\notebooks\experiment_20250804-134853.json
CPU times: total: 1.77 s
Wall time: 1.82 s


## 7. Evaluate results

In [12]:
# Plot output
plotter = Plotter(results=results, 
                  model_name = evaluator.model.name, 
                  save_as="png",
                  beta = 2) 

# Path of the directory to save the plots
output_folder = Path(Path.cwd().parent, "plotter_output")
plotter.plot_scores(output_folder)

ValueError: 
Image export using the "kaleido" engine requires the kaleido package,
which can be installed using pip:
    $ pip install -U kaleido


In [None]:
pprint({"PII F":results.pii_f, "PII recall": results.pii_recall, "PII precision": results.pii_precision})

## 8. Error analysis

Now let's look into results to understand what's behind the metrics we're getting.
Note that evaluation is never perfect. Some things to consider:
1. There's often a mismatch between the annotated span and the predicted span, which isn't necessarily a mistake. For example: `<Southern France>` compared with `Southern <France>`. In the second text, the word `Southern` was not annotated/predicted as part of the entity, but that's not necessarily an error.
2. Token based evaluation (which is used here) counts the number of true positive / false positive / false negative tokens. Some entities might be broken into more tokens than others. For example, the phone number `222-444-1234` could be broken into five different tokens, whereas `Krishna` would be broken into one token, resulting in phone numbers having more influence on metrics than names.
3. The synthetic dataset used here isn't representative of a real dataset. Consider using more realistic datasets for evaluation

In [None]:
plotter.plot_confusion_matrix(entities=entities, confmatrix=confmatrix, output_folder=output_folder)

In [None]:
plotter.plot_most_common_tokens(output_folder)

### 7a. False positives
#### Most common false positive tokens:

In [None]:
ModelError.most_common_fp_tokens(results.model_errors)

#### More FP analysis

In [None]:
fps_df = ModelError.get_fps_dataframe(results.model_errors, entity=["PERSON"])
fps_df[["full_text", "token", "annotation", "prediction"]].head(20)

### 7b. False negatives (FN)

#### Most common false negative examples + a few samples with FN

In [None]:
ModelError.most_common_fn_tokens(results.model_errors, n=15)

#### More FN analysis

In [None]:
fns_df = ModelError.get_fns_dataframe(results.model_errors, entity=["PHONE_NUMBER"])

In [None]:
fns_df[["full_text", "token", "annotation", "prediction"]].head(20)