Skip to content

GLiNER Recognizer Truncates Long Text, Leading to Poor Redaction Results #1569

Open
@rafaela00castro

Description

@rafaela00castro

Describe the bug
When using Presidio Image Redactor with the GLiNER Recognizer, the model issues the following warning:

"UserWarning: Sentence of length 20415 has been truncated to 384
warnings.warn(f"Sentence of length {len(tokens)} has been truncated to {max_len}")

Due to this truncation, the redaction performance is significantly degraded, as some PII remains undetected in the untruncated portions of the text.

To Reproduce
Steps to reproduce the behavior:

  1. Use a single-page image-based PDF and convert it to a PIL Image (e.g., using pymupdf or pdf2image)
  2. Instantiate an ImageRedactorEngine
  3. Pass the PIL Image from step 1 to the ImageRedactorEngine.redact method
  4. Since the GLiNERRecognizer is used, its analyze method processes the full OCR-generated text, causing the GLiNER model to truncate the input

Expected behavior
The analyze method should implement text chunking before passing the input to the GLiNER model. Possible solutions include using text splitters from LangChain, such as:

Additional context
The gliner-spacy wrapper (which integrates GLiNER into spaCy pipelines) already implements a chunking strategy before feeding text to the model (see implementation). This is considered a best practice when working with transformer-based models and should be adopted in Presidio's GLiNERRecognizer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions