Description
Describe the bug
When using Presidio Image Redactor with the GLiNER Recognizer, the model issues the following warning:
"UserWarning: Sentence of length 20415 has been truncated to 384
warnings.warn(f"Sentence of length {len(tokens)} has been truncated to {max_len}")
Due to this truncation, the redaction performance is significantly degraded, as some PII remains undetected in the untruncated portions of the text.
To Reproduce
Steps to reproduce the behavior:
- Use a single-page image-based PDF and convert it to a PIL Image (e.g., using
pymupdf
orpdf2image
) - Instantiate an
ImageRedactorEngine
- Pass the PIL Image from step 1 to the
ImageRedactorEngine.redact
method - Since the GLiNERRecognizer is used, its
analyze
method processes the full OCR-generated text, causing the GLiNER model to truncate the input
Expected behavior
The analyze
method should implement text chunking before passing the input to the GLiNER model. Possible solutions include using text splitters from LangChain, such as:
Additional context
The gliner-spacy
wrapper (which integrates GLiNER into spaCy pipelines) already implements a chunking strategy before feeding text to the model (see implementation). This is considered a best practice when working with transformer-based models and should be adopted in Presidio's GLiNERRecognizer.