GLiNER Recognizer Truncates Long Text, Leading to Poor Redaction Results

**Describe the bug**
When using **Presidio Image Redactor** with the **GLiNER Recognizer**, the model issues the following warning:
```
"UserWarning: Sentence of length 20415 has been truncated to 384
warnings.warn(f"Sentence of length {len(tokens)} has been truncated to {max_len}")
```
Due to this truncation, the redaction performance is significantly degraded, as some PII remains undetected in the untruncated portions of the text.

**To Reproduce**
Steps to reproduce the behavior:
1. Use a single-page image-based PDF and convert it to a PIL Image (e.g., using `pymupdf` or `pdf2image`)
2. Instantiate an **`ImageRedactorEngine`**
3. Pass the PIL Image from step 1 to the [`ImageRedactorEngine.redact`](https://github.com/microsoft/presidio/blob/main/presidio-image-redactor/presidio_image_redactor/image_redactor_engine.py#L27-L34) method
4. Since the **GLiNERRecognizer** is used, its [`analyze`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/predefined_recognizers/gliner_recognizer.py#L125-L131) method processes the full OCR-generated text, causing the GLiNER model to truncate the input

**Expected behavior**
The [`analyze`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/predefined_recognizers/gliner_recognizer.py#L125-L131) method should implement **text chunking** before passing the input to the GLiNER model. Possible solutions include using text splitters from LangChain, such as:
- [spaCy Text Splitter](https://python.langchain.com/docs/how_to/split_by_token/#spacy)
- [Sentence Transformers Splitter](https://python.langchain.com/docs/how_to/split_by_token/#sentencetransformers)

**Additional context**
The [`gliner-spacy`](https://github.com/theirstory/gliner-spacy) wrapper (which integrates GLiNER into spaCy pipelines) already implements a **chunking strategy** before feeding text to the model ([see implementation](https://github.com/theirstory/gliner-spacy/blob/main/gliner_spacy/pipeline.py#L60-L96)). This is considered a best practice when working with transformer-based models and should be adopted in Presidio's GLiNERRecognizer.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GLiNER Recognizer Truncates Long Text, Leading to Poor Redaction Results #1569

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GLiNER Recognizer Truncates Long Text, Leading to Poor Redaction Results #1569

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions