[Feature] Support `is_split_into_words` in the `TokenClassificationPipeline`. #38818

yushi2006 · 2025-06-13T17:42:37Z

What does this PR do?

This PR solves the problem that when using an already tokenized input in the TokenClassificationPipeline by adding a tokenizer_parameter called is_split_into_words. I think we will need the aggregation code because some tokenizers are subword tokenizers, I have solved the problem that start and end was not correctly assigned when using the pipeline by reconstructing the sentence and create a map that stores the position of each word and pass it along with words_id to the postprocess and in gather_pre_entities, we use the map and words_id to produce the final position.

from transformers import pipeline
from datasets import load_dataset

dataset = load_dataset("conll2003", split="test[:5]")

ner_pipeline = pipeline("token-classification", model="dslim/bert-base-NER", aggregation_strategy="first")

for example in dataset:
    tokens = example['tokens']

    predictions = ner_pipeline(tokens, is_split_into_words=True)

    for entity in predictions:
        print(entity)

Now this example doing inference on a batch of a preprocessed dataset. it gives this output.

[{'entity_group': 'MISC', 'score': np.float32(0.48014355), 'word': 'JAPAN', 'start': 9, 'end': 14}, {'entity_group': 'PER', 'score': np.float32(0.43156916), 'word': 'LUCKY', 'start': 19, 'end': 24}, {'entity_group': 'ORG', 'score': np.float32(0.64400136), 'word': 'CHINA', 'start': 31, 'end': 36}]
[{'entity_group': 'PER', 'score': np.float32(0.9979421), 'word': 'Nadim Ladki', 'start': 0, 'end': 11}]
[{'entity_group': 'LOC', 'score': np.float32(0.9851431), 'word': 'AL - AIN', 'start': 0, 'end': 6}, {'entity_group': 'LOC', 'score': np.float32(0.99941224), 'word': 'United Arab Emirates', 'start': 9, 'end': 29}]
[{'entity_group': 'LOC', 'score': np.float32(0.9998572), 'word': 'Japan', 'start': 0, 'end': 5}, {'entity_group': 'MISC', 'score': np.float32(0.99814206), 'word': 'Asian Cup', 'start': 33, 'end': 42}, {'entity_group': 'LOC', 'score': np.float32(0.9997607), 'word': 'Syria', 'start': 78, 'end': 83}, {'entity_group': 'MISC', 'score': np.float32(0.77703416), 'word': 'Group C', 'start': 89, 'end': 96}]
[{'entity_group': 'LOC', 'score': np.float32(0.9998673), 'word': 'China', 'start': 4, 'end': 9}, {'entity_group': 'LOC', 'score': np.float32(0.9997266), 'word': 'Uzbekistan', 'start': 119, 'end': 129}]

Who can review?

@Rocketknight1 @ArthurZucker

…ords argument

…ords argument and we can handle batches of tokenized input

…lit_into_words

src/transformers/pipelines/token_classification.py

Rocketknight1

Overall looks good, but made a couple of comments!

Rocketknight1 · 2025-06-20T14:30:07Z

src/transformers/pipelines/token_classification.py

+    @overload
+    def preprocess(
+        self, sentence: str, is_split_into_words: bool = False, offset_mapping=None, **preprocess_params
+    ) -> Iterator[dict]: ...
+
+    @overload
+    def preprocess(
+        self, sentence: list[str], is_split_into_words: bool = True, offset_mapping=None, **preprocess_params
+    ) -> Iterator[dict]: ...


The return type is the same in both cases so I'm not sure why we need the overload here!

You're totally right that overload was unnecessary. Not sure what I was thinking there 😅
The return type doesn’t change at all, so I’ve removed it from the code now.

Rocketknight1 · 2025-06-20T14:32:07Z

src/transformers/pipelines/token_classification.py

+            sentence = " ".join(words)  # Recreate the sentence string for later display and slicing
+            # This map will allows to convert back word => char indices


Dumb question, but really is_split_into_words means that the input has been split into tokens rather than words, right? If any of the tokens aren't full words, this display will be wrong.

I'm not sure if that's fixable because different tokenizers use different ways of indicating word completions, I'm just not sure if this will always work perfectly!

Hey! Not a dumb question at all actually, a very insightful one.

You're right to question the assumptions around is_split_into_words, especially with the diversity in how tokenizers handle word boundaries. Luckily, this specific implementation accounts for that by checking whether a token is a subword using a simple but effective heuristic:

is_subword = len(word) != len(word_ref)

This allows us to detect subword tokens even when tokenizers split words in unpredictable ways. The aggregate_word function then groups these subword tokens together correctly, leveraging the fact that tokenizers provide enough information to reconstruct the original word groupings.

That said, your point did make me realize there’s a deeper edge case here languages like Japanese or Chinese that don’t use whitespace as word boundaries.

So while the current code handles most whitespace-separated languages well, you're absolutely right that a more general, tokenizer-agnostic approach would be much trickier and probably need deeper integration with a smarter word alignment logic.

One idea I had for this is to expose a preprocessing parameter — something like a delimiter — so the user can explicitly define how their original words were separated. For whitespace-based languages, it could default to ' ', but for others, users could pass an empty string or custom logic. For example:

sentence = delimiter.join(words)

Thanks for the question it helped me think more critically about generalization.

Let me know if you have any ideas or pointers on making this more robust I’d love to explore better ways to handle those cases.

some fixes

e3cc55a

yushi2006 force-pushed the feature/supporting_split_into_words branch from 59c7e4d to e3cc55a Compare June 13, 2025 17:50

yushi2006 added 17 commits June 13, 2025 20:51

some fixes

42529c8

now the pipeline can take list of tokens as input and is_split_into_w…

988f1d4

…ords argument

now the pipeline can take list of tokens as input and is_split_into_w…

f443f93

…ords argument

now the pipeline can take list of tokens as input and is_split_into_w…

0caf1a5

…ords argument and we can handle batches of tokenized input

now the pipeline can take list of tokens as input and is_split_into_w…

6f43dde

…ords argument and we can handle batches of tokenized input

solving test problems

f230f30

some fixes

49589b8

some fixes

4efd61d

Merge remote-tracking branch 'origin/main' into feature/supporting_sp…

ade82f6

…lit_into_words

modify tests

e51f704

aligning start and end correctly

086ac74

adding tests

53fc492

some formatting

4e17c2c

some formatting

e6a25d3

some fixes

d976a54

some fixes

0ddfe17

some fixes

4f89bd5

yushi2006 marked this pull request as ready for review June 17, 2025 14:17

github-actions bot requested review from ArthurZucker and Rocketknight1 June 17, 2025 14:18

resolve conflicts

625702b

yushi2006 force-pushed the feature/supporting_split_into_words branch from f179225 to 625702b Compare June 18, 2025 17:42

resolve conflicts

3fb7995

Rocketknight1 reviewed Jun 19, 2025

View reviewed changes

src/transformers/pipelines/token_classification.py Show resolved Hide resolved

removing unimportant lines

552ceb3

yushi2006 force-pushed the feature/supporting_split_into_words branch from d69b116 to 552ceb3 Compare June 19, 2025 15:16

yushi2006 requested a review from Rocketknight1 June 19, 2025 15:25

Rocketknight1 approved these changes Jun 20, 2025

View reviewed changes

removing unimportant lines

a3734de

yushi2006 force-pushed the feature/supporting_split_into_words branch from 00b0ffd to a3734de Compare June 20, 2025 17:00

generalize to other languages

3a2eb59

yushi2006 force-pushed the feature/supporting_split_into_words branch from 7664914 to 3a2eb59 Compare June 20, 2025 18:01

generalize to other languages

6eb533e

yushi2006 force-pushed the feature/supporting_split_into_words branch from 285f409 to 6eb533e Compare June 20, 2025 18:03

generalize to other languages

fbb897a

yushi2006 force-pushed the feature/supporting_split_into_words branch from a33707c to fbb897a Compare June 20, 2025 18:06

generalize to other languages

21d97a6

yushi2006 force-pushed the feature/supporting_split_into_words branch from 699c0ba to 21d97a6 Compare June 20, 2025 18:15

Merge branch 'main' into feature/supporting_split_into_words

28a20e1

yushi2006 requested review from Rocketknight1 June 20, 2025 18:31

Merge branch 'main' into feature/supporting_split_into_words

fb913a6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Support `is_split_into_words` in the `TokenClassificationPipeline`. #38818

[Feature] Support `is_split_into_words` in the `TokenClassificationPipeline`. #38818

Uh oh!

yushi2006 commented Jun 13, 2025 •

edited

Loading

Uh oh!

Uh oh!

Rocketknight1 left a comment

Uh oh!

Rocketknight1 Jun 20, 2025

Uh oh!

yushi2006 Jun 20, 2025

Uh oh!

Rocketknight1 Jun 20, 2025

Uh oh!

yushi2006 Jun 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

		sentence = " ".join(words) # Recreate the sentence string for later display and slicing
		# This map will allows to convert back word => char indices

[Feature] Support is_split_into_words in the TokenClassificationPipeline. #38818

Are you sure you want to change the base?

[Feature] Support is_split_into_words in the TokenClassificationPipeline. #38818

Uh oh!

Conversation

yushi2006 commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Who can review?

Uh oh!

Uh oh!

Rocketknight1 left a comment

Choose a reason for hiding this comment

Uh oh!

Rocketknight1 Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

yushi2006 Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

Rocketknight1 Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

yushi2006 Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

[Feature] Support `is_split_into_words` in the `TokenClassificationPipeline`. #38818

[Feature] Support `is_split_into_words` in the `TokenClassificationPipeline`. #38818

yushi2006 commented Jun 13, 2025 •

edited

Loading

yushi2006 Jun 20, 2025 •

edited

Loading