[Feature] Support `is_split_into_words` in the `TokenClassificationPipeline`. #38818

yushi2006 · 2025-06-13T17:42:37Z

What does this PR do?

This PR solves the problem that when using an already tokenized input in the TokenClassificationPipeline by adding a tokenizer_parameter called is_split_into_words. I think we will need the aggregation code because some tokenizers are subword tokenizers, I have solved the problem that start and end was not correctly assigned when using the pipeline by reconstructing the sentence and create a map that stores the position of each word and pass it along with words_id to the postprocess and in gather_pre_entities, we use the map and words_id to produce the final position.

from transformers import pipeline
from datasets import load_dataset

dataset = load_dataset("conll2003", split="test[:5]")

ner_pipeline = pipeline("token-classification", model="dslim/bert-base-NER", aggregation_strategy="first")

for example in dataset:
    tokens = example['tokens']

    predictions = ner_pipeline(tokens, is_split_into_words=True)

    for entity in predictions:
        print(entity)

Now this example doing inference on a batch of a preprocessed dataset. it gives this output.

[{'entity_group': 'MISC', 'score': np.float32(0.48014355), 'word': 'JAPAN', 'start': 9, 'end': 14}, {'entity_group': 'PER', 'score': np.float32(0.43156916), 'word': 'LUCKY', 'start': 19, 'end': 24}, {'entity_group': 'ORG', 'score': np.float32(0.64400136), 'word': 'CHINA', 'start': 31, 'end': 36}]
[{'entity_group': 'PER', 'score': np.float32(0.9979421), 'word': 'Nadim Ladki', 'start': 0, 'end': 11}]
[{'entity_group': 'LOC', 'score': np.float32(0.9851431), 'word': 'AL - AIN', 'start': 0, 'end': 6}, {'entity_group': 'LOC', 'score': np.float32(0.99941224), 'word': 'United Arab Emirates', 'start': 9, 'end': 29}]
[{'entity_group': 'LOC', 'score': np.float32(0.9998572), 'word': 'Japan', 'start': 0, 'end': 5}, {'entity_group': 'MISC', 'score': np.float32(0.99814206), 'word': 'Asian Cup', 'start': 33, 'end': 42}, {'entity_group': 'LOC', 'score': np.float32(0.9997607), 'word': 'Syria', 'start': 78, 'end': 83}, {'entity_group': 'MISC', 'score': np.float32(0.77703416), 'word': 'Group C', 'start': 89, 'end': 96}]
[{'entity_group': 'LOC', 'score': np.float32(0.9998673), 'word': 'China', 'start': 4, 'end': 9}, {'entity_group': 'LOC', 'score': np.float32(0.9997266), 'word': 'Uzbekistan', 'start': 119, 'end': 129}]

Who can review?

@Rocketknight1 @ArthurZucker

…ords argument

…ords argument and we can handle batches of tokenized input

…lit_into_words

src/transformers/pipelines/token_classification.py

Rocketknight1

Overall looks good, but made a couple of comments!

src/transformers/pipelines/token_classification.py

Rocketknight1

Took another look and LGTM! In particular, it seems like all the behaviour when is_split_into_words is False, the default, is unchanged, so this shouldn't cause any backward compatibility problems, which makes it much easier to approve. Thank you for the PR!

HuggingFaceDocBuilderDev · 2025-06-23T15:32:07Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

some fixes

e3cc55a

yushi2006 force-pushed the feature/supporting_split_into_words branch from 59c7e4d to e3cc55a Compare June 13, 2025 17:50

yushi2006 added 17 commits June 13, 2025 20:51

some fixes

42529c8

now the pipeline can take list of tokens as input and is_split_into_w…

988f1d4

…ords argument

now the pipeline can take list of tokens as input and is_split_into_w…

f443f93

…ords argument

now the pipeline can take list of tokens as input and is_split_into_w…

0caf1a5

…ords argument and we can handle batches of tokenized input

now the pipeline can take list of tokens as input and is_split_into_w…

6f43dde

…ords argument and we can handle batches of tokenized input

solving test problems

f230f30

some fixes

49589b8

some fixes

4efd61d

Merge remote-tracking branch 'origin/main' into feature/supporting_sp…

ade82f6

…lit_into_words

modify tests

e51f704

aligning start and end correctly

086ac74

adding tests

53fc492

some formatting

4e17c2c

some formatting

e6a25d3

some fixes

d976a54

some fixes

0ddfe17

some fixes

4f89bd5

yushi2006 marked this pull request as ready for review June 17, 2025 14:17

github-actions bot requested review from ArthurZucker and Rocketknight1 June 17, 2025 14:18

resolve conflicts

625702b

yushi2006 force-pushed the feature/supporting_split_into_words branch from f179225 to 625702b Compare June 18, 2025 17:42

resolve conflicts

3fb7995

Rocketknight1 reviewed Jun 19, 2025

View reviewed changes

src/transformers/pipelines/token_classification.py Show resolved Hide resolved

removing unimportant lines

552ceb3

yushi2006 force-pushed the feature/supporting_split_into_words branch from d69b116 to 552ceb3 Compare June 19, 2025 15:16

yushi2006 requested a review from Rocketknight1 June 19, 2025 15:25

Rocketknight1 approved these changes Jun 20, 2025

View reviewed changes

src/transformers/pipelines/token_classification.py Outdated Show resolved Hide resolved

src/transformers/pipelines/token_classification.py Outdated Show resolved Hide resolved

removing unimportant lines

a3734de

yushi2006 force-pushed the feature/supporting_split_into_words branch from 00b0ffd to a3734de Compare June 20, 2025 17:00

generalize to other languages

3a2eb59

yushi2006 force-pushed the feature/supporting_split_into_words branch from 7664914 to 3a2eb59 Compare June 20, 2025 18:01

generalize to other languages

6eb533e

yushi2006 force-pushed the feature/supporting_split_into_words branch from 285f409 to 6eb533e Compare June 20, 2025 18:03

generalize to other languages

fbb897a

yushi2006 force-pushed the feature/supporting_split_into_words branch from a33707c to fbb897a Compare June 20, 2025 18:06

generalize to other languages

21d97a6

yushi2006 force-pushed the feature/supporting_split_into_words branch from 699c0ba to 21d97a6 Compare June 20, 2025 18:15

Merge branch 'main' into feature/supporting_split_into_words

28a20e1

yushi2006 requested review from Rocketknight1 June 20, 2025 18:31

Merge branch 'main' into feature/supporting_split_into_words

fb913a6

Rocketknight1 approved these changes Jun 23, 2025

View reviewed changes

Rocketknight1 enabled auto-merge (squash) June 23, 2025 15:27

Rocketknight1 merged commit 9eac19e into huggingface:main Jun 23, 2025
20 checks passed

yushi2006 deleted the feature/supporting_split_into_words branch June 23, 2025 15:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Support `is_split_into_words` in the `TokenClassificationPipeline`. #38818

[Feature] Support `is_split_into_words` in the `TokenClassificationPipeline`. #38818

Uh oh!

yushi2006 commented Jun 13, 2025 •

edited

Loading

Uh oh!

Uh oh!

Rocketknight1 left a comment

Uh oh!

Uh oh!

Uh oh!

Rocketknight1 left a comment

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jun 23, 2025

Uh oh!

Uh oh!

[Feature] Support is_split_into_words in the TokenClassificationPipeline. #38818

[Feature] Support is_split_into_words in the TokenClassificationPipeline. #38818

Uh oh!

Conversation

yushi2006 commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Who can review?

Uh oh!

Uh oh!

Rocketknight1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Rocketknight1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jun 23, 2025

Uh oh!

Uh oh!

[Feature] Support `is_split_into_words` in the `TokenClassificationPipeline`. #38818

[Feature] Support `is_split_into_words` in the `TokenClassificationPipeline`. #38818

yushi2006 commented Jun 13, 2025 •

edited

Loading