Skip to content

[Feature] Support is_split_into_words in the TokenClassificationPipeline. #38818

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

yushi2006
Copy link
Contributor

@yushi2006 yushi2006 commented Jun 13, 2025

What does this PR do?

Fixes #30757

This PR solves the problem that when using an already tokenized input in the TokenClassificationPipeline by adding a tokenizer_parameter called is_split_into_words. I think we will need the aggregation code because some tokenizers are subword tokenizers, I have solved the problem that start and end was not correctly assigned when using the pipeline by reconstructing the sentence and create a map that stores the position of each word and pass it along with words_id to the postprocess and in gather_pre_entities, we use the map and words_id to produce the final position.

from transformers import pipeline
from datasets import load_dataset

dataset = load_dataset("conll2003", split="test[:5]")

ner_pipeline = pipeline("token-classification", model="dslim/bert-base-NER", aggregation_strategy="first")

for example in dataset:
    tokens = example['tokens']

    predictions = ner_pipeline(tokens, is_split_into_words=True)

    for entity in predictions:
        print(entity)

Now this example doing inference on a batch of a preprocessed dataset. it gives this output.

[{'entity_group': 'MISC', 'score': np.float32(0.48014355), 'word': 'JAPAN', 'start': 9, 'end': 14}, {'entity_group': 'PER', 'score': np.float32(0.43156916), 'word': 'LUCKY', 'start': 19, 'end': 24}, {'entity_group': 'ORG', 'score': np.float32(0.64400136), 'word': 'CHINA', 'start': 31, 'end': 36}]
[{'entity_group': 'PER', 'score': np.float32(0.9979421), 'word': 'Nadim Ladki', 'start': 0, 'end': 11}]
[{'entity_group': 'LOC', 'score': np.float32(0.9851431), 'word': 'AL - AIN', 'start': 0, 'end': 6}, {'entity_group': 'LOC', 'score': np.float32(0.99941224), 'word': 'United Arab Emirates', 'start': 9, 'end': 29}]
[{'entity_group': 'LOC', 'score': np.float32(0.9998572), 'word': 'Japan', 'start': 0, 'end': 5}, {'entity_group': 'MISC', 'score': np.float32(0.99814206), 'word': 'Asian Cup', 'start': 33, 'end': 42}, {'entity_group': 'LOC', 'score': np.float32(0.9997607), 'word': 'Syria', 'start': 78, 'end': 83}, {'entity_group': 'MISC', 'score': np.float32(0.77703416), 'word': 'Group C', 'start': 89, 'end': 96}]
[{'entity_group': 'LOC', 'score': np.float32(0.9998673), 'word': 'China', 'start': 4, 'end': 9}, {'entity_group': 'LOC', 'score': np.float32(0.9997266), 'word': 'Uzbekistan', 'start': 119, 'end': 129}]

Who can review?

@Rocketknight1 @ArthurZucker

@yushi2006 yushi2006 force-pushed the feature/supporting_split_into_words branch from 59c7e4d to e3cc55a Compare June 13, 2025 17:50
@yushi2006 yushi2006 marked this pull request as ready for review June 17, 2025 14:17
@yushi2006 yushi2006 force-pushed the feature/supporting_split_into_words branch from f179225 to 625702b Compare June 18, 2025 17:42
@yushi2006 yushi2006 force-pushed the feature/supporting_split_into_words branch from d69b116 to 552ceb3 Compare June 19, 2025 15:16
@yushi2006 yushi2006 requested a review from Rocketknight1 June 19, 2025 15:25
Copy link
Member

@Rocketknight1 Rocketknight1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, but made a couple of comments!

@yushi2006 yushi2006 force-pushed the feature/supporting_split_into_words branch from 00b0ffd to a3734de Compare June 20, 2025 17:00
@yushi2006 yushi2006 force-pushed the feature/supporting_split_into_words branch from 7664914 to 3a2eb59 Compare June 20, 2025 18:01
@yushi2006 yushi2006 force-pushed the feature/supporting_split_into_words branch from 285f409 to 6eb533e Compare June 20, 2025 18:03
@yushi2006 yushi2006 force-pushed the feature/supporting_split_into_words branch from a33707c to fbb897a Compare June 20, 2025 18:06
@yushi2006 yushi2006 force-pushed the feature/supporting_split_into_words branch from 699c0ba to 21d97a6 Compare June 20, 2025 18:15
Copy link
Member

@Rocketknight1 Rocketknight1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took another look and LGTM! In particular, it seems like all the behaviour when is_split_into_words is False, the default, is unchanged, so this shouldn't cause any backward compatibility problems, which makes it much easier to approve. Thank you for the PR!

@Rocketknight1 Rocketknight1 enabled auto-merge (squash) June 23, 2025 15:27
@Rocketknight1 Rocketknight1 merged commit 9eac19e into huggingface:main Jun 23, 2025
20 checks passed
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@yushi2006 yushi2006 deleted the feature/supporting_split_into_words branch June 23, 2025 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TokenClassificationPipeline support is_split_into_words tokeniser parameter
3 participants