Skip to content

[Feature] Support is_split_into_words in the TokenClassificationPipeline. #38818

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 28 commits into
base: main
Choose a base branch
from

Conversation

yushi2006
Copy link
Contributor

@yushi2006 yushi2006 commented Jun 13, 2025

What does this PR do?

Fixes #30757

This PR solves the problem that when using an already tokenized input in the TokenClassificationPipeline by adding a tokenizer_parameter called is_split_into_words. I think we will need the aggregation code because some tokenizers are subword tokenizers, I have solved the problem that start and end was not correctly assigned when using the pipeline by reconstructing the sentence and create a map that stores the position of each word and pass it along with words_id to the postprocess and in gather_pre_entities, we use the map and words_id to produce the final position.

from transformers import pipeline
from datasets import load_dataset

dataset = load_dataset("conll2003", split="test[:5]")

ner_pipeline = pipeline("token-classification", model="dslim/bert-base-NER", aggregation_strategy="first")

for example in dataset:
    tokens = example['tokens']

    predictions = ner_pipeline(tokens, is_split_into_words=True)

    for entity in predictions:
        print(entity)

Now this example doing inference on a batch of a preprocessed dataset. it gives this output.

[{'entity_group': 'MISC', 'score': np.float32(0.48014355), 'word': 'JAPAN', 'start': 9, 'end': 14}, {'entity_group': 'PER', 'score': np.float32(0.43156916), 'word': 'LUCKY', 'start': 19, 'end': 24}, {'entity_group': 'ORG', 'score': np.float32(0.64400136), 'word': 'CHINA', 'start': 31, 'end': 36}]
[{'entity_group': 'PER', 'score': np.float32(0.9979421), 'word': 'Nadim Ladki', 'start': 0, 'end': 11}]
[{'entity_group': 'LOC', 'score': np.float32(0.9851431), 'word': 'AL - AIN', 'start': 0, 'end': 6}, {'entity_group': 'LOC', 'score': np.float32(0.99941224), 'word': 'United Arab Emirates', 'start': 9, 'end': 29}]
[{'entity_group': 'LOC', 'score': np.float32(0.9998572), 'word': 'Japan', 'start': 0, 'end': 5}, {'entity_group': 'MISC', 'score': np.float32(0.99814206), 'word': 'Asian Cup', 'start': 33, 'end': 42}, {'entity_group': 'LOC', 'score': np.float32(0.9997607), 'word': 'Syria', 'start': 78, 'end': 83}, {'entity_group': 'MISC', 'score': np.float32(0.77703416), 'word': 'Group C', 'start': 89, 'end': 96}]
[{'entity_group': 'LOC', 'score': np.float32(0.9998673), 'word': 'China', 'start': 4, 'end': 9}, {'entity_group': 'LOC', 'score': np.float32(0.9997266), 'word': 'Uzbekistan', 'start': 119, 'end': 129}]

Who can review?

@Rocketknight1 @ArthurZucker

@yushi2006 yushi2006 force-pushed the feature/supporting_split_into_words branch from 59c7e4d to e3cc55a Compare June 13, 2025 17:50
@yushi2006 yushi2006 marked this pull request as ready for review June 17, 2025 14:17
@yushi2006 yushi2006 force-pushed the feature/supporting_split_into_words branch from f179225 to 625702b Compare June 18, 2025 17:42
@yushi2006 yushi2006 force-pushed the feature/supporting_split_into_words branch from d69b116 to 552ceb3 Compare June 19, 2025 15:16
@yushi2006 yushi2006 requested a review from Rocketknight1 June 19, 2025 15:25
Copy link
Member

@Rocketknight1 Rocketknight1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, but made a couple of comments!

Comment on lines 269 to 271
@overload
def preprocess(
self, sentence: str, is_split_into_words: bool = False, offset_mapping=None, **preprocess_params
) -> Iterator[dict]: ...

@overload
def preprocess(
self, sentence: list[str], is_split_into_words: bool = True, offset_mapping=None, **preprocess_params
) -> Iterator[dict]: ...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type is the same in both cases so I'm not sure why we need the overload here!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're totally right that overload was unnecessary. Not sure what I was thinking there 😅
The return type doesn’t change at all, so I’ve removed it from the code now.

Comment on lines 288 to 283
sentence = " ".join(words) # Recreate the sentence string for later display and slicing
# This map will allows to convert back word => char indices
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dumb question, but really is_split_into_words means that the input has been split into tokens rather than words, right? If any of the tokens aren't full words, this display will be wrong.

I'm not sure if that's fixable because different tokenizers use different ways of indicating word completions, I'm just not sure if this will always work perfectly!

Copy link
Contributor Author

@yushi2006 yushi2006 Jun 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! Not a dumb question at all actually, a very insightful one.

You're right to question the assumptions around is_split_into_words, especially with the diversity in how tokenizers handle word boundaries. Luckily, this specific implementation accounts for that by checking whether a token is a subword using a simple but effective heuristic:

is_subword = len(word) != len(word_ref)

This allows us to detect subword tokens even when tokenizers split words in unpredictable ways. The aggregate_word function then groups these subword tokens together correctly, leveraging the fact that tokenizers provide enough information to reconstruct the original word groupings.

That said, your point did make me realize there’s a deeper edge case here languages like Japanese or Chinese that don’t use whitespace as word boundaries.

So while the current code handles most whitespace-separated languages well, you're absolutely right that a more general, tokenizer-agnostic approach would be much trickier and probably need deeper integration with a smarter word alignment logic.

One idea I had for this is to expose a preprocessing parameter — something like a delimiter — so the user can explicitly define how their original words were separated. For whitespace-based languages, it could default to ' ', but for others, users could pass an empty string or custom logic. For example:

sentence = delimiter.join(words)

Thanks for the question it helped me think more critically about generalization.

Let me know if you have any ideas or pointers on making this more robust I’d love to explore better ways to handle those cases.

@yushi2006 yushi2006 force-pushed the feature/supporting_split_into_words branch from 00b0ffd to a3734de Compare June 20, 2025 17:00
@yushi2006 yushi2006 force-pushed the feature/supporting_split_into_words branch from 7664914 to 3a2eb59 Compare June 20, 2025 18:01
@yushi2006 yushi2006 force-pushed the feature/supporting_split_into_words branch from 285f409 to 6eb533e Compare June 20, 2025 18:03
@yushi2006 yushi2006 force-pushed the feature/supporting_split_into_words branch from a33707c to fbb897a Compare June 20, 2025 18:06
@yushi2006 yushi2006 force-pushed the feature/supporting_split_into_words branch from 699c0ba to 21d97a6 Compare June 20, 2025 18:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TokenClassificationPipeline support is_split_into_words tokeniser parameter
2 participants