-
Notifications
You must be signed in to change notification settings - Fork 29.4k
[Feature] Support is_split_into_words
in the TokenClassificationPipeline
.
#38818
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Feature] Support is_split_into_words
in the TokenClassificationPipeline
.
#38818
Conversation
59c7e4d
to
e3cc55a
Compare
…ords argument and we can handle batches of tokenized input
…ords argument and we can handle batches of tokenized input
f179225
to
625702b
Compare
d69b116
to
552ceb3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, but made a couple of comments!
@overload | ||
def preprocess( | ||
self, sentence: str, is_split_into_words: bool = False, offset_mapping=None, **preprocess_params | ||
) -> Iterator[dict]: ... | ||
|
||
@overload | ||
def preprocess( | ||
self, sentence: list[str], is_split_into_words: bool = True, offset_mapping=None, **preprocess_params | ||
) -> Iterator[dict]: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return type is the same in both cases so I'm not sure why we need the overload here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're totally right that overload was unnecessary. Not sure what I was thinking there 😅
The return type doesn’t change at all, so I’ve removed it from the code now.
sentence = " ".join(words) # Recreate the sentence string for later display and slicing | ||
# This map will allows to convert back word => char indices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dumb question, but really is_split_into_words
means that the input has been split into tokens rather than words, right? If any of the tokens aren't full words, this display will be wrong.
I'm not sure if that's fixable because different tokenizers use different ways of indicating word completions, I'm just not sure if this will always work perfectly!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! Not a dumb question at all actually, a very insightful one.
You're right to question the assumptions around is_split_into_words
, especially with the diversity in how tokenizers handle word boundaries. Luckily, this specific implementation accounts for that by checking whether a token is a subword using a simple but effective heuristic:
is_subword = len(word) != len(word_ref)
This allows us to detect subword tokens even when tokenizers split words in unpredictable ways. The aggregate_word
function then groups these subword tokens together correctly, leveraging the fact that tokenizers provide enough information to reconstruct the original word groupings.
That said, your point did make me realize there’s a deeper edge case here languages like Japanese or Chinese that don’t use whitespace as word boundaries.
So while the current code handles most whitespace-separated languages well, you're absolutely right that a more general, tokenizer-agnostic approach would be much trickier and probably need deeper integration with a smarter word alignment logic.
One idea I had for this is to expose a preprocessing parameter — something like a delimiter — so the user can explicitly define how their original words were separated. For whitespace-based languages, it could default to ' ', but for others, users could pass an empty string or custom logic. For example:
sentence = delimiter.join(words)
Thanks for the question it helped me think more critically about generalization.
Let me know if you have any ideas or pointers on making this more robust I’d love to explore better ways to handle those cases.
00b0ffd
to
a3734de
Compare
7664914
to
3a2eb59
Compare
285f409
to
6eb533e
Compare
a33707c
to
fbb897a
Compare
699c0ba
to
21d97a6
Compare
What does this PR do?
Fixes #30757
This PR solves the problem that when using an already tokenized input in the
TokenClassificationPipeline
by adding atokenizer_parameter
calledis_split_into_words
. I think we will need the aggregation code because some tokenizers are subword tokenizers, I have solved the problem thatstart
andend
was not correctly assigned when using the pipeline by reconstructing the sentence and create a map that stores the position of each word and pass it along withwords_id
to thepostprocess
and ingather_pre_entities
, we use the map andwords_id
to produce the final position.Now this example doing inference on a batch of a preprocessed dataset. it gives this output.
Who can review?
@Rocketknight1 @ArthurZucker