-
Notifications
You must be signed in to change notification settings - Fork 29.5k
[Feature] Support is_split_into_words
in the TokenClassificationPipeline
.
#38818
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Support is_split_into_words
in the TokenClassificationPipeline
.
#38818
Conversation
59c7e4d
to
e3cc55a
Compare
…ords argument and we can handle batches of tokenized input
…ords argument and we can handle batches of tokenized input
f179225
to
625702b
Compare
d69b116
to
552ceb3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, but made a couple of comments!
00b0ffd
to
a3734de
Compare
7664914
to
3a2eb59
Compare
285f409
to
6eb533e
Compare
a33707c
to
fbb897a
Compare
699c0ba
to
21d97a6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took another look and LGTM! In particular, it seems like all the behaviour when is_split_into_words
is False
, the default, is unchanged, so this shouldn't cause any backward compatibility problems, which makes it much easier to approve. Thank you for the PR!
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
What does this PR do?
Fixes #30757
This PR solves the problem that when using an already tokenized input in the
TokenClassificationPipeline
by adding atokenizer_parameter
calledis_split_into_words
. I think we will need the aggregation code because some tokenizers are subword tokenizers, I have solved the problem thatstart
andend
was not correctly assigned when using the pipeline by reconstructing the sentence and create a map that stores the position of each word and pass it along withwords_id
to thepostprocess
and ingather_pre_entities
, we use the map andwords_id
to produce the final position.Now this example doing inference on a batch of a preprocessed dataset. it gives this output.
Who can review?
@Rocketknight1 @ArthurZucker