Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Note: (XLM-)RoBERTa-based SpanMarker models require text preprocessing #23

Closed
tomaarsen opened this issue Aug 7, 2023 · 3 comments · Fixed by #28
Closed

Note: (XLM-)RoBERTa-based SpanMarker models require text preprocessing #23

tomaarsen opened this issue Aug 7, 2023 · 3 comments · Fixed by #28

Comments

@tomaarsen
Copy link
Owner

tomaarsen commented Aug 7, 2023

Hello!

This is a heads up that (XLM-)RoBERTa-based SpanMarker models require text to be preprocessed to separate punctuation from words:

# ✅
model.predict("He plays J. Robert Oppenheimer , an American theoretical physicist .")
# ❌
model.predict("He plays J. Robert Oppenheimer, an American theoretical physicist.")

# You can also supply a list of words directly: ✅
model.predict(["He", "plays", "J.", "Robert", "Oppenheimer", ",", "an", "American", "theoretical", "physicist", "."])

This is a consequence of the RoBERTa tokenizer distinguishing , and , as different tokens, and the SpanMarker model is only familiar with the , variant.

Another alternative is to use the spaCy integration, which preprocesses the text into words for you!

The (m)BERT-based SpanMarker models do not require this preprocessing.

  • Tom Aarsen
@stefan-it
Copy link

stefan-it commented Aug 11, 2023

Hey @tomaarsen ,

I have may a possible alternative solution: In Flair we also can construct and predict from sentences that are given by user. For this tokenization problem we use the v1 version of segtok. You could split the input into sentences, but for just tokenizing the word_tokenizer can be used:

https://github.com/fnl/segtok/blob/master/segtok/tokenizer.py#L210

I think this could easily be added in the Model Hub Inference logic:

https://github.com/huggingface/api-inference-community/blob/main/docker_images/span_marker/app/pipelines/token_classification.py#L35

So inputs could first be tokenized by word_tokenizer. I think that segtok would be a great alternative and more lightweight compared to spaCy.

Another alternative: not just implementing it on the Model Hub side: maybe it can be implemented in model.predict directly 🤔

@tomaarsen
Copy link
Owner Author

I'll certainly consider this approach, whether with segtok, spaCy or NLTK. The spaCy version is already implemented.

By default, perhaps I can apply the tokenization only if Hello, there. tokenizes differently than Hello , there .?

@tomaarsen tomaarsen linked a pull request Sep 28, 2023 that will close this issue
2 tasks
@tomaarsen
Copy link
Owner Author

I've discovered that the issue only persisted for XLM-RoBERTa, and I've been able to tackle it in f2edd06!

@tomaarsen tomaarsen unpinned this issue Sep 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants