New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TokenClassificationRecord] Alignment issues when token is a white space #1264
Comments
cc @frascuchon |
The problem you mention here is not about annotations with whitespace tokens but annotations ENDING or STARTING with whitespace tokens. The mechanism we have to validate the annotation is not sophisticated enough to disambiguate these spaces. In your example, it's impossible or quite difficult to determine if the whitespace token refers to the first or second white space in text. The provided tokenization in your example is just an spacy convention where a whitespace after a whitespace is represented as a whitespace token, but could not be a universal convention. The following example, however, works perfectly. from rubrix.server.tasks.token_classification.api.model import CreationTokenClassificationRecord
from spacy import load
nlp = load("en_core_web_sm")
text = "every four (4)"
record = {
"text": text,
"tokens": list(map(str, doc)), # ['every', 'four', ' ', '(', '4', ')']
"prediction": {"agent": "mock", "entities": [{"start": 0, "end": len(text), "label": "mock"}]},
}
CreationTokenClassificationRecord.parse_obj(record) See that the token list also contains whitespace tokens, but the defined span is defined over real text information. I think that is a good practice do not define spans starting or ending in a whitespace (like we do at char level), even if they are part of tokenization. The reason is that they do not really represent relevant information What's the informational difference between mention span "Isabel D." and "Isabel D. "? For me there isn't. Even more, I imagine that the trend is to have models or annotator generating annotations closer to the first one. But it's my point of view. Maybe @dvsrepo can contribute with more info to clarify this edge case. |
Anyway, maybe we can improve the error message, making focus on irrelevant whitespaces in mention |
@frascuchon Thanks for the clarification, truth is I just assumed it would affect every white space token, I did not test it properly. You are right, annotations/predictions with trailing (double) white spaces do not make much sense and it is good practice to avoid them. But they do exist in the wild jungle of human offset annotations and models trained on them. From my experience with the veganuary and the metrics blog post, I see that alignment issues can be a real obstacle for less experienced users, so anything to mitigate them would lower the entrance barrier.
|
Since we cannot resolve span alignment when spaces are included as tokens and spans definition also contains spaces without some text assumptions, I vote for trying to clean the provided record info what's mean:
For example, the result for your record with this approach will be: {
'text': 'every four (4)',
'tokens': [ 'every', 'four', '(', '4', ')'],
'prediction': {
'agent': 'mock',
'score': None,
'entities': [
{
'start': 0, 'end': 15, 'label': 'mock', 'score': 1.0
}
],
}
} Does it make sense, @dcfidalgo? |
Describe the bug
Not sure if this is really a bug, but it would be nice to support cases where the tokens are white spaces. We are encountering a lot of these cases in the med7 dataset for the metrics blog post.
To Reproduce
Expected behavior
Accept entity
Screenshots
If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: