Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TokenClassificationRecord] Alignment issues when token is a white space #1264

Closed
dcfidalgo opened this issue Mar 17, 2022 · 5 comments · Fixed by #1591
Closed

[TokenClassificationRecord] Alignment issues when token is a white space #1264

dcfidalgo opened this issue Mar 17, 2022 · 5 comments · Fixed by #1591
Assignees
Labels
type: bug Indicates an unexpected problem or unintended behavior
Projects

Comments

@dcfidalgo
Copy link
Contributor

Describe the bug
Not sure if this is really a bug, but it would be nice to support cases where the tokens are white spaces. We are encountering a lot of these cases in the med7 dataset for the metrics blog post.

To Reproduce

from rubrix.server.tasks.token_classification.api.model import CreationTokenClassificationRecord
from spacy import load

nlp = load("en_core_web_sm")
text = "every four (4)  "

record = {
    "text": text,
    "tokens": list(map(str, doc)),  # ['every', 'four', '(', '4', ')', ' ']
    "prediction": {"agent": "mock", "entities": [{"start": 0, "end": 16, "label": "mock"}]},
}

CreationTokenClassificationRecord.parse_obj(record)

# AssertionError: Provided entity span every four (4)   is not aligned with provided tokens.

Expected behavior
Accept entity

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

  • OS [e.g. iOS]:
  • Browser [e.g. chrome, safari]:
  • Rubrix Version [e.g. 0.10.0]:
  • ElasticSearch Version [e.g. 7.10.2]:
  • Docker Image (optional) [e.g. rubrix:v0.10.0]:

Additional context
Add any other context about the problem here.

@dcfidalgo dcfidalgo added the type: bug Indicates an unexpected problem or unintended behavior label Mar 17, 2022
@dcfidalgo dcfidalgo changed the title [TokenClassificationRecord] Alignment issues [TokenClassificationRecord] Alignment issues when token is a white space Mar 17, 2022
@dcfidalgo dcfidalgo self-assigned this Mar 17, 2022
@dcfidalgo dcfidalgo added this to Backlog in Release via automation Mar 17, 2022
@dcfidalgo
Copy link
Contributor Author

cc @frascuchon

@frascuchon
Copy link
Member

The problem you mention here is not about annotations with whitespace tokens but annotations ENDING or STARTING with whitespace tokens.

The mechanism we have to validate the annotation is not sophisticated enough to disambiguate these spaces. In your example, it's impossible or quite difficult to determine if the whitespace token refers to the first or second white space in text.

The provided tokenization in your example is just an spacy convention where a whitespace after a whitespace is represented as a whitespace token, but could not be a universal convention.

The following example, however, works perfectly.

from rubrix.server.tasks.token_classification.api.model import CreationTokenClassificationRecord
from spacy import load

nlp = load("en_core_web_sm")
text = "every four  (4)"

record = {
     "text": text,
     "tokens": list(map(str, doc)), # ['every', 'four', ' ', '(', '4', ')']
     "prediction": {"agent": "mock", "entities": [{"start": 0, "end": len(text), "label": "mock"}]},
}

CreationTokenClassificationRecord.parse_obj(record)

See that the token list also contains whitespace tokens, but the defined span is defined over real text information.

I think that is a good practice do not define spans starting or ending in a whitespace (like we do at char level), even if they are part of tokenization. The reason is that they do not really represent relevant information

What's the informational difference between mention span "Isabel D." and "Isabel D. "? For me there isn't. Even more, I imagine that the trend is to have models or annotator generating annotations closer to the first one.

But it's my point of view. Maybe @dvsrepo can contribute with more info to clarify this edge case.

@frascuchon
Copy link
Member

Anyway, maybe we can improve the error message, making focus on irrelevant whitespaces in mention

@dcfidalgo
Copy link
Contributor Author

@frascuchon Thanks for the clarification, truth is I just assumed it would affect every white space token, I did not test it properly. You are right, annotations/predictions with trailing (double) white spaces do not make much sense and it is good practice to avoid them. But they do exist in the wild jungle of human offset annotations and models trained on them. From my experience with the veganuary and the metrics blog post, I see that alignment issues can be a real obstacle for less experienced users, so anything to mitigate them would lower the entrance barrier.
My two ideas to improve on that would be:

  • Provide some tooling to correct common issues either automatically or manually
  • Move alignment validation to the client model (we already have this on the roadmap I think)

@frascuchon frascuchon moved this from Backlog to Planified in Release Jun 27, 2022
@frascuchon
Copy link
Member

frascuchon commented Jun 29, 2022

Since we cannot resolve span alignment when spaces are included as tokens and spans definition also contains spaces without some text assumptions, I vote for trying to clean the provided record info what's mean:

  • Remove all empty tokens from the record token
  • Adjust the span definition when starting or ending with spaces (instead of raising an error)

For example, the result for your record with this approach will be:

{
  'text': 'every four  (4)',
  'tokens': [ 'every', 'four', '(', '4', ')'], 
  'prediction': {
    'agent': 'mock', 
    'score': None,
    'entities': [
      {
        'start': 0, 'end': 15, 'label': 'mock', 'score': 1.0
      }
    ], 
  }
}

Does it make sense, @dcfidalgo?

@frascuchon frascuchon moved this from Planified to Pending Review in Release Jul 5, 2022
Release automation moved this from Pending Review to Waiting Release Jul 6, 2022
@frascuchon frascuchon moved this from Waiting Release to Ready to Release QA in Release Jul 6, 2022
frascuchon added a commit that referenced this issue Jul 6, 2022
@frascuchon frascuchon moved this from Ready to Release QA to Approved Release QA in Release Jul 7, 2022
frascuchon added a commit that referenced this issue Jul 8, 2022
frascuchon added a commit that referenced this issue Jul 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Indicates an unexpected problem or unintended behavior
Projects
No open projects
Release
Approved Release QA
Development

Successfully merging a pull request may close this issue.

2 participants