[TokenClassificationRecord] Alignment issues when token is a white space #1264

dcfidalgo · 2022-03-17T09:49:06Z

Describe the bug
Not sure if this is really a bug, but it would be nice to support cases where the tokens are white spaces. We are encountering a lot of these cases in the med7 dataset for the metrics blog post.

To Reproduce

from rubrix.server.tasks.token_classification.api.model import CreationTokenClassificationRecord
from spacy import load

nlp = load("en_core_web_sm")
text = "every four (4)  "

record = {
    "text": text,
    "tokens": list(map(str, doc)),  # ['every', 'four', '(', '4', ')', ' ']
    "prediction": {"agent": "mock", "entities": [{"start": 0, "end": 16, "label": "mock"}]},
}

CreationTokenClassificationRecord.parse_obj(record)

# AssertionError: Provided entity span every four (4)   is not aligned with provided tokens.

Expected behavior
Accept entity

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

OS [e.g. iOS]:
Browser [e.g. chrome, safari]:
Rubrix Version [e.g. 0.10.0]:
ElasticSearch Version [e.g. 7.10.2]:
Docker Image (optional) [e.g. rubrix:v0.10.0]:

Additional context
Add any other context about the problem here.

dcfidalgo · 2022-03-17T09:50:33Z

cc @frascuchon

frascuchon · 2022-03-17T14:15:00Z

The problem you mention here is not about annotations with whitespace tokens but annotations ENDING or STARTING with whitespace tokens.

The mechanism we have to validate the annotation is not sophisticated enough to disambiguate these spaces. In your example, it's impossible or quite difficult to determine if the whitespace token refers to the first or second white space in text.

The provided tokenization in your example is just an spacy convention where a whitespace after a whitespace is represented as a whitespace token, but could not be a universal convention.

The following example, however, works perfectly.

from rubrix.server.tasks.token_classification.api.model import CreationTokenClassificationRecord
from spacy import load

nlp = load("en_core_web_sm")
text = "every four  (4)"

record = {
     "text": text,
     "tokens": list(map(str, doc)), # ['every', 'four', ' ', '(', '4', ')']
     "prediction": {"agent": "mock", "entities": [{"start": 0, "end": len(text), "label": "mock"}]},
}

CreationTokenClassificationRecord.parse_obj(record)

See that the token list also contains whitespace tokens, but the defined span is defined over real text information.

I think that is a good practice do not define spans starting or ending in a whitespace (like we do at char level), even if they are part of tokenization. The reason is that they do not really represent relevant information

What's the informational difference between mention span "Isabel D." and "Isabel D. "? For me there isn't. Even more, I imagine that the trend is to have models or annotator generating annotations closer to the first one.

But it's my point of view. Maybe @dvsrepo can contribute with more info to clarify this edge case.

frascuchon · 2022-03-17T14:16:43Z

Anyway, maybe we can improve the error message, making focus on irrelevant whitespaces in mention

dcfidalgo · 2022-03-21T09:57:59Z

@frascuchon Thanks for the clarification, truth is I just assumed it would affect every white space token, I did not test it properly. You are right, annotations/predictions with trailing (double) white spaces do not make much sense and it is good practice to avoid them. But they do exist in the wild jungle of human offset annotations and models trained on them. From my experience with the veganuary and the metrics blog post, I see that alignment issues can be a real obstacle for less experienced users, so anything to mitigate them would lower the entrance barrier.
My two ideas to improve on that would be:

Provide some tooling to correct common issues either automatically or manually
Move alignment validation to the client model (we already have this on the roadmap I think)

frascuchon · 2022-06-29T08:34:30Z

Since we cannot resolve span alignment when spaces are included as tokens and spans definition also contains spaces without some text assumptions, I vote for trying to clean the provided record info what's mean:

Remove all empty tokens from the record token
Adjust the span definition when starting or ending with spaces (instead of raising an error)

For example, the result for your record with this approach will be:

{
  'text': 'every four  (4)',
  'tokens': [ 'every', 'four', '(', '4', ')'], 
  'prediction': {
    'agent': 'mock', 
    'score': None,
    'entities': [
      {
        'start': 0, 'end': 15, 'label': 'mock', 'score': 1.0
      }
    ], 
  }
}

Does it make sense, @dcfidalgo?

(cherry picked from commit 9fdc648)

dcfidalgo added the type: bug Indicates an unexpected problem or unintended behavior label Mar 17, 2022

dcfidalgo changed the title ~~[TokenClassificationRecord] Alignment issues~~ [TokenClassificationRecord] Alignment issues when token is a white space Mar 17, 2022

dcfidalgo self-assigned this Mar 17, 2022

dcfidalgo added this to Backlog in Release via automation Mar 17, 2022

frascuchon moved this from Backlog to Planified in Release Jun 27, 2022

frascuchon added a commit that referenced this issue Jun 29, 2022

fix(#1264): preprocess record tokens

b28a704

frascuchon mentioned this issue Jun 29, 2022

fix(#1264): discard first space after a token #1591

Merged

frascuchon added a commit that referenced this issue Jul 4, 2022

fix(#1264): preprocess record tokens

aea7d27

frascuchon moved this from Planified to Pending Review in Release Jul 5, 2022

frascuchon closed this as completed in #1591 Jul 6, 2022

Release automation moved this from Pending Review to Waiting Release Jul 6, 2022

frascuchon added a commit that referenced this issue Jul 6, 2022

fix(#1264): discard first space after a token (#1591)

9fdc648

frascuchon moved this from Waiting Release to Ready to Release QA in Release Jul 6, 2022

frascuchon added a commit that referenced this issue Jul 6, 2022

fix(#1264): discard first space after a token (#1591)

5fa8bfa

(cherry picked from commit 9fdc648)

frascuchon moved this from Ready to Release QA to Approved Release QA in Release Jul 7, 2022

frascuchon added a commit that referenced this issue Jul 8, 2022

fix(#1264): discard first space after a token (#1591)

6da277a

(cherry picked from commit 9fdc648)

frascuchon added a commit that referenced this issue Jul 8, 2022

fix(#1264): discard first space after a token (#1591)

eff0ac5

(cherry picked from commit 9fdc648)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TokenClassificationRecord] Alignment issues when token is a white space #1264

[TokenClassificationRecord] Alignment issues when token is a white space #1264

dcfidalgo commented Mar 17, 2022

dcfidalgo commented Mar 17, 2022

frascuchon commented Mar 17, 2022

frascuchon commented Mar 17, 2022

dcfidalgo commented Mar 21, 2022

frascuchon commented Jun 29, 2022 •

edited

[TokenClassificationRecord] Alignment issues when token is a white space #1264

[TokenClassificationRecord] Alignment issues when token is a white space #1264

Comments

dcfidalgo commented Mar 17, 2022

dcfidalgo commented Mar 17, 2022

frascuchon commented Mar 17, 2022

frascuchon commented Mar 17, 2022

dcfidalgo commented Mar 21, 2022

frascuchon commented Jun 29, 2022 • edited

frascuchon commented Jun 29, 2022 •

edited