Corpus incorrectly aligning spans from Flair 0.11 #3034

Guust-Franssens · 2022-12-19T08:06:44Z

Hello,

First of all, thanks for making this library publicly available!

Description of the bug
When using Flair 0.11 and higher, I noticed that in some documents my spans were not aligned as to how they are represented in the data. Downgrading to Flair 0.10 or lower seemed to fixed that issue.

How to Reproduce
I created an example txt where the issue is prevalent. Using the following example and code snippet should allow you to reproduce the issue.

Create some fake ner data:

example_txt = "George B-NAME\n"
example_txt += "Washington I-NAME\n"
example_txt += "went O\n"
example_txt += "\t O\n"
example_txt += "Washington B-CITY\n"
example_txt += "and O\n"
example_txt += "enjoyed O\n"
example_txt += "some O\n"
example_txt += "coffee B-BEVERAGE\n"
with open("notebooks/example.txt", "w", encoding="utf-8") as file_out:
    file_out.write(example_txt)

This creates the following file:

Load in the generated data.

from flair.data import Corpus, Sentence
from flair.datasets import ColumnCorpus
columns: dict = {0: "text", 1: "ner"}
corpus: Corpus = ColumnCorpus(data_folder="data", column_format=columns, train_file="example.txt")

sentence: Sentence = corpus.train[0]
for span in sentence.get_spans("ner"):
    print(span)

>>>Span[0:2]: "George Washington" → NAME (1.0)
>>>Span[5:6]: "and" → CITY (1.0)

"And" incorrectly received the CITY span, coffee is not listed as an entity.
I assume this is due to the \t being matched as a column seperator.

Expected behavior
Using the exact same code snippet in Flair 0.10 or lower I get the following (which is also what I expect to have).

for span in sentence.get_spans("ner"):
    print(span)

>>>[<NAME-span (1,2): "George Washington">,
>>> <CITY-span (5): "Washington">,
>>> <BEVERAGE-span (9): "Coffee">]

Environment (please complete the following information):

Flair >= 0.11:

How I managed to load in my data correctly
This issue is fixable by specifying the column delimiter.

from flair.data import Corpus, Sentence
from flair.datasets import ColumnCorpus
columns: dict = {0: "text", 1: "ner"}
corpus: Corpus = ColumnCorpus(data_folder="data", column_format=columns, train_file="example.txt", column_delimiter = " ")

sentence: Sentence = corpus.train[0]
for span in sentence.get_spans("ner"):
    print(span)

>>>Span[0:2]: "George Washington" → NAME (1.0)
>>>Span[4:5]: "Washington" → CITY (1.0)
>>>Span[8:9]: "coffee" → BEVERAGE (1.0)

Why am I listing it as a bug if it's fixable?
I spent over a week debugging this issue. The code runs, the model "seems" to learn, loss decreases but F1 stays 0, the data seems correct. It's very difficult to spot this almost "invisible" issue that I believe others will overlook as well. Moreover, Flair <= 0.10 gracefully resolves this issue (I assume there is a failsafe check somewhere).

If you have any questions, please shoot and I will get back to you quickly.

Kind regards,
Guust

helpmefindaname · 2023-01-16T13:53:39Z

Hi @Guust-Franssens sorry for responding so late,
I just created #3052 can you verify if that fixes the problem?

Guust-Franssens · 2023-01-16T20:06:26Z

Hey @helpmefindaname, thanks for addressing the issue.

Unfortunately the machine on which I train the models does not allow me to install from a specific Github branch, only through pip/conda which the proxy manages.

I took a look at your code and saw that you included the example I gave here as a test case, therefore if the code you wrote fixes this case then it should solve my issue as well. Thanks!

fix label alignment if the sentence contains invalid tokens

Guust-Franssens added the bug Something isn't working label Dec 19, 2022

Guust-Franssens mentioned this issue Dec 22, 2022

Custom training data #3022

Closed

Guust-Franssens closed this as completed Jan 16, 2023

alanakbik added a commit that referenced this issue Jan 23, 2023

Merge pull request #3052 from flairNLP/gh-3034/fix-misaligned_spans

c777f45

fix label alignment if the sentence contains invalid tokens

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corpus incorrectly aligning spans from Flair 0.11 #3034

Corpus incorrectly aligning spans from Flair 0.11 #3034

Guust-Franssens commented Dec 19, 2022 •

edited

helpmefindaname commented Jan 16, 2023

Guust-Franssens commented Jan 16, 2023

Corpus incorrectly aligning spans from Flair 0.11 #3034

Corpus incorrectly aligning spans from Flair 0.11 #3034

Comments

Guust-Franssens commented Dec 19, 2022 • edited

helpmefindaname commented Jan 16, 2023

Guust-Franssens commented Jan 16, 2023

Guust-Franssens commented Dec 19, 2022 •

edited