Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corpus incorrectly aligning spans from Flair 0.11 #3034

Closed
Guust-Franssens opened this issue Dec 19, 2022 · 2 comments
Closed

Corpus incorrectly aligning spans from Flair 0.11 #3034

Guust-Franssens opened this issue Dec 19, 2022 · 2 comments
Labels
bug Something isn't working

Comments

@Guust-Franssens
Copy link

Guust-Franssens commented Dec 19, 2022

Hello,

First of all, thanks for making this library publicly available!

Description of the bug
When using Flair 0.11 and higher, I noticed that in some documents my spans were not aligned as to how they are represented in the data. Downgrading to Flair 0.10 or lower seemed to fixed that issue.

How to Reproduce
I created an example txt where the issue is prevalent. Using the following example and code snippet should allow you to reproduce the issue.

Create some fake ner data:

example_txt = "George B-NAME\n"
example_txt += "Washington I-NAME\n"
example_txt += "went O\n"
example_txt += "\t O\n"
example_txt += "Washington B-CITY\n"
example_txt += "and O\n"
example_txt += "enjoyed O\n"
example_txt += "some O\n"
example_txt += "coffee B-BEVERAGE\n"
with open("notebooks/example.txt", "w", encoding="utf-8") as file_out:
    file_out.write(example_txt)

This creates the following file:
image

Load in the generated data.

from flair.data import Corpus, Sentence
from flair.datasets import ColumnCorpus
columns: dict = {0: "text", 1: "ner"}
corpus: Corpus = ColumnCorpus(data_folder="data", column_format=columns, train_file="example.txt")

sentence: Sentence = corpus.train[0]
for span in sentence.get_spans("ner"):
    print(span)

>>>Span[0:2]: "George Washington"NAME (1.0)
>>>Span[5:6]: "and"CITY (1.0)

"And" incorrectly received the CITY span, coffee is not listed as an entity.
I assume this is due to the \t being matched as a column seperator.

Expected behavior
Using the exact same code snippet in Flair 0.10 or lower I get the following (which is also what I expect to have).

for span in sentence.get_spans("ner"):
    print(span)

>>>[<NAME-span (1,2): "George Washington">,
>>> <CITY-span (5): "Washington">,
>>> <BEVERAGE-span (9): "Coffee">]

Environment (please complete the following information):

  • Flair >= 0.11:

How I managed to load in my data correctly
This issue is fixable by specifying the column delimiter.

from flair.data import Corpus, Sentence
from flair.datasets import ColumnCorpus
columns: dict = {0: "text", 1: "ner"}
corpus: Corpus = ColumnCorpus(data_folder="data", column_format=columns, train_file="example.txt", column_delimiter = " ")

sentence: Sentence = corpus.train[0]
for span in sentence.get_spans("ner"):
    print(span)

>>>Span[0:2]: "George Washington"NAME (1.0)
>>>Span[4:5]: "Washington"CITY (1.0)
>>>Span[8:9]: "coffee"BEVERAGE (1.0)

Why am I listing it as a bug if it's fixable?
I spent over a week debugging this issue. The code runs, the model "seems" to learn, loss decreases but F1 stays 0, the data seems correct. It's very difficult to spot this almost "invisible" issue that I believe others will overlook as well. Moreover, Flair <= 0.10 gracefully resolves this issue (I assume there is a failsafe check somewhere).

If you have any questions, please shoot and I will get back to you quickly.

Kind regards,
Guust

@Guust-Franssens Guust-Franssens added the bug Something isn't working label Dec 19, 2022
@helpmefindaname
Copy link
Collaborator

Hi @Guust-Franssens sorry for responding so late,
I just created #3052 can you verify if that fixes the problem?

@Guust-Franssens
Copy link
Author

Hey @helpmefindaname, thanks for addressing the issue.

Unfortunately the machine on which I train the models does not allow me to install from a specific Github branch, only through pip/conda which the proxy manages.

I took a look at your code and saw that you included the example I gave here as a test case, therefore if the code you wrote fixes this case then it should solve my issue as well. Thanks!

alanakbik added a commit that referenced this issue Jan 23, 2023
fix label alignment if the sentence contains invalid tokens
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants