Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion error with CONLL03 #34

Closed
Riroaki opened this issue Dec 9, 2020 · 12 comments
Closed

Assertion error with CONLL03 #34

Riroaki opened this issue Dec 9, 2020 · 12 comments

Comments

@Riroaki
Copy link

Riroaki commented Dec 9, 2020

Hi, here I met another problem when using luke on NER dataset CONLL03...
When creating features from examples, the variable entity_labels is empty at some examples, like train-945:

guid=train-945
words=['SOCCER', '-', 'ENGLISH', 'SOCCER', 'RESULTS', '.', 'LONDON', '1996-08-30', 'Results', 'of', 'English', 'league', 'matches', 'on', 'Friday', ':', 'Division', 'two', 'Plymouth', '2', 'Preston', '1', 'Division', 'three', 'Swansea', '1', 'Lincoln', '2']
labels=['O', 'O', 'B-MISC', 'O', 'O', 'O', 'B-LOC', 'O', 'O', 'O', 'B-MISC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'O', 'B-ORG', 'O', 'O', 'O', 'B-ORG', 'O', 'B-ORG', 'O']

and the code here throws an AssertionError:

assert not entity_labels

Do you have any idea what's wrong with these examples?

@ikuyamada
Copy link
Member

Hi @Riroaki,
The error may be due to the difference in the format of the input dataset.
The code currently supports the CoNLL-2003 dataset in IOB1 format, consisting of eng.train, eng.testa, 'eng.testb` files.
These files can be found online, e.g., on the following repository:
https://github.com/synalp/NER/tree/master/corpus/CoNLL-2003

@Riroaki
Copy link
Author

Riroaki commented Dec 10, 2020

Thanks! It works fine with the dataset in your link.
BTW, I think the format of my dataset is IOB2 right? I'm new to NER lol...

@ikuyamada
Copy link
Member

It's great to hear that👍
Also, the format of your dataset looks like IOB2 format.

@Riroaki
Copy link
Author

Riroaki commented Dec 11, 2020

Thanks, and do you know any handy script that could easily convert IOB2 format to IOB1?

@Riroaki
Copy link
Author

Riroaki commented Dec 11, 2020

Here another error in validation occurs:
TypeError: Found input variables without list of list.
in seqeval/metrics/v1.py", line 97, in check_consistent_length raise TypeError('Found input variables without list of list.')

@ikuyamada
Copy link
Member

ikuyamada commented Dec 11, 2020

Since I use only IOB1 format during experiments, I do not know such script.

Here another error in validation occurs:
TypeError: Found input variables without list of list.
in seqeval/metrics/v1.py", line 97, in check_consistent_length raise TypeError('Found input variables without list of list.')

Did you use the CoNLL dataset in IOB1 format?

@Riroaki
Copy link
Author

Riroaki commented Dec 11, 2020

Yes, and the training went smoothly with weights dumped into a bin file...
I checked the output and found dev_prediction.txt has the same 51362 lines with labels, so the validation should be ok, and the error should be during testing.

@ikuyamada
Copy link
Member

Hi,
It seems that the error happens when using the recent versions of the seqeval library.
I will investigate this error later, but the error should be fixed if you use seqeval==0.0.12 which is also specified in the poetry.lock file.
https://github.com/studio-ousia/luke/blob/master/poetry.lock#L792

@Riroaki
Copy link
Author

Riroaki commented Dec 11, 2020

Thanks for being so nice and helpful!
I noticed the error occurred in seqeval library, and I should have mentioned it in my descriptions lol

@Riroaki Riroaki closed this as completed Dec 11, 2020
@Riroaki
Copy link
Author

Riroaki commented Dec 11, 2020

Oh well, I would like to mention that I just installed the different version of seqeval library because I couldn't install the libs in a normal way by poetry install (downloading will timeout because of the GFW in China lol...).

Also, I found the transformers library required in lock file is 2.11.0, which may cause bugs when tokenizing empty strings with Roberta tokenizer ( huggingface/transformers#3809 actually it happened to me when doing RE on TACRED dataset), and I updated it to a new version 3.0.0, which solved the problem, I wonder if it's necessary to update the version of codes (if the codes work fine in your environment, just ignore this).

@Riroaki
Copy link
Author

Riroaki commented Dec 11, 2020

I found one script that converts IOB1 to IOB2 format, and I revised it so that the function could convert IOB2 to IOB1 format...
And it's amazing to see the 2 functions are perfectly symmetric: just replace the chars B and I in 2 functions would make a reverse converter!
The original script I found is here:
https://gist.github.com/allanj/b9bd448dc9b70d71eb7c2b6dd33fe4ef

I rewrote the functions as below:

# Amazing!
def iob2to1(tags):
    for i, tag in enumerate(tags):
        if tag in {'O', '-X-'} or tag[0] == 'I':
            continue
        elif i == 0 or tags[i - 1] == 'O':
            tags[i] = 'I' + tag[1:]
        elif tags[i - 1][1:] != tag[1:]:
            tags[i] = 'I' + tag[1:]
    return tags


def iob1to2(tags):
    for i, tag in enumerate(tags):
        if tag in {'O', '-X-'} or tag[0] == 'B':
            continue
        elif i == 0 or tags[i - 1] == 'O':
            tags[i] = 'B' + tag[1:]
        elif tags[i - 1][1:] != tag[1:]:
            tags[i] = 'B' + tag[1:]
    return tags

Lol I just want to share some interesting (to me) facts, and I wonder if you have any idea about why this happens? ^_^

@ikuyamada
Copy link
Member

Thank you very much for reporting the issue with transformers==2.11.0!
Because we actually run the experiments using transformers==2.4.1 and updated it after experiments (we used the versions in poetry_emnlp2020.lock file), it is possible that there is an issue when using 2.11.0. I will investigate it further and fix it if necessary.

Also, thank you for sharing the code to convert from IOB2 to IOB1!😍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants