Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sequence Labeling Issue #1

Closed
qhd1996 opened this issue Jul 30, 2020 · 1 comment
Closed

Sequence Labeling Issue #1

qhd1996 opened this issue Jul 30, 2020 · 1 comment

Comments

@qhd1996
Copy link

qhd1996 commented Jul 30, 2020

How to label words that bert vocab does not contain?
This situation happens in English corpus. For example, the word 'jony' is tokenizered to 'jon', '##y'. If in the origin corpus 'jony' is labeled with B-PER, how to modify the corresponding sequence labels in tokens produced by BERT in your code?

@weizhepei
Copy link
Owner

In my understanding, there is no need to modify the corresponding sequence labels.
For example, given a sequence ['Jony', 'is', 'a', 'boy'] with label being ['B-PER', 'O', 'O', 'O'].
After tokenization, you will get ['Jon', '##y', 'is', 'a', 'boy'].

According to the BERT paper:

we use the representation of the first sub-token as the input to the token-level classifier over the NER label set.

In this setting, you are using ['Jon', 'is', 'a', 'boy'] as the representative of the original sequence ['Jony', 'is', 'a', 'boy']. Since the sequence length is unchanged, you don't need to change the original sequence labels either. Hope this discussion would be helpful to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants