You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
How to label words that bert vocab does not contain?
This situation happens in English corpus. For example, the word 'jony' is tokenizered to 'jon', '##y'. If in the origin corpus 'jony' is labeled with B-PER, how to modify the corresponding sequence labels in tokens produced by BERT in your code?
The text was updated successfully, but these errors were encountered:
In my understanding, there is no need to modify the corresponding sequence labels.
For example, given a sequence ['Jony', 'is', 'a', 'boy'] with label being ['B-PER', 'O', 'O', 'O'].
After tokenization, you will get ['Jon', '##y', 'is', 'a', 'boy'].
we use the representation of the first sub-token as the input to the token-level classifier over the NER label set.
In this setting, you are using ['Jon', 'is', 'a', 'boy'] as the representative of the original sequence ['Jony', 'is', 'a', 'boy']. Since the sequence length is unchanged, you don't need to change the original sequence labels either. Hope this discussion would be helpful to you.
How to label words that bert vocab does not contain?
This situation happens in English corpus. For example, the word 'jony' is tokenizered to 'jon', '##y'. If in the origin corpus 'jony' is labeled with B-PER, how to modify the corresponding sequence labels in tokens produced by BERT in your code?
The text was updated successfully, but these errors were encountered: