Sequence Labeling Issue #1

qhd1996 · 2020-07-30T10:45:38Z

How to label words that bert vocab does not contain?
This situation happens in English corpus. For example, the word 'jony' is tokenizered to 'jon', '##y'. If in the origin corpus 'jony' is labeled with B-PER, how to modify the corresponding sequence labels in tokens produced by BERT in your code?

weizhepei · 2020-08-17T03:45:48Z

In my understanding, there is no need to modify the corresponding sequence labels.
For example, given a sequence ['Jony', 'is', 'a', 'boy'] with label being ['B-PER', 'O', 'O', 'O'].
After tokenization, you will get ['Jon', '##y', 'is', 'a', 'boy'].

According to the BERT paper:

we use the representation of the first sub-token as the input to the token-level classifier over the NER label set.

In this setting, you are using ['Jon', 'is', 'a', 'boy'] as the representative of the original sequence ['Jony', 'is', 'a', 'boy']. Since the sequence length is unchanged, you don't need to change the original sequence labels either. Hope this discussion would be helpful to you.

weizhepei closed this as completed Aug 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sequence Labeling Issue #1

Sequence Labeling Issue #1

qhd1996 commented Jul 30, 2020

weizhepei commented Aug 17, 2020

Sequence Labeling Issue #1

Sequence Labeling Issue #1

Comments

qhd1996 commented Jul 30, 2020

weizhepei commented Aug 17, 2020