[SEP], [CLS] 등 스페셜 토큰의 토크나이저 이슈 #11

haven-jeon · 2019-12-17T09:07:08Z

> tokenizer('[CLS] 감사합니다. [SEP]')
['▁[', 'C', 'LS', ']', '▁감사', '합니다', '.', '▁[', 'S', 'E', 'P', ']']

현재로서는 아래와 같은 방식으로 우회해야 됨

> ['[CLS]', ] + tokenizer('감사합니다. ') + ['[SEP]', ]

구글 protobuf를 수정하는 방식으로 기존 tokenizer 모델을 아래와 같이 수정하여 재 등록 해야 됨

google/sentencepiece#426
google/sentencepiece#306

robinsongh381 · 2019-12-18T08:03:00Z

[CLS] 토큰에 대한 임베딩 값이 필요할 경우 어떻게해야 하나요 ?

haven-jeon · 2019-12-18T13:00:33Z

논문을 보시면 아시겠지만 [CLS]토큰은 이미 vocab에 들어가 있고 네트워크 내 토큰 임베딩으로 학습이 되어 있습니다. 따라서 모델을 로딩하고 임베딩 값을 뽑아 쓰시면 됩니다.

참고로 해당 질문은 이 이슈하고는 관련이 없는거 같습니다...

#11

haven-jeon · 2019-12-19T07:47:44Z

not fixed.

In [4]: tokenizer('[CLS] 감사합니다. [SEP]')                                                                                                                                                                                                                                        
Out[4]: ['▁', '[CLS]', '▁감사', '합니다', '.', '▁', '[SEP]']

haven-jeon · 2020-01-22T10:45:25Z

해당 이슈는 follow up 이 없기 때문에 당분간 close합니다.

haven-jeon added a commit that referenced this issue Dec 19, 2019

Update tokenizer

6130046

#11

haven-jeon added a commit that referenced this issue Dec 19, 2019

update chksum

2cb8198

#11

haven-jeon closed this as completed Dec 19, 2019

haven-jeon reopened this Dec 19, 2019

haven-jeon closed this as completed Jan 22, 2020

haven-jeon mentioned this issue Oct 29, 2020

보캡에 대해서 SKT-AI/KoGPT2#27

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SEP], [CLS] 등 스페셜 토큰의 토크나이저 이슈 #11

[SEP], [CLS] 등 스페셜 토큰의 토크나이저 이슈 #11

haven-jeon commented Dec 17, 2019 •

edited

Loading

robinsongh381 commented Dec 18, 2019

haven-jeon commented Dec 18, 2019 •

edited

Loading

haven-jeon commented Dec 19, 2019 •

edited

Loading

haven-jeon commented Jan 22, 2020

[SEP], [CLS] 등 스페셜 토큰의 토크나이저 이슈 #11

[SEP], [CLS] 등 스페셜 토큰의 토크나이저 이슈 #11

Comments

haven-jeon commented Dec 17, 2019 • edited Loading

robinsongh381 commented Dec 18, 2019

haven-jeon commented Dec 18, 2019 • edited Loading

haven-jeon commented Dec 19, 2019 • edited Loading

haven-jeon commented Jan 22, 2020

haven-jeon commented Dec 17, 2019 •

edited

Loading

haven-jeon commented Dec 18, 2019 •

edited

Loading

haven-jeon commented Dec 19, 2019 •

edited

Loading