Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

请问使用sentencepiece分词做中文预训练模型是如何处理token级别任务的(例如抽取式阅读理解cmrc2018或NER任务) #11

Closed
ewrfcas opened this issue Aug 20, 2019 · 4 comments
Labels
good first issue Good for newcomers

Comments

@ewrfcas
Copy link

ewrfcas commented Aug 20, 2019

例如cmrc2018,由于sentencepiece经常会把答案切分到奇怪的地方导致结果EM降低。
举个例子,原句是”1994年被任命为XXX“,答案是”1994年“。但是由于”年被“的出现频率很高,sentencepiece吧”年被“分成一个词了,结果变成”1994年被“了,这无论在训练还是测试的时候都会遇到,请问是怎么解决的?非常感谢!

@ewrfcas
Copy link
Author

ewrfcas commented Aug 20, 2019

补充一下,英语能够解决这个问题应该是因为英文里面空格是天然的分隔符,他可以前后找空格(在sentencepiece里应该是_下划线)来匹配答案,但是中文的sentencepiece感觉很难解决这个问题

@ewrfcas ewrfcas changed the title 请问使用sentencepiece分词的时候如何解决抽取式阅读理解问题的gap 请问使用sentencepiece分词做中文预训练模型是如何处理token级别任务的(例如抽取式阅读理解cmrc2018或NER任务) Aug 20, 2019
@ymcui
Copy link
Owner

ymcui commented Aug 20, 2019

Good catch.
目前两个阅读理解数据EM相对偏低的原因就是如此。
后续可以训练一个基于WordPiece的,不过暂时还没有具体计划。

@ymcui ymcui added the good first issue Good for newcomers label Aug 20, 2019
@ewrfcas
Copy link
Author

ewrfcas commented Aug 20, 2019

如果没做这个预处理能达到这个效果,说明已经非常好了@ymcui

@ymcui
Copy link
Owner

ymcui commented Aug 23, 2019

re-open if necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants