New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

请问使用sentencepiece分词做中文预训练模型是如何处理token级别任务的(例如抽取式阅读理解cmrc2018或NER任务) #11

Closed

ewrfcas opened this issue Aug 20, 2019 · 4 comments

Labels

good first issue

ewrfcas commented Aug 20, 2019 •

edited

例如cmrc2018,由于sentencepiece经常会把答案切分到奇怪的地方导致结果EM降低。
举个例子，原句是”1994年被任命为XXX“，答案是”1994年“。但是由于”年被“的出现频率很高，sentencepiece吧”年被“分成一个词了，结果变成”1994年被“了，这无论在训练还是测试的时候都会遇到，请问是怎么解决的？非常感谢！

Author

ewrfcas commented Aug 20, 2019

补充一下，英语能够解决这个问题应该是因为英文里面空格是天然的分隔符，他可以前后找空格（在sentencepiece里应该是_下划线）来匹配答案，但是中文的sentencepiece感觉很难解决这个问题

ewrfcas changed the title ~~请问使用sentencepiece分词的时候如何解决抽取式阅读理解问题的gap~~ 请问使用sentencepiece分词做中文预训练模型是如何处理token级别任务的(例如抽取式阅读理解cmrc2018或NER任务)

Owner

ymcui commented Aug 20, 2019

Good catch.
目前两个阅读理解数据EM相对偏低的原因就是如此。
后续可以训练一个基于WordPiece的，不过暂时还没有具体计划。

ymcui added the good first issue label

Author

ewrfcas commented Aug 20, 2019

如果没做这个预处理能达到这个效果，说明已经非常好了@ymcui

Owner

ymcui commented Aug 23, 2019

re-open if necessary.

ymcui closed this as completed

lsq357 mentioned this issue

序列标注的问题 #17

Closed

ewrfcas mentioned this issue

请问为什么roberta_large比roberta_middle在CMRC2018上低很多？ brightmart/roberta_zh#16

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment