We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
例如cmrc2018,由于sentencepiece经常会把答案切分到奇怪的地方导致结果EM降低。 举个例子,原句是”1994年被任命为XXX“,答案是”1994年“。但是由于”年被“的出现频率很高,sentencepiece吧”年被“分成一个词了,结果变成”1994年被“了,这无论在训练还是测试的时候都会遇到,请问是怎么解决的?非常感谢!
The text was updated successfully, but these errors were encountered:
补充一下,英语能够解决这个问题应该是因为英文里面空格是天然的分隔符,他可以前后找空格(在sentencepiece里应该是_下划线)来匹配答案,但是中文的sentencepiece感觉很难解决这个问题
Sorry, something went wrong.
Good catch. 目前两个阅读理解数据EM相对偏低的原因就是如此。 后续可以训练一个基于WordPiece的,不过暂时还没有具体计划。
如果没做这个预处理能达到这个效果,说明已经非常好了@ymcui
re-open if necessary.
No branches or pull requests
例如cmrc2018,由于sentencepiece经常会把答案切分到奇怪的地方导致结果EM降低。
举个例子,原句是”1994年被任命为XXX“,答案是”1994年“。但是由于”年被“的出现频率很高,sentencepiece吧”年被“分成一个词了,结果变成”1994年被“了,这无论在训练还是测试的时候都会遇到,请问是怎么解决的?非常感谢!
The text was updated successfully, but these errors were encountered: