Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

训练集的数据出现在验证集及测试集中 #26

Closed
JamyDon opened this issue Nov 8, 2023 · 3 comments
Closed

训练集的数据出现在验证集及测试集中 #26

JamyDon opened this issue Nov 8, 2023 · 3 comments
Assignees
Labels
corpus issue Done The issue is fixed

Comments

@JamyDon
Copy link

JamyDon commented Nov 8, 2023

您好,统计了一下,在2000条句子的验证集中,有37条句子纠错前的原始错句或170条句子纠错后的答案曾在训练集中出现;在3000条句子的测试集中,有48条句子纠错前的原始错句曾在训练集中出现(由于测试集的答案未知,因此有多少句子纠错后的答案曾在训练集中出现未知)。这个情况可能会导致few-shot模型测试结果不准确的问题。

请问是否能提供一个过滤集,包含所有需要从训练集中筛去的出现在验证集或测试集中的句子(包括同源句子的出现),以便得到一个更纯净的训练集?非常感谢!

@xlxwalex
Copy link
Owner

xlxwalex commented Nov 8, 2023

你好,

感谢反馈,我会在今天检查这个问题,在确认并修复后再次回复!

xlxwalex added a commit that referenced this issue Nov 8, 2023
Deduplicate homologous sentences
xlxwalex added a commit that referenced this issue Nov 8, 2023
@xlxwalex xlxwalex self-assigned this Nov 8, 2023
@xlxwalex xlxwalex added the Done The issue is fixed label Nov 8, 2023
@xlxwalex
Copy link
Owner

xlxwalex commented Nov 8, 2023

你好,FCGEC_train_filtered.json是更新后的训练集,已经将相似/同源句进行了过滤。再次感谢您的反馈!

@JamyDon
Copy link
Author

JamyDon commented Nov 9, 2023

非常感谢!

@JamyDon JamyDon closed this as completed Nov 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
corpus issue Done The issue is fixed
Projects
None yet
Development

No branches or pull requests

2 participants