We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
看模型的训练部分代码,似乎会把空格全都移除掉,这是不是意味着如果我的测试集合中存在空格的case,那么模型在predict时,存在空格的case一定会被判断为错误,从而影响模型的训练效果。
我看这个mr似乎修复过一次关于空格的bug, 但是看现在的纠错代码,如果遇到带空格的case,那么则不会对其进行纠错。 如下代码所示,对原文本进行了spilt后,则会导致text中原始的空格消失,从而导致纠错后文本和纠错前文本长度不一致 不知道是有意而为之还是一个bug
for id, (logit_tensor, sentence) in enumerate(zip(outputs.logits, batch)): decode_tokens_new = self.tokenizer.decode( torch.argmax(logit_tensor, dim=-1), skip_special_tokens=True).split(' ') decode_tokens_new = decode_tokens_new[:len(sentence)] if len(decode_tokens_new) == len(sentence): probs = torch.max(torch.softmax(logit_tensor, dim=-1), dim=-1)[0].cpu().numpy() decode_str = '' for i in range(len(sentence)): if probs[i + 1] >= threshold: decode_str += decode_tokens_new[i] else: decode_str += sentence[i] corrected_text = decode_str else: corrected_text = sentence corrected_sents.append(corrected_text)
#192
The text was updated successfully, but these errors were encountered:
预测时会过滤空格,空格会跳过不纠。
Sorry, something went wrong.
也就是说我的训练数据里面其实是可以存在带有空格的case是吗
感谢您的解答!
No branches or pull requests
看模型的训练部分代码,似乎会把空格全都移除掉,这是不是意味着如果我的测试集合中存在空格的case,那么模型在predict时,存在空格的case一定会被判断为错误,从而影响模型的训练效果。
我看这个mr似乎修复过一次关于空格的bug,
但是看现在的纠错代码,如果遇到带空格的case,那么则不会对其进行纠错。
如下代码所示,对原文本进行了spilt后,则会导致text中原始的空格消失,从而导致纠错后文本和纠错前文本长度不一致
不知道是有意而为之还是一个bug
#192
The text was updated successfully, but these errors were encountered: