Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于文本中存在空格问题 #476

Closed
1120475708 opened this issue Feb 28, 2024 · 2 comments
Closed

关于文本中存在空格问题 #476

1120475708 opened this issue Feb 28, 2024 · 2 comments
Labels
question Further information is requested

Comments

@1120475708
Copy link

看模型的训练部分代码,似乎会把空格全都移除掉,这是不是意味着如果我的测试集合中存在空格的case,那么模型在predict时,存在空格的case一定会被判断为错误,从而影响模型的训练效果。

我看这个mr似乎修复过一次关于空格的bug,
但是看现在的纠错代码,如果遇到带空格的case,那么则不会对其进行纠错。
如下代码所示,对原文本进行了spilt后,则会导致text中原始的空格消失,从而导致纠错后文本和纠错前文本长度不一致
不知道是有意而为之还是一个bug

            for id, (logit_tensor, sentence) in enumerate(zip(outputs.logits, batch)):
                decode_tokens_new = self.tokenizer.decode(
                    torch.argmax(logit_tensor, dim=-1), skip_special_tokens=True).split(' ')
                decode_tokens_new = decode_tokens_new[:len(sentence)]
                if len(decode_tokens_new) == len(sentence):
                    probs = torch.max(torch.softmax(logit_tensor, dim=-1), dim=-1)[0].cpu().numpy()
                    decode_str = ''
                    for i in range(len(sentence)):
                        if probs[i + 1] >= threshold:
                            decode_str += decode_tokens_new[i]
                        else:
                            decode_str += sentence[i]
                    corrected_text = decode_str
                else:
                    corrected_text = sentence
                corrected_sents.append(corrected_text)

#192

@1120475708 1120475708 added the question Further information is requested label Feb 28, 2024
@shibing624
Copy link
Owner

shibing624 commented Feb 28, 2024

预测时会过滤空格,空格会跳过不纠。

@1120475708
Copy link
Author

也就是说我的训练数据里面其实是可以存在带有空格的case是吗

感谢您的解答!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants