Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

新版本macbert4csc中ConfusionCorrector实现逻辑问题 #494

Closed
yongzhuo opened this issue May 16, 2024 · 2 comments
Closed

新版本macbert4csc中ConfusionCorrector实现逻辑问题 #494

yongzhuo opened this issue May 16, 2024 · 2 comments
Labels
question Further information is requested

Comments

@yongzhuo
Copy link

Describe the Question

Please provide a clear and concise description of what the question is.
新版本macbert4csc中ConfusionCorrector实现逻辑问题,这里需要遍历疑似错误词典,然后每一个都需要re正则,当混淆词典比较大的时候,会特别慢。建议改为前缀树或者其他形式。

def correct(self, sentence: str):
        """
        基于混淆集纠错
        :param sentence: str, 待纠错的文本
        :return: dict, {'source': 'src', 'target': 'trg', 'errors': [(error_word, correct_word, position), ...]}
        """
        corrected_sentence = sentence
        details = []
        # 自定义混淆集加入疑似错误词典
        for err, truth in self.custom_confusion.items():
            for i in re.finditer(err, sentence):
                start, end = i.span()
                corrected_sentence = corrected_sentence[:start] + truth + corrected_sentence[end:]
                details.append((err, truth, start))
        return {'source': sentence, 'target': corrected_sentence, 'errors': details}

实测当混淆词典为1万时,ConfusionCorrector纠正速度为200-300ms每个句子,而macbert4csc推理一条句子,只需要几毫秒几十毫秒

@yongzhuo yongzhuo added the question Further information is requested label May 16, 2024
@shibing624
Copy link
Owner

fixed, use ahocorasick

@yongzhuo
Copy link
Author

get

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants