Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

这里代码写错了,把最重要的corpus_id位置的文本丢掉了。 #35

Closed
1148270327 opened this issue Aug 15, 2024 · 7 comments

Comments

@1148270327
Copy link

代码行数 425, chatpdf.py中,上下扩充检索内容时,把自己给去掉了。
expanded_reference += self.sim_model.corpus.get(corpus_id + i + 1, '')

@shibing624
Copy link
Owner

啥意思?应该咋改

@1148270327
Copy link
Author

1148270327 commented Aug 15, 2024

我这边用了faiss向量库和其他embeding模型算法。所以截取不到代码了,大致修改如下:

if self.num_expand_context_chunk > 0:
                new_reference_results = []
                for corpus_id, hit_chunk in hit_chunk_dict.items():
                    expanded_reference = self.sim_model.corpus.get(corpus_id - 1, '') + hit_chunk
                    for i in range(0, self.num_expand_context_chunk+1, 1):
                        expanded_reference += self.sim_model.corpus.get(corpus_id + i , '')
                    new_reference_results.append(expanded_reference)`

@1148270327
Copy link
Author

另外,当文章内容很少时,这个机制会导致Prompt的内容大量重复,比如基于corpus_id=3去拉取上下文,有可能刚好拉取到了已经在相关列表中的chunks. 那这样提供的prompt正文参考,会出现大量重复的chunk。

@shibing624
Copy link
Owner

好,我修复下。

@qiufb
Copy link

qiufb commented Sep 4, 2024

好牛!

@1148270327
Copy link
Author

image
这个位置代码也有问题,建议重构下切分逻辑。中英文符号是很难区分开的。另外一个是,截框里的条件判断有问题,not current_chunk一定为true的。还有sentences分的太长超过了chunk_size,也照样返回了,这会导致后续逻辑embedding异常,维度超表了。

@shibing624
Copy link
Owner

fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants