这里代码写错了，把最重要的corpus_id位置的文本丢掉了。 #35

1148270327 · 2024-08-15T08:48:47Z

代码行数 425， chatpdf.py中，上下扩充检索内容时，把自己给去掉了。
expanded_reference += self.sim_model.corpus.get(corpus_id + i + 1, '')

shibing624 · 2024-08-15T09:29:00Z

啥意思？应该咋改

1148270327 · 2024-08-15T10:04:03Z

我这边用了faiss向量库和其他embeding模型算法。所以截取不到代码了，大致修改如下：

if self.num_expand_context_chunk > 0:
                new_reference_results = []
                for corpus_id, hit_chunk in hit_chunk_dict.items():
                    expanded_reference = self.sim_model.corpus.get(corpus_id - 1, '') + hit_chunk
                    for i in range(0, self.num_expand_context_chunk+1, 1):
                        expanded_reference += self.sim_model.corpus.get(corpus_id + i , '')
                    new_reference_results.append(expanded_reference)`

1148270327 · 2024-08-15T10:07:53Z

另外，当文章内容很少时，这个机制会导致Prompt的内容大量重复，比如基于corpus_id=3去拉取上下文，有可能刚好拉取到了已经在相关列表中的chunks. 那这样提供的prompt正文参考，会出现大量重复的chunk。

shibing624 · 2024-08-15T12:05:55Z

好，我修复下。

qiufb · 2024-09-04T09:52:33Z

好牛！

1148270327 · 2024-09-05T01:47:09Z

这个位置代码也有问题，建议重构下切分逻辑。中英文符号是很难区分开的。另外一个是，截框里的条件判断有问题，not current_chunk一定为true的。还有sentences分的太长超过了chunk_size，也照样返回了，这会导致后续逻辑embedding异常，维度超表了。

shibing624 · 2024-09-06T00:54:35Z

fixed.

shibing624 closed this as completed in e475066 Sep 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

这里代码写错了，把最重要的corpus_id位置的文本丢掉了。 #35

这里代码写错了，把最重要的corpus_id位置的文本丢掉了。 #35

1148270327 commented Aug 15, 2024

shibing624 commented Aug 15, 2024

1148270327 commented Aug 15, 2024 •

edited

Loading

1148270327 commented Aug 15, 2024

shibing624 commented Aug 15, 2024

qiufb commented Sep 4, 2024

1148270327 commented Sep 5, 2024

shibing624 commented Sep 6, 2024

这里代码写错了，把最重要的corpus_id位置的文本丢掉了。 #35

这里代码写错了，把最重要的corpus_id位置的文本丢掉了。 #35

Comments

1148270327 commented Aug 15, 2024

shibing624 commented Aug 15, 2024

1148270327 commented Aug 15, 2024 • edited Loading

1148270327 commented Aug 15, 2024

shibing624 commented Aug 15, 2024

qiufb commented Sep 4, 2024

1148270327 commented Sep 5, 2024

shibing624 commented Sep 6, 2024

1148270327 commented Aug 15, 2024 •

edited

Loading