-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPTSimpleVectorIndex: Effective chunk size is non positive after considering extra_info #748
Comments
Hello there, It seems that the issue you are facing is related to the structure of the documents retrieved from Elasticsearch. The split_text_with_overlaps method is used by the GPTSimpleVectorIndex to split the document text into chunks of a certain size for processing with the language model. The method takes into account any extra tokens in the document, such as metadata, to ensure that the chunks are of the desired size. For you, it appears that the extra_info_str field contains additional metadata that is not being taken into account when chunking the text, leading to a negative value for effective_chunk_size. This can be addressed by either fixing the structure of the documents in the ElasticsearchReader to remove the extra_info_str field or modifying the split_text_with_overlaps method to handle such cases. To detect such cases, you can modify the GPTSimpleVectorIndex to check for the presence of the extra_info_str field and adjust the chunk size accordingly. For example, you could check the length of the extra_info_str field and subtract it from the chunk size to get the effective chunk size. Alternatively, you could use regular expressions to identify and remove the metadata from the text before chunking it. However, I recommended to fix the structure of the documents in the ElasticsearchReader to ensure that they conform to the expected format. This will prevent any issues with chunking and improve the overall performance and accuracy of the indexing process. |
Hi, @jan-tosovsky-cz! I'm here to help the LlamaIndex team manage their backlog and I wanted to let you know that we are marking this issue as stale. From what I understand, you reported an issue with the brunoramirez12 suggested fixing the structure of the documents in the ElasticsearchReader to remove the Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LlamaIndex repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days. Thank you for your contribution and we appreciate your understanding! |
When trying to index documents obtained from ElasticSearch, it fails inside
split_text_with_overlaps
method. See the full stack trace at the bottom.I think this is caused by the incorrectly constructed
Document
object. The desired content is stored in thetext
field, but the original JSON content is flattened and stored insideextra_info_str
field.text:
my content...
extra_info_str:
id: 168613\ngroupId: 10719\npublishDate: 19700101000000\nlanguage: English\ncontent: my content...
It is clear the latter field contains more chars, so when they are chunked, this equation returns negative value:
If the Document structure is really incorrect, I assume it should be fixed in that ElasticsearchReader first, but I can imagine having some detection for such cases here as well.
Stack trace
The text was updated successfully, but these errors were encountered: