Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPTSimpleVectorIndex: Effective chunk size is non positive after considering extra_info #748

Closed
jan-tosovsky-cz opened this issue Mar 15, 2023 · 2 comments

Comments

@jan-tosovsky-cz
Copy link

When trying to index documents obtained from ElasticSearch, it fails inside split_text_with_overlaps method. See the full stack trace at the bottom.

ElasticsearchReader = download_loader("ElasticsearchReader")
reader = ElasticsearchReader("http://localhost:9201", "my-index")
documents = reader.load_data("content", query=query_dict)
index = GPTSimpleVectorIndex(documents, llm_predictor=llm_predictor)

I think this is caused by the incorrectly constructed Document object. The desired content is stored in the text field, but the original JSON content is flattened and stored inside extra_info_str field.

text: my content...
extra_info_str: id: 168613\ngroupId: 10719\npublishDate: 19700101000000\nlanguage: English\ncontent: my content...

It is clear the latter field contains more chars, so when they are chunked, this equation returns negative value:

effective_chunk_size = self._chunk_size - num_extra_tokens

If the Document structure is really incorrect, I assume it should be fixed in that ElasticsearchReader first, but I can imagine having some detection for such cases here as well.

Stack trace
Exception has occurred: ValueError       (note: full exception trace is shown but execution is paused at: _run_module_as_main)
Effective chunk size is non positive after considering extra_info
  File "C:\Python310\Lib\site-packages\llama_index\langchain_helpers\text_splitter.py", line 136, in split_text_with_overlaps
    raise ValueError(
  File "C:\Python310\Lib\site-packages\llama_index\indices\node_utils.py", line 28, in get_text_splits_from_document
    text_splits = text_splitter.split_text_with_overlaps(
  File "C:\Python310\Lib\site-packages\llama_index\indices\node_utils.py", line 48, in get_nodes_from_document
    text_splits = get_text_splits_from_document(
  File "C:\Python310\Lib\site-packages\llama_index\indices\base.py", line 263, in _get_nodes_from_document
    return get_nodes_from_document(
  File "C:\Python310\Lib\site-packages\llama_index\indices\vector_store\base.py", line 181, in _add_document_to_index
    nodes = self._get_nodes_from_document(document)
  File "C:\Python310\Lib\site-packages\llama_index\indices\vector_store\base.py", line 206, in _build_index_from_documents
    self._add_document_to_index(index_struct, d)
  File "C:\Python310\Lib\site-packages\llama_index\indices\base.py", line 281, in build_index_from_documents
    return self._build_index_from_documents(documents)
  File "C:\Python310\Lib\site-packages\llama_index\token_counter\token_counter.py", line 84, in wrapped_llm_predict
    f_return_val = f(_self, *args, **kwargs)
  File "C:\Python310\Lib\site-packages\llama_index\indices\base.py", line 109, in __init__
    self._index_struct = self.build_index_from_documents(documents)
  File "C:\Python310\Lib\site-packages\llama_index\indices\vector_store\base.py", line 63, in __init__
    super().__init__(
  File "C:\Python310\Lib\site-packages\llama_index\indices\vector_store\vector_indices.py", line 84, in __init__
    super().__init__(
  File "C:\llama\llama.py", line 40, in <module>
    index = GPTSimpleVectorIndex(documents, llm_predictor=llm_predictor)
  File "C:\Python310\Lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Python310\Lib\runpy.py", line 196, in _run_module_as_main (Current frame)
    return _run_code(code, main_globals, None,
ValueError: Effective chunk size is non positive after considering extra_info
@brunoramirez12
Copy link

Hello there,

It seems that the issue you are facing is related to the structure of the documents retrieved from Elasticsearch. The split_text_with_overlaps method is used by the GPTSimpleVectorIndex to split the document text into chunks of a certain size for processing with the language model. The method takes into account any extra tokens in the document, such as metadata, to ensure that the chunks are of the desired size.

For you, it appears that the extra_info_str field contains additional metadata that is not being taken into account when chunking the text, leading to a negative value for effective_chunk_size. This can be addressed by either fixing the structure of the documents in the ElasticsearchReader to remove the extra_info_str field or modifying the split_text_with_overlaps method to handle such cases.

To detect such cases, you can modify the GPTSimpleVectorIndex to check for the presence of the extra_info_str field and adjust the chunk size accordingly. For example, you could check the length of the extra_info_str field and subtract it from the chunk size to get the effective chunk size. Alternatively, you could use regular expressions to identify and remove the metadata from the text before chunking it.

However, I recommended to fix the structure of the documents in the ElasticsearchReader to ensure that they conform to the expected format. This will prevent any issues with chunking and improve the overall performance and accuracy of the indexing process.

@dosubot
Copy link

dosubot bot commented Aug 18, 2023

Hi, @jan-tosovsky-cz! I'm here to help the LlamaIndex team manage their backlog and I wanted to let you know that we are marking this issue as stale.

From what I understand, you reported an issue with the GPTSimpleVectorIndex failing when indexing documents from ElasticSearch due to an incorrectly constructed Document object. The content is stored in the text field, but the original JSON content is flattened and stored in the extra_info_str field, causing the equation for calculating the effective chunk size to return a negative value.

brunoramirez12 suggested fixing the structure of the documents in the ElasticsearchReader to remove the extra_info_str field or modifying the split_text_with_overlaps method to handle such cases. They recommend checking for the presence of the extra_info_str field and adjusting the chunk size accordingly.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LlamaIndex repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.

Thank you for your contribution and we appreciate your understanding!

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Aug 18, 2023
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 10, 2023
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Sep 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants