GPTSimpleVectorIndex: Effective chunk size is non positive after considering extra_info #748

jan-tosovsky-cz · 2023-03-15T17:37:26Z

When trying to index documents obtained from ElasticSearch, it fails inside split_text_with_overlaps method. See the full stack trace at the bottom.

ElasticsearchReader = download_loader("ElasticsearchReader")
reader = ElasticsearchReader("http://localhost:9201", "my-index")
documents = reader.load_data("content", query=query_dict)
index = GPTSimpleVectorIndex(documents, llm_predictor=llm_predictor)

I think this is caused by the incorrectly constructed Document object. The desired content is stored in the text field, but the original JSON content is flattened and stored inside extra_info_str field.

text: my content...
extra_info_str: id: 168613\ngroupId: 10719\npublishDate: 19700101000000\nlanguage: English\ncontent: my content...

It is clear the latter field contains more chars, so when they are chunked, this equation returns negative value:

effective_chunk_size = self._chunk_size - num_extra_tokens

If the Document structure is really incorrect, I assume it should be fixed in that ElasticsearchReader first, but I can imagine having some detection for such cases here as well.

Stack trace

Exception has occurred: ValueError       (note: full exception trace is shown but execution is paused at: _run_module_as_main)
Effective chunk size is non positive after considering extra_info
  File "C:\Python310\Lib\site-packages\llama_index\langchain_helpers\text_splitter.py", line 136, in split_text_with_overlaps
    raise ValueError(
  File "C:\Python310\Lib\site-packages\llama_index\indices\node_utils.py", line 28, in get_text_splits_from_document
    text_splits = text_splitter.split_text_with_overlaps(
  File "C:\Python310\Lib\site-packages\llama_index\indices\node_utils.py", line 48, in get_nodes_from_document
    text_splits = get_text_splits_from_document(
  File "C:\Python310\Lib\site-packages\llama_index\indices\base.py", line 263, in _get_nodes_from_document
    return get_nodes_from_document(
  File "C:\Python310\Lib\site-packages\llama_index\indices\vector_store\base.py", line 181, in _add_document_to_index
    nodes = self._get_nodes_from_document(document)
  File "C:\Python310\Lib\site-packages\llama_index\indices\vector_store\base.py", line 206, in _build_index_from_documents
    self._add_document_to_index(index_struct, d)
  File "C:\Python310\Lib\site-packages\llama_index\indices\base.py", line 281, in build_index_from_documents
    return self._build_index_from_documents(documents)
  File "C:\Python310\Lib\site-packages\llama_index\token_counter\token_counter.py", line 84, in wrapped_llm_predict
    f_return_val = f(_self, *args, **kwargs)
  File "C:\Python310\Lib\site-packages\llama_index\indices\base.py", line 109, in __init__
    self._index_struct = self.build_index_from_documents(documents)
  File "C:\Python310\Lib\site-packages\llama_index\indices\vector_store\base.py", line 63, in __init__
    super().__init__(
  File "C:\Python310\Lib\site-packages\llama_index\indices\vector_store\vector_indices.py", line 84, in __init__
    super().__init__(
  File "C:\llama\llama.py", line 40, in <module>
    index = GPTSimpleVectorIndex(documents, llm_predictor=llm_predictor)
  File "C:\Python310\Lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Python310\Lib\runpy.py", line 196, in _run_module_as_main (Current frame)
    return _run_code(code, main_globals, None,
ValueError: Effective chunk size is non positive after considering extra_info

The text was updated successfully, but these errors were encountered:

brunoramirez12 · 2023-03-15T17:42:45Z

Hello there,

It seems that the issue you are facing is related to the structure of the documents retrieved from Elasticsearch. The split_text_with_overlaps method is used by the GPTSimpleVectorIndex to split the document text into chunks of a certain size for processing with the language model. The method takes into account any extra tokens in the document, such as metadata, to ensure that the chunks are of the desired size.

For you, it appears that the extra_info_str field contains additional metadata that is not being taken into account when chunking the text, leading to a negative value for effective_chunk_size. This can be addressed by either fixing the structure of the documents in the ElasticsearchReader to remove the extra_info_str field or modifying the split_text_with_overlaps method to handle such cases.

To detect such cases, you can modify the GPTSimpleVectorIndex to check for the presence of the extra_info_str field and adjust the chunk size accordingly. For example, you could check the length of the extra_info_str field and subtract it from the chunk size to get the effective chunk size. Alternatively, you could use regular expressions to identify and remove the metadata from the text before chunking it.

However, I recommended to fix the structure of the documents in the ElasticsearchReader to ensure that they conform to the expected format. This will prevent any issues with chunking and improve the overall performance and accuracy of the indexing process.

dosubot · 2023-08-18T16:16:30Z

Hi, @jan-tosovsky-cz! I'm here to help the LlamaIndex team manage their backlog and I wanted to let you know that we are marking this issue as stale.

From what I understand, you reported an issue with the GPTSimpleVectorIndex failing when indexing documents from ElasticSearch due to an incorrectly constructed Document object. The content is stored in the text field, but the original JSON content is flattened and stored in the extra_info_str field, causing the equation for calculating the effective chunk size to return a negative value.

brunoramirez12 suggested fixing the structure of the documents in the ElasticsearchReader to remove the extra_info_str field or modifying the split_text_with_overlaps method to handle such cases. They recommend checking for the presence of the extra_info_str field and adjusting the chunk size accordingly.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LlamaIndex repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.

Thank you for your contribution and we appreciate your understanding!

jan-tosovsky-cz mentioned this issue Mar 15, 2023

Elasticsearch: Current extra_info content breaks document indexing run-llama/llama-hub#117

Open

dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Aug 18, 2023

dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 10, 2023

dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Sep 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPTSimpleVectorIndex: Effective chunk size is non positive after considering extra_info #748

GPTSimpleVectorIndex: Effective chunk size is non positive after considering extra_info #748

jan-tosovsky-cz commented Mar 15, 2023

brunoramirez12 commented Mar 15, 2023

dosubot bot commented Aug 18, 2023

GPTSimpleVectorIndex: Effective chunk size is non positive after considering extra_info #748

GPTSimpleVectorIndex: Effective chunk size is non positive after considering extra_info #748

Comments

jan-tosovsky-cz commented Mar 15, 2023

brunoramirez12 commented Mar 15, 2023

dosubot bot commented Aug 18, 2023