-
Notifications
You must be signed in to change notification settings - Fork 315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: PapaCSVReader concatRows=true fails for some .csv files #836
Comments
After further investigation this seems to be an issue loading certain .csv files using PapaCSVReader. as example the .csv used in LlamaIndexTS example documentation loads as expected with the referenced code, above (this is the file: however other tested, valid .csv files are not parsed as expected (see previous, MOCK_DATA.csv) There is a workaround to circumvent this and allow other .csvs to be loaded. |
@reidperyam the error says that the text is too long to generate an embedding.
|
and the issue remains: I pushed these changes up to the github repro repo @marcusschiesser |
Sorry but I cannot reproduce this bug ~/Code/practical-star git:[master]
npm run generate
> nextjs@0.1.0 generate
> tsx app/generate.ts
Using 'openai' model provider
CHUNK_SIZE 512
CHUNK_OVERLAP 20
EMBEDDING_DIM 1024
Generating generateDatasource...
STORAGE_CACHE_DIR ./cache
Generating serviceContextFromDefaults...
Generating storageContextFromDefaults...
No valid data found at path: cache/index_store.json starting new store.
No valid data found at path: ./cache/vector_store.json starting new store.
getting docutments...
document ac67672e-d880-4808-8c22-97071ee2f947 loaded
Generating VectorStoreIndex.fromDocuments...
Storage context successfully generated in 0.006s.
Finished generating storage. |
@himself65 But does it actually create a vector_store.json and index_store.json? I could reproduce the bug with the tokens, and it only created a doc_store.json |
will check this, I think there are some bugs in node parser |
I think we shouldn't modify the sentence splitter since there is no grammar for a CSV result. So I think it's better to split the CSV to different documents |
@himself65 Additional information. If you triple the size of titanic_train.csv by just copy pasting the content twice, it works as it should and creates an index_store.json and vector_store,json that contains the same results thrice, even tho it has more rows and columns. I could reproduce the error with multiple mock_files from different file generators. But both titanic_train.csv and movie_reviews.csv worked even when increasing the size. So something about the underlying csv structure? |
@himself65 isolated the issue down to the If we extend it to include one or more whitespaces instead of just one whitespace, it fixes the issue:
I don't really know enough about document structures to judge if that change would break a lot of stuff or not. |
After further testing the above "fix" only works with the mock file, but still struggles with other files. So true, no proper grammar in csv, so probably quite hard to detect the right regex. So either: |
Bug Description
The following code fails to generate a vectorStorageIndex from a simple .CSV file (see attached MOCK_DATA.csv ).
Version
0.2.10
Steps to Reproduce
clone the following github repo:
https://github.com/reidperyam/practical-star/tree/master
see README.md for repro which I will copy here:
First, populate .env with OPENAI_API_KEY
install the dependencies:
npm install
verify that all contents of /cache directory are removed!
Second, generate the embeddings of the documents in the ./data directory:
npm run generate
EXPECTED RESULT:
generated cache/doc_store.json
generated cache/index_store.json
generated cache/vector_store.json
ACTUAL RESULT:
Error generating text embeddings: [see Relevant Logs/Tracbacks added here
Relevant Logs/Tracbacks
The text was updated successfully, but these errors were encountered: