Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing documents on Vespa Cloud #396

Closed
neo-anderson opened this issue Nov 10, 2022 · 4 comments
Closed

Missing documents on Vespa Cloud #396

neo-anderson opened this issue Nov 10, 2022 · 4 comments

Comments

@neo-anderson
Copy link

Hi!
I noticed that some batches didn't complete successfully during batch feed to Vespa Cloud.

Successful documents fed: 986/1000.
Batch progress: 7/1094.

I was ingesting a dataset of about 1M documents, but Vespa Cloud only shows 735k.
Assuming it's because of the failed batches, I reingested the dataset. However, I still saw failures (but not the same batches that failed earlier) and the document count is still stuck at 735k on the dashboard as well as the query response json.
Disk usage is around 2% on Vespa Cloud dashboard.
What should I do? Thanks!

@thigm85
Copy link
Contributor

thigm85 commented Nov 11, 2022

There are likely schema issues with the documents that are failing. You can inspect the results and see the error messages, for ex.:

results = app.feed_batch()
failed_docs = [x for x in results if x.status_code != 200]
failed_docs[0].json # inspect the error message

@neo-anderson
Copy link
Author

I tried on a more powerful machine.

In [9]: failed_docs = [x for x in response if x.status_code != 200]
In [10]: len(failed_docs)
Out[10]: 0

No failed feed this time.
The number of documents in the data store is still 735.1k instead of 1M+.

For context,

# command used to ingest data
response = app.feed_batch(schema="coso", batch=data, batch_size=1000, total_timeout=200, asynchronous=True, connections=100)
len(data)
#1093460

I tried ingesting data into prod and see the same number of documents in the prod store as well - 735.1k.

What else can I do to check what went wrong?

@thigm85
Copy link
Contributor

thigm85 commented Nov 12, 2022

You need to specify id, e.g.:

batch_feed = [
    {
        "id": idx,
        "fields": sentence
    }
    for idx, sentence in enumerate(sentence_data)
]

Maybe you have duplicate id`s?

@neo-anderson
Copy link
Author

That's it! Generated ids were not unique and resulted in documents getting overwritten in the index. Thanks for the pointer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants