Missing documents on Vespa Cloud #396

neo-anderson · 2022-11-10T21:43:59Z

Hi!
I noticed that some batches didn't complete successfully during batch feed to Vespa Cloud.

Successful documents fed: 986/1000.
Batch progress: 7/1094.

I was ingesting a dataset of about 1M documents, but Vespa Cloud only shows 735k.
Assuming it's because of the failed batches, I reingested the dataset. However, I still saw failures (but not the same batches that failed earlier) and the document count is still stuck at 735k on the dashboard as well as the query response json.
Disk usage is around 2% on Vespa Cloud dashboard.
What should I do? Thanks!

The text was updated successfully, but these errors were encountered:

thigm85 · 2022-11-11T00:07:26Z

There are likely schema issues with the documents that are failing. You can inspect the results and see the error messages, for ex.:

results = app.feed_batch()
failed_docs = [x for x in results if x.status_code != 200]
failed_docs[0].json # inspect the error message

neo-anderson · 2022-11-12T02:03:59Z

I tried on a more powerful machine.

In [9]: failed_docs = [x for x in response if x.status_code != 200]
In [10]: len(failed_docs)
Out[10]: 0

No failed feed this time.
The number of documents in the data store is still 735.1k instead of 1M+.

For context,

# command used to ingest data
response = app.feed_batch(schema="coso", batch=data, batch_size=1000, total_timeout=200, asynchronous=True, connections=100)
len(data)
#1093460

I tried ingesting data into prod and see the same number of documents in the prod store as well - 735.1k.

What else can I do to check what went wrong?

thigm85 · 2022-11-12T11:10:06Z

You need to specify id, e.g.:

batch_feed = [
    {
        "id": idx,
        "fields": sentence
    }
    for idx, sentence in enumerate(sentence_data)
]

Maybe you have duplicate id`s?

neo-anderson · 2022-11-14T19:55:34Z

That's it! Generated ids were not unique and resulted in documents getting overwritten in the index. Thanks for the pointer.

neo-anderson closed this as completed Nov 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing documents on Vespa Cloud #396

Missing documents on Vespa Cloud #396

neo-anderson commented Nov 10, 2022

thigm85 commented Nov 11, 2022

neo-anderson commented Nov 12, 2022

thigm85 commented Nov 12, 2022

neo-anderson commented Nov 14, 2022

Missing documents on Vespa Cloud #396

Missing documents on Vespa Cloud #396

Comments

neo-anderson commented Nov 10, 2022

thigm85 commented Nov 11, 2022

neo-anderson commented Nov 12, 2022

thigm85 commented Nov 12, 2022

neo-anderson commented Nov 14, 2022