Batch mode for document APIs #26579

eostis · 2023-03-24T14:45:38Z

Related to #16713, closed a long time ago.

@jobergum I agree, the batch mode to index data is more than useful. Calling tens of thousands times the API is really not good.

@frodelu Personally, I would very much accept a single error code for the whole batch. For instance the first document with an error. In my client code, I just do not commit if any error is detected in a whole batch call, to prevent any issues.

bratseth · 2023-03-24T15:39:55Z

That ticket was "while we wait for http/2". Now we have http/2 and there is no reason to add batch.

eostis · 2023-03-24T17:21:13Z

Are there examples of sending a "batch" of document updates with HTTP/2?

Does it depend on the client language?

How do we manage error catching?

bratseth · 2023-03-24T17:26:23Z

You don't need batch with HTTP/2 since it is fully asynchronous.

The client language needs to have a HTTP client that supports HTTP/2 but most do by now I think.

Each request may fail with an error even though it is asynchronous.

eostis · 2023-03-24T18:05:24Z

I found https://github.com/vespa-engine/pyvespa/blob/master/vespa/application.py#L1082, which uses async, semaphores and coroutines to loop on a document for async batch updates.

It looks to me that batch with Vespa is currently the responsibility of clients: depending on the programming language, one has to build the asynchronous loop calls. And we know this is a difficult problem.

Also, I doubt that batches with hundreds of documents can be implemented in front-end (my WooCommerce customers often index 500-1000 products per batch).

Would not it be better to implement the batch in the backend, rather than in the front-end?

bratseth · 2023-03-27T09:58:58Z

Short answer: No :-)

The backend here is a distributed system where asynch messaging is used to let individual document operations flow to where they should go for intermediate processing and final storage, and where different streams of updates (to the same or different documents) may be ongoing indefinitely and in parallel. Replacing this by batching is only truly possible in a small subset of cases, and even then is suboptimal. If clients have a need to turn the stream into chunks interleaved by waiting for completeness that should happen in the client. I believe this has become a commonly supported thing in most languages after we got HTTP/2 clients? I don't see a problem with doing this in chunks of 1k by the way.

eostis · 2023-03-27T10:48:55Z

Thanks for the detailed answer. I'll manage the chunks!

eostis · 2023-03-27T17:06:02Z

For those interested in PHP async HTTP2 batch:


use GuzzleHttp\Client;
use GuzzleHttp\Promise;

$client = new Client( [
       'version' => 2.0,
	#'debug'   => true
] );

$promises = [];
$i        = 0;
foreach ( ... ) {
	$promises[ $i ++ ] = $client->postAsync( ...);
}

$result = Promise\Utils::unwrap( $promises );

eostis closed this as completed Mar 27, 2023

eostis mentioned this issue Mar 30, 2023

Add exclusions to aggregation #26627

Open

eostis mentioned this issue Apr 6, 2023

A checklist for WooCommerce #26694

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch mode for document APIs #26579

Batch mode for document APIs #26579

eostis commented Mar 24, 2023

bratseth commented Mar 24, 2023

eostis commented Mar 24, 2023

bratseth commented Mar 24, 2023

eostis commented Mar 24, 2023 •

edited

bratseth commented Mar 27, 2023

eostis commented Mar 27, 2023

eostis commented Mar 27, 2023 •

edited

Batch mode for document APIs #26579

Batch mode for document APIs #26579

Comments

eostis commented Mar 24, 2023

bratseth commented Mar 24, 2023

eostis commented Mar 24, 2023

bratseth commented Mar 24, 2023

eostis commented Mar 24, 2023 • edited

bratseth commented Mar 27, 2023

eostis commented Mar 27, 2023

eostis commented Mar 27, 2023 • edited

eostis commented Mar 24, 2023 •

edited

eostis commented Mar 27, 2023 •

edited