Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch mode for document APIs #26579

Closed
eostis opened this issue Mar 24, 2023 · 7 comments
Closed

Batch mode for document APIs #26579

eostis opened this issue Mar 24, 2023 · 7 comments

Comments

@eostis
Copy link

eostis commented Mar 24, 2023

Related to #16713, closed a long time ago.

@jobergum I agree, the batch mode to index data is more than useful. Calling tens of thousands times the API is really not good.

@frodelu Personally, I would very much accept a single error code for the whole batch. For instance the first document with an error. In my client code, I just do not commit if any error is detected in a whole batch call, to prevent any issues.

@bratseth
Copy link
Member

That ticket was "while we wait for http/2". Now we have http/2 and there is no reason to add batch.

@eostis
Copy link
Author

eostis commented Mar 24, 2023

Are there examples of sending a "batch" of document updates with HTTP/2?

Does it depend on the client language?

How do we manage error catching?

@bratseth
Copy link
Member

You don't need batch with HTTP/2 since it is fully asynchronous.

The client language needs to have a HTTP client that supports HTTP/2 but most do by now I think.

Each request may fail with an error even though it is asynchronous.

@eostis
Copy link
Author

eostis commented Mar 24, 2023

I found https://github.com/vespa-engine/pyvespa/blob/master/vespa/application.py#L1082, which uses async, semaphores and coroutines to loop on a document for async batch updates.

It looks to me that batch with Vespa is currently the responsibility of clients: depending on the programming language, one has to build the asynchronous loop calls. And we know this is a difficult problem.

Also, I doubt that batches with hundreds of documents can be implemented in front-end (my WooCommerce customers often index 500-1000 products per batch).

Would not it be better to implement the batch in the backend, rather than in the front-end?

@bratseth
Copy link
Member

Short answer: No :-)

The backend here is a distributed system where asynch messaging is used to let individual document operations flow to where they should go for intermediate processing and final storage, and where different streams of updates (to the same or different documents) may be ongoing indefinitely and in parallel. Replacing this by batching is only truly possible in a small subset of cases, and even then is suboptimal. If clients have a need to turn the stream into chunks interleaved by waiting for completeness that should happen in the client. I believe this has become a commonly supported thing in most languages after we got HTTP/2 clients? I don't see a problem with doing this in chunks of 1k by the way.

@eostis
Copy link
Author

eostis commented Mar 27, 2023

Thanks for the detailed answer. I'll manage the chunks!

@eostis eostis closed this as completed Mar 27, 2023
@eostis
Copy link
Author

eostis commented Mar 27, 2023

For those interested in PHP async HTTP2 batch:


use GuzzleHttp\Client;
use GuzzleHttp\Promise;

$client = new Client( [
       'version' => 2.0,
	#'debug'   => true
] );

$promises = [];
$i        = 0;
foreach ( ... ) {
	$promises[ $i ++ ] = $client->postAsync( ...);
}

$result = Promise\Utils::unwrap( $promises );

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants