Skip to content

2.25.2.0-b222

@spolitov spolitov tagged this 25 Mar 04:36
Summary:
The process of vector index backfill is initiated and executed entirely on the tserver.
During this time, postgres simply waits for the process to complete before proceeding further.

Given this workflow, it would be feasible to apply the same underlying logic for both concurrent and non-concurrent index creation.
Since the non-concurrent approach involves simpler logic at the higher layers compared to the concurrent approach, it would be more efficient and straightforward to always use the non-concurrent mode when creating a vector index.
This would streamline the process and eliminate the need to handle separate logic paths for concurrent and non-concurrent scenarios.

During the backfill process, the DocDB ensures that the data being indexed is consistent with the state of the database at the time the index build started.
This is achieved by reading a snapshot of the data using MVCC (Multi-Version Concurrency Control) to ensure that the index reflects a consistent view of the data.
Because of that there is not difference between concurrent and non concurrent backfill.

Semantically, users are expecting "concurrent" backfill to not cause DMLs to fail. But we have schema version mismatch errors.
Concurrent backfill is generally risky in the presence of multiple nodes.
But in the vector case, indexes are local to each shard, not global, so the only thing you need to worry about is backfill, which you claim was already done properly beforehand in the nonconcurrently case.

Jira: DB-15764

Test Plan: PgIndexBackfillTest.VectorIndex

Reviewers: jason

Reviewed By: jason

Subscribers: slingam, ybase, yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D42230
Assets 2
Loading