This design document explains how online backfill of indexes in YugabyteDB works. Upon adding new indexes to a table that already has data, this feature would enable building these indexes in an online manner, while continuing to serve other traffic. Note that this feature should work across both YSQL and YCQL APIs.
- Online builds: Support building the indexes without locking out reads or writes on the table. The index build itself will occur asynchronously.
- Correctness: After the index builds are completed, they should be consistent with the data in the primary table.
- Constraint violations: If a problem arises while scanning the table, such as a unique constraint violation in a unique index, the
CREATE INDEX
command should abort and result in a failure. An aborted index will be cleaned up and deleted. Details (such as which constraints were violated) will be found in the logs. - Efficient for large datasets: Index build should occur in a distributed manner (utilizing multiple/all nodes in the cluster) to efficiently handle large datasets.
- Resilience: The index build should be resilient to failures. The entire build process should not need to restart on a node failure in the cluster.
Note: Online index backfill relies on the online schema change framework. This design doc assumes the reader is familiar with how online schema changes are handled in YugabyteDB.
Before a schema change is initiated, the currently active copy of the schema is stored in the YB-Master, and cached on all the nodes (YB-TServers). The process of schema change is initiated by sending an RPC call to the YB-Master. The overall protocol for safely creating and backfilling an index (in other words, the protocol for performing online schema changes) relies on transitioning through various intermediate states explained below.
Let us say that we have a table MyTable
with pre-existing data and we are adding an index MyIndex
to this table.
Once the updates are made, the YB-Master leader then creates the desired number of new tablets for the index table MyIndex
and sends asynchronous HandleAlterTable()
requests to each tablet leader of the table. Typically, until the backfill process is complete, the newly created index will not be available for any reads. However, incoming write operations that are concurrent with the backfill process may need to update the index.
The backfill process moves through the following 4 states (after the MyIndex
index table has already been created). The currently active state of the MyIndex
index is persisted by the YB-Master and replicated across all the YB-TServers as a part of the system catalog / metadata. The IndexPermissions
state entry for MyIndex
is used to determine what kind of index updates/access will be allowed against the index at any point in time.
-
DELETE_ONLY
: In this state, whenever a row inMyTable
is updated, the delete operation on the index (corresponding to the old value) is applied to theMyIndex
index table. However, writes to the index (corresponding to the new value) are prohibited. All the queries/updates continue againstMyTable
(and the existing indexes if any).For example, in a typical update operation consisting of the following steps, only the
DELETE
operation is applied to the index:BEGIN INTERNAL TRANSACTION UPDATE the primary MyTable table DELETE the old value in the MyIndex index table INSERT the new value into the MyIndex index table COMMIT INTERNAL TRANSACTION
-
WRITE_AND_DELETE
: In this state, whenever a row is updated, the the following operations are applied to theMyIndex
index table:- delete the old value
- update the new value
The update to the index is performed using the current hybrid timestamp, which is the same hybrid time used to insert the values into the primary table. This ensures that once the update completes, the hybrid timestamp of the entry in primary table and the index table are the same. The index can still not be used to perform read operations.
-
DB_REORG
: This is the state where the YB-TServers actually perform the backfill. This state could take a long time to complete, this depends on the dataset size. In terms of operations applied to theMyIndex
table, this state is similar toWRITE_AND_DELETE
- all inserts, updates and deletes are applied to theMyIndex
index table. The index cannot be used to perform read operations, which still are satisfied by the primary tableMyTable
(and any existing indexes if any).Note: The details of the backfill process are covered in a dedicated section below.
-
READ_WRITE_AND_DELETE
: This is the final state, where theMyIndex
index can be used to serve reads.
The steps below outline the index backfill algorithm.
The CREATE INDEX
statement (approximate syntax shown below) can be sent to any node of the cluster. Note that the exact details of this statement depends on the exact API being used (YSQL or YCQL).
CREATE INDEX MyIndex on MyTable (...);
This statement is parsed and executed by the YB-TServer, which results in an AlterTable()
RPC call to the YB-Master leader. This RPC call which kicks off the multi-stage online schema change, of which the the index build is one stage. The CREATE INDEX
command is asynchronous, it does not wait for the index backfill to complete. It would be possible to query and determine the status (IN-PROGRESS
, SUCCESS
, FAILED
) of the asynchronous job.
Upon receiving the AlterTable()
RPC call, the YB-Master first performs the requisite updates to the system catalog / metadata in a transactional manner. The updates essentially do the following:
- Adds a reference from the list of indexes in the primary table
MyTable
to the new indexMyIndex
- Stores the information about the index
MyIndex
- Creates an entry in the
IndexPermissions
table for thisMyIndex
index, which determines what can be done with the index. - Sets the state of the
MyIndex
index toDELETE_ONLY
Note: The exact set of updates to the system catalog vary based on the API, meaning the set of updates performed in the case of YSQL would differ from YCQL since the metadata organization is different between the two APIs.
After setting the MyIndex
index state to the DELETE_ONLY
, the YB-Master leader sends the HandleAlterTable()
RPC calls to the various YB-TServers. The HandleAlterTable()
RPC call initiates a schema change on all the nodes in the cluster. Once the HandleAlterTable()
call completes on all the tablets of the table, the YB-Master performs checks to see if another schema change is required (for example, in the case of adding multiple indexes to a table and building all of them at the same time). If another change is required, this results in another round of schema changes across all the tablets.
Once all the schema changes are propagated to all the nodes, the index state is updated from DELETE_ONLY
to WRITE_AND_DELETE
. Once all the schema changes converge, the index state finally gets set to DB_REORG
.
After the index state is updated to DB_REORG
, the YB-Master orchestrates the backfill process by issuing BackfillIndex()
RPC calls to each tablet. This starts building the index across all the tablets of the table MyTable
. The YB-Master keeps track of how many tablets have completed the build. At this point, the YB-Master needs to wait for the backfill to complete on all the tablets before updating the table to the READ_WRITE_AND_DELETE
state.
Note: Details of how the index backfill works on any tablet is covered in detail in the next section.
Once the index build is successfully completed on all the tablets of the table, the table state is updated to READ_WRITE_AND_DELETE
, at which point the index is completely rebuilt.
The backfill process is a background job that runs on each of the tablets of the MyTable
table. Since each row belongs to exactly one tablet, the backfill process on any tablet can proceed independent of the others. The index build process is made efficient by running the build process for multiple tablets in parallel.
For YCQL, index backfill for a tablet is handled in the tserver of that tablet (leader). For YSQL, since we currently don't have sufficient context in tservers, a connection is opened up from tserver to postgres. Postgres then takes the responsibility of reading the tablet and writing to the index table.
The following apply to both YCQL and YSQL:
-
The index build requires a scan of the entire tablet data. However, there could be new updates happening on the dataset which would affect the values read by this scan. In order to prevent this, the scan is performed at a fixed timestamp. This hybrid logical timestamp
t_read
is picked by the YB-Master and sent to all the tablets. The data is scanned using this timestampt_read
as the read point so that subsequent writes do not affect the values read by this scan. -
The data is then scanned to generate the writes that need to be applied to the index table. These generated writes are batched and a batched write is performed to update the index table.
-
It is important that the generated write entries being applied to the index table are written with a hybrid timestamp that is in the past, so that it is older than the hybrid timestamp of the new update operations that are running concurrent with the backfill process. These entries can either be written with one of the following hybrid logical timestamps (HTS):
- The update time of the row being read
- The timestamp
t_read
with which we are performing the scan
-
Note that compactions for the Index table would not reclaim the delete markers until the backfill process is complete, i.e. until the index is in READ_WRITE_AND_DELETE state.
A unique index will accept the writes only if both the following conditions hold true
-
Scan backwards in time and either:
- there is no previous entry.
- there is an entry and the immediately previous entry is a delete.
- there is an entry and the immediately previous entry value matches the value being written.
-
Scan forward in time and either:
- there is no next entry for that key.
- there is an entry and the immediately next entry is a delete.
- there is an entry and the immediately next entry value matches the value being written.
Requirement 1) is similar to what a unique index would do anyways. Condition 2) is require to detect cases where a concurrent insert/update - that violates uniqueness - may have been accepted; because the conflicting row was not backfilled. Having this criteria will help detect the conflict when the backfilled entry arrives after the concurrent write.
The rate at which the backfill should proceed can be specified by the desired number of rows of the primary table MyTable
to process per second. In order to enforce this rate, the index backfill process keeps track of the number of rows being processed per second from the primary table MyTable
. Note that this counter is maintained per backfill task.
Additionally, the maximum number of backfill operations happening on any YB-TServer across tablets can also be specified in order to rate-limit backfilling.
So far, the discussion has made the assumption that all the concurrent updates (happening at the same time as the index build) on the MyTable
table are instantaneous operations. Specifically, this implies that:
- All the concurrent updates finish at a point in time
- The resulting updates on the index table (
MyIndex
in our example) can be handled based on its state as of that time (as determined by the value set inIndexPermissions
).
However, this may not hold true for “transactions” where the write/index-permission checking is done at “apply” time. However the backfill algorithm, that may kick in later, will only see the “commit” time.
This means that if a write was “applied” before getting to update the index (wrt deleting the old value), and commits “after” the backfill timestamp is chosen, then neither operations may be updating the “index” to delete the overwritten value.
To guard against this case, the GetSafeTime()
operation will wait for all “pending transactions” to finish (i.e. commit or abort) before determining the timestamp at which the scan is to be performed for backfill.
- Note that this strategy causes the backfill to “wait on” pending transactions, and if somebody has a terminal open with an ongoing txn that never commits. Backfill may be stalled indefinitely unless the very long running transaction is forcibly aborted.
- Add a timeout mechanism after which, pending txns (that started before getting to
WRITE_AND_DELETE
state) will be aborted. - For user-enforced txns, there is no way for the tablet to know if a txn is in progress or not. The tablet will thus just “wait for a specific duration” which is a system flag
--index_backfill_upperbound_for_user_enforced_txn_duration_ms
and can be controlled by the user. In order to be safe, this value should be set to at least the RPC timeout (or a greater value).
The YB-TServers, as a part of handling each of these RPC calls, will backfill only a portion of the tablet’s key range, and respond with a checkpointing info representing how far it managed to successfully build (much like a paging state). The YB-Master persists the checkpointing info in TabletInfo and issues futher RPCs as necessary for completing the entire key range for the tablet.
The backfill process itself does not require that the job be restarted if there is a leadership change. The peer that has already started doing the backfill may be allowed to complete.
If a user creates multiple indices, the backfill for the different indices should be batched together so that only one scan is done. This would require the following:
- The user can create multiple indexes without the build immediately starting
- Backfill needs to be kicked off by the user using an explicit command -- say something along the lines of:
BUILD/BACKFILL index <indexed_table>