[YSQL] Resume backfill on master leader failover #6218

jaki · 2020-10-30T07:39:49Z

Jira Link: DB-2052
An upcoming fix for YSQL index backfill will make it more stable but consequently cause backfill to not get retried on failures like master leader failover. Once backfill is stabilized, bring this feature back.

Summary: To improve YSQL index backfill failure handling, switch over to a new design: - Make `pg_index` permissions `indislive`, `indisready`, `indisvalid` the authority, and don't really use the DocDB permissions (only use three of them for persisting state information) - For the backfill step, add a postgres-master RPC `BackfillIndex` Now, `CREATE INDEX` with backfill enabled looks like 1. postgres: send create index request to master 1. master: create index table 1. master: alter index table to have index info at `WRITE_AND_DELETE` perm - don't set fully applied schema - set schema to `WRITE_AND_DELETE` - set `ALTERING` state 1. tservers: apply the alter to the indexed table 1. master: _don't_ move on to the next permission - don't set fully applied schema - keep schema at `WRITE_AND_DELETE` 1. postgres: create postgres index at `indislive` permission 1. postgres: add `indisready` permission 1. postgres: send backfill request and wait for at least `READ_WRITE_AND_DELETE` permissions 1. master: move on to backfilling 1. master: get safe time for read 1. tservers: get safe time 1. master: send backfill requests to tservers 1. tservers: send backfill requests to postgres 1. master: alter to success (`READ_WRITE_AND_DELETE`) or abort (`WRITE_AND_DELETE_WHILE_REMOVING`) - clear fully applied schema - set schema to `READ_WRITE_AND_DELETE` or `WRITE_AND_DELETE_WHILE_REMOVING` - clear `ALTERING` state 1. postgres: finish waiting and, on success, add `indisvalid` permission If postgres dies before backfill, master isn't stuck, and a separate request can be made to drop the half-created index. If postgres dies during or after backfill, we can still drop the index, but you may need to kill any backfills in progress. If master fails to backfill, postgres will just stop at that point and not set the index public, so a separate request can be made to drop the index. Add some gflags: - master: - `ysql_backfill_is_create_table_done_delay_ms`: delay after finding index info in table for YSQL `IsCreateTableDone` (currently helpful, but not foolproof, for preventing backfill from happening while a tserver doesn't have the index info--issue #6234) - `ysql_index_backfill_rpc_timeout_ms`: deadline for master to tserver backfill tablet rpc calls (useful to handle large tables because we currently don't respect the deadline and try to backfill the entire tablet at once--issue #5326) - tserver: - `TEST_ysql_index_state_flags_update_delay_ms`: delay after committing `pg_index` state flag change (currently helpful, but not foolproof for consistency, when, in our current design, commits aren't guaranteed to have been seen by all tservers by the time the commit finishes--issue #6238) - `ysql_wait_until_index_permissions_timeout_ms`: timeout for `WaitUntilIndexPermissionsAtLeast` client function (again, useful for handling large tables--issue #5326) Add some helper functions: - `yb::ExternalMiniCluster::SetFlagOnMasters` for external minicluster tests (currently unused) - `yb::pgwrapper::GetBool` for pgwrapper tests Adjust some tests to the new backfill model: - `PgLibPqTest.BackfillPermissions` - `PgLibPqTest.BackfillReadTime` Fix `BackfillIndexesForYsql` to better handle libpq connections and results. Other issues: - `CREATE INDEX` can't be stopped with Ctrl+C, probably because of the long `YBCPgWaitUntilIndexPermissionsAtLeast` call - An in-progress master to tserver backfill RPC will not notice a DROP INDEX removes index tablets - master leader failover during backfill will not cause backfill to resume, unlike YCQL (issue #6218) Close: #5325 Test Plan: - `./yb_build.sh --cxx-test pgwrapper_pg_libpq-test --gtest_filter` - `PgLibPqTest.BackfillWaitBackfillTimeout` - `PgLibPqTest.BackfillDropAfterFail` - `PgLibPqTest.BackfillMasterLeaderStepdown` - `PgLibPqTest.BackfillDropWhileBackfilling` - `PgLibPqTest.BackfillAuth` Reviewers: amitanand, mihnea Reviewed By: mihnea Subscribers: yql, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D9682

Summary: To improve YSQL index backfill failure handling, switch over to a new design: - Make `pg_index` permissions `indislive`, `indisready`, `indisvalid` the authority, and don't really use the DocDB permissions (only use three of them for persisting state information) - For the backfill step, add a postgres-master RPC `BackfillIndex` Now, `CREATE INDEX` with backfill enabled looks like 1. postgres: send create index request to master 1. master: create index table 1. master: alter index table to have index info at `WRITE_AND_DELETE` perm - don't set fully applied schema - set schema to `WRITE_AND_DELETE` - set `ALTERING` state 1. tservers: apply the alter to the indexed table 1. master: _don't_ move on to the next permission - don't set fully applied schema - keep schema at `WRITE_AND_DELETE` 1. postgres: create postgres index at `indislive` permission 1. postgres: add `indisready` permission 1. postgres: send backfill request and wait for at least `READ_WRITE_AND_DELETE` permissions 1. master: move on to backfilling 1. master: get safe time for read 1. tservers: get safe time 1. master: send backfill requests to tservers 1. tservers: send backfill requests to postgres 1. master: alter to success (`READ_WRITE_AND_DELETE`) or abort (`WRITE_AND_DELETE_WHILE_REMOVING`) - clear fully applied schema - set schema to `READ_WRITE_AND_DELETE` or `WRITE_AND_DELETE_WHILE_REMOVING` - clear `ALTERING` state 1. postgres: finish waiting and, on success, add `indisvalid` permission If postgres dies before backfill, master isn't stuck, and a separate request can be made to drop the half-created index. If postgres dies during or after backfill, we can still drop the index, but you may need to kill any backfills in progress. If master fails to backfill, postgres will just stop at that point and not set the index public, so a separate request can be made to drop the index. Add some gflags: - master: - `ysql_backfill_is_create_table_done_delay_ms`: delay after finding index info in table for YSQL `IsCreateTableDone` (currently helpful, but not foolproof, for preventing backfill from happening while a tserver doesn't have the index info--issue #6234) - `ysql_index_backfill_rpc_timeout_ms`: deadline for master to tserver backfill tablet rpc calls (useful to handle large tables because we currently don't respect the deadline and try to backfill the entire tablet at once--issue #5326) - tserver: - `TEST_ysql_index_state_flags_update_delay_ms`: delay after committing `pg_index` state flag change (currently helpful, but not foolproof for consistency, when, in our current design, commits aren't guaranteed to have been seen by all tservers by the time the commit finishes--issue #6238) - `ysql_wait_until_index_permissions_timeout_ms`: timeout for `WaitUntilIndexPermissionsAtLeast` client function (again, useful for handling large tables--issue #5326) Add some helper functions: - `yb::ExternalMiniCluster::SetFlagOnMasters` for external minicluster tests (currently unused) - `yb::pgwrapper::GetBool` for pgwrapper tests Adjust some tests to the new backfill model: - `PgLibPqTest.BackfillPermissions` - `PgLibPqTest.BackfillReadTime` Fix `BackfillIndexesForYsql` to better handle libpq connections and results. Other issues: - `CREATE INDEX` can't be stopped with Ctrl+C, probably because of the long `YBCPgWaitUntilIndexPermissionsAtLeast` call - An in-progress master to tserver backfill RPC will not notice a DROP INDEX removes index tablets - master leader failover during backfill will not cause backfill to resume, unlike YCQL (issue #6218) Depends on D9803 Test Plan: Jenkins: rebase: 2.3 - `./yb_build.sh --cxx-test pgwrapper_pg_libpq-test --gtest_filter` - `PgLibPqTest.BackfillWaitBackfillTimeout` - `PgLibPqTest.BackfillDropAfterFail` - `PgLibPqTest.BackfillMasterLeaderStepdown` - `PgLibPqTest.BackfillDropWhileBackfilling` - `PgLibPqTest.BackfillAuth` Reviewers: mihnea Reviewed By: mihnea Subscribers: yql, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D9804

jaki added kind/enhancement This is an enhancement of an existing feature area/ysql Yugabyte SQL (YSQL) labels Oct 30, 2020

jaki self-assigned this Oct 30, 2020

jaki added this to To do/Current-Sprint in Index backfill via automation Oct 30, 2020

jaki mentioned this issue Nov 5, 2020

[YSQL] Clear table ALTERING state whenever there's no alter #6284

Closed

jaki mentioned this issue Apr 5, 2021

Node goes down while CREATE INDEX is running, Index does not continue running after node comes back up #7923

Open

yugabyte-ci added the priority/medium Medium priority issue label Jun 9, 2022

tverona1 added lifecycle/open and removed lifecycle/open labels Jun 9, 2022

yugabyte-ci unassigned jaki Jul 9, 2022

amitanandaiyer self-assigned this Sep 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[YSQL] Resume backfill on master leader failover #6218

[YSQL] Resume backfill on master leader failover #6218

jaki commented Oct 30, 2020 •

edited by yugabyte-ci

[YSQL] Resume backfill on master leader failover #6218

[YSQL] Resume backfill on master leader failover #6218

Comments

jaki commented Oct 30, 2020 • edited by yugabyte-ci

jaki commented Oct 30, 2020 •

edited by yugabyte-ci