New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[YSQL] Resume backfill on master leader failover #6218
Labels
area/ysql
Yugabyte SQL (YSQL)
kind/enhancement
This is an enhancement of an existing feature
priority/medium
Medium priority issue
Projects
Comments
jaki
added
kind/enhancement
This is an enhancement of an existing feature
area/ysql
Yugabyte SQL (YSQL)
labels
Oct 30, 2020
jaki
added a commit
that referenced
this issue
Oct 31, 2020
Summary: To improve YSQL index backfill failure handling, switch over to a new design: - Make `pg_index` permissions `indislive`, `indisready`, `indisvalid` the authority, and don't really use the DocDB permissions (only use three of them for persisting state information) - For the backfill step, add a postgres-master RPC `BackfillIndex` Now, `CREATE INDEX` with backfill enabled looks like 1. postgres: send create index request to master 1. master: create index table 1. master: alter index table to have index info at `WRITE_AND_DELETE` perm - don't set fully applied schema - set schema to `WRITE_AND_DELETE` - set `ALTERING` state 1. tservers: apply the alter to the indexed table 1. master: _don't_ move on to the next permission - don't set fully applied schema - keep schema at `WRITE_AND_DELETE` 1. postgres: create postgres index at `indislive` permission 1. postgres: add `indisready` permission 1. postgres: send backfill request and wait for at least `READ_WRITE_AND_DELETE` permissions 1. master: move on to backfilling 1. master: get safe time for read 1. tservers: get safe time 1. master: send backfill requests to tservers 1. tservers: send backfill requests to postgres 1. master: alter to success (`READ_WRITE_AND_DELETE`) or abort (`WRITE_AND_DELETE_WHILE_REMOVING`) - clear fully applied schema - set schema to `READ_WRITE_AND_DELETE` or `WRITE_AND_DELETE_WHILE_REMOVING` - clear `ALTERING` state 1. postgres: finish waiting and, on success, add `indisvalid` permission If postgres dies before backfill, master isn't stuck, and a separate request can be made to drop the half-created index. If postgres dies during or after backfill, we can still drop the index, but you may need to kill any backfills in progress. If master fails to backfill, postgres will just stop at that point and not set the index public, so a separate request can be made to drop the index. Add some gflags: - master: - `ysql_backfill_is_create_table_done_delay_ms`: delay after finding index info in table for YSQL `IsCreateTableDone` (currently helpful, but not foolproof, for preventing backfill from happening while a tserver doesn't have the index info--issue #6234) - `ysql_index_backfill_rpc_timeout_ms`: deadline for master to tserver backfill tablet rpc calls (useful to handle large tables because we currently don't respect the deadline and try to backfill the entire tablet at once--issue #5326) - tserver: - `TEST_ysql_index_state_flags_update_delay_ms`: delay after committing `pg_index` state flag change (currently helpful, but not foolproof for consistency, when, in our current design, commits aren't guaranteed to have been seen by all tservers by the time the commit finishes--issue #6238) - `ysql_wait_until_index_permissions_timeout_ms`: timeout for `WaitUntilIndexPermissionsAtLeast` client function (again, useful for handling large tables--issue #5326) Add some helper functions: - `yb::ExternalMiniCluster::SetFlagOnMasters` for external minicluster tests (currently unused) - `yb::pgwrapper::GetBool` for pgwrapper tests Adjust some tests to the new backfill model: - `PgLibPqTest.BackfillPermissions` - `PgLibPqTest.BackfillReadTime` Fix `BackfillIndexesForYsql` to better handle libpq connections and results. Other issues: - `CREATE INDEX` can't be stopped with Ctrl+C, probably because of the long `YBCPgWaitUntilIndexPermissionsAtLeast` call - An in-progress master to tserver backfill RPC will not notice a DROP INDEX removes index tablets - master leader failover during backfill will not cause backfill to resume, unlike YCQL (issue #6218) Close: #5325 Test Plan: - `./yb_build.sh --cxx-test pgwrapper_pg_libpq-test --gtest_filter` - `PgLibPqTest.BackfillWaitBackfillTimeout` - `PgLibPqTest.BackfillDropAfterFail` - `PgLibPqTest.BackfillMasterLeaderStepdown` - `PgLibPqTest.BackfillDropWhileBackfilling` - `PgLibPqTest.BackfillAuth` Reviewers: amitanand, mihnea Reviewed By: mihnea Subscribers: yql, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D9682
jaki
added a commit
that referenced
this issue
Oct 31, 2020
Summary: To improve YSQL index backfill failure handling, switch over to a new design: - Make `pg_index` permissions `indislive`, `indisready`, `indisvalid` the authority, and don't really use the DocDB permissions (only use three of them for persisting state information) - For the backfill step, add a postgres-master RPC `BackfillIndex` Now, `CREATE INDEX` with backfill enabled looks like 1. postgres: send create index request to master 1. master: create index table 1. master: alter index table to have index info at `WRITE_AND_DELETE` perm - don't set fully applied schema - set schema to `WRITE_AND_DELETE` - set `ALTERING` state 1. tservers: apply the alter to the indexed table 1. master: _don't_ move on to the next permission - don't set fully applied schema - keep schema at `WRITE_AND_DELETE` 1. postgres: create postgres index at `indislive` permission 1. postgres: add `indisready` permission 1. postgres: send backfill request and wait for at least `READ_WRITE_AND_DELETE` permissions 1. master: move on to backfilling 1. master: get safe time for read 1. tservers: get safe time 1. master: send backfill requests to tservers 1. tservers: send backfill requests to postgres 1. master: alter to success (`READ_WRITE_AND_DELETE`) or abort (`WRITE_AND_DELETE_WHILE_REMOVING`) - clear fully applied schema - set schema to `READ_WRITE_AND_DELETE` or `WRITE_AND_DELETE_WHILE_REMOVING` - clear `ALTERING` state 1. postgres: finish waiting and, on success, add `indisvalid` permission If postgres dies before backfill, master isn't stuck, and a separate request can be made to drop the half-created index. If postgres dies during or after backfill, we can still drop the index, but you may need to kill any backfills in progress. If master fails to backfill, postgres will just stop at that point and not set the index public, so a separate request can be made to drop the index. Add some gflags: - master: - `ysql_backfill_is_create_table_done_delay_ms`: delay after finding index info in table for YSQL `IsCreateTableDone` (currently helpful, but not foolproof, for preventing backfill from happening while a tserver doesn't have the index info--issue #6234) - `ysql_index_backfill_rpc_timeout_ms`: deadline for master to tserver backfill tablet rpc calls (useful to handle large tables because we currently don't respect the deadline and try to backfill the entire tablet at once--issue #5326) - tserver: - `TEST_ysql_index_state_flags_update_delay_ms`: delay after committing `pg_index` state flag change (currently helpful, but not foolproof for consistency, when, in our current design, commits aren't guaranteed to have been seen by all tservers by the time the commit finishes--issue #6238) - `ysql_wait_until_index_permissions_timeout_ms`: timeout for `WaitUntilIndexPermissionsAtLeast` client function (again, useful for handling large tables--issue #5326) Add some helper functions: - `yb::ExternalMiniCluster::SetFlagOnMasters` for external minicluster tests (currently unused) - `yb::pgwrapper::GetBool` for pgwrapper tests Adjust some tests to the new backfill model: - `PgLibPqTest.BackfillPermissions` - `PgLibPqTest.BackfillReadTime` Fix `BackfillIndexesForYsql` to better handle libpq connections and results. Other issues: - `CREATE INDEX` can't be stopped with Ctrl+C, probably because of the long `YBCPgWaitUntilIndexPermissionsAtLeast` call - An in-progress master to tserver backfill RPC will not notice a DROP INDEX removes index tablets - master leader failover during backfill will not cause backfill to resume, unlike YCQL (issue #6218) Depends on D9803 Test Plan: Jenkins: rebase: 2.3 - `./yb_build.sh --cxx-test pgwrapper_pg_libpq-test --gtest_filter` - `PgLibPqTest.BackfillWaitBackfillTimeout` - `PgLibPqTest.BackfillDropAfterFail` - `PgLibPqTest.BackfillMasterLeaderStepdown` - `PgLibPqTest.BackfillDropWhileBackfilling` - `PgLibPqTest.BackfillAuth` Reviewers: mihnea Reviewed By: mihnea Subscribers: yql, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D9804
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area/ysql
Yugabyte SQL (YSQL)
kind/enhancement
This is an enhancement of an existing feature
priority/medium
Medium priority issue
Jira Link: DB-2052
An upcoming fix for YSQL index backfill will make it more stable but consequently cause backfill to not get retried on failures like master leader failover. Once backfill is stabilized, bring this feature back.
The text was updated successfully, but these errors were encountered: