Skip to content

Commit

Permalink
[backport 2.3][YSQL] improve index backfill failure handling (#5325)
Browse files Browse the repository at this point in the history
Summary:
To improve YSQL index backfill failure handling, switch over to a new
design:

- Make `pg_index` permissions `indislive`, `indisready`, `indisvalid`
  the authority, and don't really use the DocDB permissions (only use
  three of them for persisting state information)
- For the backfill step, add a postgres-master RPC `BackfillIndex`

Now, `CREATE INDEX` with backfill enabled looks like

1. postgres: send create index request to master
1. master: create index table
1. master: alter index table to have index info at
   `WRITE_AND_DELETE` perm
   - don't set fully applied schema
   - set schema to `WRITE_AND_DELETE`
   - set `ALTERING` state
1. tservers: apply the alter to the indexed table
1. master: _don't_ move on to the next permission
   - don't set fully applied schema
   - keep schema at `WRITE_AND_DELETE`
1. postgres: create postgres index at `indislive` permission
1. postgres: add `indisready` permission
1. postgres: send backfill request and wait for at least
   `READ_WRITE_AND_DELETE` permissions
1. master: move on to backfilling
1. master: get safe time for read
1. tservers: get safe time
1. master: send backfill requests to tservers
1. tservers: send backfill requests to postgres
1. master: alter to success (`READ_WRITE_AND_DELETE`) or abort
   (`WRITE_AND_DELETE_WHILE_REMOVING`)
   - clear fully applied schema
   - set schema to `READ_WRITE_AND_DELETE` or
     `WRITE_AND_DELETE_WHILE_REMOVING`
   - clear `ALTERING` state
1. postgres: finish waiting and, on success, add `indisvalid` permission

If postgres dies before backfill, master isn't stuck, and a separate
request can be made to drop the half-created index.  If postgres dies
during or after backfill, we can still drop the index, but you may need
to kill any backfills in progress.

If master fails to backfill, postgres will just stop at that point and
not set the index public, so a separate request can be made to drop the
index.

Add some gflags:

- master:
  - `ysql_backfill_is_create_table_done_delay_ms`: delay after finding
    index info in table for YSQL `IsCreateTableDone` (currently helpful,
    but not foolproof, for preventing backfill from happening while a
    tserver doesn't have the index info--issue #6234)
  - `ysql_index_backfill_rpc_timeout_ms`: deadline for master to tserver
    backfill tablet rpc calls (useful to handle large tables because we
    currently don't respect the deadline and try to backfill the entire
    tablet at once--issue #5326)
- tserver:
  - `TEST_ysql_index_state_flags_update_delay_ms`: delay after
    committing `pg_index` state flag change (currently helpful, but not
    foolproof for consistency, when, in our current design, commits
    aren't guaranteed to have been seen by all tservers by the time the
    commit finishes--issue #6238)
  - `ysql_wait_until_index_permissions_timeout_ms`: timeout for
    `WaitUntilIndexPermissionsAtLeast` client function (again, useful
    for handling large tables--issue #5326)

Add some helper functions:

- `yb::ExternalMiniCluster::SetFlagOnMasters` for external minicluster
  tests (currently unused)
- `yb::pgwrapper::GetBool` for pgwrapper tests

Adjust some tests to the new backfill model:

- `PgLibPqTest.BackfillPermissions`
- `PgLibPqTest.BackfillReadTime`

Fix `BackfillIndexesForYsql` to better handle libpq connections and
results.

Other issues:

- `CREATE INDEX` can't be stopped with Ctrl+C, probably because of the
  long `YBCPgWaitUntilIndexPermissionsAtLeast` call
- An in-progress master to tserver backfill RPC will not notice a DROP
  INDEX removes index tablets
- master leader failover during backfill will not cause backfill to
  resume, unlike YCQL (issue #6218)

Depends on D9803

Test Plan:
Jenkins: rebase: 2.3

- `./yb_build.sh --cxx-test pgwrapper_pg_libpq-test --gtest_filter`
  - `PgLibPqTest.BackfillWaitBackfillTimeout`
  - `PgLibPqTest.BackfillDropAfterFail`
  - `PgLibPqTest.BackfillMasterLeaderStepdown`
  - `PgLibPqTest.BackfillDropWhileBackfilling`
  - `PgLibPqTest.BackfillAuth`

Reviewers: mihnea

Reviewed By: mihnea

Subscribers: yql, bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D9804
  • Loading branch information
jaki committed Oct 31, 2020
1 parent 9a0b8a0 commit 7b0d02f
Show file tree
Hide file tree
Showing 28 changed files with 747 additions and 179 deletions.
84 changes: 33 additions & 51 deletions src/postgres/src/backend/commands/indexcmds.c
Expand Up @@ -1258,86 +1258,68 @@ DefineIndex(Oid relationId,

PopActiveSnapshot();

elog(LOG, "committing pg_index tuple with indislive=true");
/*
* TODO(jason): retry backfill or revert schema changes instead of failing
* through HandleYBStatus.
*/
elog(LOG, "waiting for YB_INDEX_PERM_DELETE_ONLY");
HandleYBStatus(YBCPgWaitUntilIndexPermissionsAtLeast(MyDatabaseId,
relationId,
indexRelationId,
YB_INDEX_PERM_DELETE_ONLY,
&actual_index_permissions));
/*
* TODO(jason): handle bad actual_index_permissions.
* No need to break (abort) ongoing txns since this is an online schema
* change.
* TODO(jason): handle nested CREATE INDEX (this assumes we're at nest
* level 1).
*/

elog(LOG, "committing pg_index tuple with indislive=true");
/* TODO(jason): handle nested CREATE INDEX (this assumes we're at nest
* level 1). */
/* No need to break (abort) ongoing txns since this is an online schema change */
YBDecrementDdlNestingLevel(true /* success */,
YBDecrementDdlNestingLevel(true /* success */,
true /* is_catalog_version_increment */,
false /* is_breaking_catalog_change */);
CommitTransactionCommand();

StartTransactionCommand();
YBIncrementDdlNestingLevel();

/*
* TODO(jason): retry backfill or revert schema changes instead of failing
* through HandleYBStatus.
*/
HandleYBStatus(YBCPgAsyncUpdateIndexPermissions(MyDatabaseId, relationId));
elog(LOG, "waiting for YB_INDEX_PERM_WRITE_AND_DELETE");
HandleYBStatus(YBCPgWaitUntilIndexPermissionsAtLeast(MyDatabaseId,
relationId,
indexRelationId,
YB_INDEX_PERM_WRITE_AND_DELETE,
&actual_index_permissions));
/*
* TODO(jason): handle bad actual_index_permissions.
* Delay after committing pg_index update. Although it is controlled by a
* test flag, it currently helps (but does not guarantee) correctness
* because commits may not have propagated to all tservers by this time.
*/
pg_usleep(YBCGetTestIndexStateFlagsUpdateDelayMs() * 1000);

StartTransactionCommand();
YBIncrementDdlNestingLevel();

/*
* Update the pg_index row to mark the index as ready for inserts. This
* allows writes, but Yugabyte would only accept deletes until the index
* permission changes to INDEX_PERM_WRITE_AND_DELETE.
* Update the pg_index row to mark the index as ready for inserts.
*/
index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);

elog(LOG, "committing pg_index tuple with indisready=true");
/*
* Commit this transaction to make the indisready update visible.
* No need to break (abort) ongoing txns since this is an online schema
* change.
* TODO(jason): handle nested CREATE INDEX (this assumes we're at nest
* level 1).
*/
elog(LOG, "committing pg_index tuple with indisready=true");
/* TODO(jason): handle nested CREATE INDEX (this assumes we're at nest
* level 1). */
/* No need to break (abort) ongoing txns since this is an online schema change */
YBDecrementDdlNestingLevel(true /* success */,
YBDecrementDdlNestingLevel(true /* success */,
true /* is_catalog_version_increment */,
false /* is_breaking_catalog_change */);
CommitTransactionCommand();

/*
* Delay after committing pg_index update. Although it is controlled by a
* test flag, it currently helps (but does not guarantee) correctness
* because commits may not have propagated to all tservers by this time.
*/
pg_usleep(YBCGetTestIndexStateFlagsUpdateDelayMs() * 1000);

StartTransactionCommand();
YBIncrementDdlNestingLevel();

/* TODO(jason): handle exclusion constraints, possibly not here. */

/*
* TODO(jason): retry backfill or revert schema changes instead of failing
* through HandleYBStatus.
*/
HandleYBStatus(YBCPgAsyncUpdateIndexPermissions(MyDatabaseId, relationId));
elog(LOG, "waiting for Yugabyte index read permission");
/* Do backfill. */
HandleYBStatus(YBCPgBackfillIndex(MyDatabaseId, indexRelationId));
HandleYBStatus(YBCPgWaitUntilIndexPermissionsAtLeast(MyDatabaseId,
relationId,
indexRelationId,
YB_INDEX_PERM_READ_WRITE_AND_DELETE,
&actual_index_permissions));
/*
* TODO(jason): handle bad actual_index_permissions, like
* YB_INDEX_PERM_WRITE_AND_DELETE_WHILE_REMOVING.
*/
if (actual_index_permissions != YB_INDEX_PERM_READ_WRITE_AND_DELETE)
ereport(ERROR,
(errcode(ERRCODE_EXTERNAL_ROUTINE_EXCEPTION),
errmsg("index backfill failed")));

/*
* Index can now be marked valid -- update its pg_index entry
Expand Down
30 changes: 30 additions & 0 deletions src/yb/client/client-internal.cc
Expand Up @@ -301,6 +301,7 @@ YB_CLIENT_SPECIALIZE_SIMPLE(IsAlterTableDone);
YB_CLIENT_SPECIALIZE_SIMPLE(IsFlushTablesDone);
YB_CLIENT_SPECIALIZE_SIMPLE(IsCreateTableDone);
YB_CLIENT_SPECIALIZE_SIMPLE(IsTruncateTableDone);
YB_CLIENT_SPECIALIZE_SIMPLE(BackfillIndex);
YB_CLIENT_SPECIALIZE_SIMPLE(IsDeleteTableDone);
YB_CLIENT_SPECIALIZE_SIMPLE(IsLoadBalanced);
YB_CLIENT_SPECIALIZE_SIMPLE(IsLoadBalancerIdle);
Expand Down Expand Up @@ -761,6 +762,35 @@ Status YBClient::Data::AlterNamespace(YBClient* client,
return Status::OK();
}

Status YBClient::Data::BackfillIndex(YBClient* client,
const YBTableName& table_name,
const TableId& table_id,
CoarseTimePoint deadline) {
BackfillIndexRequestPB req;
BackfillIndexResponsePB resp;

if (table_name.has_table()) {
table_name.SetIntoTableIdentifierPB(req.mutable_table_identifier());
}
if (!table_id.empty()) {
req.mutable_table_identifier()->set_table_id(table_id);
}

RETURN_NOT_OK((SyncLeaderMasterRpc<BackfillIndexRequestPB, BackfillIndexResponsePB>(
deadline,
req,
&resp,
nullptr /* num_attempts */,
"BackfillIndex",
&MasterServiceProxy::BackfillIndex)));
if (resp.has_error()) {
return StatusFromPB(resp.error().status());
}

LOG(INFO) << "Initiated backfill index " << req.table_identifier().ShortDebugString();
return Status::OK();
}

Status YBClient::Data::IsCreateNamespaceInProgress(
YBClient* client,
const std::string& namespace_name,
Expand Down
5 changes: 5 additions & 0 deletions src/yb/client/client-internal.h
Expand Up @@ -165,6 +165,11 @@ class YBClient::Data {
const std::string& table_id,
CoarseTimePoint deadline);

CHECKED_STATUS BackfillIndex(YBClient* client,
const YBTableName& table_name,
const TableId& table_id,
CoarseTimePoint deadline);

CHECKED_STATUS AlterTable(YBClient* client,
const master::AlterTableRequestPB& req,
CoarseTimePoint deadline);
Expand Down
5 changes: 5 additions & 0 deletions src/yb/client/client.cc
Expand Up @@ -496,6 +496,11 @@ Status YBClient::TruncateTables(const vector<string>& table_ids, bool wait) {
return data_->TruncateTables(this, table_ids, deadline, wait);
}

Status YBClient::BackfillIndex(const TableId& table_id) {
auto deadline = CoarseMonoClock::Now() + default_admin_operation_timeout();
return data_->BackfillIndex(this, YBTableName(), table_id, deadline);
}

Status YBClient::DeleteTable(const YBTableName& table_name, bool wait) {
auto deadline = CoarseMonoClock::Now() + default_admin_operation_timeout();
return data_->DeleteTable(this,
Expand Down
3 changes: 3 additions & 0 deletions src/yb/client/client.h
Expand Up @@ -285,6 +285,9 @@ class YBClient {
CHECKED_STATUS TruncateTable(const std::string& table_id, bool wait = true);
CHECKED_STATUS TruncateTables(const std::vector<std::string>& table_ids, bool wait = true);

// Backfill the specified index table. This is only supported for YSQL at the moment.
CHECKED_STATUS BackfillIndex(const TableId& table_id);

// Delete the specified table.
// Set 'wait' to true if the call must wait for the table to be fully deleted before returning.
CHECKED_STATUS DeleteTable(const YBTableName& table_name, bool wait = true);
Expand Down
7 changes: 7 additions & 0 deletions src/yb/integration-tests/external_mini_cluster.cc
Expand Up @@ -1461,6 +1461,13 @@ Status ExternalMiniCluster::SetFlag(ExternalDaemon* daemon,
return Status::OK();
}

Status ExternalMiniCluster::SetFlagOnMasters(const string& flag, const string& value) {
for (const auto& master : masters_) {
RETURN_NOT_OK(SetFlag(master.get(), flag, value));
}
return Status::OK();
}

Status ExternalMiniCluster::SetFlagOnTServers(const string& flag, const string& value) {
for (const auto& tablet_server : tablet_servers_) {
RETURN_NOT_OK(SetFlag(tablet_server.get(), flag, value));
Expand Down
2 changes: 2 additions & 0 deletions src/yb/integration-tests/external_mini_cluster.h
Expand Up @@ -373,6 +373,8 @@ class ExternalMiniCluster : public MiniClusterBase {
const std::string& flag,
const std::string& value);

// Sets the given flag on all masters.
CHECKED_STATUS SetFlagOnMasters(const std::string& flag, const std::string& value);
// Sets the given flag on all tablet servers.
CHECKED_STATUS SetFlagOnTServers(const std::string& flag, const std::string& value);

Expand Down
58 changes: 51 additions & 7 deletions src/yb/master/backfill_index.cc
Expand Up @@ -113,8 +113,14 @@

#include "yb/tserver/remote_bootstrap_client.h"

DEFINE_int32(ysql_index_backfill_rpc_timeout_ms, 30 * 60 * 1000, // 30 min.
"Timeout used by the master when attempting to backfill a YSQL tablet during index "
"creation.");
TAG_FLAG(ysql_index_backfill_rpc_timeout_ms, advanced);
TAG_FLAG(ysql_index_backfill_rpc_timeout_ms, runtime);

DEFINE_int32(index_backfill_rpc_timeout_ms, 1 * 30 * 1000, // 30 sec.
"Timeout used by the master when attempting to backfilll a tablet "
"Timeout used by the master when attempting to backfill a tablet "
"during index creation.");
TAG_FLAG(index_backfill_rpc_timeout_ms, advanced);
TAG_FLAG(index_backfill_rpc_timeout_ms, runtime);
Expand Down Expand Up @@ -364,7 +370,7 @@ Status MultiStageAlterTable::StartBackfillingData(
const scoped_refptr<TableInfo> &indexed_table, const IndexInfoPB index_pb) {
if (indexed_table->IsBackfilling()) {
LOG(WARNING) << __func__ << " Not starting backfill for "
<< indexed_table->ToString() << " one is already in progress ";
<< indexed_table->ToString() << ": one is already in progress";
return STATUS(AlreadyPresent, "Backfill already in progress");
}

Expand Down Expand Up @@ -394,7 +400,7 @@ Status MultiStageAlterTable::StartBackfillingData(
ns_identifier.set_id(indexed_table->namespace_id());
RETURN_NOT_OK_PREPEND(
catalog_manager->FindNamespace(ns_identifier, &ns_info),
"Getting namespace info for backfill");
"Unable to get namespace info for backfill");
}
auto backfill_table = std::make_shared<BackfillTable>(
catalog_manager->master_, catalog_manager->AsyncTaskPool(),
Expand Down Expand Up @@ -440,6 +446,8 @@ Status MultiStageAlterTable::LaunchNextTableInfoVersionIfNecessary(
uint32_t current_version) {
DVLOG(3) << __PRETTY_FUNCTION__ << yb::ToString(*indexed_table);

const bool is_ysql_table = (indexed_table->GetTableType() == TableType::PGSQL_TABLE_TYPE);

std::unordered_map<TableId, IndexPermissions> indexes_to_update;
vector<IndexInfoPB> indexes_to_backfill;
vector<IndexInfoPB> indexes_to_delete;
Expand All @@ -462,7 +470,10 @@ Status MultiStageAlterTable::LaunchNextTableInfoVersionIfNecessary(
indexes_to_backfill.emplace_back(idx_pb);
} else if (idx_pb.index_permissions() == INDEX_PERM_INDEX_UNUSED) {
indexes_to_delete.emplace_back(idx_pb);
} else if (idx_pb.index_permissions() != INDEX_PERM_READ_WRITE_AND_DELETE) {
// For YSQL, INDEX_PERM_WRITE_AND_DELETE_WHILE_REMOVING is considered a terminal state.
} else if (idx_pb.index_permissions() != INDEX_PERM_READ_WRITE_AND_DELETE &&
!(is_ysql_table &&
idx_pb.index_permissions() == INDEX_PERM_WRITE_AND_DELETE_WHILE_REMOVING)) {
indexes_to_update.emplace(idx_pb.table_id(), NextPermission(idx_pb.index_permissions()));
}
}
Expand All @@ -471,12 +482,40 @@ Status MultiStageAlterTable::LaunchNextTableInfoVersionIfNecessary(
if (indexes_to_update.empty() &&
indexes_to_delete.empty() &&
indexes_to_backfill.empty()) {

TRACE("Not necessary to launch next version");
VLOG(1) << "Not necessary to launch next version";
return ClearAlteringState(catalog_manager, indexed_table, current_version);
}

// For YSQL online schema migration of indexes, instead of master driving the schema changes,
// postgres will drive it. Postgres will use three of the DocDB index permissions:
//
// - INDEX_PERM_WRITE_AND_DELETE (set from the start)
// - INDEX_PERM_READ_WRITE_AND_DELETE (set by master)
// - INDEX_PERM_WRITE_AND_DELETE_WHILE_REMOVING (set by master)
//
// This changes how we treat indexes_to_foo:
//
// - indexes_to_update shouldn't cause moving on to the next permission because we don't want
// master to control that. Do nothing when nonempty.
// - indexes_to_delete is impossible to be nonempty, and, in the future, when we do use
// INDEX_PERM_INDEX_UNUSED, we want to use some other delete trigger that makes sure no
// transactions are left using the index. Prepare for that by doing nothing when nonempty.
// - indexes_to_backfill is impossible to be nonempty, but, in the future, we want to set
// INDEX_PERM_DO_BACKFILL so that backfill resumes on master leader changes. Prepare for that
// by handling indexes_to_backfill like for YCQL.
//
// We still need to know indexes_to_foo.empty for ClearAlteringState logic since we want to keep
// the table as ALTERING if the index is in a transient state.
// TODO(jason): when using INDEX_PERM_DO_BACKFILL, update this comment (issue #6218).
if (is_ysql_table) {
if (indexes_to_backfill.empty()) {
return Status::OK();
}
indexes_to_update.clear();
indexes_to_delete.clear();
}

if (!indexes_to_update.empty()) {
Result<bool> permissions_updated =
VERIFY_RESULT(UpdateIndexPermission(catalog_manager, indexed_table, indexes_to_update,
Expand Down Expand Up @@ -1065,7 +1104,12 @@ void BackfillChunk::Launch() {

MonoTime BackfillChunk::ComputeDeadline() {
MonoTime timeout = MonoTime::Now();
timeout.AddDelta(MonoDelta::FromMilliseconds(FLAGS_index_backfill_rpc_timeout_ms));
if (GetTableType() == TableType::PGSQL_TABLE_TYPE) {
timeout.AddDelta(MonoDelta::FromMilliseconds(FLAGS_ysql_index_backfill_rpc_timeout_ms));
} else {
DCHECK(GetTableType() == TableType::YQL_TABLE_TYPE);
timeout.AddDelta(MonoDelta::FromMilliseconds(FLAGS_index_backfill_rpc_timeout_ms));
}
return MonoTime::Earliest(timeout, deadline_);
}

Expand All @@ -1085,7 +1129,7 @@ bool BackfillChunk::SendRequest(int attempt) {
req.set_read_at_hybrid_time(backfill_tablet_->read_time_for_backfill().ToUint64());
req.set_schema_version(backfill_tablet_->schema_version());
req.set_start_key(start_key_);
if (backfill_tablet_->tablet()->table()->GetTableType() == TableType::PGSQL_TABLE_TYPE) {
if (GetTableType() == TableType::PGSQL_TABLE_TYPE) {
req.set_namespace_name(backfill_tablet_->GetNamespaceName());
}
for (const IndexInfoPB& idx_info : backfill_tablet_->indexes()) {
Expand Down
4 changes: 4 additions & 0 deletions src/yb/master/backfill_index.h
Expand Up @@ -346,6 +346,10 @@ class BackfillChunk : public RetryingTSRpcTask {
int num_max_retries() override;
int max_delay_ms() override;

TableType GetTableType() const {
return backfill_tablet_->tablet()->table()->GetTableType();
}

tserver::BackfillIndexResponsePB resp_;
std::shared_ptr<BackfillTablet> backfill_tablet_;
std::string start_key_;
Expand Down

0 comments on commit 7b0d02f

Please sign in to comment.