Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YSQL] Propagate catalog changes more reliably #6238

Open
jaki opened this issue Oct 31, 2020 · 8 comments
Open

[YSQL] Propagate catalog changes more reliably #6238

jaki opened this issue Oct 31, 2020 · 8 comments
Labels
area/ysql Yugabyte SQL (YSQL) kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue
Projects

Comments

@jaki
Copy link
Contributor

jaki commented Oct 31, 2020

Jira Link: DB-1446
This is a broad issue. Hopefully, I'm not creating a duplicate here.

The model of having a single catalog version has flaws, yes. Let's focus on the fact that tservers pull the catalog version changes via heartbeats.

Heartbeats take some time, and they can get lost. It's easy for the YSQL catalog version on tservers to lag behind master sometimes.

YSQL DDL transactions will commit the changes, which causes writes on the system tables located on master sys catalog tablet, then wait one heartbeat (according to internal docs; I didn't verify). If tservers don't pull in the catalog version bump caused by the system table writes, they will be behind! The commit went to master but possibly not to all the tservers.

One example of where this can be a problem is with the new index design that relies more on pg_index system table commits for determining the schema. We do not want to get nodes more than 2 schemas apart, and this may be possible if commits aren't robustly propagated to all the tservers.

@jaki jaki added kind/bug This issue is a bug area/ysql Yugabyte SQL (YSQL) labels Oct 31, 2020
@jaki jaki assigned m-iancu and jaki Oct 31, 2020
@jaki jaki added this to Backlog in YSQL via automation Oct 31, 2020
jaki added a commit that referenced this issue Oct 31, 2020
Summary:

To improve YSQL index backfill failure handling, switch over to a new
design:

- Make `pg_index` permissions `indislive`, `indisready`, `indisvalid`
  the authority, and don't really use the DocDB permissions (only use
  three of them for persisting state information)
- For the backfill step, add a postgres-master RPC `BackfillIndex`

Now, `CREATE INDEX` with backfill enabled looks like

1. postgres: send create index request to master
1. master: create index table
1. master: alter index table to have index info at
   `WRITE_AND_DELETE` perm
   - don't set fully applied schema
   - set schema to `WRITE_AND_DELETE`
   - set `ALTERING` state
1. tservers: apply the alter to the indexed table
1. master: _don't_ move on to the next permission
   - don't set fully applied schema
   - keep schema at `WRITE_AND_DELETE`
1. postgres: create postgres index at `indislive` permission
1. postgres: add `indisready` permission
1. postgres: send backfill request and wait for at least
   `READ_WRITE_AND_DELETE` permissions
1. master: move on to backfilling
1. master: get safe time for read
1. tservers: get safe time
1. master: send backfill requests to tservers
1. tservers: send backfill requests to postgres
1. master: alter to success (`READ_WRITE_AND_DELETE`) or abort
   (`WRITE_AND_DELETE_WHILE_REMOVING`)
   - clear fully applied schema
   - set schema to `READ_WRITE_AND_DELETE` or
     `WRITE_AND_DELETE_WHILE_REMOVING`
   - clear `ALTERING` state
1. postgres: finish waiting and, on success, add `indisvalid` permission

If postgres dies before backfill, master isn't stuck, and a separate
request can be made to drop the half-created index.  If postgres dies
during or after backfill, we can still drop the index, but you may need
to kill any backfills in progress.

If master fails to backfill, postgres will just stop at that point and
not set the index public, so a separate request can be made to drop the
index.

Add some gflags:

- master:
  - `ysql_backfill_is_create_table_done_delay_ms`: delay after finding
    index info in table for YSQL `IsCreateTableDone` (currently helpful,
    but not foolproof, for preventing backfill from happening while a
    tserver doesn't have the index info--issue #6234)
  - `ysql_index_backfill_rpc_timeout_ms`: deadline for master to tserver
    backfill tablet rpc calls (useful to handle large tables because we
    currently don't respect the deadline and try to backfill the entire
    tablet at once--issue #5326)
- tserver:
  - `TEST_ysql_index_state_flags_update_delay_ms`: delay after
    committing `pg_index` state flag change (currently helpful, but not
    foolproof for consistency, when, in our current design, commits
    aren't guaranteed to have been seen by all tservers by the time the
    commit finishes--issue #6238)
  - `ysql_wait_until_index_permissions_timeout_ms`: timeout for
    `WaitUntilIndexPermissionsAtLeast` client function (again, useful
    for handling large tables--issue #5326)

Add some helper functions:

- `yb::ExternalMiniCluster::SetFlagOnMasters` for external minicluster
  tests (currently unused)
- `yb::pgwrapper::GetBool` for pgwrapper tests

Adjust some tests to the new backfill model:

- `PgLibPqTest.BackfillPermissions`
- `PgLibPqTest.BackfillReadTime`

Fix `BackfillIndexesForYsql` to better handle libpq connections and
results.

Other issues:

- `CREATE INDEX` can't be stopped with Ctrl+C, probably because of the
  long `YBCPgWaitUntilIndexPermissionsAtLeast` call
- An in-progress master to tserver backfill RPC will not notice a DROP
  INDEX removes index tablets
- master leader failover during backfill will not cause backfill to
  resume, unlike YCQL (issue #6218)

Close: #5325

Test Plan:

- `./yb_build.sh --cxx-test pgwrapper_pg_libpq-test --gtest_filter`
  - `PgLibPqTest.BackfillWaitBackfillTimeout`
  - `PgLibPqTest.BackfillDropAfterFail`
  - `PgLibPqTest.BackfillMasterLeaderStepdown`
  - `PgLibPqTest.BackfillDropWhileBackfilling`
  - `PgLibPqTest.BackfillAuth`

Reviewers: amitanand, mihnea

Reviewed By: mihnea

Subscribers: yql, bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D9682
jaki added a commit that referenced this issue Oct 31, 2020
Summary:
To improve YSQL index backfill failure handling, switch over to a new
design:

- Make `pg_index` permissions `indislive`, `indisready`, `indisvalid`
  the authority, and don't really use the DocDB permissions (only use
  three of them for persisting state information)
- For the backfill step, add a postgres-master RPC `BackfillIndex`

Now, `CREATE INDEX` with backfill enabled looks like

1. postgres: send create index request to master
1. master: create index table
1. master: alter index table to have index info at
   `WRITE_AND_DELETE` perm
   - don't set fully applied schema
   - set schema to `WRITE_AND_DELETE`
   - set `ALTERING` state
1. tservers: apply the alter to the indexed table
1. master: _don't_ move on to the next permission
   - don't set fully applied schema
   - keep schema at `WRITE_AND_DELETE`
1. postgres: create postgres index at `indislive` permission
1. postgres: add `indisready` permission
1. postgres: send backfill request and wait for at least
   `READ_WRITE_AND_DELETE` permissions
1. master: move on to backfilling
1. master: get safe time for read
1. tservers: get safe time
1. master: send backfill requests to tservers
1. tservers: send backfill requests to postgres
1. master: alter to success (`READ_WRITE_AND_DELETE`) or abort
   (`WRITE_AND_DELETE_WHILE_REMOVING`)
   - clear fully applied schema
   - set schema to `READ_WRITE_AND_DELETE` or
     `WRITE_AND_DELETE_WHILE_REMOVING`
   - clear `ALTERING` state
1. postgres: finish waiting and, on success, add `indisvalid` permission

If postgres dies before backfill, master isn't stuck, and a separate
request can be made to drop the half-created index.  If postgres dies
during or after backfill, we can still drop the index, but you may need
to kill any backfills in progress.

If master fails to backfill, postgres will just stop at that point and
not set the index public, so a separate request can be made to drop the
index.

Add some gflags:

- master:
  - `ysql_backfill_is_create_table_done_delay_ms`: delay after finding
    index info in table for YSQL `IsCreateTableDone` (currently helpful,
    but not foolproof, for preventing backfill from happening while a
    tserver doesn't have the index info--issue #6234)
  - `ysql_index_backfill_rpc_timeout_ms`: deadline for master to tserver
    backfill tablet rpc calls (useful to handle large tables because we
    currently don't respect the deadline and try to backfill the entire
    tablet at once--issue #5326)
- tserver:
  - `TEST_ysql_index_state_flags_update_delay_ms`: delay after
    committing `pg_index` state flag change (currently helpful, but not
    foolproof for consistency, when, in our current design, commits
    aren't guaranteed to have been seen by all tservers by the time the
    commit finishes--issue #6238)
  - `ysql_wait_until_index_permissions_timeout_ms`: timeout for
    `WaitUntilIndexPermissionsAtLeast` client function (again, useful
    for handling large tables--issue #5326)

Add some helper functions:

- `yb::ExternalMiniCluster::SetFlagOnMasters` for external minicluster
  tests (currently unused)
- `yb::pgwrapper::GetBool` for pgwrapper tests

Adjust some tests to the new backfill model:

- `PgLibPqTest.BackfillPermissions`
- `PgLibPqTest.BackfillReadTime`

Fix `BackfillIndexesForYsql` to better handle libpq connections and
results.

Other issues:

- `CREATE INDEX` can't be stopped with Ctrl+C, probably because of the
  long `YBCPgWaitUntilIndexPermissionsAtLeast` call
- An in-progress master to tserver backfill RPC will not notice a DROP
  INDEX removes index tablets
- master leader failover during backfill will not cause backfill to
  resume, unlike YCQL (issue #6218)

Depends on D9803

Test Plan:
Jenkins: rebase: 2.3

- `./yb_build.sh --cxx-test pgwrapper_pg_libpq-test --gtest_filter`
  - `PgLibPqTest.BackfillWaitBackfillTimeout`
  - `PgLibPqTest.BackfillDropAfterFail`
  - `PgLibPqTest.BackfillMasterLeaderStepdown`
  - `PgLibPqTest.BackfillDropWhileBackfilling`
  - `PgLibPqTest.BackfillAuth`

Reviewers: mihnea

Reviewed By: mihnea

Subscribers: yql, bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D9804
@jaki
Copy link
Contributor Author

jaki commented Dec 15, 2020

I believe that the breaking catalog change system is flawed. Here's an example of what can happen:

  1. ts-1 is at catalog version 2; ts-2 is at catalog version 1
  2. postgres-2 sees catalog version 1 and sends a request to ts-2
  3. ts-2 heartbeats and gets catalog version 2
  4. master goes to catalog version 3
  5. ts-1 heartbeats and gets catalog version 3
  6. postgres-1 sees catalog version 3 and sends a request to ts-1
  7. ts-1 receives request, and it goes through
  8. ts-2 receives request, sees that the catalog version 1 != 2, but (say) that was not a breaking catalog change, so it goes through

This is unlikely because table schema versions catch this. However, for index backfill, which tries to do away with docdb table schema changes, this issue is probably possible.

@m-iancu
Copy link
Contributor

m-iancu commented Dec 15, 2020

@jaki Can you give a more concrete example, or elaborate on the current one. I'm not sure I follow.
Is the problem that 2->3 could be a breaking change and it would be allowed by step 8 in your example?
In that case how does the 1->2 change even matter? Also we know there is a delay for the version propagation that we need to account for and that can cause such issues. Or is this a different more subtle issue?

@jaki
Copy link
Contributor Author

jaki commented Dec 15, 2020

@m-iancu ,

First, postgres only checks for catalog version on either

  • retry loop (each query)
  • version mismatch error

If postgres spends a long time on a query, it wouldn't notice catalog version changes.

Second, writes not part of a transaction get a write time assigned to them on tserver side (see MvccManager::AddPending). If the network is really bad and packets take a while to get to tserver, we can have a delayed write.

Using those two, here's an example I can think of. I have not proven it or tried to produce it.

  1. all are at catalog version 1
  2. pg-2: delete B in progress...
  3. all are at catalog version 2 (breaking version 1) for CREATE INDEX indislive true
  4. ts-1 and pg-1 go to catalog version 3 (breaking version 2) for CREATE INDEX indisready true
  5. pg-1: insert B goes to ts-2 (fine because v3 > v1 and TabletServiceImpl::Write only checks if the request is behind, not ahead), and it also writes to index because indisready is true
  6. pg-2: ...finally delete B goes through to ts-2 (fine because v1 = v1), but it doesn't write to index because it's not aware of the index
  7. the index is inconsistent for having a row that doesn't exist in the indexed table

I fact checked only some of these things, so please correct me if I'm wrong on a point.

@jaki
Copy link
Contributor Author

jaki commented Dec 16, 2020

The above concrete example is covered because of table schema version mismatch. pg-2's table schema doesn't have the index while ts-2's table schema does have the index. The above example can probably be shifted around to be a problem for DROP INDEX (when we start doing it online).

@jaki
Copy link
Contributor Author

jaki commented Nov 30, 2021

To add to the write time derivation (I previously only mentioned MvccManager::AddPending),

(gdb) bt
#0  yb::tablet::MvccManager::AddPending (this=this@entry=0x42b8790, ht=..., op_id=..., is_follower_side=is_follower_side@entry=false) at ../../src/yb/tablet/mvcc.cc:335
#1  0x00007f5087710b49 in yb::tablet::MvccManager::AddLeaderPending (this=0x42b8790, op_id=...) at ../../src/yb/tablet/mvcc.cc:308
#2  0x00007f50876f2cc6 in yb::tablet::Operation::AddedToLeader (this=0x431d080, op_id=..., committed_op_id=...) at ../../src/yb/tablet/operations/operation.cc:115
#3  0x00007f50876f6efb in yb::tablet::OperationDriver::AddedToLeader (this=0x1b2c480, op_id=..., committed_op_id=...) at ../../src/yb/tablet/operations/operation_driver.cc:203
#4  0x00007f508732488b in yb::consensus::RaftConsensus::AppendNewRoundsToQueueUnlocked (this=0x45e0250, rounds=std::vector of length 1, capacity 1 = {...}, processed_rounds=0x7f50764dcd78) at ../../src/yb/consensus/raft_consensus.cc:1259
#5  0x00007f508731fc76 in yb::consensus::RaftConsensus::DoReplicateBatch (this=this@entry=0x45e0250, rounds=std::vector of length 1, capacity 1 = {...}, processed_rounds=processed_rounds@entry=0x7f50764dcd78) at ../../src/yb/consensus/raft_consensus.cc:1166
#6  0x00007f508731fdf4 in yb::consensus::RaftConsensus::ReplicateBatch (this=0x45e0250, rounds=std::vector of length 1, capacity 1 = {...}) at ../../src/yb/consensus/raft_consensus.cc:1134
#7  0x00007f508772aceb in yb::tablet::PreparerImpl::ReplicateSubBatch (this=this@entry=0x4267140, batch_begin=..., batch_begin@entry=0x1b2c480, batch_end=..., batch_end@entry=0x0) at ../../src/yb/tablet/preparer.cc:332
#8  0x00007f508772b274 in yb::tablet::PreparerImpl::ProcessAndClearLeaderSideBatch (this=this@entry=0x4267140) at ../../src/yb/tablet/preparer.cc:298
#9  0x00007f508772b710 in yb::tablet::PreparerImpl::Run (this=0x4267140) at ../../src/yb/tablet/preparer.cc:190
#10 0x00007f507ed55a94 in yb::ThreadPool::DispatchThread (this=0x1d95200, permanent=true) at ../../src/yb/util/threadpool.cc:611
#11 0x00007f507ed50b65 in operator() (this=0x1a3a058) at /opt/yb-build/brew/linuxbrew-20181203T161736v9-3ba4c2ed9b0587040949a4a9a95b576f520bae/Cellar/gcc/5.5.0_4/include/c++/5.5.0/functional:2267
#12 yb::Thread::SuperviseThread (arg=0x1a3a000) at ../../src/yb/util/thread.cc:774
#13 0x00007f5079206694 in start_thread (arg=0x7f50764e5700) at pthread_create.c:333
#14 0x00007f5078f4841d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
HybridTime MvccManager::AddLeaderPending(const OpId& op_id) {
  std::lock_guard<std::mutex> lock(mutex_);
  auto ht = clock_->Now();
  AtomicFlagSleepMs(&FLAGS_TEST_inject_mvcc_delay_add_leader_pending_ms);
  VLOG_WITH_PREFIX(1) << __func__ << "(" << op_id << "), time: " << ht;
  AddPending(ht, op_id, /* is_follower_side= */ false);
...
  return ht;
void Operation::AddedToLeader(const OpId& op_id, const OpId& committed_op_id) {
  HybridTime hybrid_time;
  if (use_mvcc()) {
    hybrid_time = tablet_->mvcc_manager()->AddLeaderPending(op_id);
  } else {
    hybrid_time = tablet_->clock()->Now();
  }

  {
    std::lock_guard<simple_spinlock> l(mutex_);
    hybrid_time_ = hybrid_time;
  HybridTime hybrid_time() const {
    std::lock_guard<simple_spinlock> l(mutex_);
    DCHECK(hybrid_time_.is_valid());
    return hybrid_time_;
  }
HybridTime Operation::WriteHybridTime() const {
  return hybrid_time();
}
HybridTime WriteOperation::WriteHybridTime() const {
  if (request()->has_external_hybrid_time()) {
    return HybridTime(request()->external_hybrid_time());
  }
  return Operation::WriteHybridTime();
}
void TabletServiceImpl::Write(const WriteRequestPB* req,
                              WriteResponsePB* resp,
                              rpc::RpcContext context) {
...
  auto operation = std::make_unique<WriteOperation>(
      tablet.leader_term, context.GetClientDeadline(), tablet.peer.get(),
      tablet.peer->tablet(), resp);
Status Tablet::ApplyOperation(
    const Operation& operation, int64_t batch_idx,
    const docdb::KeyValueWriteBatchPB& write_batch,
    AlreadyAppliedToRegularDB already_applied_to_regular_db) {
  auto hybrid_time = operation.WriteHybridTime();

Given tablets

  • table: 98f2baf9d47647f4afd344d02c7b15da
  • index: 7a0cc88740784b4a98efb3ffdbedf1bd

On insert, tserver logs show

I1129 21:08:06.547746  6229 mvcc.cc:307] T 7a0cc88740784b4a98efb3ffdbedf1bd P 2305d5364caa44b4aea3dca1a7286f8d: AddLeaderPending(1.4), time: { physical: 1638248886547743 }
I1129 21:08:06.547755  7149 mvcc.cc:307] T 98f2baf9d47647f4afd344d02c7b15da P 2305d5364caa44b4aea3dca1a7286f8d: AddLeaderPending(1.5), time: { physical: 1638248886547751 }
...
I1129 21:08:06.551976  6229 mvcc.cc:307] T 59fb1a59b1b943c19442b79902427fcd P 2305d5364caa44b4aea3dca1a7286f8d: AddLeaderPending(1.10), time: { physical: 1638248886551975 }
...
I1129 21:08:06.553159  7149 mvcc.cc:307] T 98f2baf9d47647f4afd344d02c7b15da P 2305d5364caa44b4aea3dca1a7286f8d: AddLeaderPending(1.6), time: { physical: 1638248886553158 }
I1129 21:08:06.553160  6229 mvcc.cc:307] T 7a0cc88740784b4a98efb3ffdbedf1bd P 2305d5364caa44b4aea3dca1a7286f8d: AddLeaderPending(1.5), time: { physical: 1638248886553158 logical: 1 }

sst_dump shows

SubDocKey(DocKey(0xa39b, ["J\xce\x8dP\xfeaH\"\xa3\xf2\xca\xd4y\xd5q\xb8"], []), [SystemColumnId(0); HT{ physical: 1638248886551975 }]) -> null; intent doc ht: HT{ physical: 1638248886547751 }
SubDocKey(DocKey(0xa39b, ["J\xce\x8dP\xfeaH\"\xa3\xf2\xca\xd4y\xd5q\xb8"], []), [ColumnId(1); HT{ physical: 1638248886551975 w: 1 }]) -> 2; intent doc ht: HT{ physical: 1638248886547751 w: 1 }

and

SubDocKey(DocKey(0xc0c4, [2], [EncodedSubDocKey(DocKey(0xa39b, ["J\xce\x8dP\xfeaH\"\xa3\xf2\xca\xd4y\xd5q\xb8"], []), [])]), [SystemColumnId(0); HT{ physical: 1638248886551975 }]) -> null; intent doc ht: HT{ physical: 1638248886547743 }

So it appears the ht = clock_->Now() is the time used for the records inserted. And this "Now" time is after index permissions are checked in execIndexing.c.

@jaki
Copy link
Contributor Author

jaki commented Nov 30, 2021

@jaki
Copy link
Contributor Author

jaki commented Jan 22, 2022

Don't forget that postgres catalog caches can be behind even if tservers aren't. As far as I'm aware, postgres catalog is only checked

  • between statements
  • when RPCs hit master or tserver

A mismatch in the latter case will cause catalog version mismatch and (if possible) a single transparent retry, but if the retry also fails, it's a bad user experience.

@mbautin
Copy link
Collaborator

mbautin commented Jan 22, 2022

To the point of de-synchronization between postgres and tserver processes, this is one more reason to consider implementing #9008 in the long term.

@yugabyte-ci yugabyte-ci added the priority/medium Medium priority issue label Jun 8, 2022
@yugabyte-ci yugabyte-ci added kind/enhancement This is an enhancement of an existing feature and removed kind/bug This issue is a bug labels Sep 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ysql Yugabyte SQL (YSQL) kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue
Projects
Status: No status
YSQL
  
Backlog
Development

No branches or pull requests

5 participants