Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[xCluster] Support Atomic and Ordered multi-shard Transactions #10976

Closed
rahuldesirazu opened this issue Jan 3, 2022 · 1 comment
Closed
Assignees
Labels
area/docdb YugabyteDB core features kind/new-feature This is a request for a completely new feature priority/medium Medium priority issue xCluster Label for xCluster related issues/improvements

Comments

@rahuldesirazu
Copy link
Contributor

rahuldesirazu commented Jan 3, 2022

Jira Link: [DB-310](https://yugabyte.atlassian.net/browse/DB-310)

Product-Level Description and Requirements for the First Iteration

The Use Cases Supported

For the use case where only one cluster is taking writes at a given time, we will support the following transaction semantics on the target cluster:

  1. Multi-shard atomic reads: A transaction spanning tablets A and B will either fully be readable or not at all.
  2. Ordered multi-shard reads for a single-session: If timestamp(txn1) < timestamp(txn2) on the source universe, then the user will never be able to read txn2 before txn1 in a single session.
  3. Consistent cutover on disaster recovery. When the target is promoted, it will have a clean cut of the system and preserve atomic and ordered reads.

For the above, multi-shard can refer to single-table, multi-table, or table-index. So this will encapsulate guarantees for transactional updates to a main table and index, as well as two separate transactions to a parent then child foreign key table.

Product-level Changes and Semantics

  1. Reads on the target will occur at a timestamp bounded by the laggiest tablet in the system. If the lag is 1s, the read timestamp for any read will be at least 1s in the past.
  2. This read time will be applied on all tables on the target, whether or not they are under replication.
  3. Each server will periodically get an updated version of the read time from the master leader. So if the master process is down or there is a master - tserver partition, the read time will not advance and reads will become increasingly stale.
  4. In the case of DR cutover, all tablets will cutover to the timestamp of the laggiest tablet. What this means is, if the laggiest tablet is behind by 1s, every tablet will lose 1s worth of data in the system.
  5. A toggle will be introduced between reading at a consistent but past timestamp or reading at the current time. The tradeoff is the atomic and ordered guarantees of an older timestamp vs lower latency and less data loss in case of a cutover of a real-time read.
  6. When target-side read staleness exceeds the history retention cutoff (currently 15m), the target will reject reads to avoid serving incorrect data.
  7. Transaction status WAL retention will need to be increased on the source, which will which will increase space utilization.
  8. The user will need to do a full backup restore of all tables in case the transaction status replication stream can’t catch up. This is because the transaction status table has no B/R mechanism and therefore the only way to resolve transactions on the target for user tables is to do a B/R for those tables.

Situations not supported

  1. Active-acive use cases (both sides taking independent writes) will not preserve transactional guarantees. For this case, we will continue to use last-writer wins semantics.
  2. Ordered multi-shard reads across sessions: If timestamp(txn1) < timestamp(txn2) on the source universe, then the user may still be to read txn2 before txn1 across multiple sessions.
  3. 1 : N, N : 1, or triangle formations
  4. Local transaction status tables on the source cluster

Subtasks

Open Problems

  1. What do we do when we need to bootstrap the txn status table? Since there is no rocksdb for this table and all committed txns are eventually gc'ed, we may need to bootstrap the entire cluster over again to resolve lost transactions.
  2. How do we ensure that two versions of ht_read on two different servers are bounded? On a single cluster, we use clock skew to bound the read time on any two servers. However, ht_read doesn't have any bound between two consecutive versions.
  3. Do we need to change history retention at all? Given we're reading at a timestamp in the past, if ht_read falls below history retention, we may no longer be able to serve reads.
@bmatican
Copy link
Contributor

Removing priority label as this is a higher level tracking task.

@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels May 11, 2022
@rahuldesirazu rahuldesirazu self-assigned this May 11, 2022
@lingamsandeep lingamsandeep added the xCluster Label for xCluster related issues/improvements label Jun 30, 2022
@yugabyte-ci yugabyte-ci assigned hari90 and unassigned rahuldesirazu Jun 30, 2022
@yugabyte-ci yugabyte-ci added kind/new-feature This is a request for a completely new feature and removed kind/bug This issue is a bug labels Jul 9, 2022
@hari90 hari90 closed this as completed Sep 26, 2023
xCluster replication automation moved this from To do to Done Sep 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features kind/new-feature This is a request for a completely new feature priority/medium Medium priority issue xCluster Label for xCluster related issues/improvements
Development

No branches or pull requests

5 participants