Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add document about single region 1PC #57

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
59 changes: 59 additions & 0 deletions design/async-commit/single-region-1pc.md
@@ -0,0 +1,59 @@
# Single Region 1PC

For transactions that affect only one region, or more strictly speaking, that can be prewritten with only one prewrite request, can be committed directly while the prewrite request is being handled, so the commit phase can be totally removed. Therefore we can get less latency and more throughput. For TiDB, indices and rows are unsally not in a same region, **so this optimization only works under a limited amount of scenarios, like sysbench oltp_update_non_index**. But in a suitable scenario, it gains significantly better performance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For transactions that affect only one region, or more strictly speaking, that can be prewritten with only one prewrite request, can be committed directly while the prewrite request is being handled, so the commit phase can be totally removed. Therefore we can get less latency and more throughput. For TiDB, indices and rows are unsally not in a same region, **so this optimization only works under a limited amount of scenarios, like sysbench oltp_update_non_index**. But in a suitable scenario, it gains significantly better performance.
For transactions that affect only one region, or more strictly speaking, that can be prewritten with only one prewrite request, can be committed directly while the prewrite request is being handled, so the commit phase can be totally removed. Therefore we can get less latency and more throughput. For TiDB, indices and rows are usually not in a same region, **so this optimization only works under a limited amount of scenarios, like sysbench oltp_update_non_index**. But in a suitable scenario, it gains significantly better performance.


We used to think about supporting 1PC before but stopped after meeting many difficulties. Now since async commit, which meets and solved mostly the same problem as 1PC, is implemented, we can continue supporting 1PC with much less effort to make.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From @cfzjywxk :

Maybe we could extend it to 1PC for single-store transactions with some placement usages in the future.


* Problems already solved by supporting async commit:
* Commit ts calculation
* Non-unique commit ts and rollback record overlapping
* Memory lock
* Replica read problem
* Problems that we do not care anymore:
* Binlog incompatibility

## Basic Design

Since async commit is implemented, it's not difficult to implement a working 1PC. However, it's still hard to make it perfectly correct and compatible with other components.

* When committing a transaction, if TiDB finds that the prewrite phase can be done with only one single request, the transaction is allowed to be committed with 1PC protocol. A field named `try_one_pc` in the prewrite request will be set to let TiKV know that 1PC is available for this transaction.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would this be set or not set? If and only if the transaction only touches one region/store?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be set iff the prewrite can be done with only one request. Typically it means the transaction affects only one region. But if there's too many affected keys in a region, it may need to be prewritten with more than one requests, in which case 1PC is not applicable.

* When TiKV receives a request with `try_one_pc` set, it first handle it just like how it handles normal prewrite requests.But after generating write buffer and before writing them down to RocksDB, it will additionally check if the prewrite is fully successful, and convert the locks into commit records if so. And finally write them down to RocksDB. The `commit_ts` is `max(max_ts, start_ts, for_update_ts) + 1`. It fetches the `max_ts` while acquiring the memory lock, and the memory lock is released after applying, just like how async commit does. The final `commit_ts` will be sent back to TiDB via prewrite response.
* 1PC and async commit can be independent. When TiKV rejects to commit a transaction with 1PC, the transaction can then fallback to normal transactions, and it may become a normal 2PC transaction or an async commit transaction, according to if the async commit flag is set.

## Problems need to be solved

### Schema version checking problem

[This problem exists in async commit too](https://github.com/tikv/sig-transaction/blob/master/design/async-commit/parallel-commit-known-issues-and-solutions.md#schema-version-checking). But it's even harder to solve for 1PC, because if it's committed in TiKV, it will have no chance to check if the schema version has changed between the `start_ts` and `commit_ts`, while for async commit it can be checked after prewrite finishing.

Possible solution: When trying committing a transaction with 1PC, find a ts `one_pc_max_commit_ts` before which we can guarantee that the schema version can't change, and send it to TiKV. TiKV will reject committing if the calculated `commit_ts` exceeds the `one_pc_max_commit_ts`.

### CDC compatibility problem

CDC syncs data from TiKV by observing applying events. Prewrites and commits are distinguished, and CDC will use these events to compose complete transactions and then send them to the downstream. So when 1PC is enabled, there need to be some way for CDC to distinguish if a apply event is caused by 1PC committing, otherwise CDC will expect that a commit event must has a corresponding prewrite event to compose a complete transaction.

Possible solutions:
Copy link

@coocood coocood Oct 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Firstly, we should have a way that disables 1PC when CDC is connected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will 1PC be shipped with the problem unsolved? Otherwise I think we can allow enabling 1PC only after resolving these problems, then we won't need to automatically disable 1PC when CDC is connected.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we can enable 1PC by default on the master branch.
It makes POC much easier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if enabling it by default is a good idea unless we've done enough tests and it's stable enough.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enable it by default can make the feature mature more quickly.

1. Passing a `is_1pc` flag from txn layer to apply, just like how `TxnExtra` (which is used to support CDC outputting old value) was written previously. It would be ugly.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the least ugly solution :-) 2 is very fragile, 3 seems wasteful and make write records more complex.

2. When an apply batch puts write records but without deleting lockcf, we conclude that it must be a 1PC committing. This sounds like a simple way, but this may leave troubles to future. For example, if we want to introduce some new mechanisms that may amend persisted write records, we will then also need a new mechanism to distinguish the amending and 1PC committing. There's even one more problem: if the transaction being committed is a pessimistic transaction, the pessimistic lock need to be removed while writing the commit record
3. Add a field to write record that marks it's committed by 1PC. This approach makes the write record more complicated may waste some disk space.

## POC Test

We verified the improvement on a draft implementation with a brief test with following configurations, on a cluster with 3 TiKV nodes(16c 90G each) and 1 TiDB node(16c 64G):

```
sysbench oltp_update_non_index run --rand-type=uniform --db-driver=mysql --tables=8 --table-size=10000000 --time=270 --percentile=99 --report-interval=10 --threads=128
```

Here's the result:

| test | qps | lat avg (ms) | lat .99 (ms) | lat max (ms)|
|----------------------------|---------|--------------|--------------|-------------|
|update_non_index<br/>1pc | 16603.37| 7.71| 15.00| 81.50|
|update_non_index<br/>async_commit|13278.29| 9.64| 18.61| 110.50|
|update_non_index<br/>2pc | 8892.85| 14.39| 24.83| 118.57|


## Plan

TBD