Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate util/groupcommit package with the coordinator table commit #1728

Merged
merged 30 commits into from
May 28, 2024

Conversation

komamitsu
Copy link
Contributor

@komamitsu komamitsu commented May 14, 2024

Description

We've already added util/groupcommit package, but not used yet. This PR integrate the group commit with the coordinator table writes.

Related issues and/or PRs

Changes made

image
  • Add CoordinatorGroupCommitter that inherits GroupCommitter and CommitHandlerWithGroupCommit that inherits CommitHandler
    • They write a record to coordinator.state table as Emitter.emit()
    • There are 2 types of coordinator.state table record
      • In the case of normal group commit, the columns of a record are as follows:
        • tx_id: $parent_id (a key of the normal group contains multiple transactions)
        • tx_child_ids: $child_id_1,$child_id_2,...,$child_id_n (values of the normal group contains multiple transactions)
      • In the case of delayed group commit which contains a single transaction isolated from a normal group due to delayed transaction, the columns of a record are as follows:
        • tx_id: $parent_id:$child_id (a key and a value of the delayed group contains a single transaction)
        • tx_child_ids: Empty
  • Modify ConsensusCommitConfig, ConsensusCommit, ConsensusCommitManager, TwoPhaseConsensusCommit, and TwoPhaseConsensusCommitManager to use CommitHandlerWithGroupCommit
  • Modify util/groupcommit package for fixing some issues occurred in this integration
    • Sorry, this includes an improvement changing oldGroupAbortTimeoutSeconds to oldGroupAbortTimeoutMillis that can be handled in another PR...
  • Add some unit tests and modify some existing tests
  • Modify some existing integration tests
    • ConsensusJdbcEnv is introduced to inject the group commit related properties.

Checklist

The following is a best-effort checklist. If any items in this checklist are not applicable to this PR or are dependent on other, unmerged PRs, please still mark the checkboxes after you have read and understood each item.

  • I have commented my code, particularly in hard-to-understand areas.
  • I have updated the documentation to reflect the changes.
  • Any remaining open issues linked to this PR are documented and up-to-date (Jira, GitHub, etc.).
  • Tests (unit, integration, etc.) have been added for the changes.
  • My changes generate no new warnings.
  • Any dependent changes in other PRs have been merged and published.

Additional notes (optional)

Transaction ID conventions

  • Parent ID
    • A key of normal group that contains multiple transactions
    • coordinator.state.tx_id will be set to this ID when storing transactions in a normal group
    • This consists of 24 length 62 type characters [0-9A-Za-z]
  • Child ID
    • A key to identify each transaction in a group
    • coordinator.state.tx_child_ids will contain this type of ID when storing transactions in a normal group
    • This will be UUID v4 when automatically generating on ScalarDB side or an arbitrary string passed by users via begin(String txId)
  • Full ID
    • A concatenated string of Parent ID and Child ID for a transaction
    • coordinator.state.tx_id will be set to this ID when storing transactions in a delayed group that contains only a single transaction (coordinator.state.tx_child_ids will be empty in this case)
    • The format of this ID is $ParentID:$ChildID whose total length is 61 (24 + 36 + 1). So we don't need to extend the maximum length of coordinator.state.tx_id

Limitations

image

Release notes

Added Group Commit feature for Coordinator Table

@@ -8,6 +8,6 @@ public class ConsensusCommitCrossPartitionScanIntegrationTestWithJdbcDatabase

@Override
protected Properties getProps(String testName) {
return JdbcEnv.getProperties(testName);
return ConsensusCommitJdbcEnv.getProperties(testName);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConsensusCommitJdbcEnv is introduced to inject GroupCommit related properties.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, we might have already discussed it, but do we add integration tests for GroupCommit only for JDBC, right?

Copy link
Contributor Author

@komamitsu komamitsu May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, this is a kind of trade-off topic between the cost and completeness, based on the memory of the discussion we had before. The reason why I picked JDBC is the current main target of the group commit feature is YugabyteDB.

But, on second thought, it might be better to add the group commit integration tests with other storages. Maybe scheduled weekly integration test with all the storages is a good trade-off? If it's okay, I'll take care of it as another PR later. What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I also think it is better to add the group commit integration tests with other storages. But I'm wondering if we should add them to the regular integration tests. You assume they will impact the integration test time, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. In the case of https://github.com/scalar-labs/scalardb/actions/runs/9202687096, the group commit integration test using PostgreSQL which executed only **ConsensusCommit** cases took 12 minutes, while the full integration test using PostgreSQL without the group commit feature took only 2 minutes. (It was basically because of frequent open/close CoordinatorGroupCommitter and garbage ongoing transactions that were not explicitly rollbacked, though.)

So, this is a very rough estimate, but it may take 6 times larger computation resources of the GitHub Actions in total. I think the combination of 1) the group commit integration test only using PostgreSQL in the regular workflow, and 2) testing using all the storages with the group commit feature enabled in a weekly or daily workflow. 1 is already implemented in this PR, and 2 will be added in another PR later. What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that sounds good. I thought we should run CI for the group commit feature with other storages as much as possible before pushing changes, because we should detect bugs early. However, it seems like this significantly impacts the regular CI time.

Let's add item 2 in another PR. Also, maybe we should add the workflow_dispatch trigger to the workflow so that we can manually run the CI for the group commit feature with other storages when we make changes related to the group commit feature.

}

commitState(snapshot);
commitRecords(snapshot);
}

protected void handleCommitConflict(Snapshot snapshot, Exception cause)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted this from commitState() so that CommitHandlerWithGroupCommit can reuse the impl

@@ -17,7 +17,7 @@ class DelayedGroup<PARENT_KEY, CHILD_KEY, FULL_KEY, EMIT_KEY, V>
FULL_KEY fullKey,
Emittable<EMIT_KEY, V> emitter,
KeyManipulator<PARENT_KEY, CHILD_KEY, FULL_KEY, EMIT_KEY> keyManipulator) {
super(emitter, keyManipulator, 1, config.oldGroupAbortTimeoutSeconds());
super(emitter, keyManipulator, 1, config.oldGroupAbortTimeoutMillis());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking, this refactoring isn't directly related to this PR.

@Mock private Coordinator coordinator;
@Mock private TransactionTableMetadataManager tableMetadataManager;
@Mock private ConsensusCommitConfig config;
@Mock protected DistributedStorage storage;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is inherited by CommitHandlerWithGroupCommitTest and these variables are used by it.

@komamitsu komamitsu marked this pull request as ready for review May 15, 2024 10:31
@komamitsu komamitsu self-assigned this May 15, 2024
@komamitsu komamitsu added the enhancement New feature or request label May 15, 2024
Copy link
Member

@josh-wong josh-wong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! I've added some comments and suggestions. PTAL!

docs/configurations.md Outdated Show resolved Hide resolved
docs/api-guide.md Outdated Show resolved Hide resolved
docs/api-guide.md Outdated Show resolved Hide resolved
docs/api-guide.md Outdated Show resolved Hide resolved
docs/api-guide.md Outdated Show resolved Hide resolved
docs/api-guide.md Outdated Show resolved Hide resolved
komamitsu and others added 2 commits May 16, 2024 18:03
Co-authored-by: Josh Wong <joshua.wong@scalar-labs.com>
Co-authored-by: Josh Wong <joshua.wong@scalar-labs.com>
Copy link
Member

@josh-wong josh-wong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left some minor suggestions. Other than that, LGTM. Thank you!🙇🏻‍♂️

docs/api-guide.md Outdated Show resolved Hide resolved
docs/api-guide.md Outdated Show resolved Hide resolved
docs/api-guide.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@brfrn169 brfrn169 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, but I'll review this again later. Left several comments. Please take a look when you have time!

@@ -8,6 +8,6 @@ public class ConsensusCommitCrossPartitionScanIntegrationTestWithJdbcDatabase

@Override
protected Properties getProps(String testName) {
return JdbcEnv.getProperties(testName);
return ConsensusCommitJdbcEnv.getProperties(testName);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, we might have already discussed it, but do we add integration tests for GroupCommit only for JDBC, right?

@@ -137,19 +157,34 @@ TwoPhaseCommitTransaction join(String txId, Isolation isolation, SerializableStr
return resume(txId);
}

return createNewTransaction(txId, isolation, strategy);
// Participant services don't use the group commit feature even if it's enabled. They simply use
Copy link
Collaborator

@brfrn169 brfrn169 May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please elaborate why participants don't use the group commit feature even if it's enabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing!

image

How GroupCommitter places transactions over multiple groups depends on various timings, including some types of timeouts (e.g., moving delayed transactions to a DelayedGroup) and events (e.g., GroupCommitter.ready(SnapShot) is called after the prepare and validate phases are done). So, it's unlikely that the Coordinator service and all the Participant services will have the same the groups with the same transaction distributions in-memory. So, if the participant services use the group commit feature, they may insert transaction groups different from other nodes, including the Coordinator service.

See
https://github.com/scalar-labs/scalardb/blob/ebbb0274bb01fd52dc553a333aff5183434c44b6/docs/api-guide.md#commit-in-two-phase-commit-interface also.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. Thank you for your explanation!

So the way to use the two-phase interface varies depending on whether group commit is enabled or not, which might be confusing for users. One option is to always disable group commit when using the two-phase interface. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the way to use the two-phase interface varies depending on whether group commit is enabled or not, which might be confusing for users.

Yes. Users need to change the commit order in their applications as a part of the limitation, and it may be confusing.

One option is to always disable group commit when using the two-phase interface

You mean to throw an exception or something if TwoPhaseCommitTransactionManager is used with the group commit feature enabled, right? Indeed, the idea sounds simpler and less confusing. On the other hand, I'm wondering what if a user wants to use the two-phase commit interface on a high-latency underlying storage where the coordinator table is placed 🤔

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean to throw an exception or something if TwoPhaseCommitTransactionManager is used with the group commit feature enabled, right?

Yes, I meant that.

On the other hand, I'm wondering what if a user wants to use the two-phase commit interface on a high-latency underlying storage where the coordinator table is placed 🤔

Correct. But, actually, there are also some other optimizations and features we have to give up for the two-phase commit interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. That sounds a reasonable concern.

I guess it's related to the product specification. @feeblefakie Do you think we can/should prohibit the group commit feature in the two-phase commit interface?

Copy link
Contributor Author

@komamitsu komamitsu May 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brfrn169 I had a chance to talk with @feeblefakie about that and he agreed with you. I'll revise this PR (or create another PR to disable that?).

This PR already contains many changes and comments, and got some approvals. So, I think it would be great if I can disable the combination of the group commit and two-phase commit interface in a separate PR.

Users need to change the commit order in their applications as a part of the limitation

BTW, although this is a kind of memo for me about future development when we'll enable the group commit with two-phase commit interface, we should add a check whether the transaction is committed in the coordinator table, before committing records on participant services.

Copy link
Contributor

@feeblefakie feeblefakie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looking good!
Left some comments and suggestions. PTAL!

}

public Optional<Coordinator.State> getState(String id) throws CoordinatorException {
if (keyManipulator.isFullKey(id)) {
// The following read operation order is important. Reading a coordinator state is likely to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These sentences are a bit hard to understand, especially the if a transaction scanned by another transaction executing lazy recovery is delayed part.
Let's chat about it and rephase them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback! I agree with you. I moved these logics for the group commit to getStateForGroupCommit() and revised the comment. What you think?

Of course, chatting on the comment is welcome!

// transaction ID. Therefore, looking up with the full transaction ID should be tried first
// to minimize read operations as much as possible.

// Scan with the full ID for a delayed group that contains only a single transaction.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this method is also used in normal cases without group commit, so the wording is probably too specific to group commit. So, we need to describe two cases: one for the normal case without group commit and the other for the group commit case?

Also, regarding the Scan with the full ID, should it be something like Regard the ID as a full ID and scan with the ID since we don't know if the given ID is a full ID or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this logic for the group commit to getStateForGroupCommit() to make things clearer.

regarding the Scan with the full ID, should it be something like Regard the ID as a full ID and scan with the ID since we don't know if the given ID is a full ID or not.

Ah, whether the ID is a full ID is already checked. But I changed the argument name to a more clear one in the commit.

return state;
}

// Scan with the parent ID for a normal group that contains multiple transactions with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A parent ID is explicitly extracted from a full ID here. So, we can go with the current comment as is?

.withValue(Attribute.toStateValue(state.getState()))
Put put = new Put(new Key(Attribute.toIdValue(state.getId())));
if (!state.getChildIds().isEmpty()) {
put.withValue(Attribute.toChildIdsValue(Joiner.on(',').join(state.getChildIds())));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"" is hard-coded, so should CHILD_IDS_DELIMITER be defined at the top level and used here, or should a helper method be defined in the State inner class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Yeah, I overlooked it. State class already knows how to handle the column including what's the delimiter. So I took care of it in State class.

private final Random random = new Random();

@Override
public String generateParentKey() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random string generation might affect the performance, so it would be great if you could do a simple benchmark with multiple threads.

I previously encountered crazily slow random string generation when using an Apache Commons library, so I used a method taken from the benchbase one instead.
https://github.com/scalar-labs/scalardb-benchmarks/blob/master/src/main/java/com/scalar/db/benchmarks/ycsb/YcsbCommon.java#L109

So, it would be great if you could compare this with the one above.

Copy link
Contributor Author

@komamitsu komamitsu May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing the concern. That makes sense.

I benchmarked the following combinations with this commit.

a. Simple impl (the original generateParentKey() impl) + Random (as same as the current impl)
b. Faster impl (ported from scalardb-benchmark) + Random
c. Simple impl (the original generateParentKey() impl) + ThreadLocalRandom
d. Faster impl (ported from scalardb-benchmark) + ThreadLocalRandom

[komamitsu@pop-os ~]$ lscpu | grep '^CPU(s)'
CPU(s):                             16
[komamitsu@pop-os ~]$ uname -a
Linux pop-os 6.6.10-76060610-generic #202401051437~1704728131~22.04~24d69e2 SMP PREEMPT_DYNAMIC Mon J x86_64 x86_64 x86_64 GNU/Linux
image

What I noticed from the result are:

  • The implementation ported from scalardb-benchmark is about 10-15% faster than the original simple generateParentKey() impl when using ThreadLocalRandom
  • Which to use Random or ThreadLocalRandom significantly affects the performance along with the number of threads. The average duration was 0.015 ms in case a with 16 threads which is the same number of the Linux machine's CPUs. The duration would increase when using a larger number of CPU machine

I think we should use ThreadLocalRandom just in case. When I benchmarked the group commit feature with ThreadLocalRandom before, I observed a lot of parent key conflicts at the beginning of benchmarking. I'll come up with a way to mitigate the issue and update this PR.

Copy link
Contributor Author

@komamitsu komamitsu May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@feeblefakie BTW, I checked the distribution of the random character values when using the implementation ported from scalardb-benchmark. It seems skewed compared to the original simple impl. I also observed a similar distribution when directly testing https://github.com/scalar-labs/scalardb-benchmarks/blob/master/src/main/java/com/scalar/db/benchmarks/ycsb/YcsbCommon.java#L109.

I think it's okay to generate just a random value, but the simple one would be better when generating unique identifiers.

  • The implementation ported from scalardb-benchmark, where the number in the square brackets is the index of the valid char sequence 0,1,2,...,A,B,..Z,a,b,..z and the next number indicates how many the target character is used
[000] 12170
[001] 21654
[002] 13412
[003] 23733
[004] 15577
[005] 26786
[006] 15973
[007] 56956
[008] 14013
[009] 26212
[010] 14705
[011] 107740
[012] 13730
[013] 28530
[014] 19633
[015] 106851
[016] 10670
[017] 29521
[018] 19581
[019] 61491
[020] 17591
[021] 46623
[022] 22265
[023] 48223
[024] 16174
[025] 58574
[026] 17782
[027] 44353
[028] 16499
[029] 69603
[030] 18314
[031] 40725
[032] 8898
[033] 73426
[034] 17933
[035] 47698
[036] 12817
[037] 81226
[038] 18393
[039] 30052
[040] 12386
[041] 64607
[042] 14837
[043] 53717
[044] 15901
[045] 73947
[046] 17029
[047] 30590
[048] 11894
[049] 34586
[050] 22686
[051] 52105
[052] 16454
[053] 43558
[054] 22372
[055] 33789
[056] 15510
[057] 24684
[058] 17335
[059] 28752
[060] 18036
[061] 29118
  • The original simple implementation
[000] 39054
[001] 38329
[002] 38803
[003] 39094
[004] 38654
[005] 38724
[006] 38692
[007] 38661
[008] 38667
[009] 38810
[010] 38764
[011] 38945
[012] 38641
[013] 38578
[014] 38603
[015] 38749
[016] 38753
[017] 38709
[018] 38565
[019] 38706
[020] 38555
[021] 38993
[022] 38613
[023] 38477
[024] 38939
[025] 38521
[026] 38813
[027] 38752
[028] 38829
[029] 38728
[030] 38606
[031] 38984
[032] 38803
[033] 38611
[034] 39012
[035] 38577
[036] 38516
[037] 38485
[038] 38580
[039] 38982
[040] 38378
[041] 38650
[042] 38782
[043] 38494
[044] 38518
[045] 38574
[046] 38892
[047] 38815
[048] 38335
[049] 38681
[050] 38510
[051] 38572
[052] 38581
[053] 38473
[054] 38734
[055] 38822
[056] 39052
[057] 39031
[058] 38996
[059] 38902
[060] 38711
[061] 38620

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I benchmarked the group commit feature with ThreadLocalRandom before, I observed a lot of parent key conflicts at the beginning of benchmarking

This was my silly mistake. I stored the return value from ThreadLocalRandom.current(), but I should've not done that...

Usages of this class should typically be of the form: ThreadLocalRandom.current().nextX(...) (where X is Int, Long, etc). When all usages are of this form, it is never possible to accidently share a ThreadLocalRandom across multiple threads.

https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ThreadLocalRandom.html

Fixed it in 755b911

@@ -84,7 +84,10 @@ public void commitState(Snapshot snapshot)
commitStateViaGroupCommit(snapshot);
}

private void groupCommitState(String parentId, List<Snapshot> snapshots)
// TODO: Emitter interface should have `emitNormalGroup()` and `emitDelayedGroup()` separately and
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found there was room to improve this method and related ones to make which type of ID is passed clearer, so added this TODO.

Copy link
Contributor

@Torch3333 Torch3333 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

docs/configurations.md Outdated Show resolved Hide resolved
Copy link
Contributor

@feeblefakie feeblefakie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you!

Copy link
Member

@josh-wong josh-wong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left some minor suggestions. Other than that, LGTM. Thank you!🙇🏻‍♂️

docs/api-guide.md Outdated Show resolved Hide resolved
docs/configurations.md Outdated Show resolved Hide resolved
docs/configurations.md Outdated Show resolved Hide resolved
docs/configurations.md Outdated Show resolved Hide resolved
docs/configurations.md Outdated Show resolved Hide resolved
Co-authored-by: Josh Wong <joshua.wong@scalar-labs.com>
Copy link
Collaborator

@brfrn169 brfrn169 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants