Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Accounts and Transfers to u128 amounts #1157

Merged
merged 2 commits into from Sep 15, 2023
Merged

Refactor Accounts and Transfers to u128 amounts #1157

merged 2 commits into from Sep 15, 2023

Conversation

batiati
Copy link
Contributor

@batiati batiati commented Aug 23, 2023

This PR:

  • Account balances fields are now u128.

  • Transfers amounts are u128.

  • Timeouts are u32 and expressed as seconds instead of nanoseconds.

  • Three user_data fields, as u128, u64 and u32.

  • AccountMutable and AccountImmutable are now a single groove Account.

  • Reordered the Transfer and Account fields to keep @alignOf(T) == 16.

Direct changes:

Docs: All changes have been reflected, and new examples have been added for the user_data_{128|64|32} fields.

Benchmarks, Demos, Samples, Tests, and Repl: The code has been refactored.

Clients: Besides the naming and types refactor, these are specific changes for each client:

  • Node client: No extensive changes were needed, as BigInt was already being used for u64 values.

  • Java client: The API now defaults all get acessors to byte[] for IDs and user data, and BigInteger for balances and amounts. Additionally, overloaded accessors allow the application to get/set values as a pair of primitive long. It allows integrating with third-party numerical libraries without the need to allocate temporary objects like arrays or BigIntegers.

  • C# client: .NET 7.0 comes with support for UInt128. Users can easily opt-in to use the new UInt128 or to use a replacement type we distribute for .NET Standard 2.1 (which is supported by other runtimes like .NET Core, Xamarin, Unity, and .NET up to 6.0). Also, there are extension methods for converting to/from BigInteger.

  • Go client: We expose the type Uint128 with a set of conversion functions, including math/big.Int. Likewise, users have the option to use big.Int or any other implementation to handle or format 128-bit values.

Other changes:

  • Increased block_size to 128K.
    Without this, we can't fit everything in a single block due to the size of each object and lsm_batch_multiple:
src/lsm/schema.zig:180:51: error: overflow of integer type 'u32' with value '-1048'
        const padding_size = constants.block_size - padding_offset;
  • Argument --cache-grid is now validated for the minimum size.
    Currently, it's --cache-grid=256MB instead of 128MB due to block_size changes.

  • Repl parser now accepts identifiers containing numbers (but not starting with numbers).
    Example: user_data_128 would be rejected by the previous version.

Benchmarks

Before:

❯ ./scripts/benchmark.sh
Formatting replica 0...
Starting replica 0...

Benchmarking...
steps [2/4] zig build-exe benchmark ReleaseSafe native-linux... LLVM Emit Object... info: Benchmark running against { 127.0.0.1:43705 }
info(message_bus): connected to replica 0
1225 batches in 73.57 s
load offered = 1000000 tx/s
load accepted = 135926 tx/s
batch latency p00 = 0 ms
batch latency p10 = 18 ms
batch latency p20 = 22 ms
batch latency p30 = 25 ms
batch latency p40 = 26 ms
batch latency p50 = 28 ms
batch latency p60 = 29 ms
batch latency p70 = 29 ms
batch latency p80 = 30 ms
batch latency p90 = 31 ms
batch latency p100 = 3236 ms
transfer latency p00 = 0 ms
transfer latency p10 = 3435 ms
transfer latency p20 = 7277 ms
transfer latency p30 = 12325 ms
transfer latency p40 = 18875 ms
transfer latency p50 = 25889 ms
transfer latency p60 = 33611 ms
transfer latency p70 = 41591 ms
transfer latency p80 = 49480 ms
transfer latency p90 = 56775 ms
transfer latency p100 = 63583 ms

After:

❯ ./scripts/benchmark.sh
Formatting replica 0...
Starting replica 0...

Benchmarking...
info: Benchmark running against { 127.0.0.1:38517 }
info(message_bus): connected to replica 0
1226 batches in 66.92 s
load offered = 1000000 tx/s
load accepted = 149436 tx/s
batch latency p00 = 0 ms
batch latency p10 = 17 ms
batch latency p20 = 19 ms
batch latency p30 = 23 ms
batch latency p40 = 24 ms
batch latency p50 = 25 ms
batch latency p60 = 26 ms
batch latency p70 = 28 ms
batch latency p80 = 30 ms
batch latency p90 = 31 ms
batch latency p100 = 2754 ms
transfer latency p00 = 0 ms
transfer latency p10 = 3181 ms
transfer latency p20 = 6575 ms
transfer latency p30 = 11852 ms
transfer latency p40 = 17253 ms
transfer latency p50 = 23811 ms
transfer latency p60 = 30330 ms
transfer latency p70 = 37195 ms
transfer latency p80 = 43995 ms
transfer latency p90 = 50735 ms
transfer latency p100 = 56930 ms

@cb22
Copy link
Contributor

cb22 commented Aug 23, 2023

Interesting benchmark results, too. I remember when testing a pure block size increase, that bumped performance too due to the situation wrt compaction pacing (think the sweet spot was 256K).

Would be interesting to run those again (eg, compare main @ 128K).

@batiati
Copy link
Contributor Author

batiati commented Aug 24, 2023

Interesting benchmark results, too. I remember when testing a pure block size increase, that bumped performance too due to the situation wrt compaction pacing (think the sweet spot was 256K).

The combination of block_size = 256k + lsm_batch_multiple = 128 gives the best TPS. IMO, mostly because it uses more NVMe bandwidth. I think it may change once the compaction pacing is in place, so it's not worth tuning it yet.

Would be interesting to run those again (eg, compare main @ 128K).

The main branch with 128k blocks showed the same ~150k TPS we have measured with these PR changes.
The difference is that now we're writing all the 128 bytes of each account during balance updates (no more split between AccountMutable and AccountImmutable). I could observe about 2GB more writes during the same benchmark, but I think the NVMe could absorb the extra IO and we still gained by not having to do two lookups for the same object.

❯ ./scripts/benchmark.sh
Formatting replica 0...
Starting replica 0...

Benchmarking...
steps [2/4] zig build-exe benchmark ReleaseSafe native-linux... LLVM Emit Object... info: Benchmark running against { 127.0.0.1:33091 }
info(message_bus): connected to replica 0
1224 batches in 67.04 s
load offered = 1000000 tx/s
load accepted = 149174 tx/s
batch latency p00 = 0 ms
batch latency p10 = 19 ms
batch latency p20 = 22 ms
batch latency p30 = 25 ms
batch latency p40 = 27 ms
batch latency p50 = 29 ms
batch latency p60 = 30 ms
batch latency p70 = 31 ms
batch latency p80 = 33 ms
batch latency p90 = 35 ms
batch latency p100 = 2569 ms
transfer latency p00 = 0 ms
transfer latency p10 = 3714 ms
transfer latency p20 = 7157 ms
transfer latency p30 = 12014 ms
transfer latency p40 = 17586 ms
transfer latency p50 = 23668 ms
transfer latency p60 = 29589 ms
transfer latency p70 = 36722 ms
transfer latency p80 = 43803 ms
transfer latency p90 = 50881 ms
transfer latency p100 = 57049 ms

@batiati batiati changed the title [RFC] Refactor Account and Transfers [WIP] Refactor Account and Transfers to u128 amounts Sep 1, 2023
@batiati batiati force-pushed the batiati-u128 branch 2 times, most recently from ade000a to 24cfb47 Compare September 12, 2023 13:21
@batiati batiati changed the title [WIP] Refactor Account and Transfers to u128 amounts Refactor Accounts and Transfers to u128 amounts Sep 12, 2023
@batiati batiati marked this pull request as ready for review September 12, 2023 13:21
docs/design/data-modeling.md Outdated Show resolved Hide resolved
docs/reference/transfers.md Outdated Show resolved Hide resolved
Comment on lines 272 to 273
* Must not overflow a 64-bit unsigned integer when converted to nanoseconds
and summed with the transfer's timestamp (`error.overflows_timeout`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this warning is less important now that the user passes in seconds, which we convert to nanoseconds -- unless the TB cluster's clock is 400 years in the future! 😉
May as well leave it in for completeness, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed!
I removed it from here and instead it's now mentioned in create_transfers.md#overflows_timeout

src/clients/go/pkg/types/main_test.go Outdated Show resolved Hide resolved
src/tigerbeetle.zig Outdated Show resolved Hide resolved
src/tigerbeetle/repl.zig Outdated Show resolved Hide resolved
src/state_machine.zig Outdated Show resolved Hide resolved
src/state_machine.zig Outdated Show resolved Hide resolved
src/state_machine.zig Outdated Show resolved Hide resolved
src/state_machine.zig Outdated Show resolved Hide resolved
- Account balances fields are now u128.

- Transfers amounts are u128.

- Timeouts are u32 and expressed as seconds instead of nanoseconds.

- Three user_data fields, as u128, u64 and u32.

- AccountMutable and AccountImmutable are now a single groove Account.

- Reordered the Transfer and Account fields to keep @Alignof(T) == 16.
src/tigerbeetle/main.zig Outdated Show resolved Hide resolved
src/tigerbeetle/main.zig Outdated Show resolved Hide resolved
.user_data_32 = 8,
.ledger = 9,
.code = 10,
.timestamp = 11,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 Despite the name "timestamp" here, what this actually corresponds to is the groove's object tree. Imo that should take precedence over the field order, and this tree's id should be listed/assigned before id. (Likewise for transfers).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what this actually corresponds to is the groove's object tree

However, our usage of timestamp as internal identifier for the groove's object tree is a secondary implementation detail. The primary usage of timestamp is external, as a totally ordered hybrid logical clock. Beyond that, the reason it's last, is because it's the only field that's "filled in"—in the sense that it's appended to the existing data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, ahead of timestamp is id, which is the external identifier and which comes first (it's the "key", followed by the value). Users provide this much of the key/value, consecutively, and then the time is "stamped" on after.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users provide this much of the key/value, consecutively, and then the time is "stamped" on after.

This code is defining/assigning the LSM tree ids though, it doesn't relate to anything that users see.
From the LSM's perspective, the object tree is the "primary" tree -- object and id trees are special cases in the groove, separate from all of the secondary index trees.

Copy link
Contributor Author

@batiati batiati Sep 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imo, there's not much benefit that justifies not following the object's declaration order here.
We already handle timestamp as a special case in the LSM, which is very clear, there's no need to reinforce it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, those fields are intended to be stable IDs and may become unordered in the future if we ever change our schema. Either we will follow the declaration order or the numerical order. So, I don't have strong feelings about it.

src/state_machine.zig Outdated Show resolved Hide resolved
Comment on lines +1838 to +1844
\\ transfer T1 A1 A3 -0 _ U1 U1 U1 _ L1 C2 _ _ _ _ _ _ _ _ exists_with_different_flags
\\ transfer T1 A3 A1 -0 _ U1 U1 U1 1 L1 C2 _ PEN _ _ _ _ _ _ exists_with_different_debit_account_id
\\ transfer T1 A1 A4 -0 _ U1 U1 U1 1 L1 C2 _ PEN _ _ _ _ _ _ exists_with_different_credit_account_id
\\ transfer T1 A1 A3 -0 _ U1 U1 U1 1 L1 C1 _ PEN _ _ _ _ _ _ exists_with_different_amount
\\ transfer T1 A1 A3 123 _ U1 U1 U1 1 L1 C2 _ PEN _ _ _ _ _ _ exists_with_different_user_data_128
\\ transfer T1 A1 A3 123 _ _ U1 U1 1 L1 C2 _ PEN _ _ _ _ _ _ exists_with_different_user_data_64
\\ transfer T1 A1 A3 123 _ _ _ U1 1 L1 C2 _ PEN _ _ _ _ _ _ exists_with_different_user_data_32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These test vectors are always so nice and concise.

src/tigerbeetle.zig Show resolved Hide resolved
@batiati
Copy link
Contributor Author

batiati commented Sep 15, 2023

Note: Dotnet tests are failing for Macos self-hosted.
Please ignore it for this PR, I suspect it is due to cache issues because the self-hosted runner has no isolation when building .Net 7.0 for this PR and .Net 6.0 for everything else.

@batiati batiati added this pull request to the merge queue Sep 15, 2023
Merged via the queue into main with commit c5d2d4b Sep 15, 2023
142 of 156 checks passed
@batiati batiati deleted the batiati-u128 branch September 15, 2023 21:48
sentientwaffle added a commit that referenced this pull request Sep 21, 2023
@sentientwaffle sentientwaffle mentioned this pull request Jul 26, 2023
92 tasks
* @throws IllegalStateException if not at a {@link #isValidPosition valid position}.
* @see <a href="https://docs.tigerbeetle.com/reference/accounts/#user_data_64">user_data_64</a>
*/
public long getUserData64() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 it feels like this API makes it too easy to ignore high order bits. Should we perhaps through an exception here if this is not the entirety of user data?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this is also inconsistetn with how, eg, getDebitsPosted works (that API makes sense more to me personally).

@batiati it feels like maybe we have an incomplete refactor here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants