Refactor Accounts and Transfers to u128 amounts #1157

batiati · 2023-08-23T16:00:41Z

This PR:

Account balances fields are now u128.
Transfers amounts are u128.
Timeouts are u32 and expressed as seconds instead of nanoseconds.
Three user_data fields, as u128, u64 and u32.
AccountMutable and AccountImmutable are now a single groove Account.
Reordered the Transfer and Account fields to keep @alignOf(T) == 16.

Direct changes:

Docs: All changes have been reflected, and new examples have been added for the user_data_{128|64|32} fields.

Benchmarks, Demos, Samples, Tests, and Repl: The code has been refactored.

Clients: Besides the naming and types refactor, these are specific changes for each client:

Node client: No extensive changes were needed, as BigInt was already being used for u64 values.
Java client: The API now defaults all get acessors to byte[] for IDs and user data, and BigInteger for balances and amounts. Additionally, overloaded accessors allow the application to get/set values as a pair of primitive long. It allows integrating with third-party numerical libraries without the need to allocate temporary objects like arrays or BigIntegers.
C# client: .NET 7.0 comes with support for UInt128. Users can easily opt-in to use the new UInt128 or to use a replacement type we distribute for .NET Standard 2.1 (which is supported by other runtimes like .NET Core, Xamarin, Unity, and .NET up to 6.0). Also, there are extension methods for converting to/from BigInteger.
Go client: We expose the type Uint128 with a set of conversion functions, including math/big.Int. Likewise, users have the option to use big.Int or any other implementation to handle or format 128-bit values.

Other changes:

Increased block_size to 128K.
Without this, we can't fit everything in a single block due to the size of each object and lsm_batch_multiple:

src/lsm/schema.zig:180:51: error: overflow of integer type 'u32' with value '-1048'
        const padding_size = constants.block_size - padding_offset;

Argument --cache-grid is now validated for the minimum size.
Currently, it's --cache-grid=256MB instead of 128MB due to block_size changes.
Repl parser now accepts identifiers containing numbers (but not starting with numbers).
Example: user_data_128 would be rejected by the previous version.

Benchmarks

Before:

❯ ./scripts/benchmark.sh
Formatting replica 0...
Starting replica 0...

Benchmarking...
steps [2/4] zig build-exe benchmark ReleaseSafe native-linux... LLVM Emit Object... info: Benchmark running against { 127.0.0.1:43705 }
info(message_bus): connected to replica 0
1225 batches in 73.57 s
load offered = 1000000 tx/s
load accepted = 135926 tx/s
batch latency p00 = 0 ms
batch latency p10 = 18 ms
batch latency p20 = 22 ms
batch latency p30 = 25 ms
batch latency p40 = 26 ms
batch latency p50 = 28 ms
batch latency p60 = 29 ms
batch latency p70 = 29 ms
batch latency p80 = 30 ms
batch latency p90 = 31 ms
batch latency p100 = 3236 ms
transfer latency p00 = 0 ms
transfer latency p10 = 3435 ms
transfer latency p20 = 7277 ms
transfer latency p30 = 12325 ms
transfer latency p40 = 18875 ms
transfer latency p50 = 25889 ms
transfer latency p60 = 33611 ms
transfer latency p70 = 41591 ms
transfer latency p80 = 49480 ms
transfer latency p90 = 56775 ms
transfer latency p100 = 63583 ms

After:

❯ ./scripts/benchmark.sh
Formatting replica 0...
Starting replica 0...

Benchmarking...
info: Benchmark running against { 127.0.0.1:38517 }
info(message_bus): connected to replica 0
1226 batches in 66.92 s
load offered = 1000000 tx/s
load accepted = 149436 tx/s
batch latency p00 = 0 ms
batch latency p10 = 17 ms
batch latency p20 = 19 ms
batch latency p30 = 23 ms
batch latency p40 = 24 ms
batch latency p50 = 25 ms
batch latency p60 = 26 ms
batch latency p70 = 28 ms
batch latency p80 = 30 ms
batch latency p90 = 31 ms
batch latency p100 = 2754 ms
transfer latency p00 = 0 ms
transfer latency p10 = 3181 ms
transfer latency p20 = 6575 ms
transfer latency p30 = 11852 ms
transfer latency p40 = 17253 ms
transfer latency p50 = 23811 ms
transfer latency p60 = 30330 ms
transfer latency p70 = 37195 ms
transfer latency p80 = 43995 ms
transfer latency p90 = 50735 ms
transfer latency p100 = 56930 ms

cb22 · 2023-08-23T19:52:12Z

Interesting benchmark results, too. I remember when testing a pure block size increase, that bumped performance too due to the situation wrt compaction pacing (think the sweet spot was 256K).

Would be interesting to run those again (eg, compare main @ 128K).

batiati · 2023-08-24T12:15:43Z

Interesting benchmark results, too. I remember when testing a pure block size increase, that bumped performance too due to the situation wrt compaction pacing (think the sweet spot was 256K).

The combination of block_size = 256k + lsm_batch_multiple = 128 gives the best TPS. IMO, mostly because it uses more NVMe bandwidth. I think it may change once the compaction pacing is in place, so it's not worth tuning it yet.

Would be interesting to run those again (eg, compare main @ 128K).

The main branch with 128k blocks showed the same ~150k TPS we have measured with these PR changes.
The difference is that now we're writing all the 128 bytes of each account during balance updates (no more split between AccountMutable and AccountImmutable). I could observe about 2GB more writes during the same benchmark, but I think the NVMe could absorb the extra IO and we still gained by not having to do two lookups for the same object.

❯ ./scripts/benchmark.sh
Formatting replica 0...
Starting replica 0...

Benchmarking...
steps [2/4] zig build-exe benchmark ReleaseSafe native-linux... LLVM Emit Object... info: Benchmark running against { 127.0.0.1:33091 }
info(message_bus): connected to replica 0
1224 batches in 67.04 s
load offered = 1000000 tx/s
load accepted = 149174 tx/s
batch latency p00 = 0 ms
batch latency p10 = 19 ms
batch latency p20 = 22 ms
batch latency p30 = 25 ms
batch latency p40 = 27 ms
batch latency p50 = 29 ms
batch latency p60 = 30 ms
batch latency p70 = 31 ms
batch latency p80 = 33 ms
batch latency p90 = 35 ms
batch latency p100 = 2569 ms
transfer latency p00 = 0 ms
transfer latency p10 = 3714 ms
transfer latency p20 = 7157 ms
transfer latency p30 = 12014 ms
transfer latency p40 = 17586 ms
transfer latency p50 = 23668 ms
transfer latency p60 = 29589 ms
transfer latency p70 = 36722 ms
transfer latency p80 = 43803 ms
transfer latency p90 = 50881 ms
transfer latency p100 = 57049 ms

src/state_machine.zig

docs/design/data-modeling.md

docs/reference/transfers.md

sentientwaffle · 2023-09-13T14:08:52Z

docs/reference/transfers.md

+* Must not overflow a 64-bit unsigned integer when converted to nanoseconds
+  and summed with the transfer's timestamp (`error.overflows_timeout`)


I guess this warning is less important now that the user passes in seconds, which we convert to nanoseconds -- unless the TB cluster's clock is 400 years in the future! 😉
May as well leave it in for completeness, though.

Agreed!
I removed it from here and instead it's now mentioned in create_transfers.md#overflows_timeout

src/clients/go/pkg/types/main_test.go

src/tigerbeetle.zig

src/tigerbeetle/repl.zig

src/state_machine.zig

@Alignof

- Account balances fields are now u128. - Transfers amounts are u128. - Timeouts are u32 and expressed as seconds instead of nanoseconds. - Three user_data fields, as u128, u64 and u32. - AccountMutable and AccountImmutable are now a single groove Account. - Reordered the Transfer and Account fields to keep @Alignof(T) == 16.

src/tigerbeetle/main.zig

sentientwaffle · 2023-09-15T17:11:00Z

src/state_machine.zig

+                    .user_data_32 = 8,
+                    .ledger = 9,
+                    .code = 10,
+                    .timestamp = 11,


🤔 Despite the name "timestamp" here, what this actually corresponds to is the groove's object tree. Imo that should take precedence over the field order, and this tree's id should be listed/assigned before id. (Likewise for transfers).

what this actually corresponds to is the groove's object tree

However, our usage of timestamp as internal identifier for the groove's object tree is a secondary implementation detail. The primary usage of timestamp is external, as a totally ordered hybrid logical clock. Beyond that, the reason it's last, is because it's the only field that's "filled in"—in the sense that it's appended to the existing data.

Also, ahead of timestamp is id, which is the external identifier and which comes first (it's the "key", followed by the value). Users provide this much of the key/value, consecutively, and then the time is "stamped" on after.

Users provide this much of the key/value, consecutively, and then the time is "stamped" on after.

This code is defining/assigning the LSM tree ids though, it doesn't relate to anything that users see.
From the LSM's perspective, the object tree is the "primary" tree -- object and id trees are special cases in the groove, separate from all of the secondary index trees.

Imo, there's not much benefit that justifies not following the object's declaration order here.
We already handle timestamp as a special case in the LSM, which is very clear, there's no need to reinforce it.

Anyway, those fields are intended to be stable IDs and may become unordered in the future if we ever change our schema. Either we will follow the declaration order or the numerical order. So, I don't have strong feelings about it.

src/state_machine.zig

sentientwaffle · 2023-09-15T17:31:02Z

src/state_machine.zig

+        \\ transfer   T1 A1 A3   -0   _ U1 U1 U1    _ L1 C2   _   _   _   _   _   _  _ _ exists_with_different_flags
+        \\ transfer   T1 A3 A1   -0   _ U1 U1 U1    1 L1 C2   _ PEN   _   _   _   _  _ _ exists_with_different_debit_account_id
+        \\ transfer   T1 A1 A4   -0   _ U1 U1 U1    1 L1 C2   _ PEN   _   _   _   _  _ _ exists_with_different_credit_account_id
+        \\ transfer   T1 A1 A3   -0   _ U1 U1 U1    1 L1 C1   _ PEN   _   _   _   _  _ _ exists_with_different_amount        
+        \\ transfer   T1 A1 A3  123   _ U1 U1 U1    1 L1 C2   _ PEN   _   _   _   _  _ _ exists_with_different_user_data_128
+        \\ transfer   T1 A1 A3  123   _  _ U1 U1    1 L1 C2   _ PEN   _   _   _   _  _ _ exists_with_different_user_data_64
+        \\ transfer   T1 A1 A3  123   _  _  _ U1    1 L1 C2   _ PEN   _   _   _   _  _ _ exists_with_different_user_data_32


These test vectors are always so nice and concise.

src/tigerbeetle.zig

batiati · 2023-09-15T21:36:36Z

Note: Dotnet tests are failing for Macos self-hosted.
Please ignore it for this PR, I suspect it is due to cache issues because the self-hosted runner has no isolation when building .Net 7.0 for this PR and .Net 6.0 for everything else.

Regression via #1157

matklad · 2023-10-20T13:24:44Z

src/clients/java/src/main/java/com/tigerbeetle/AccountBatch.java

+     * @throws IllegalStateException if not at a {@link #isValidPosition valid position}.
+     * @see <a href="https://docs.tigerbeetle.com/reference/accounts/#user_data_64">user_data_64</a>
+     */
+    public long getUserData64() {


🤔 it feels like this API makes it too easy to ignore high order bits. Should we perhaps through an exception here if this is not the entirety of user data?

And this is also inconsistetn with how, eg, getDebitsPosted works (that API makes sense more to me personally).

@batiati it feels like maybe we have an incomplete refactor here?

batiati changed the title ~~[RFC] Refactor Account and Transfers~~ [WIP] Refactor Account and Transfers to u128 amounts Sep 1, 2023

batiati force-pushed the batiati-u128 branch from 1131b3f to 06d8615 Compare September 4, 2023 20:14

sentientwaffle reviewed Sep 8, 2023

View reviewed changes

src/state_machine.zig Outdated Show resolved Hide resolved

batiati force-pushed the batiati-u128 branch 2 times, most recently from ade000a to 24cfb47 Compare September 12, 2023 13:21

batiati changed the title ~~[WIP] Refactor Account and Transfers to u128 amounts~~ Refactor Accounts and Transfers to u128 amounts Sep 12, 2023

batiati marked this pull request as ready for review September 12, 2023 13:21

batiati assigned sentientwaffle Sep 12, 2023

sentientwaffle reviewed Sep 13, 2023

View reviewed changes

batiati force-pushed the batiati-u128 branch from 0780b7e to 6cef56f Compare September 14, 2023 23:36

batiati requested a review from sentientwaffle September 15, 2023 00:18

sentientwaffle reviewed Sep 15, 2023

View reviewed changes

Review

3221357

sentientwaffle approved these changes Sep 15, 2023

View reviewed changes

batiati added this pull request to the merge queue Sep 15, 2023

Merged via the queue into main with commit c5d2d4b Sep 15, 2023
142 of 156 checks passed

batiati deleted the batiati-u128 branch September 15, 2023 21:48

batiati mentioned this pull request Sep 17, 2023

Debit/Credit Balance Scaling #986

Closed

rbino mentioned this pull request Sep 19, 2023

Support new TigerBeetle u128 financial amounts (and new user data fields) rbino/tigerbeetlex#33

Closed

kassane mentioned this pull request Sep 19, 2023

Integers 128-bit support kassane/tigerbeetle-cpp#15

Open

sentientwaffle added a commit that referenced this pull request Sep 21, 2023

StateMachine: Prefetch from posted groove

6b33d76

Regression via #1157

sentientwaffle mentioned this pull request Sep 21, 2023

StateMachine: Prefetch from posted groove #1196

Merged

sentientwaffle mentioned this pull request Jul 26, 2023

Roadmap #259

Open

92 tasks

matklad reviewed Oct 20, 2023

View reviewed changes

mankins mentioned this pull request Apr 11, 2024

256 or variable bit account ids #1840

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Accounts and Transfers to u128 amounts #1157

Refactor Accounts and Transfers to u128 amounts #1157

batiati commented Aug 23, 2023 •

edited

cb22 commented Aug 23, 2023

batiati commented Aug 24, 2023

sentientwaffle Sep 13, 2023

batiati Sep 14, 2023

sentientwaffle Sep 15, 2023

jorangreef Sep 15, 2023

jorangreef Sep 15, 2023

sentientwaffle Sep 15, 2023 •

edited

batiati Sep 15, 2023 •

edited

batiati Sep 15, 2023

sentientwaffle Sep 15, 2023

jorangreef Sep 15, 2023

batiati commented Sep 15, 2023

matklad Oct 20, 2023

matklad Oct 20, 2023

		* Must not overflow a 64-bit unsigned integer when converted to nanoseconds
		and summed with the transfer's timestamp (`error.overflows_timeout`)

Refactor Accounts and Transfers to u128 amounts #1157

Refactor Accounts and Transfers to u128 amounts #1157

Conversation

batiati commented Aug 23, 2023 • edited

This PR:

Direct changes:

Other changes:

Before:

After:

cb22 commented Aug 23, 2023

batiati commented Aug 24, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sentientwaffle Sep 15, 2023 • edited

Choose a reason for hiding this comment

batiati Sep 15, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

batiati commented Sep 15, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

batiati commented Aug 23, 2023 •

edited

sentientwaffle Sep 15, 2023 •

edited

batiati Sep 15, 2023 •

edited