refactor(experimental): pre-build the byte array before encoding with codecs #1865

lorisleiva · 2023-11-18T00:31:44Z

Note this is a draft PR focusing on codecs-core to gather feedback, I will continue to update the other libraries afterwards.

This PR updates the codecs API such that both encoding and decoding function have access to the entire byte array. Let’s first see the change this PR introduce and then see why this is a valuable change.

API Changes

Encode: The Encoder type contains a new function write. Contrary to the encode function which creates a new Uint8Array and returns it directly, the write function updates the provided bytes argument at the provided offset. It then returns the next offset that should be written to.

// Before
type Encoder<T> = {
  encode: (value: T) => Uint8Array;
  // ...
};

// After
type Encoder<T> = {
  encode: (value: T) => Uint8Array;
  write: (value: T, bytes: Uint8Array, offset: Offset) => Offset;
  // ...
};

A new createEncoder function was provided to automatically fill the encode function from the write function.

const myU8Encoder = createEncoder({
  fixedSize: 1,
  write: (value: number, bytes: Uint8Array, offset: Offset) {
    bytes.set(value, offset);
    return offset + 1;
  };
});

Decode: The decode function already following a similar approach by using offsets. The newly added function read takes over this responsibility. The only difference is we now make the offset a mandatory argument to stay consistent with the write function. The decode function now becomes syntactic sugar for accessing the value directly.

// Before
type Decoder<T> = {
  decode: (bytes: Uint8Array, offset?: Offset) => [T, Offset];
  // ...
};

// After
type Decoder<T> = {
  decode: (bytes: Uint8Array, offset?: Offset) => T;
  read: (bytes: Uint8Array, offset: Offset) => [T, Offset];
  // ...
};

Similarly to the Encoder changes, a new createDecoder function is provided to fill the decode function using the read function.

const myU8Decoder = createDecoder({
  fixedSize: 1,
  read: (bytes: Uint8Array, offset: Offset) {
    return [bytes[offset], offset + 1];
  };
});

Sizes: Because we now need to pre-build the entire byte array that will be encoded, we need a way to find the variable size of a given value. We introduce a new variableSize function and narrow the types so that it can only be provided when fixedSize is null.

// Before
type Encoder<T> = {
  fixedSize: number | null;
  maxSize: number | null;
}

// After
type Encoder<T> = { ... } & (
  | { fixedSize: number; }
  | { fixedSize: null; variableSize: (value: T) => number; maxSize?: number }
)

We do something similar the Decoder except that this one doesn’t need to know about the variable size (it would make no sense as the type parameter T for decoder refers to the decoded type and not the type to encode).

// Before
type Decoder<T> = {
  fixedSize: number | null;
  maxSize: number | null;
}

// After
type Decoder<T> = { ... } & (
  | { fixedSize: number; }
  | { fixedSize: null; maxSize?: number }
)

Description: This PR takes this refactoring opportunity to remove the description attribute from the codecs API which brings little value to the end-user.

Why?

Consistent API: The implementation of the encode / decode and write / read functions are now consistent with each other. Before one was using offsets to navigate through an entire byte array and the other was returning and merge byte arrays together. Now they both use offsets to navigate the byte array to encode or decode.
Performant API: By pre-building the byte array once, we’re avoiding creating multiple instance of byte arrays and merging them together.
Non-linear serialisation: The main reason why it’s important for the encode method to have access to the entire encoded byte array is that it allows us to offer more complex codec primitives that are able to jump back and forth within the buffer. Without it, we are locking ourselves to only supporting serialisation strategies that are read linearly which isn’t always the case. For instance, imagine we have the size of an array being stored at the very beginning of the account whereas the items themselves are being stored at the end. Because we now have the full byte array when encoding, we can push the size at the beginning whilst inserting all items at the requested offset. We could even offer a getOffsetCodec (or similar) that allows us to shift the offset forward or backward to compose more complex, non-linear data structures. This would be simply impossible with the current format of the encode function.

lorisleiva · 2023-11-18T00:31:56Z

Current dependencies on/for this PR:

master
- PR refactor(experimental): rename codec options to codec config #1856
  - PR refactor(experimental): pre-build the byte array before encoding with codecs #1865 👈
    - PR refactor(experimental): add FixedSizeCodec and VariableSizeCodec types #1883
      - PR refactor(experimental): add read write functions to codecs-numbers #1884
        
        PR refactor(experimental): add read write functions to codecs-strings #1885
        
        PR refactor(experimental): rename variableSize to getSizeFromValue in codecs #1901
        
        PR refactor(experimental): remove { fixedSize: null } from variable size codecs #1903
        
        PR refactor(experimental): add size type param to fixed-size codecs #1909
        
        PR refactor(experimental): add read and write functions to codecs-data-structures #1910
        PR refactor(experimental): rename isFixedSize and isVariableSize type guards #1919
        PR refactor(experimental): add read write functions to option codecs #1920
        PR refactor(experimental) update codecs in other packages #1922
        PR refactor(experimental): add accounts package #1855

This stack of pull requests is managed by Graphite.

mcintyre94

Just a clarifying question, when you say "offer more complex codec primitives that are able to jump back and forth within the buffer" do we have a use case for that more complex implementation now? Or is it just something you've recognised as a limitation of the current ones?

Also am I right in thinking the performance improvement would increase based on how much encoders/decoders are composed? I think that's potentially a really important motivation too because things like the transaction decoder are pretty complex

mcintyre94 · 2023-11-22T15:24:20Z

packages/codecs-core/src/codec.ts

+/**
+ * Use the provided Encoder to encode the given value to a `Uint8Array`.
+ */
+export function encode<T>(value: T, encoder: Encoder<T>): Uint8Array {


I think my only concern would be that encode and decode are really generic terms for us to be exporting into people's apps, where stuff like getTransactionEncoder is less likely to clash with anything they're doing
They're definitely the right names to use for this though, so not really suggesting any change there!

Thanks for the review Callum!

Regarding the encode and decode functions: I agree they are pretty generic. I’d love to keep the codec.encode and codec.decode API which we could. It just comes with its own set of challenges as codecs will now need to provide 4 functions such that two of them are just delegating to the other two. On the plus side, we no longer have that annoying [0] suffix on the main API.

Regarding offering more complex primitives: The main issue here is that the current encode and decode logic are not symmetrical. Meaning there are things we can do with decode that we cannot do with encode because we do not have access to the entire buffer. I do have some examples of that. For instance, imagine an offsetCodec(codec, relativeOffset) helper that, given a codec, shifts its position on the buffer by the given relative offset. We can do that right now with Decoders but it’s impossible to do with Encoders because they use a different mechanism that rely on linear serialisation only. An example of why we would we need something like an offsetCodec helper is any array such that its size does not immediately prefix the items.

Regarding performance improvements: The decoding logic should stay pretty much the same because it already uses a single buffer but the encoding logic should drastically use less instances of Uint8Array which I believe will be a big performance boost.

buffalojoec · 2023-11-22T18:03:45Z

Another way to achieve the same thing but keep the previous encoder.encode(value) API is to introduce a new set of low-level methods such as write and read on the Encoder and Decoder types respectively. However, this means every single Encoder and Decoder needs to implement the same encode and decode helper method over and over again. It also makes it easy for people to misuse it and use the encode method instead of the write method when chaining encoders together.

I think this sounds like a better approach to me. This API aligns with borsh in Rust (not that we care to align with borsh necessarily) so it's not unfamiliar, and I would really hate to see any partial destruction of the simple TS/JS syntax that is:

const bytes: UInt8Array = getStructCodec(..).encode(myObject);

If the intention is to provide better low-level functionality, better to do that with write and let people still get the slick API of encode.

I wouldn't even hate just adding offset to the return type of encode, but that's tripped me up with decode before, too. I'd probably rather see the offset removed from decode to match encode, and instead used in read.

Performant API: By pre-building the byte array once, we’re avoiding creating multiple instance of byte arrays and merging them together.

I understand you're also going for performance here, and you're right the current implementation does degrade performance/memory, however, I think having a simple, user-friendly API alongside a lower-level performant one is more than enough.

lorisleiva · 2023-11-22T18:34:33Z

Thanks Joe! I agree, it would make me sad loosing the getStructCodec(..).encode(myObject) API.

I also like the alternative API, it's just a little bit more of a pain to write codecs with this API. That being said, I can offer helper methods like createEncoder and createDecoder that accepts an Encoder or Decoder with write and read function and provide the encode and decode function for you.

So for instance, if I wanted to create a number encoder. instead of this:

const myEncoder: Encoder<number> = {
  fixedSize: 1,
  encode(value, bytes, offset): Offset {
    bytes.set(value, offset);
    return offset + 1;
  }
}

I'd be doing this:

const myEncoder: Encoder<number> = createEncoder({
  fixedSize: 1,
  write(value, bytes, offset): Offset {
    bytes.set(value, offset);
    return offset + 1;
  }
})

// `createEncoder` basically just adds the `encode` function for us:
myEncoder.encode(42); // <- Uint8Array.

What do you think?

buffalojoec · 2023-11-22T20:25:26Z

Thanks Joe! I agree, it would make me sad loosing the getStructCodec(..).encode(myObject) API.

I also like the alternative API, it's just a little bit more of a pain to write codecs with this API. That being said, I can offer helper methods like createEncoder and createDecoder that accepts an Encoder or Decoder with write and read function and provide the encode and decode function for you.

So for instance, if I wanted to create a number encoder. instead of this:
const myEncoder: Encoder<number> = {
  fixedSize: 1,
  encode(value, bytes, offset): Offset {
    bytes.set(value, offset);
    return offset + 1;
  }
}
I'd be doing this:
const myEncoder: Encoder<number> = createEncoder({
  fixedSize: 1,
  write(value, bytes, offset): Offset {
    bytes.set(value, offset);
    return offset + 1;
  }
})

// `createEncoder` basically just adds the `encode` function for us:
myEncoder.encode(42); // <- Uint8Array.
What do you think?

Yep, I think that's great! Lgtm 💪🏼

mcintyre94 · 2023-11-23T10:21:29Z

Agreed, that looks like a really nice API to me. Only question is how easy is it to compose eg a struct encoder using that? Would you be internally using eg the write function of the nested fields, and the encode function is just the external API?

lorisleiva · 2023-11-23T10:35:14Z

Agreed, that looks like a really nice API to me. Only question is how easy is it to compose eg a struct encoder using that? Would you be internally using eg the write function of the nested fields, and the encode function is just the external API?

That's right, codecs that are composed from other codecs should only use the write and read functions internally. The encode and decode functions are syntactic sugar for the end-user.

lorisleiva · 2023-11-27T13:31:48Z

I've updated the PR and its description.

steveluscher

packages/codecs-core/src/codec.ts

steveluscher · 2023-11-28T07:42:39Z

packages/codecs-core/src/__tests__/codec-test.ts

+            write: (value: string, bytes, offset) => {
                const charCodes = [...value.slice(0, 32)].map(char => Math.min(char.charCodeAt(0), 255));
-                bytes.set(new Uint8Array(charCodes));
-                return bytes;
+                bytes.set(charCodes, offset);
+                return offset + 32;
            },


Is there any reason to make this context-y with this?

Suggested change

write: (value: string, bytes, offset) => {

const charCodes = [...value.slice(0, 32)].map(char => Math.min(char.charCodeAt(0), 255));

bytes.set(new Uint8Array(charCodes));

return bytes;

bytes.set(charCodes, offset);

return offset + 32;

},

write(value: string, bytes, offset) {

const length = this.fixedSize;

const charCodes = [...value.slice(0, length)].map(char => Math.min(char.charCodeAt(0), 255));

bytes.set(charCodes, offset);

return offset + length;

},

steveluscher · 2023-11-28T07:48:33Z

packages/codecs-core/src/codec.ts

+     * Writes the encoded value into the provided byte array at the given offset.
+     * Returns the offset of the next byte after the encoded value.
+     */
+    readonly write: (value: T, bytes: Uint8Array, offset: Offset) => Offset;


You could achieve the context-y suggestion above by doing something like this:

Suggested change

readonly write: (value: T, bytes: Uint8Array, offset: Offset) => Offset;

readonly write: (

this: Readonly<{fixedSize: number | null}>,

value: T,

bytes: Uint8Array, offset: Offset,

) => Offset;

Then something like this later:

const context = {fixedSize}; return { // ... write(...args) { return encoder.write.apply(context, args); }, }

Hmm I don't know that I like this tbh. It's making me implement codecs thinking: "it's possible that a parent codec is going to inject a different size and therefore mess up with this implementation". 🤔

buffalojoec

Looking slick! One question from me popped up.

packages/codecs-core/src/codec.ts

mcintyre94

This looks good to me now! I think the createEncoder and createDecoder and keeping the encode/decode is much nicer too.

…oducing new read write methods

lorisleiva · 2023-12-01T15:16:42Z

Merge activity

Dec 1, 10:16 AM: @lorisleiva started a stack merge that includes this pull request via Graphite.
Dec 1, 10:17 AM: @lorisleiva merged this pull request with Graphite.

…1884) This PR continues to refactor the codecs libraries as described in PR #1865. This one focuses on `@solana/codecs-numbers`.

…1885) This PR continues to refactor the codecs libraries as described in PR #1865. This one focuses on `@solana/codecs-strings`. Here, we can see the downside of having to compute the size beforehand for base X since, unless we add some caching, we end up having to encode twice: once to get the size of the buffer and once to fill it.

…ze codecs (#1903) As suggested [here](#1865 (comment)), this PR removes the `{ fixedSize: null }` attribute from `VariableSize*` types. This makes it easier and less confusing the create and use variable-size codecs. Since it's now slightly less convinient to find out if a codec if fixed-size or not, this PR also offers type guards to make this easier.

github-actions · 2023-12-18T08:03:06Z

Because there has been no activity on this PR for 14 days since it was merged, it has been automatically locked. Please open a new issue if it requires a follow up.

lorisleiva mentioned this pull request Nov 18, 2023

refactor(experimental): rename codec options to codec config #1856

Merged

lorisleiva self-assigned this Nov 18, 2023

mergify bot added the community label Nov 18, 2023

mergify bot requested a review from a team November 18, 2023 00:32

lorisleiva requested review from steveluscher, mcintyre94 and buffalojoec November 18, 2023 00:39

Base automatically changed from loris/rename-codec-options to master November 20, 2023 12:07

lorisleiva mentioned this pull request Nov 20, 2023

refactor(experimental): add accounts package #1855

Merged

mcintyre94 reviewed Nov 22, 2023

View reviewed changes

lorisleiva force-pushed the loris/write-read-codecs branch 2 times, most recently from 878c7e8 to ec7a4e1 Compare November 27, 2023 13:19

lorisleiva requested a review from mcintyre94 November 27, 2023 13:31

lorisleiva marked this pull request as ready for review November 27, 2023 13:32

This was referenced Nov 27, 2023

refactor(experimental): add FixedSizeCodec and VariableSizeCodec types #1883

Merged

refactor(experimental): add read write functions to codecs-numbers #1884

Merged

refactor(experimental): add read write functions to codecs-strings #1885

Merged

steveluscher approved these changes Nov 28, 2023

View reviewed changes

lorisleiva force-pushed the loris/write-read-codecs branch from ec7a4e1 to 5ceedf0 Compare November 28, 2023 14:09

lorisleiva mentioned this pull request Nov 28, 2023

refactor(experimental): rename variableSize to getSizeFromValue in codecs #1901

Merged

buffalojoec reviewed Nov 28, 2023

View reviewed changes

packages/codecs-core/src/codec.ts Show resolved Hide resolved

buffalojoec approved these changes Nov 28, 2023

View reviewed changes

lorisleiva force-pushed the loris/write-read-codecs branch from 5ceedf0 to 53197a9 Compare November 28, 2023 18:00

lorisleiva mentioned this pull request Nov 28, 2023

refactor(experimental): remove { fixedSize: null } from variable size codecs #1903

Merged

lorisleiva force-pushed the loris/write-read-codecs branch from 53197a9 to deec5fb Compare November 29, 2023 14:23

lorisleiva mentioned this pull request Nov 29, 2023

refactor(experimental): add size type param to fixed-size codecs #1909

Merged

mcintyre94 approved these changes Nov 29, 2023

View reviewed changes

lorisleiva mentioned this pull request Nov 29, 2023

refactor(experimental): add read and write functions to codecs-data-structures #1910

Merged

lorisleiva force-pushed the loris/write-read-codecs branch 2 times, most recently from 1ac1910 to 7d1c2f4 Compare November 30, 2023 14:55

This was referenced Nov 30, 2023

refactor(experimental): rename isFixedSize and isVariableSize type guards #1919

Merged

refactor(experimental): add read write functions to option codecs #1920

Merged

lorisleiva added 2 commits December 1, 2023 09:45

refactor(experimental): add read and write functions to codecs

c358ad8

refactor(experimental): use external helper functions instead of intr…

369a605

…oducing new read write methods

lorisleiva force-pushed the loris/write-read-codecs branch from 7d1c2f4 to 369a605 Compare December 1, 2023 09:46

lorisleiva mentioned this pull request Dec 1, 2023

refactor(experimental) update codecs in other packages #1922

Merged

lorisleiva merged commit 7800e3b into master Dec 1, 2023
5 of 7 checks passed

lorisleiva deleted the loris/write-read-codecs branch December 1, 2023 15:17

lorisleiva added a commit that referenced this pull request Dec 1, 2023

refactor(experimental): add read write functions to codecs-numbers (#…

613942a

…1884) This PR continues to refactor the codecs libraries as described in PR #1865. This one focuses on `@solana/codecs-numbers`.

github-actions bot locked as resolved and limited conversation to collaborators Dec 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(experimental): pre-build the byte array before encoding with codecs #1865

refactor(experimental): pre-build the byte array before encoding with codecs #1865

lorisleiva commented Nov 18, 2023 •

edited

lorisleiva commented Nov 18, 2023 •

edited

mcintyre94 left a comment

mcintyre94 Nov 22, 2023

lorisleiva Nov 22, 2023

buffalojoec commented Nov 22, 2023 •

edited

lorisleiva commented Nov 22, 2023

buffalojoec commented Nov 22, 2023

mcintyre94 commented Nov 23, 2023

lorisleiva commented Nov 23, 2023

lorisleiva commented Nov 27, 2023

steveluscher left a comment

steveluscher Nov 28, 2023

steveluscher Nov 28, 2023

lorisleiva Nov 28, 2023

buffalojoec left a comment

mcintyre94 left a comment

lorisleiva commented Dec 1, 2023 •

edited

github-actions bot commented Dec 18, 2023

refactor(experimental): pre-build the byte array before encoding with codecs #1865

refactor(experimental): pre-build the byte array before encoding with codecs #1865

Conversation

lorisleiva commented Nov 18, 2023 • edited

API Changes

Why?

lorisleiva commented Nov 18, 2023 • edited

mcintyre94 left a comment

Choose a reason for hiding this comment

mcintyre94 Nov 22, 2023

Choose a reason for hiding this comment

lorisleiva Nov 22, 2023

Choose a reason for hiding this comment

buffalojoec commented Nov 22, 2023 • edited

lorisleiva commented Nov 22, 2023

buffalojoec commented Nov 22, 2023

mcintyre94 commented Nov 23, 2023

lorisleiva commented Nov 23, 2023

lorisleiva commented Nov 27, 2023

steveluscher left a comment

Choose a reason for hiding this comment

steveluscher Nov 28, 2023

Choose a reason for hiding this comment

steveluscher Nov 28, 2023

Choose a reason for hiding this comment

lorisleiva Nov 28, 2023

Choose a reason for hiding this comment

buffalojoec left a comment

Choose a reason for hiding this comment

mcintyre94 left a comment

Choose a reason for hiding this comment

lorisleiva commented Dec 1, 2023 • edited

Merge activity

github-actions bot commented Dec 18, 2023

lorisleiva commented Nov 18, 2023 •

edited

lorisleiva commented Nov 18, 2023 •

edited

buffalojoec commented Nov 22, 2023 •

edited

lorisleiva commented Dec 1, 2023 •

edited