Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(experimental): pre-build the byte array before encoding with codecs #1865

Merged
merged 2 commits into from
Dec 1, 2023

Conversation

lorisleiva
Copy link
Collaborator

@lorisleiva lorisleiva commented Nov 18, 2023

Note this is a draft PR focusing on codecs-core to gather feedback, I will continue to update the other libraries afterwards.

This PR updates the codecs API such that both encoding and decoding function have access to the entire byte array. Let’s first see the change this PR introduce and then see why this is a valuable change.

API Changes

  • Encode: The Encoder type contains a new function write. Contrary to the encode function which creates a new Uint8Array and returns it directly, the write function updates the provided bytes argument at the provided offset. It then returns the next offset that should be written to.

    // Before
    type Encoder<T> = {
      encode: (value: T) => Uint8Array;
      // ...
    };
    
    // After
    type Encoder<T> = {
      encode: (value: T) => Uint8Array;
      write: (value: T, bytes: Uint8Array, offset: Offset) => Offset;
      // ...
    };

    A new createEncoder function was provided to automatically fill the encode function from the write function.

    const myU8Encoder = createEncoder({
      fixedSize: 1,
      write: (value: number, bytes: Uint8Array, offset: Offset) {
        bytes.set(value, offset);
        return offset + 1;
      };
    });
  • Decode: The decode function already following a similar approach by using offsets. The newly added function read takes over this responsibility. The only difference is we now make the offset a mandatory argument to stay consistent with the write function. The decode function now becomes syntactic sugar for accessing the value directly.

    // Before
    type Decoder<T> = {
      decode: (bytes: Uint8Array, offset?: Offset) => [T, Offset];
      // ...
    };
    
    // After
    type Decoder<T> = {
      decode: (bytes: Uint8Array, offset?: Offset) => T;
      read: (bytes: Uint8Array, offset: Offset) => [T, Offset];
      // ...
    };

    Similarly to the Encoder changes, a new createDecoder function is provided to fill the decode function using the read function.

    const myU8Decoder = createDecoder({
      fixedSize: 1,
      read: (bytes: Uint8Array, offset: Offset) {
        return [bytes[offset], offset + 1];
      };
    });
  • Sizes: Because we now need to pre-build the entire byte array that will be encoded, we need a way to find the variable size of a given value. We introduce a new variableSize function and narrow the types so that it can only be provided when fixedSize is null.

    // Before
    type Encoder<T> = {
      fixedSize: number | null;
      maxSize: number | null;
    }
    
    // After
    type Encoder<T> = { ... } & (
      | { fixedSize: number; }
      | { fixedSize: null; variableSize: (value: T) => number; maxSize?: number }
    )

    We do something similar the Decoder except that this one doesn’t need to know about the variable size (it would make no sense as the type parameter T for decoder refers to the decoded type and not the type to encode).

    // Before
    type Decoder<T> = {
      fixedSize: number | null;
      maxSize: number | null;
    }
    
    // After
    type Decoder<T> = { ... } & (
      | { fixedSize: number; }
      | { fixedSize: null; maxSize?: number }
    )
  • Description: This PR takes this refactoring opportunity to remove the description attribute from the codecs API which brings little value to the end-user.

Why?

  • Consistent API: The implementation of the encode / decode and write / read functions are now consistent with each other. Before one was using offsets to navigate through an entire byte array and the other was returning and merge byte arrays together. Now they both use offsets to navigate the byte array to encode or decode.

  • Performant API: By pre-building the byte array once, we’re avoiding creating multiple instance of byte arrays and merging them together.

  • Non-linear serialisation: The main reason why it’s important for the encode method to have access to the entire encoded byte array is that it allows us to offer more complex codec primitives that are able to jump back and forth within the buffer. Without it, we are locking ourselves to only supporting serialisation strategies that are read linearly which isn’t always the case. For instance, imagine we have the size of an array being stored at the very beginning of the account whereas the items themselves are being stored at the end. Because we now have the full byte array when encoding, we can push the size at the beginning whilst inserting all items at the requested offset. We could even offer a getOffsetCodec (or similar) that allows us to shift the offset forward or backward to compose more complex, non-linear data structures. This would be simply impossible with the current format of the encode function.

@lorisleiva lorisleiva self-assigned this Nov 18, 2023
@mergify mergify bot added the community label Nov 18, 2023
@mergify mergify bot requested a review from a team November 18, 2023 00:32
Base automatically changed from loris/rename-codec-options to master November 20, 2023 12:07
Copy link
Collaborator

@mcintyre94 mcintyre94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a clarifying question, when you say "offer more complex codec primitives that are able to jump back and forth within the buffer" do we have a use case for that more complex implementation now? Or is it just something you've recognised as a limitation of the current ones?

Also am I right in thinking the performance improvement would increase based on how much encoders/decoders are composed? I think that's potentially a really important motivation too because things like the transaction decoder are pretty complex

/**
* Use the provided Encoder to encode the given value to a `Uint8Array`.
*/
export function encode<T>(value: T, encoder: Encoder<T>): Uint8Array {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my only concern would be that encode and decode are really generic terms for us to be exporting into people's apps, where stuff like getTransactionEncoder is less likely to clash with anything they're doing
They're definitely the right names to use for this though, so not really suggesting any change there!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review Callum!

Regarding the encode and decode functions: I agree they are pretty generic. I’d love to keep the codec.encode and codec.decode API which we could. It just comes with its own set of challenges as codecs will now need to provide 4 functions such that two of them are just delegating to the other two. On the plus side, we no longer have that annoying [0] suffix on the main API.

Regarding offering more complex primitives: The main issue here is that the current encode and decode logic are not symmetrical. Meaning there are things we can do with decode that we cannot do with encode because we do not have access to the entire buffer. I do have some examples of that. For instance, imagine an offsetCodec(codec, relativeOffset) helper that, given a codec, shifts its position on the buffer by the given relative offset. We can do that right now with Decoders but it’s impossible to do with Encoders because they use a different mechanism that rely on linear serialisation only. An example of why we would we need something like an offsetCodec helper is any array such that its size does not immediately prefix the items.

Regarding performance improvements: The decoding logic should stay pretty much the same because it already uses a single buffer but the encoding logic should drastically use less instances of Uint8Array which I believe will be a big performance boost.

@buffalojoec
Copy link
Collaborator

buffalojoec commented Nov 22, 2023

Another way to achieve the same thing but keep the previous encoder.encode(value) API is to introduce a new set of low-level methods such as write and read on the Encoder and Decoder types respectively. However, this means every single Encoder and Decoder needs to implement the same encode and decode helper method over and over again. It also makes it easy for people to misuse it and use the encode method instead of the write method when chaining encoders together.

I think this sounds like a better approach to me. This API aligns with borsh in Rust (not that we care to align with borsh necessarily) so it's not unfamiliar, and I would really hate to see any partial destruction of the simple TS/JS syntax that is:

const bytes: UInt8Array = getStructCodec(..).encode(myObject);

If the intention is to provide better low-level functionality, better to do that with write and let people still get the slick API of encode.

I wouldn't even hate just adding offset to the return type of encode, but that's tripped me up with decode before, too. I'd probably rather see the offset removed from decode to match encode, and instead used in read.

  • Performant API: By pre-building the byte array once, we’re avoiding creating multiple instance of byte arrays and merging them together.

I understand you're also going for performance here, and you're right the current implementation does degrade performance/memory, however, I think having a simple, user-friendly API alongside a lower-level performant one is more than enough.

@lorisleiva
Copy link
Collaborator Author

Thanks Joe! I agree, it would make me sad loosing the getStructCodec(..).encode(myObject) API.

I also like the alternative API, it's just a little bit more of a pain to write codecs with this API. That being said, I can offer helper methods like createEncoder and createDecoder that accepts an Encoder or Decoder with write and read function and provide the encode and decode function for you.

So for instance, if I wanted to create a number encoder. instead of this:

const myEncoder: Encoder<number> = {
  fixedSize: 1,
  encode(value, bytes, offset): Offset {
    bytes.set(value, offset);
    return offset + 1;
  }
}

I'd be doing this:

const myEncoder: Encoder<number> = createEncoder({
  fixedSize: 1,
  write(value, bytes, offset): Offset {
    bytes.set(value, offset);
    return offset + 1;
  }
})

// `createEncoder` basically just adds the `encode` function for us:
myEncoder.encode(42); // <- Uint8Array.

What do you think?

@buffalojoec
Copy link
Collaborator

Thanks Joe! I agree, it would make me sad loosing the getStructCodec(..).encode(myObject) API.

I also like the alternative API, it's just a little bit more of a pain to write codecs with this API. That being said, I can offer helper methods like createEncoder and createDecoder that accepts an Encoder or Decoder with write and read function and provide the encode and decode function for you.

So for instance, if I wanted to create a number encoder. instead of this:

const myEncoder: Encoder<number> = {
  fixedSize: 1,
  encode(value, bytes, offset): Offset {
    bytes.set(value, offset);
    return offset + 1;
  }
}

I'd be doing this:

const myEncoder: Encoder<number> = createEncoder({
  fixedSize: 1,
  write(value, bytes, offset): Offset {
    bytes.set(value, offset);
    return offset + 1;
  }
})

// `createEncoder` basically just adds the `encode` function for us:
myEncoder.encode(42); // <- Uint8Array.

What do you think?

Yep, I think that's great! Lgtm 💪🏼

@mcintyre94
Copy link
Collaborator

Agreed, that looks like a really nice API to me. Only question is how easy is it to compose eg a struct encoder using that? Would you be internally using eg the write function of the nested fields, and the encode function is just the external API?

@lorisleiva
Copy link
Collaborator Author

Agreed, that looks like a really nice API to me. Only question is how easy is it to compose eg a struct encoder using that? Would you be internally using eg the write function of the nested fields, and the encode function is just the external API?

That's right, codecs that are composed from other codecs should only use the write and read functions internally. The encode and decode functions are syntactic sugar for the end-user.

@lorisleiva lorisleiva force-pushed the loris/write-read-codecs branch 2 times, most recently from 878c7e8 to ec7a4e1 Compare November 27, 2023 13:19
@lorisleiva
Copy link
Collaborator Author

I've updated the PR and its description.

Copy link
Collaborator

@steveluscher steveluscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

giphy.gif

packages/codecs-core/src/codec.ts Show resolved Hide resolved
Comment on lines +7 to 11
write: (value: string, bytes, offset) => {
const charCodes = [...value.slice(0, 32)].map(char => Math.min(char.charCodeAt(0), 255));
bytes.set(new Uint8Array(charCodes));
return bytes;
bytes.set(charCodes, offset);
return offset + 32;
},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason to make this context-y with this?

Suggested change
write: (value: string, bytes, offset) => {
const charCodes = [...value.slice(0, 32)].map(char => Math.min(char.charCodeAt(0), 255));
bytes.set(new Uint8Array(charCodes));
return bytes;
bytes.set(charCodes, offset);
return offset + 32;
},
write(value: string, bytes, offset) {
const length = this.fixedSize;
const charCodes = [...value.slice(0, length)].map(char => Math.min(char.charCodeAt(0), 255));
bytes.set(charCodes, offset);
return offset + length;
},

* Writes the encoded value into the provided byte array at the given offset.
* Returns the offset of the next byte after the encoded value.
*/
readonly write: (value: T, bytes: Uint8Array, offset: Offset) => Offset;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could achieve the context-y suggestion above by doing something like this:

Suggested change
readonly write: (value: T, bytes: Uint8Array, offset: Offset) => Offset;
readonly write: (
this: Readonly<{fixedSize: number | null}>,
value: T,
bytes: Uint8Array, offset: Offset,
) => Offset;

Then something like this later:

const context = {fixedSize};
return {
  // ...
  write(...args) { return encoder.write.apply(context, args); },
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I don't know that I like this tbh. It's making me implement codecs thinking: "it's possible that a parent codec is going to inject a different size and therefore mess up with this implementation". 🤔

Copy link
Collaborator

@buffalojoec buffalojoec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking slick! One question from me popped up.

packages/codecs-core/src/codec.ts Show resolved Hide resolved
Copy link
Collaborator

@mcintyre94 mcintyre94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me now! I think the createEncoder and createDecoder and keeping the encode/decode is much nicer too.

Copy link
Collaborator Author

lorisleiva commented Dec 1, 2023

Merge activity

@lorisleiva lorisleiva merged commit 7800e3b into master Dec 1, 2023
5 of 7 checks passed
@lorisleiva lorisleiva deleted the loris/write-read-codecs branch December 1, 2023 15:17
lorisleiva added a commit that referenced this pull request Dec 1, 2023
…1884)

This PR continues to refactor the codecs libraries as described in PR #1865. This one focuses on `@solana/codecs-numbers`.
lorisleiva added a commit that referenced this pull request Dec 1, 2023
…1885)

This PR continues to refactor the codecs libraries as described in PR #1865. This one focuses on `@solana/codecs-strings`.

Here, we can see the downside of having to compute the size beforehand for base X since, unless we add some caching, we end up having to encode twice: once to get the size of the buffer and once to fill it.
lorisleiva added a commit that referenced this pull request Dec 1, 2023
…ze codecs (#1903)

As suggested [here](#1865 (comment)), this PR removes the `{ fixedSize: null }` attribute from `VariableSize*` types. This makes it easier and less confusing the create and use variable-size codecs.

Since it's now slightly less convinient to find out if a codec if fixed-size or not, this PR also offers type guards to make this easier.
Copy link
Contributor

Because there has been no activity on this PR for 14 days since it was merged, it has been automatically locked. Please open a new issue if it requires a follow up.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 18, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants