-
Notifications
You must be signed in to change notification settings - Fork 774
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(experimental): pre-build the byte array before encoding with codecs #1865
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a clarifying question, when you say "offer more complex codec primitives that are able to jump back and forth within the buffer" do we have a use case for that more complex implementation now? Or is it just something you've recognised as a limitation of the current ones?
Also am I right in thinking the performance improvement would increase based on how much encoders/decoders are composed? I think that's potentially a really important motivation too because things like the transaction decoder are pretty complex
packages/codecs-core/src/codec.ts
Outdated
/** | ||
* Use the provided Encoder to encode the given value to a `Uint8Array`. | ||
*/ | ||
export function encode<T>(value: T, encoder: Encoder<T>): Uint8Array { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my only concern would be that encode and decode are really generic terms for us to be exporting into people's apps, where stuff like getTransactionEncoder is less likely to clash with anything they're doing
They're definitely the right names to use for this though, so not really suggesting any change there!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review Callum!
Regarding the encode
and decode
functions: I agree they are pretty generic. I’d love to keep the codec.encode
and codec.decode
API which we could. It just comes with its own set of challenges as codecs will now need to provide 4 functions such that two of them are just delegating to the other two. On the plus side, we no longer have that annoying [0]
suffix on the main API.
Regarding offering more complex primitives: The main issue here is that the current encode
and decode
logic are not symmetrical. Meaning there are things we can do with decode
that we cannot do with encode
because we do not have access to the entire buffer. I do have some examples of that. For instance, imagine an offsetCodec(codec, relativeOffset)
helper that, given a codec, shifts its position on the buffer by the given relative offset. We can do that right now with Decoders
but it’s impossible to do with Encoders
because they use a different mechanism that rely on linear serialisation only. An example of why we would we need something like an offsetCodec
helper is any array such that its size does not immediately prefix the items.
Regarding performance improvements: The decoding logic should stay pretty much the same because it already uses a single buffer but the encoding logic should drastically use less instances of Uint8Array
which I believe will be a big performance boost.
I think this sounds like a better approach to me. This API aligns with const bytes: UInt8Array = getStructCodec(..).encode(myObject); If the intention is to provide better low-level functionality, better to do that with I wouldn't even hate just adding offset to the return type of
I understand you're also going for performance here, and you're right the current implementation does degrade performance/memory, however, I think having a simple, user-friendly API alongside a lower-level performant one is more than enough. |
Thanks Joe! I agree, it would make me sad loosing the I also like the alternative API, it's just a little bit more of a pain to write codecs with this API. That being said, I can offer helper methods like So for instance, if I wanted to create a number encoder. instead of this: const myEncoder: Encoder<number> = {
fixedSize: 1,
encode(value, bytes, offset): Offset {
bytes.set(value, offset);
return offset + 1;
}
} I'd be doing this: const myEncoder: Encoder<number> = createEncoder({
fixedSize: 1,
write(value, bytes, offset): Offset {
bytes.set(value, offset);
return offset + 1;
}
})
// `createEncoder` basically just adds the `encode` function for us:
myEncoder.encode(42); // <- Uint8Array. What do you think? |
Yep, I think that's great! Lgtm 💪🏼 |
Agreed, that looks like a really nice API to me. Only question is how easy is it to compose eg a struct encoder using that? Would you be internally using eg the |
That's right, codecs that are composed from other codecs should only use the |
878c7e8
to
ec7a4e1
Compare
I've updated the PR and its description. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
write: (value: string, bytes, offset) => { | ||
const charCodes = [...value.slice(0, 32)].map(char => Math.min(char.charCodeAt(0), 255)); | ||
bytes.set(new Uint8Array(charCodes)); | ||
return bytes; | ||
bytes.set(charCodes, offset); | ||
return offset + 32; | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason to make this context-y with this
?
write: (value: string, bytes, offset) => { | |
const charCodes = [...value.slice(0, 32)].map(char => Math.min(char.charCodeAt(0), 255)); | |
bytes.set(new Uint8Array(charCodes)); | |
return bytes; | |
bytes.set(charCodes, offset); | |
return offset + 32; | |
}, | |
write(value: string, bytes, offset) { | |
const length = this.fixedSize; | |
const charCodes = [...value.slice(0, length)].map(char => Math.min(char.charCodeAt(0), 255)); | |
bytes.set(charCodes, offset); | |
return offset + length; | |
}, |
* Writes the encoded value into the provided byte array at the given offset. | ||
* Returns the offset of the next byte after the encoded value. | ||
*/ | ||
readonly write: (value: T, bytes: Uint8Array, offset: Offset) => Offset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could achieve the context-y suggestion above by doing something like this:
readonly write: (value: T, bytes: Uint8Array, offset: Offset) => Offset; | |
readonly write: ( | |
this: Readonly<{fixedSize: number | null}>, | |
value: T, | |
bytes: Uint8Array, offset: Offset, | |
) => Offset; |
Then something like this later:
const context = {fixedSize};
return {
// ...
write(...args) { return encoder.write.apply(context, args); },
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I don't know that I like this tbh. It's making me implement codecs thinking: "it's possible that a parent codec is going to inject a different size and therefore mess up with this implementation". 🤔
ec7a4e1
to
5ceedf0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking slick! One question from me popped up.
5ceedf0
to
53197a9
Compare
53197a9
to
deec5fb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me now! I think the createEncoder
and createDecoder
and keeping the encode
/decode
is much nicer too.
1ac1910
to
7d1c2f4
Compare
7d1c2f4
to
369a605
Compare
Merge activity
|
…1885) This PR continues to refactor the codecs libraries as described in PR #1865. This one focuses on `@solana/codecs-strings`. Here, we can see the downside of having to compute the size beforehand for base X since, unless we add some caching, we end up having to encode twice: once to get the size of the buffer and once to fill it.
…ze codecs (#1903) As suggested [here](#1865 (comment)), this PR removes the `{ fixedSize: null }` attribute from `VariableSize*` types. This makes it easier and less confusing the create and use variable-size codecs. Since it's now slightly less convinient to find out if a codec if fixed-size or not, this PR also offers type guards to make this easier.
Because there has been no activity on this PR for 14 days since it was merged, it has been automatically locked. Please open a new issue if it requires a follow up. |
Note this is a draft PR focusing on
codecs-core
to gather feedback, I will continue to update the other libraries afterwards.This PR updates the codecs API such that both encoding and decoding function have access to the entire byte array. Let’s first see the change this PR introduce and then see why this is a valuable change.
API Changes
Encode: The
Encoder
type contains a new functionwrite
. Contrary to theencode
function which creates a newUint8Array
and returns it directly, thewrite function
updates the providedbytes
argument at the providedoffset
. It then returns the next offset that should be written to.A new
createEncoder
function was provided to automatically fill theencode
function from thewrite
function.Decode: The
decode
function already following a similar approach by using offsets. The newly added functionread
takes over this responsibility. The only difference is we now make the offset a mandatory argument to stay consistent with thewrite
function. Thedecode
function now becomes syntactic sugar for accessing the value directly.Similarly to the
Encoder
changes, a newcreateDecoder
function is provided to fill thedecode
function using theread
function.Sizes: Because we now need to pre-build the entire byte array that will be encoded, we need a way to find the variable size of a given value. We introduce a new
variableSize
function and narrow the types so that it can only be provided whenfixedSize
isnull
.We do something similar the
Decoder
except that this one doesn’t need to know about the variable size (it would make no sense as the type parameterT
for decoder refers to the decoded type and not the type to encode).Description: This PR takes this refactoring opportunity to remove the
description
attribute from the codecs API which brings little value to the end-user.Why?
Consistent API: The implementation of the
encode
/decode
andwrite
/read
functions are now consistent with each other. Before one was using offsets to navigate through an entire byte array and the other was returning and merge byte arrays together. Now they both use offsets to navigate the byte array to encode or decode.Performant API: By pre-building the byte array once, we’re avoiding creating multiple instance of byte arrays and merging them together.
Non-linear serialisation: The main reason why it’s important for the
encode
method to have access to the entire encoded byte array is that it allows us to offer more complex codec primitives that are able to jump back and forth within the buffer. Without it, we are locking ourselves to only supporting serialisation strategies that are read linearly which isn’t always the case. For instance, imagine we have the size of an array being stored at the very beginning of the account whereas the items themselves are being stored at the end. Because we now have the full byte array when encoding, we can push the size at the beginning whilst inserting all items at the requested offset. We could even offer agetOffsetCodec
(or similar) that allows us to shift the offset forward or backward to compose more complex, non-linear data structures. This would be simply impossible with the current format of theencode
function.