Replies: 3 comments 26 replies
-
I have not read this particular guide but I have changed {start_offset, end_offset} to {start_offset, length} since all our APIs using this format and length usually is a small value. I then combined that with short for all that fit in 15 bits and then have highest value represent a marker to say it should be a full int (I have also played with 6bit byte values, short values and long but that is barely largely than just short+int). This reduced the size of the serialized file by 50% or so. In my case the speed got a tiny bit faster so the cost of reducing size was mildly beneficial to speed. It was always doing things at byte boundaries which I think makes it a simple scheme. I will read that document and see if I can grok it :) It looks complicated but a lot of thought has been put into protobufs so I am optimistic. We should try it and measure it but that I definitely think some scheme can be used without hurting perf. |
Beta Was this translation helpful? Give feedback.
-
Using single sign bits for continuation vs what I am doing seems like it would help on space. I suppose my waste of space by using a single continuation byte is offset by not requiring the math that 128 bit varints use but simple int math is pretty cheap. The reduction in bytes to process is at odds with the math to reassemble the value. Tough to know without trying whether we would notice that overhead or not. |
Beta Was this translation helpful? Give feedback.
-
One note is while this can decrease the serialized size it won't change anything about deserialized node size e.g. for Java or Ruby nodes. But still it seems highly worth it if we can make the serialized size smaller without impacting performance too negatively. I think we should also look at msgpack varint encoding, that may be simpler. The protobuf seems not so efficient because you need to read each 1 bit of every byte to know how long vs knowing it from the first byte read. |
Beta Was this translation helpful? Give feedback.
-
Most of the integers that we store in serialization are related to offsets. For every integer we have, we use 4 bytes to represent the value. We don't need that much space for most cases.
One thing we could do instead is to use variable length integers. Protobuf describes it pretty well: https://protobuf.dev/programming-guides/encoding/#varints. This could potentially save on a fair amount of space because I would imagine most files would fit into 2^14, and almost all of the rest would fit into 2^21.
@enebo, @eregon thoughts? Would this kind of deserialization slow it down too much so that it wouldn't be worth it?
Beta Was this translation helpful? Give feedback.
All reactions