Make encoding and decoding roundtrip correctly #22

kurtmckee · 2023-01-25T14:33:04Z

THIS IS A BREAKING CHANGE

Previously, leading null bytes were lost when encoding. For example, b"\x00\x01" would roundtrip as b"\x01". As a side effect, it was impossible for an encoded string to start with "0", so it was safe to look for -- and strip -- a leading "0z" prefix. It was also safe to left-pad an encoded string with zeros to maintain a minimum length.

This PR makes encoding and decoding of bytes round-trip correctly.

This is achieved by interpreting leading null bytes and leading zeros as significant, and prepending the encoded value with a leading zero followed by a quantity character, and prepending decoded output with leading null bytes.

For example, b"\x00" is encoded as "01", and b"\x00" * 62 is encoded as "0z01".

However, in doing so, it is no longer safe to strip leading "0z" prefixes, nor is it safe to pad outputs to minimum lengths. Therefore:

Applications that rely on automatic stripping of "0z" must now perform this stripping themselves based on their knowledge of their problem domain.
Applications that rely on the minlen parameter must now pad the output themselves based on their knowledge of their problem domain, and must remove references to the minlen parameter.
Applications that rely on leading null bytes being stripped must now strip leading null bytes themselves.

This makes it possible to rely on base62 for encoding and decoding arbitrary byte strings.

If this PR is merged, I recommend releasing this as a new major version ("1.0.0") and introducing a CHANGELOG to document changes in the package.

Fixes #18
Closes #12

THIS IS A BREAKING CHANGE Leading nulls are now encoded as "0", followed by a count. For example, `b"\x00"` becomes "01". For example, `b"\x00" * 62` becomes "0z01". Applications that rely on automatic stripping of "0z" must now perform this stripping themselves based on their knowledge of their problem domain. Applications that rely on the `minlen` parameter must now pad the output themselves based on their knowledge of their problem domain, and must remove references to the `minlen` parameter. Applications that rely on leading null bytes being stripped must now strip leading null bytes themselves. Closes suminb#12 Fixes suminb#18

kurtmckee · 2023-01-26T14:38:03Z

I've updated the changes. Now, instead of padding the output 1-to-1 with null bytes or leading zeros, the leading zeros are interpreted to have a count following them. So, for example, "01" will indicate that there is a single null byte at the beginning of the decoded string, and "0z01" will indicate that there are 62 null bytes at the beginning of the decoded string.

suminb · 2023-02-25T02:56:31Z

Thanks for your PR. Here are some thoughts.

As this is a breaking change, I see it as a good opportunity to get rid of the 0z prefix. The prefix has been carried on for a historical reason, and I don't see any value other than keeping backward compatibility. The optional, arbitrary 0z prefix makes it difficult to keep encoding/decoding roundtrip simple and consistent.
This might be a fault on me for not reading your proposal carefully, but I don't quite understand why b"\x00" is encoded as "01". Shouldn't it be "0"?

I see that b"\x00" makes the roundtrip successfully.

>>> base62.decodebytes(base62.encodebytes(b'\x00'))
b'\x00'

However, it appears there are some inconsistencies in the system.

>>> base62.encodebytes(base62.decodebytes('0'))
''
>>> base62.encodebytes(base62.decodebytes('01'))
'01'
>>> base62.encodebytes(base62.decodebytes('001'))
'1'
>>> base62.encodebytes(base62.decodebytes('0001'))
'01'

Sorry for the late response though. To give you a lousy excuse, it seems like life keeps finding its way to throw all kinds of distractions 😁

Thanks again for your work, and please let me know your thoughts on this.

kurtmckee · 2023-02-26T16:18:50Z

Regarding b"\x00" converting to "01", this is because leading null characters needed to be represented somehow.

My initial implementation converted leading null characters to leading zeroes in a one-to-one manner. That is:

b"\x00" would convert to "0"
b"\x00" * 9 would convert to "0" * 9

However, I noticed that it would be more efficient to encode leading null characters as a "0" followed by a character representing the count of null characters. So:

b"\x00" converts to "01" (meaning there is one null character)
b"\x00" * 9 converts to "09" (meaning there are nine null characters)

During decoding, this two-character pattern is removed, and the remaining string is checked again to see if it starts with "0" (which means that additional leading null characters are present).

Regarding the roundtrip examples you provided, I don't think that you've discovered an inconsistency in the system. I think of it this way: all inputs to an encoder like gzip are valid, but not all inputs to a decoder like gunzip are valid. The roundtrip examples above don't match because the inputs are non-canonical representations of the same values (with the exception of "0", which is invalid).

"0" is invalid input. Leading zeros must be followed by a count character.
"001" is interpreted in two parts. "00" means that there are zero leading null characters, and "1" means b"\x01". When encoded, the result is "1", because the encoder doesn't reintroduce the trivial "00" sequence. The meaning is the same, but only "1" is canonical.
"0001" is interpreted in two parts. "00" again means that there are zero leading null characters, and "01" means that there is one null character (b"\x00"). When encoded, the result is "01".

In general, I don't think it's valid to pass arbitrary input to the decoder and expect the same result back from the encoder. The only valid way to create roundtrip tests is to pass arbitrary inputs to the encoder and expect identical outputs from the decoder:

import random

import base62

assert base62.decodebytes(base62.encodebytes(b"")) == b""

for _ in range(1_000_000):
    sequence = bytes(random.randint(0, 255) for _ in range(random.randint(1, 5)))
    assert base62.decodebytes(base62.encodebytes(sequence)) == sequence

suminb · 2023-03-01T11:41:30Z

I see. That makes sense. I'll go ahead and release this as v1.0.0. Thanks!

kurtmckee marked this pull request as draft January 26, 2023 13:45

kurtmckee force-pushed the fix-leading-zeros-issue-18 branch from 09d5006 to 420841b Compare January 26, 2023 14:05

kurtmckee marked this pull request as ready for review January 26, 2023 14:06

kurtmckee force-pushed the fix-leading-zeros-issue-18 branch from 420841b to 58642de Compare January 26, 2023 14:33

kurtmckee added 2 commits January 26, 2023 08:35

Demonstrate issue suminb#18

afe2287

kurtmckee force-pushed the fix-leading-zeros-issue-18 branch from 58642de to 7436c61 Compare January 26, 2023 14:35

suminb force-pushed the develop branch from 95ffa8a to 53b87a6 Compare February 5, 2023 02:20

suminb merged commit 1361dba into suminb:develop Mar 1, 2023

kurtmckee deleted the fix-leading-zeros-issue-18 branch March 1, 2023 12:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make encoding and decoding roundtrip correctly #22

Make encoding and decoding roundtrip correctly #22

kurtmckee commented Jan 25, 2023 •

edited

Loading

kurtmckee commented Jan 26, 2023

suminb commented Feb 25, 2023

kurtmckee commented Feb 26, 2023

suminb commented Mar 1, 2023

Make encoding and decoding roundtrip correctly #22

Make encoding and decoding roundtrip correctly #22

Conversation

kurtmckee commented Jan 25, 2023 • edited Loading

THIS IS A BREAKING CHANGE

This PR makes encoding and decoding of bytes round-trip correctly.

kurtmckee commented Jan 26, 2023

suminb commented Feb 25, 2023

kurtmckee commented Feb 26, 2023

suminb commented Mar 1, 2023

kurtmckee commented Jan 25, 2023 •

edited

Loading