Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make encoding and decoding roundtrip correctly #22

Merged
merged 2 commits into from
Mar 1, 2023

Conversation

kurtmckee
Copy link
Contributor

@kurtmckee kurtmckee commented Jan 25, 2023

THIS IS A BREAKING CHANGE

Previously, leading null bytes were lost when encoding. For example, b"\x00\x01" would roundtrip as b"\x01". As a side effect, it was impossible for an encoded string to start with "0", so it was safe to look for -- and strip -- a leading "0z" prefix. It was also safe to left-pad an encoded string with zeros to maintain a minimum length.

This PR makes encoding and decoding of bytes round-trip correctly.

This is achieved by interpreting leading null bytes and leading zeros as significant, and prepending the encoded value with a leading zero followed by a quantity character, and prepending decoded output with leading null bytes.

For example, b"\x00" is encoded as "01", and b"\x00" * 62 is encoded as "0z01".

However, in doing so, it is no longer safe to strip leading "0z" prefixes, nor is it safe to pad outputs to minimum lengths. Therefore:

  • Applications that rely on automatic stripping of "0z" must now perform this stripping themselves based on their knowledge of their problem domain.
  • Applications that rely on the minlen parameter must now pad the output themselves based on their knowledge of their problem domain, and must remove references to the minlen parameter.
  • Applications that rely on leading null bytes being stripped must now strip leading null bytes themselves.

This makes it possible to rely on base62 for encoding and decoding arbitrary byte strings.

If this PR is merged, I recommend releasing this as a new major version ("1.0.0") and introducing a CHANGELOG to document changes in the package.

Fixes #18
Closes #12

@kurtmckee kurtmckee marked this pull request as draft January 26, 2023 13:45
@kurtmckee kurtmckee marked this pull request as ready for review January 26, 2023 14:06
THIS IS A BREAKING CHANGE

Leading nulls are now encoded as "0", followed by a count.
For example, `b"\x00"` becomes "01".
For example, `b"\x00" * 62` becomes "0z01".

Applications that rely on automatic stripping of "0z"
must now perform this stripping themselves
based on their knowledge of their problem domain.

Applications that rely on the `minlen` parameter
must now pad the output themselves
based on their knowledge of their problem domain,
and must remove references to the `minlen` parameter.

Applications that rely on leading null bytes being stripped
must now strip leading null bytes themselves.

Closes suminb#12
Fixes suminb#18
@kurtmckee
Copy link
Contributor Author

I've updated the changes. Now, instead of padding the output 1-to-1 with null bytes or leading zeros, the leading zeros are interpreted to have a count following them. So, for example, "01" will indicate that there is a single null byte at the beginning of the decoded string, and "0z01" will indicate that there are 62 null bytes at the beginning of the decoded string.

@suminb
Copy link
Owner

suminb commented Feb 25, 2023

Thanks for your PR. Here are some thoughts.

  1. As this is a breaking change, I see it as a good opportunity to get rid of the 0z prefix. The prefix has been carried on for a historical reason, and I don't see any value other than keeping backward compatibility. The optional, arbitrary 0z prefix makes it difficult to keep encoding/decoding roundtrip simple and consistent.

  2. This might be a fault on me for not reading your proposal carefully, but I don't quite understand why b"\x00" is encoded as "01". Shouldn't it be "0"?

I see that b"\x00" makes the roundtrip successfully.

>>> base62.decodebytes(base62.encodebytes(b'\x00'))
b'\x00'

However, it appears there are some inconsistencies in the system.

>>> base62.encodebytes(base62.decodebytes('0'))
''
>>> base62.encodebytes(base62.decodebytes('01'))
'01'
>>> base62.encodebytes(base62.decodebytes('001'))
'1'
>>> base62.encodebytes(base62.decodebytes('0001'))
'01'

Sorry for the late response though. To give you a lousy excuse, it seems like life keeps finding its way to throw all kinds of distractions 😁

Thanks again for your work, and please let me know your thoughts on this.

@kurtmckee
Copy link
Contributor Author

Regarding b"\x00" converting to "01", this is because leading null characters needed to be represented somehow.

My initial implementation converted leading null characters to leading zeroes in a one-to-one manner. That is:

  • b"\x00" would convert to "0"
  • b"\x00" * 9 would convert to "0" * 9

However, I noticed that it would be more efficient to encode leading null characters as a "0" followed by a character representing the count of null characters. So:

  • b"\x00" converts to "01" (meaning there is one null character)
  • b"\x00" * 9 converts to "09" (meaning there are nine null characters)

During decoding, this two-character pattern is removed, and the remaining string is checked again to see if it starts with "0" (which means that additional leading null characters are present).

Regarding the roundtrip examples you provided, I don't think that you've discovered an inconsistency in the system. I think of it this way: all inputs to an encoder like gzip are valid, but not all inputs to a decoder like gunzip are valid. The roundtrip examples above don't match because the inputs are non-canonical representations of the same values (with the exception of "0", which is invalid).

  • "0" is invalid input. Leading zeros must be followed by a count character.
  • "001" is interpreted in two parts. "00" means that there are zero leading null characters, and "1" means b"\x01". When encoded, the result is "1", because the encoder doesn't reintroduce the trivial "00" sequence. The meaning is the same, but only "1" is canonical.
  • "0001" is interpreted in two parts. "00" again means that there are zero leading null characters, and "01" means that there is one null character (b"\x00"). When encoded, the result is "01".

In general, I don't think it's valid to pass arbitrary input to the decoder and expect the same result back from the encoder. The only valid way to create roundtrip tests is to pass arbitrary inputs to the encoder and expect identical outputs from the decoder:

import random

import base62

assert base62.decodebytes(base62.encodebytes(b"")) == b""

for _ in range(1_000_000):
    sequence = bytes(random.randint(0, 255) for _ in range(random.randint(1, 5)))
    assert base62.decodebytes(base62.encodebytes(sequence)) == sequence

@suminb
Copy link
Owner

suminb commented Mar 1, 2023

I see. That makes sense. I'll go ahead and release this as v1.0.0. Thanks!

@suminb suminb merged commit 1361dba into suminb:develop Mar 1, 2023
@kurtmckee kurtmckee deleted the fix-leading-zeros-issue-18 branch March 1, 2023 12:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ignoring leading zero bytes Question with regard to 0z
2 participants