-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make encoding and decoding roundtrip correctly #22
Conversation
09d5006
to
420841b
Compare
420841b
to
58642de
Compare
THIS IS A BREAKING CHANGE Leading nulls are now encoded as "0", followed by a count. For example, `b"\x00"` becomes "01". For example, `b"\x00" * 62` becomes "0z01". Applications that rely on automatic stripping of "0z" must now perform this stripping themselves based on their knowledge of their problem domain. Applications that rely on the `minlen` parameter must now pad the output themselves based on their knowledge of their problem domain, and must remove references to the `minlen` parameter. Applications that rely on leading null bytes being stripped must now strip leading null bytes themselves. Closes suminb#12 Fixes suminb#18
58642de
to
7436c61
Compare
I've updated the changes. Now, instead of padding the output 1-to-1 with null bytes or leading zeros, the leading zeros are interpreted to have a count following them. So, for example, "01" will indicate that there is a single null byte at the beginning of the decoded string, and "0z01" will indicate that there are 62 null bytes at the beginning of the decoded string. |
Thanks for your PR. Here are some thoughts.
I see that >>> base62.decodebytes(base62.encodebytes(b'\x00'))
b'\x00' However, it appears there are some inconsistencies in the system. >>> base62.encodebytes(base62.decodebytes('0'))
''
>>> base62.encodebytes(base62.decodebytes('01'))
'01'
>>> base62.encodebytes(base62.decodebytes('001'))
'1'
>>> base62.encodebytes(base62.decodebytes('0001'))
'01' Sorry for the late response though. To give you a lousy excuse, it seems like life keeps finding its way to throw all kinds of distractions 😁 Thanks again for your work, and please let me know your thoughts on this. |
Regarding My initial implementation converted leading null characters to leading zeroes in a one-to-one manner. That is:
However, I noticed that it would be more efficient to encode leading null characters as a
During decoding, this two-character pattern is removed, and the remaining string is checked again to see if it starts with Regarding the roundtrip examples you provided, I don't think that you've discovered an inconsistency in the system. I think of it this way: all inputs to an encoder like gzip are valid, but not all inputs to a decoder like gunzip are valid. The roundtrip examples above don't match because the inputs are non-canonical representations of the same values (with the exception of
In general, I don't think it's valid to pass arbitrary input to the decoder and expect the same result back from the encoder. The only valid way to create roundtrip tests is to pass arbitrary inputs to the encoder and expect identical outputs from the decoder: import random
import base62
assert base62.decodebytes(base62.encodebytes(b"")) == b""
for _ in range(1_000_000):
sequence = bytes(random.randint(0, 255) for _ in range(random.randint(1, 5)))
assert base62.decodebytes(base62.encodebytes(sequence)) == sequence |
I see. That makes sense. I'll go ahead and release this as v1.0.0. Thanks! |
THIS IS A BREAKING CHANGE
Previously, leading null bytes were lost when encoding. For example,
b"\x00\x01"
would roundtrip asb"\x01"
. As a side effect, it was impossible for an encoded string to start with"0"
, so it was safe to look for -- and strip -- a leading"0z"
prefix. It was also safe to left-pad an encoded string with zeros to maintain a minimum length.This PR makes encoding and decoding of bytes round-trip correctly.
This is achieved by interpreting leading null bytes and leading zeros as significant, and prepending the encoded value with a leading zero followed by a quantity character, and prepending decoded output with leading null bytes.
For example,
b"\x00"
is encoded as"01"
, andb"\x00" * 62
is encoded as"0z01"
.However, in doing so, it is no longer safe to strip leading
"0z"
prefixes, nor is it safe to pad outputs to minimum lengths. Therefore:minlen
parameter must now pad the output themselves based on their knowledge of their problem domain, and must remove references to theminlen
parameter.This makes it possible to rely on base62 for encoding and decoding arbitrary byte strings.
If this PR is merged, I recommend releasing this as a new major version ("1.0.0") and introducing a CHANGELOG to document changes in the package.
Fixes #18
Closes #12