-
-
Notifications
You must be signed in to change notification settings - Fork 29.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base85 encoding #61818
Comments
Base85 encoding (see e.g. http://en.wikipedia.org/wiki/Ascii85 ) allows a tighter encoding than Base64: it has a 5/4 expansion ratio, rather than 4/3. It would be nice to have a Base85 implementation in either the binascii or base64 modules. (unfortunately the Mercurial implementation is GPL'ed, although if we want to copy it we might simply ask them for a relicensing) |
Is anyone working on this? I'd like to include this in a CPython sprint @mit on 4/13. |
Antoine is talking to Mercurial about relicensing, and I believe at this point it is just a matter of working out the mechanical details (that is, he has an agreement-in-principal from them). |
The Mercurial authors have given their informal agreement for a relicensing. OTOH, they must still sign a contributor's agreement. The relicensing would allow us to use their pure Python implementation (in mercurial/pure/base85.py). OTOH, the C implementation (in mercurial/base85.c) is a ripoff of the git one, so we'd better rewrite our own. My current plan would be the following:
OTOH, if we don't get the Mercurial authors' contributor agreement, we can also re-write our own pure Python implementation. In any case, our implementation should IMHO be compatible with Mercurial's (i.e. produce identical outputs for the same inputs). |
I wrote an implementation from scratch (based on the wikipedia article; I've not looked at any existing implementations) in pure Python in the attached diff. It includes tests. Feel free to use it as the pure Python implementation if desired, though I won't be offended if you just end up using the Mercurial one. :-) |
(sorry for spam) Forgot to mention, I included an optional keyword argument to support the 'btoa' shortcut for sequences of space characters as described in the Wikipedia article. However, I'm unsure if any other implementation supports this, so might not be worth including it in our implementation. A better feature might be to support full btoa output, but the Wikipedia article doesn't have a complete enough specification, and I couldn't find (didn't really look for) one elsewhere. If no one uses it though, again. probably not worth including it. |
In this issue I would really like to aim for Mercurial/git-like (by the way, it seems "btoa" has been dead for a long time, I don't |
Ok, having now looked at mercurial's implementation... it looks like they implemented the RFC1924 version, whereas my implementation is the Ascii85 version (and I verified it against, amongst others: http://www.tools4noobs.com/online_tools/ascii85_encode/ ). The Ascii85 version is what is used with PDF, and the default output format for the equivalent Ruby library, so seems useful to have. So I guess what might be desirable is to have both in the codebase? |
Yes, it could be useful to have both. |
Ok, I'm not even sure that Mercurial follows RFC1924! That RFC is specifically for encoding IPv6 addresses, and mandates that the calculations be performed on a 128bit integer. The Mercurial implementation seems to follow the Ascii85 policy of taking each 4 bytes separately and doing 32bit arithmetic, but uses the lookup table from RFC1924, and is less lenient about spacing, and has no compression for sequences of zeroes. It therefore looks like Mercurial (and I guess Git) have their own, non-standard base64 encoding. The Web at large mostly has "standard" Ascii85 encoding/decoding described. RFC1924 itself has a Python implementation on Github: https://github.com/drkjam/netaddr/blob/rel-0.7.x/netaddr/ip/rfc1924.py So I'm not sure what you want to do. I would suggest a standard Ascii85 encoder is definitely useful, and provides feature parity with Ruby. If we want the standard library to be able to read/write Mercurial/Git base64 encoded files, then I guess that can be added too. If we think RFC1924 is useful/used, then the implementation in the netaddr lib looks right. |
Agreed for both the Ascii85 encoder and the hg/git brand of base85 (I think using "ascii85" and "base85" for those encodings, respectively, |
For the record, Mads and Brendan have submitted a contributor's agreement, so we can now take what we want from Mercurial's base85.py (which you can find at http://selenic.com/hg/file/4e1ae55e63ef/mercurial/pure/base85.py). |
Ok, great. I'll update the patch to include both encoding schemes. |
Updated patch that includes both my original implementation of Ascii85, as well as the Mercurial implementation of base85. A few notes/questions:
|
I want to see both algorithms to be similar so far as it is possible. It might be worth extract and reuse a common code. Mercurial's code looks far more optimal (for example a85encode has a quadratic complexity in result accumulating). |
I've updated the Ascii85 algorithms to remove the quadratic complexity, and use a single struct.pack/unpack. They should now be much quicker for large input strings. It's difficult to factor out commonality with b85* because the encodings and rules differ. This is especially true for decode (where Ascii85 allows arbitrary whitespace, so it either has to be stepped through as I've implemented it, or it would have to first be sanitised with .replace() or similar, which is expensive for large inputs). For encode, the special cases supported by Ascii85 make it impossible to *just* use a lookup table, and the simplified algorithm for encoding means it isn't necessary to use one at all. I also wanted to keep the Mercurial code intact as much as possible, so it can be kept in sync in future if necessary. My notes from the previous diff also still apply if anyone has thoughts on those. |
Hi and thanks for the patch!
I agree, it's better like this.
Yes, I think it's ok. I was thinking about binascii in the context of making a C version, but we can refactor things later anyway. Now about the patch: I haven't read it in detail, but it seems to lack tests for b85decode and b85encode. It should be easy enough to get some test values by calling Mercurial's version. |
After a more careful look of the b85encode code I say that it's implementation is not optimal. For the sake of simplicity the entire volume of data is copied several times. This can affect the processing of a large volume of data. On other hand, this dumb copying can be faster then more smart processing in a85encode. Only benchmarks will show the truth. Using a trick with struct.unpack() has very unpleasant side effect. It might be a few speed up encoding, but creates the Struct object with the size is many times larger than the size of the processed data. Worse, this object is cached and continues to consume memory. Since the size of the data most likely will be unique, almost every call of b85encode creates a new object. This will lead to memory leaks. |
After searching a lot of other implementations of this encoding I conclude that there are at least three different variants.
Some implementations combine (1) and (2) (optionally enclose an output in <~ and ~>, optionally wrap an output into several lines, optionally pad last 4 incomplete bytes). |
Yes. The current proposal is to include both the Adobe version ("ascii85") |
At least half of ascii85 encoders in wild implement this variant. I think we can provide a universal solution compatible (with some pre/postprocessing) with both variants. Enclose encoded data in <~ and ~> or not, and at which column wrap an encoded data. Padding can be easy implemented as preprocessing (data + (-len(data)) % 4 * b'\0'). As for Git/Mercurial's base85, what other applications use this encoding? |
Le mercredi 17 avril 2013 à 18:14 +0000, Serhiy Storchaka a écrit :
That's ok with me. It's just more work for whoever does it :-)
I don't know, but they use it to produce binary diffs ("diff" chunks of (and I also like that the Mercurial/Git variant is the simpler of all |
Can you elaborate on this? What leakage is there? I assume this is some
As I mentioned in one of my previous comments, I was trying very hard In my solution (a85(en|de)code) I wrote it from scratch in what I felt I'm less convinced it's sensible to merge the ascii85 implementations (we can do all this independently of the function names, which I think
I actually prefer the Ascii85 one for the simplicity of the encoding |
|
Thanks for the pointer. I will rework the patch for the encoder/decoders |
Serhiy, Martin, perhaps one of you could report the potential memory leak on the Mercurial bug tracker: http://bz.selenic.com/ |
New diff. Changes from the last one:
|
Raised http://bz.selenic.com/show_bug.cgi?id=3894 against Mercurial for them to workaround bpo-14596. |
Attached a minor tweak over the last diff - I'd forgotten to fix the Struct handling inside the Mercurial implementation as well. All other comments still apply to this diff. |
There are some bugs in ascii85 end base85 implementations (see in Rietveld for details). Besides, ascii85 implementation was too slow. I've prepared a patch that corrects errors and speeds up encoding and decoding. Microbenchmarks: ./python -m timeit -r 1 -n 1 -s "from base64 import a85encode as encode; data = open('python', 'rb').read(1000001)" "encode(data)"
a85encode 8.4 sec 1.13 sec |
As for interface, I think 'adobe' flag should be false by default. It makes encoder simpler. ascii85 encoder in Go's standard library doesn't wrap nor add Adobe's brackets. btoa/atob functions looks redundant as we can just use a85encode/a85decoder with appropriate options. Perhaps we should get rid from 'adobe' flag in a85decode and autodetect it. And perhaps to do the same with other a85decode's options. |
On 21 Apr 2013, at 17:38, Serhiy Storchaka <report@bugs.python.org> wrote:
The problem with autodetecting is that it makes it impossible for an application to use this library to verify that something is encoded in a specific way. Explicit is better than implicit. Otherwise, your changes look good to me. |
Agreed. Also, you generally known what format your data is in. Otherwise, how do you know that it is base85 rather than, say, base64 or uuencode? |
Serhiy, Martin, is one of you still working on this? |
What issues are there with the implementation as it stands? I am happy to contribute (as I need to code a base36 implementation myself, and it's basically the same work) but it looks like the existing implementation is fine, except possibly some people don't like "adobe" being implemented by default? |
I'm not very interesting in working on this (but analyzing and optimizing made fun to me). You Antoine as originator definitely are interested. So make decision about interface which you need and finish the work using proposed patches as a basis. I would made a review. I'm a little doubt about appropriateness base85 codec in the base64 module ("This module provides data encoding and decoding as specified in RFC 3548."). Base85 is not standard. But I don't see better place for it. At least the description of the base64 module should be corrected. I suggest first resolve bpo-16995. Perhaps it will get suggestions about base85 interface. |
Well, I think the following comments (Serhiy's) should be implemented: """As for interface, I think 'adobe' flag should be false by default. It makes encoder simpler. ascii85 encoder in Go's standard library doesn't wrap nor add Adobe's brackets. btoa/atob functions looks redundant as we can just use a85encode/a85decoder with appropriate options.""" |
Updated patch with suggested API changes, + docs. |
Updated patch incorporating Serhiy's self-review from 6 months ago (grr). |
I added more comments on Rietveld. |
Did you forget to publish them? |
Grr. |
Updated patch after Serhiy's comments. |
Yet one nitpick and the patch LGTM. |
New changeset 42366e293b7b by Antoine Pitrou in branch 'default': |
Now committed, thanks for the reviews and the code! |
New changeset 1853679c6f71 by R David Murray in branch 'default': |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: