-
-
Notifications
You must be signed in to change notification settings - Fork 31.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproducible pyc: FLAG_REF is not stable. #78274
Comments
PR-8226 makes marshal two-pass. It may have small overhead. In case of compiling module, marshal performance is negligible. And should we backport this to Python 3.7? |
marshal: Mean +- std dev: [master] 123 us +- 7 us -> [patched] 173 us +- 2 us: 1.41x slower (+41%) |
Look also at alternate patches for bpo-20416. Some of them can solve this problem for simple types. If they have better performance, using them for simple types could save a time. But this will complicate a code, and I'm not sure it is worth. |
According to Serhiy Storchaka, currently marshal.dumps() writes frozenset in arbitrary order, and so frozenset serialization is not reproducible: |
What is the time spent in marshal.dumps() at Python startup when Python has to create all .pyc files? For example "./python -c pass" in the master branch with no external dependency? My question is if the PR makes Python startup 5% slower or less than 1% slower. |
PYTHONHASHSEED can be used to stable frozenset order. On the other hand, refcnt based approach is more unstable. |
When startup, Python does more than compile()+marshal.dumps(). Additionally, it happens only once if pyc can be writable. |
I know. But Python does more than compile()+dumps() at the first run. I'm curious if it is feasible to measure this cost. But it may be hard to get reliable benchmarks, since I expect that the difference will be very small, and I know very well that measuring Python startup is hard since it depends a lot of on the filesystem which is hard to measure. |
Why must this become slower? To my knowledge, many projects prefer marshal over pickle Would it not be easy to add a named optional keyword |
My pull request did it. But for now, I get hint on ML and overwrote my PR with another way: Use FLAG_REF for all interned strings. |
I created bpo-37596 "Reproducible pyc: frozenset is not serialized in a deterministic order" to track this issue. |
FYI, I unknowingly created a duplicate of this issue a few days ago, bpo-45186, and created a PR for it: #28379. Interestingly, while I did that PR independently, it has a lot in common with Inada-san's second PR. My interest here is in how frozen modules can be affected by this problem, particularly between debug and non-debug builds. See bpo-45020, where I'm working on freezing all the stdlib modules imported during startup. |
It turns out that I don't need this after all (once I merged #72578 and bpo-45188 was resolved). That impacts how much time I have to spend on this, so I might not be able to pursue this further. That said, I think it is worth doing and the PR I have up mostly does everything we need here. So I'll see if I can follow this through. :) |
FWIW, I found a faster solution than calling Currently the logic for w_ref() (used for each "complex" object) looks like this:
The faster solution looks like this:
While this is faster, there are two downsides: extra memory usage and it isn't practical when writing to a file. However, I don't think either is a significant problem. For the former, it can be mostly mitigated by using the negative values in WFILE.hashtable to store the type byte position. For the latter, "marshal.dump()" is already a light wrapper around "marshal.dump()" and for PyMarshal_WriteObjectToFile() we simply stick with the current unstable approach (or change it to do what "marshal.dump()" does). FYI, I mostly have that implemented in a branch, but am not sure when I'll get back to it. |
Since this is no longer needed by CPython, I recommend closing the issue (and PR). I'll close next week if there are no objections. |
Do you mean that the initial issue was fixed? PYC are now reproducible? |
No. I mean that they don't need to be reproducible. See Guido's comment: #28379 (comment) |
Ok, but this issue is about making PYC files reproducible to reproducible builds. @methane: What do you think? |
Wheel doesn't contain pyc files. So Python ecosystem around PyPI doesn't need reproducible pyc.
I think its fine too. |
Wheels might not use pyc but other downstream build artifacts might want to include pyc files for performance reasons while remaining reproducible themselves. Specifically, the python-for-android project packages pyc files inside Another example is .AppImage on desktop linux. If the appimage bundles .pyc files, it makes the program launch faster. It makes sense to do the pyc compilation at build-time and package it inside the binary, instead of doing the pyc compilation on the enduser's machine on every app launch. To have the .AppImage reproducibly built, the pyc files need to be reproducible. |
@SomberNight We know it. What we talked about is 3rd party tool can implement reproducible pyc build. So marshal don't need to implement it. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: