-
-
Notifications
You must be signed in to change notification settings - Fork 30.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ceval: use Wordcode, 16-bit bytecode #70834
Comments
Originally started @ https://github.com/abarnert/cpython/tree/wpy This patch is based off of https://github.com/serprex/cpython/tree/wpy It omits importlib.h & importlib_external.h as those are generated It omits https://github.com/serprex/cpython/blob/wpy/Python/wordcode.md I got around to benchmarking against building on master rather than using my repo's packaged version, it's currently a 1% speed improvement (every bit counts). I'm testing on an Intel Atom 330 with Linux. Besides the minor perf increase, it generates smaller bytecode & is simpler (peephole now handles EXTENDED_ARG since it isn't too hard to track & while loops become for loops in dis) Previous discussion: https://mail.python.org/pipermail/python-dev/2016-February/143357.html pdb works without changes. coverage.py doesn't seem to rely on anything this changes I modified byteplay to target this change mostly over the course of half an hour before work: https://github.com/serprex/byteplay/blob/master/wbyteplay.py I'd be interested to hear if this encoding simplifies things for FAT python & the recent work to cache attribute/global lookup Remaining code issues: peepholer could allocate half the space as it does now for basic block tracking, compile.c & peephole.c repeat themselves on computing instruction size given an argument & how to spit out an instruction given an argument Breaking change in dis: I've removed HAVE_ARGUMENT. This is to help code fail fast. It could be replaced with IGNORES_ARGUMENT or, as abarnert suggested, a range(90,256) named after the other hasXXXs 'hasarg' |
Also missing from this patch is modification of the bytecode magic number |
Sorry, I don't have the context. Can you please explain your change? What did you do? What is the rationale? Do you expect better performances? If yes, please run the Python benchmark suite and post results here. What is the new format of bytecode? etc. |
I'll dig up benchmark results when I get home, but I'd be interested to get results on a less wannabe RISC CPU The change is to have all instructions take an argument. This removes the branch on each instruction on whether to load oparg. It then also aligns instructions to always be 2 bytes rather than 1 or 3 by having arguments only take up 1 byte. In the case that an argument to an instruction is greater than 255, it can chain EXTENDED_ARG up to 3 times. In practice this rarely occurs, mostly only for jumps, & abarnert measured stdlib to be ~5% smaller The rationale is that this offers 3 benefits: Smaller code size, simpler instruction iteration/indexing (One may now scan backwards, as peephole.c does in this patch), which between the two results in a small perf gain (the only way for perf to be negatively impacted is by an increase in EXTENDED_ARGs, when I post benchmarking I'll also post a count of how many more EXTENDED_ARGs are emitted) This also means that if I want to create something like a tracer that tracks some information for each instruction, I can allocate an array of codesize/2 bytes, then index off of half the instruction index. This isn't currently done in peephole.c, nor does this include halving jump opargs I've looked up the 'recent work to cache attribute/global lookup' issue I mentioned: http://bugs.python.org/issue26219 |
There is something called "inline caching": put the cache between instructions, in the same memory block. Example of paper on CPython: "Efficient Inline Caching without Dynamic Translation" by Stefan Brunthaler (2009) Yury's approach is a standard lookup table: offset => cache. In the issue bpo-26219, he even used two tables: co->co_opt_opcodemap is an array mapping an instruction offset to the offset in the cache, then the second offset is used to retrieve cache data from a second array. You have 3 structures (co_code, co_opt_opcodemap, co_opt), whereas inline caching propose to only use one flat structure (a single array). The paper promises "improved data locality and instruction decoding effciency". but "The new combined data-structure requires significantly more space—two native machine words for each instruction byte. To compensate for the additional space requirements, we use a profiling infrastructure to decide when to switch to this new instruction encoding at run time." Memory footprint and detection of hot code is handled in the issue bpo-26219. |
Oh ok, I like that :-) I had the same idea. Your patch contains unrelated changes, you should revert them to have a change simpler to review. Removing HAVE_ARGUMENT from opcode.h/dis.py doesn't seem like a good idea. IMHO it's stil useful for dis to show a more compact bytcode. For example, I expect "DUP_TOP", not "DUP_TOP 0", or worse "DUP_TOP 5". For backward compatibility, I also suggest to keep HAS_ARG() even if it must not be used to decode instructions anymore. The following obvious change is to use a pointer aligned to 16-bits for co_code to be able to use 16-bit instructions rather than two 8-bit instructions to retrieve the opcode and then the argument in ceval.c (see the issue bpo-25823). I suggest to implement that later to keep the change as simple as possible. |
I've attached some benchmarking results as requested There is 1 failing test which doesn't fail in master for test_trace; the unit test for bpo-9936 |
To clarify format of extended arg listings: 1st column is the number of instances of EXTENDED_ARG being emitted, 2nd column is length of bytecode, followed by filename The previous numbers were of modules, which generally are run-once and listing many constants. I've attached a modification where instead I iterated over the code objects inside co_consts from compiling the *.py file. Trunk currently only emits EXTENDED_ARGs for classes (Pdb, & then the rest in Lib/typing.py) so I've omitted it |
Thanks for the benchmark results, Demur, but I think the benchmarks Victor was talking about hg.Python.org/benchmarks |
Demur, I think you're on the right track here. It will nice to be rid of the HAVE_ARGUMENT tests and to not have to decode the arguments one-byte at a time. Overall, the patch looks good (although it includes several small unrelated changes). Besides the speed benefit, the code looks cleaner than before. I was surprised to see that the peephole optimizer grew larger, but the handling of extended arguments is likely worth it even though it adds several new wordy chunks of code. When it comes to benchmarks, expect a certain amount of noise (especially from those that use I/O or that exercise the C-API more than the pure python bytecode). |
FWIW, I'm seeing about a 7% improvement to pystone. |
While it's good to know benchmarking in core Python goes beyond the microbenchmarks included in the distribution, I'm having some trouble with hg.python.org/benchmarks due to my system only having 256MB of ram I've attached results for 2 benchmarks: 2to3 & regex |
Report on Darwin Raymonds-2013-MacBook-Pro.local 15.4.0 Darwin Kernel Version 15.4.0: Fri Feb 26 22:08:05 PST 2016; root:xnu-3248.40.184~3/RELEASE_X86_64 x86_64 i386 ### 2to3 ### ### chameleon_v2 ### ### django_v3 ### ### nbody ### The following not significant results are hidden, use -v to show them: |
Thanks to Demur and Raymond for running the benchmarks. All of these numbers look good and with Raymond saying the code looks cleaner and everyone so far -- including me -- liking the overall idea this definitely seems worth continuing to work on. Thanks for starting this, Demur, and I hope you feel up for continuing to work on making this work! |
Added back HAVE_ARGUMENT & HAS_ARG. As a result printing has removed arguments Removed some code which was labelled unrelated This does _not_ include having f_lasti be -1 instead of -2 |
Addressed feedback from josh.rosenberg besides reintroducing FOURTH/SET_FOURTH |
I reviewed wpy3.patch. I concur with Raymond, it's really nice to have a regular structure for the bytecode. -- Serhiy proposed to *reduce* the size of bytecode by adding new specialized bytecode which include the argument. For example (LOAD_CONST, 0) => LOAD_CONST_0. I would like to hear his opinion on this change. Data+code loaded by import is the top #1 memory consumer on basic scripts according to tracemalloc: I don't know the ratio between data and code. But here we are only talking about the co_code fields of code objects. I guess that the file size of .pyc is a good estimation. I don't think that the memory footprint of bytecode (co_code fields of code objects) really matters on computers (and smartphones?) of 2016. *If* I have to choose between CPU performance and memory footprint, I choose the CPU! --
IMHO it's ok to break the C API, but I would prefer to keep the backward compatibility for the Python API (replace any negative number with -1 for the Python API). |
Got f_lasti working as -1. Applied PEP-7. Unrelated: fixed a misnamed variable in test_grammar because it ran into a peephole bug (const -> jump_if_false erase didn't work when EXTENDED_ARGs were involved). dis has argval/arg set to None instead of the unused argument value Things are seeming more brittle with f_lasti as -1. But maybe it's all in my head |
[12:36] <serprex> Could I get a code review for wordcode's 4th patchset? http://bugs.python.org/review/26647/#ps16875 |
module_finder.patch: cleanup (optimize?) modulefinder.ModuleFinder.scan_opcodes_25(): Use an index rather than creating a lot of substrings. It's unrelated to Wordcode, it's just that I noticed the inefficient code while reviewing the whole patch. |
New changeset 7bf08a11d4c9 by Victor Stinner in branch 'default': New changeset 423e2a96189e by Victor Stinner in branch 'default': New changeset f8398dba48fb by Victor Stinner in branch '3.5': |
Demur Rumed: can you please rebase your patch? And can you please generate a patch without the git format? |
Made changes from code review, did a little extra on fixing up type consistency, not sure if this is exactly the patch format you wanted; I tried I've updated modulefinder with haypo's index patch except in the context of wordcode |
Updated wpy5.patch to use a more standard diff format (patch generated with Mercurial, hg diff > patch). |
Crap, I forgot Python/wordcode_helpers.h. I updated a fixed wpy6.patch. |
Demur Rumed: Is the peephole optimizer able to emit *two* EXTENDED_ARG for jump larger than 16 bits? Currently, it only has to retry once to add EXTENDED_ARG if a jump is larger than 16 bits (to use 32-bit jump offset). |
I ran the Python benchmark suite on wpy6.patch.
It looks like we get more faster benchmarks than slower benchamrks. Faster is up to 11% faster, whereas the worst slowdown is only 4%. The overall results look good to me. Slower:
Faster:
Full Output: Original python: ../wordcode/python Patched python: ../wordcode/python INFO:root:Automatically selected timer: perf_counter Patched python: ../wordcode/python INFO:root:Automatically selected timer: perf_counter Report on Linux smithers 4.4.4-301.fc23.x86_64 #1 SMP Fri Mar 4 17:42:42 UTC 2016 x86_64 x86_64 ### call_method ### ### call_method_slots ### ### call_method_unknown ### ### call_simple ### ### chaos ### ### django_v3 ### ### etree_iterparse ### ### etree_parse ### ### fannkuch ### ### float ### ### mako_v2 ### ### meteor_contest ### ### nbody ### ### pickle_dict ### ### richards ### ### silent_logging ### ### simple_logging ### ### telco ### ### unpack_sequence ### The following not significant results are hidden, use -v to show them: |
Sorry for the nuisance of uploading another patch so soon. wpyB modifies test_ctypes now that __phello__ is smaller, & fixes a typo in a comment I made & removes a blank line I had added in when adding in if(0) logic |
Warnings still emitted in debug build. |
I have verified that wpyC does not produce signed/unsigned warnings with make DEBUG=1 |
Removes 0 <= unsigned assertion & fixes j < 0 check to avoid overflow bug |
LGTM. If no one has more comments, I'm going to commit the patch. |
wpyD.patch LGTM, go ahead! We can still polish it later and discuss how to implement the 16-bit fetch ;-) It would be nice to add a short comment in Python/wordcode_helpers.h explaining that it contains code shared by the compiler and the peephole optimizer. It can be done later. |
New changeset 3a57eafd8401 by Serhiy Storchaka in branch 'default': |
Oh, I forgot to add a note in What's New. And the documentation of the dis module should be updated (EXTENDED_ARG etc). Could anyone do this? |
Yeah, congrats Demur! |
I join in the congratulations, Demur! Thank you for your contributions. I left this issue open for updating the documentation and other polishing. |
A documentation touch up for EXTENDED_ARG is included in bpo-27095 |
Chatting to Russell Keith-Magee, I realised the bytecode section in the devguide's description of the code generation pipeline may need some tweaks to account for the differences between 3.6 and earlier versions: https://docs.python.org/devguide/compiler.html#ast-to-cfg-to-bytecode |
I switched the target component to Documentation to reflect that as far as we know this is feature complete from a functional perspective, but there hasn't been a review of the docs for bytecode references yet, nor a decision on whether or not we want to systematically switching to using "wordcode" instead. |
Hi, I ran the CPython benchmark suite (my fork modified to be more stable) on ed4eec682199 (patched) vs 7a7f54fe0698 (base). The patched version contains wordcode (issue bpo-26647) + 16-bit fetch for opcode and oparg (issue bpo-27097). The speedup is quite nice. Attached default-May26-03-05-10.log contains the full output. Faster (27):
Slower (1):
Not significat (14):
|
I think we should make yet few related changes:
These changes break compatibility (already broken by switching to 16-bit bytecode). The first one breaks compatibility with compiled bytecode and needs incrementing the magic number. That is why I think we should do this in this issue. What is better, provide one large patch or separate simpler patches for every stage? |
Here is large patch (not including generated Python/importlib.h and Python/importlib_external.h). |
Serhiy: please open a new issue for your change. While it's related, it's By the way , please don't include generated importlib .h file in your |
New changeset 303cedfb9e7a by Victor Stinner in branch '3.6': |
This issue is done: see issue bpo-27129 for the next step. |
I left this issue open for documenting the wordcode. Now opened separate bpo-28810 for this. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: