-
-
Notifications
You must be signed in to change notification settings - Fork 29.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PROFILE_TASK for PGO build is not a good workload #80225
Comments
I was doing some 'perf' profiling runs of Python. I decided to try running PROFILE_TASK to see what the profile looks like. I was surprised that gc_collect dominated runtime: Children Self Symbol Part of the problem is that we run full cyclic GC for every test. I.e. cleanup_test_droppings() calls gc.collect(). Maybe we could make these calls conditional on the --pgo flag of regtest. Or, maybe we need to re-evaluate if running the unit test suite is the best way to generate PGO trace data. Based on a tiny bit of further investigation, it looks like gc_collect() is getting called quite a lot of times, in addition to cleanup_test_droppings(). Maybe there is some low-hanging fruit here for optimization. Full GC is pretty expensive. |
Also, the test suite exercises a lot of branches (like passing incorrect types to function to check errors) that will hurt the branch prediction that PGO generates. |
In my experience, people overthink what needs to go into a CPython profiling run. Sure, our default PROFILE_TASK is rather unfortunate because it takes a very long time by including runs of super slow tests that won't meaningfully contribute profile data (multiprocessing, subprocess, concurrent_futures, heavy I/O, etc). But despite somewhat popular belief, it is not actually a problem that the suite exercises other sides of branches, because by and large just by executing Python code at all and exercising C extension module code, it still acts like most any Python program and spends most of the time on the common critical paths - regardless tests that trigger specific number of executions of other paths. Those executions pale in comparison to the ordinary ones in anywhere critical. I don't recommend making any claim about something "harming" the profile without reliable data to prove it. Feel free to tune what test.regrtest --pgo enables or disables by default. But try to do it in a scientific data based manner rather than guessing. Decreasing the total wall time for a default --enable-optimizations build would be a good thing for everyone, provided the resulting interpreter remains "effectively similar" in speed. If you somehow manage to find something that actually speeds up the resulting interpreter, amazing! https://github.com/python/pyperformance is your friend for real world performance measurement. Patience is key. The builds and benchmark runs are slow. One thing --enable-optimizations does _not_ do today is enable link time optimization by default. Various toolchain+platform versions were having problems successfully generating a working interpreter with LTO enabled. If you want a potential large speed up in the interpreter, figuring out how to get link time optimization on top of the existing PGO working reliably, detecting when toolchains+platform combos where it will be reliable, and enabling it by default on such systems is _likely_ the largest possible gain still to be had. |
I spent quite a lot of time making different PGO builds and comparing with pyperformance. The current PGO task is *really* slow. Just running the PROFILE_TASK takes 24 minutes on my decently fast PC. Using this set of tests seems to work pretty well: PROFILE_TASK=-m test.regrtest --pgo \ Instead of 24 minutes, the above task takes one and a half minutes. pyperformance results seem largely unchanged. Comparison below. Tuning the tests to get the best pyperformance result is a bit dangerous and perhaps running the whole test suite is safer (i.e. we are not optimizing for specific benchmarks). I didn't tweak the list too much. I added test_int, test_long, test_struct and test_itertools as a result of my pyperformance runs. Not too surprising those are important modules. I think the set of tests above should do a pretty good job of covering the hot code paths in most Python programs. So, maybe it is good enough given the massive speedup in build time. +-------------------------+----------+------------------------------+ Not significant (15): deltablue; django_template; hexiom; html5lib; logging_simple; regex_compile; scimark_fft; scimark_lu; scimark_monte_carlo; scimark_sor; sympy_str; unpack_sequence; unpickle; xml_etree_parse; xml_etree_iterparse |
yeah I pulled a similar looking list of tests for the PROFILE_TASK to suggest to the Docker Python image maintainers - see docker-library/python#404 and docker-library/python#160. I haven't run comparisons of that vs the default overly long profile task, but a comparison against a non PGO enabled cpython 3.7 build shows the types of speedups i'd expect from a decent PGO build. doing proper comparisons against a full build would take a lot more time than I have right now. It'd be reasonable to change the default task to something that is faster to build so long as we don't appear to be making major sacrifices in speed somewhere. (your benchmark table's pickle result changes are inconsistent and odd, but that benchmark may not be very meaningful as written, i haven't looked at it) |
I tweaked the list of unit tests a little more, trying to incorporate some from your Docker build settings. Not sure what's going on with the pickle results. Below are new pyperformance runs, comparing my PR to the "master" version it is based on. I added the --with-profile-task configure option. If someone really wants to wait for a long build to gain a tiny bit of extra speed, that makes it pretty easy to do so. BTW, I tried making the --pgo disable most of the gc_collect() calls. It doesn't actually make much difference in profile generation time so I discarded those changes. |
+-------------------------+-----------+-----------------------------+ Not significant (17): chaos; deltablue; go; json_dumps; logging_silent; mako; meteor_contest; pickle; pickle_dict; pickle_pure_python; scimark_lu; scimark_sor; sqlalchemy_declarative; sqlite_synth; sympy_integrate; telco; xml_etree_iterparse |
Looking at PR 14702, I would much rather update what Back when we added the --pgo flag, the intent was always to move toward an equivalently good profile, but for practical reasons we settled at the time for "skip tests that break PGO". I know that will make the change far more skips throughout the test suite, so perhaps we want to change how --pgo works completely, but it would certainly be better to make --pgo do what it advertises than have multiple lists of PGO profiles. |
As far as where to put the lists of tests, you're probably right. putting it within test.regrtest itself under the --pgo banner makes sense. (though we should keep logic to accept a list of explicit tests to add or exclude if the user has also provided those) |
Thanks for the feedback. I agree that putting the tests in regrtest is better. I've made the following changes:
|
I think this is abusing of How about |
I changed configure.in to use AC_ARG_VAR instead. So, you can override it as you suggest: ./configure [..] PROFILE_TASK="-m test --pgo-extended" |
Travis only runs 40 tests after this change (https://travis-ci.org/python/cpython/jobs/562362678). |
tracking that as a release blocker in https://bugs.python.org/issue37667 |
I just ran this on Windows and noticed that there is no data for _msi.pyd, winsound.pyd or _sqlite3.pyd. The first two don't matter - we should just suppress PGO on those projects to avoid the warnings (add a <SupportPGO>false</SupportPGO> property). That's a note to myself, unless someone else gets there first. However, I think we should get coverage of the sqlite3 module, so it may be worth adding its tests into this set. I'm not sure if they're great coverage, but it'll be better than nothing. |
I also added test_bz2 and test_lzma into the PGO profile, as the extension modules for those on Windows were barely being covered (they get imported, apparently, but that seems to be it). |
@steve.dower |
i doubt test_lzma test_bz2 and test_sqlite matter. What matters there is compiling the underlying lzma, bz2, and sqlite3 libraries with PGO. that isn't done as part of our build system. regardless, those tests are fast enough so i've approved the PR to add them. |
On Windows they are linked in as source files, which is why it matters for us. On platforms where we rely on system libraries for this it doesn't matter. We currently don't PGO OpenSSL on Windows, which would probably be nice but is much harder to set up as we use their build scripts. Libffi is built in our build though, so adding ctypes would get that covered, but I'm not so worried about it. |
Thanks for approving my PR, Greg! |
I think expanding the list of tests used by --pgo is fine, as long as we put a little thought into each one we add. The ones added by Steve look fine to me. It seems useful to run a profiler and see if there are any unusually expensive tests being added. I used 'cprofile' and found a few candidates. See bug bpo-37707 and PR 15009. |
Closing as I think PR 14702 mostly resolves this. The gc_collect() issue has not been resolved but based on some investigation, I think it is not worth the effort to try to reduce the time spend in that function. Reducing the number of unit tests being run had a huge benefit and did not significantly impact the optimization. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: