-
-
Notifications
You must be signed in to change notification settings - Fork 31.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Micro optimization for range.index if step is 1 #86068
Comments
if we declare range without setting step, we don't have to process divide operation. here is the benchmark result both setting step and without step, With my patch, there is no performance degrade with range.index when the step is not one and showing 19% enhancement when the step is the default value (1) . Mean +- std dev: [range_master] 1.11 us +- 0.01 us -> [range_opt] 896 ns +- 23 ns: 1.24x faster (-19%) Mean +- std dev: [range_step_master] 1.12 us +- 0.02 us -> [range_step_opt] 1.11 us +- 0.01 us: 1.01x faster (-1%) |
For optimization case, >>> a = range(0, 11111)
>>> a.index(3)
3 |
PR 22479 avoids calling PyNumber_FloorDivide(a, b) if b == 1 (if b == _PyLong_One): it makes range.index(a, 1) call 214 ns faster. I'm surprised that PyNumber_FloorDivide(a, 1) takes 214 ns. Does it spend most time in binary_op1()? Or PyNumber_FloorDivide()? long_div(a, 1) is quite simple:
if (Py_ABS(Py_SIZE(a)) == 1 && Py_ABS(Py_SIZE(b)) == 1) {
return fast_floor_div((PyLongObject*)a, (PyLongObject*)b); with: static PyObject *
fast_floor_div(PyLongObject *a, PyLongObject *b)
{
sdigit left = a->ob_digit[0];
sdigit right = b->ob_digit[0];
sdigit div;
if (Py_SIZE(a) == Py_SIZE(b)) {
div = left / right;
}
else {
div = -1 - (left - 1) / right;
}
return PyLong_FromLong(div);
} Do we need another fast-path in long_div(a, b) when b == _PyLong_One? Just return a in this case. |
For practical use case
|
About your benchmark, do you build Python with PGO+LTO optimizations? What is your OS and CPU? Do you use CPU isolation on Linux? It would be good to use PGO+LTO optimizations. |
Nope, but I will try to run the benchmark
macOS, intel i9 with 8 cores |
For the record with optimization Mean +- std dev: [range_master] 112 ns +- 3 ns -> [range_opt] 99.3 ns +- 1.8 ns: 1.13x faster (-12%) |
I'd much prefer not. Every extra fast path check costs time for the general case, and there's no strong reason to expect people to be dividing by one. The range code seems like the right place for this optimization, not the long-divide code. |
Mark: "I'd much prefer not. Every extra fast path check costs time for the general case, and there's no strong reason to expect people to be dividing by one. The range code seems like the right place for this optimization, not the long-divide code." In this case, I suggest to add a comment in long_div() to explain why there is no fast-path in long_div(). Otherwise, I am likely to suggest the exact same optimization in 6 months :-D I'm thinking at something similar to my comment in ceval.c: case TARGET(BINARY_ADD): {
PyObject *right = POP();
PyObject *left = TOP();
PyObject *sum;
/* NOTE(haypo): Please don't try to micro-optimize int+int on
CPython using bytecode, it is simply worthless.
See http://bugs.python.org/issue21955 and
http://bugs.python.org/issue10044 for the discussion. In short,
no patch shown any impact on a realistic benchmark, only a minor
speedup on microbenchmarks. */ This comment is the outcome of 2 years of hard work by many developers :-D See bpo-21955. |
Out of curiosity, what is the motivation for this optimization? Is there reason to believe that range.index is performed often enough that it's worth the extra code and maintenance cost here? |
There are at least 3 reasons
|
I found another optimal case and this looks more practical usecase than range.index |
Mean +- std dev: [master-compute] 317 ns +- 3 ns -> [bpo-41902-compute] 287 ns +- 6 ns: 1.11x faster (-10%) |
On my local machine (without cpu isolation), import pyperf
runner = pyperf.Runner()
runner.timeit(name="bench long divide",
stmt="""
for i in range(1, 256):
a = 10000 // i
""") but I think that my benchmark does not cover the worst case. I need to set up the CPU isolation environment but my resource is limited. PR 22479 and PR 22480 has two sides of assumption. PR 22479: PyNumber_FloorDivide is a heavy operation if the user thinks that this is an unnecessary operation it should be avoided. In conclusion, I'd like to +1 on mark's decision.
|
I would need to carefully look at the PRs to estimate the maintenance cost, but it seems to me initially that all these operations happen very rarely in regular code and probably will have no impact on macro benchmarks. In general, I dislike branches because they quickly can become the source of slight different semantics. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: