[ENH] Improve performance of isotonic regression#2944
[ENH] Improve performance of isotonic regression#2944ajtulloch wants to merge 10 commits intoscikit-learn:masterfrom
Conversation
This supercedes scikit-learn#2940. PAVA runs in linear time, but the existing algorithm was scaling approximately quadratically, due to the array pop in the inner loop being O(N). I implemented the O(N) version of PAVA using the decreasing subsequences trick, and the performance is significantly faster in benchmarking. The benchmarks in this diff (adapted from one of the existing unit tests) show significant performance improvements - on the order of ~7x faster for problems of size ~1000, ~30x faster for problems of size 10,000, and ~250x faster for problems of size 100,000. On correctness - unit tests cover the isotonic regression code fairly well, and all pass before and after the change. It's a fairly well known algorithm with a bunch of implementations, so I think this is correct. In coding up this algorithm I made some mistakes and the unit tests caught the failures, which makes me more confident in the correctness now. Still, the performance improvements are surprisingly high.
|
Thanks very much for the fix @ajtulloch. It looks good to me. One cosmetics note: please replace @fabianp @NelleV you might be interested in having a look at this. |
benchmarks/bench_isotonic.py
Outdated
There was a problem hiding this comment.
Please make it work under Python 3 by adding parens to print statements.
|
I didn't run the code but it looks awesome :-) |
|
To replicate/test for yourself, run cython sklearn/_isotonic.pyx
make inplace
nosetests sklearn.tests.test_isotonic
python benchmarks/bench_isotonic.py 2 8 1 | gnuplot -e "set terminal dumb; set logscale xy; plot '<cat' title 'performance'" |
|
It's funny, because I originnally implemented pava, and it was slower than @fabianp's active set's implementation. Can you show us how you generated the benchmarks ? |
|
My mistake: the benchmarks are in the PR… |
|
@ajtulloch could you please commit the result of the cythonization of the edited pyx file in this PR? Please use the latest stable version of cython (run |
There was a problem hiding this comment.
If you want to link to your home page you must give the URL at the end of the file. Otherwise, just remove the link markup.
|
Hi @ogrisel, The updated I'll update the link in @NelleV, I suspect the performance differences are down to a few things:
|
|
My bad I somehow did not see the C file even though I thought I had clicked on the list of updated files. |
|
+1 for merging. |
|
The benchmark script is undocumented and uncommented. It takes three numbers and produces two other numbers, without a hint as to how to interpret them. |
understandable. For an example usage, run: ``` python benchmarks/bench_isotonic.py --iterations 10 --log_min_problem_size 2 --log_max_problem_size 8 --dataset logistic ```
|
@larsmans - I've updated the benchmark script with more usage details, a proper parameter parser, and added comments where appropriate. Thanks for your comment. |
|
Ok, merged as 3753563. Thanks! |
|
Great, thanks again for your comments. |



This supercedes #2940.
PAVA runs in linear time, but the existing algorithm was scaling
approximately quadratically, due to the array pop in the inner loop
being O(N).
I implemented the O(N) version of PAVA using the decreasing subsequences
trick, and the performance is significantly faster in benchmarking.
The benchmarks in this diff (adapted from one of the existing unit
tests) show significant performance improvements
On correctness - unit tests cover the isotonic regression code fairly
well, and all pass before and after the change. It's a fairly well known
algorithm with a bunch of implementations, so I think this is correct.
In coding up this algorithm I made some mistakes and the unit tests
caught the failures, which makes me more confident in the correctness
now. Still, the performance improvements seem suspiciously large.