[ENH] Improve performance of isotonic regression #2944

Closed
wants to merge 10 commits into
from

Projects

None yet

6 participants

@ajtulloch
Contributor

This supercedes #2940.

PAVA runs in linear time, but the existing algorithm was scaling
approximately quadratically, due to the array pop in the inner loop
being O(N).

I implemented the O(N) version of PAVA using the decreasing subsequences
trick, and the performance is significantly faster in benchmarking.

The benchmarks in this diff (adapted from one of the existing unit
tests) show significant performance improvements

Problem Size Relative Speedup of PAVA
10 2x
100 3x
1,000 14x
10,000 57x
100,000 561x
1,000,000 4645x

benchmarks47115

On correctness - unit tests cover the isotonic regression code fairly
well, and all pass before and after the change. It's a fairly well known
algorithm with a bunch of implementations, so I think this is correct.
In coding up this algorithm I made some mistakes and the unit tests
caught the failures, which makes me more confident in the correctness
now. Still, the performance improvements seem suspiciously large.

@ajtulloch ajtulloch ENH - Improve performance of isotonic regression
This supercedes #2940.

PAVA runs in linear time, but the existing algorithm was scaling
approximately quadratically, due to the array pop in the inner loop
being O(N).

I implemented the O(N) version of PAVA using the decreasing subsequences
trick, and the performance is significantly faster in benchmarking.

The benchmarks in this diff (adapted from one of the existing unit
tests) show significant performance improvements - on the order of ~7x
faster for problems of size ~1000, ~30x faster for problems of size
10,000, and ~250x faster for problems of size 100,000.

On correctness - unit tests cover the isotonic regression code fairly
well, and all pass before and after the change. It's a fairly well known
algorithm with a bunch of implementations, so I think this is correct.
In coding up this algorithm I made some mistakes and the unit tests
caught the failures, which makes me more confident in the correctness
now. Still, the performance improvements are surprisingly high.
8e25dbc
@coveralls

Coverage Status

Coverage remained the same when pulling 8e25dbc on ajtulloch:isotonic-regression-in-linear-time into 4437d56 on scikit-learn:master.

@ogrisel
Member
ogrisel commented Mar 7, 2014

Thanks very much for the fix @ajtulloch. It looks good to me. One cosmetics note: please replace k+1 by k + 1 to be consistent with the project style. Also please add an entry to the doc/whats_new.rst changelog.

@fabianp @NelleV you might be interested in having a look at this.

@ogrisel ogrisel commented on an outdated diff Mar 7, 2014
benchmarks/bench_isotonic.py
+ delta = (datetime.now() - tstart)
+ return total_seconds(delta)
+
+if __name__ == '__main__':
+ min_exp, max_exp, iters = map(int, sys.argv[1:])
+ timings = []
+ for i in range(min_exp, max_exp):
+ n = 10 ** i
+ X = np.arange(n)
+ Y = np.random.randint(-50, 50, size=(n,)) \
+ + 50. * np.log(1 + np.arange(n))
+ times = [bench_isotonic_regression(X, Y) for i in range(iters)]
+ timing = (n, np.mean(times))
+ timings.append(timing)
+ xs, ys = zip(*timings)
+ print json.dumps(timings)
@ogrisel
ogrisel Mar 7, 2014 Member

Please make it work under Python 3 by adding parens to print statements.

@ogrisel

There are still k+1 lying around such as this one :)

@fabianp
Member
fabianp commented Mar 7, 2014

I didn't run the code but it looks awesome :-)

@ajtulloch
Contributor

To replicate/test for yourself, run

cython sklearn/_isotonic.pyx
make inplace
nosetests sklearn.tests.test_isotonic
python benchmarks/bench_isotonic.py 2 8 1 | gnuplot -e "set terminal dumb; set logscale xy; plot '<cat' title 'performance'"
@NelleV
Member
NelleV commented Mar 7, 2014

It's funny, because I originnally implemented pava, and it was slower than @fabianp's active set's implementation. Can you show us how you generated the benchmarks ?

@NelleV
Member
NelleV commented Mar 7, 2014

My mistake: the benchmarks are in the PR…

@ogrisel
Member
ogrisel commented Mar 7, 2014

@ajtulloch could you please commit the result of the cythonization of the edited pyx file in this PR? Please use the latest stable version of cython (run pip install -U cython to be sure).

@ogrisel ogrisel commented on the diff Mar 7, 2014
doc/whats_new.rst
@@ -154,6 +154,10 @@ Changelog
:func:`feature_selection.f_regression` when variates are not centered.
By `VirgileFritsch`_.
+ - Significant performance improvements (more than 100x speedup for
+ large problems) in :class:`isotonic.IsotonicRegression` by
+ `Andrew Tulloch`_.
@ogrisel
ogrisel Mar 7, 2014 Member

If you want to link to your home page you must give the URL at the end of the file. Otherwise, just remove the link markup.

@ajtulloch
Contributor

Hi @ogrisel,

The updated sklearn/_isotonic.c is attached (using version 0.20.1), see https://github.com/ajtulloch/scikit-learn/blob/isotonic-regression-in-linear-time/sklearn/_isotonic.c. GitHub doesn't show it in the diff due the line /sklearn/_isotonic.c -diff in .gitattributes.

I'll update the link in whats_new.rst, thanks!

@NelleV, I suspect the performance differences are down to a few things:

  • this is a somewhat tuned implementation of PAVA for Cython - entirely inlined, all type information given, so it's very easy for GCC/Clang to generate efficient code.
  • the performance of the algorithms depends on the given dataset, so if you have different example problems we should benchmark those as well.

screenshot 2014-03-07 13 38 43

@ogrisel
Member
ogrisel commented Mar 7, 2014

My bad I somehow did not see the C file even though I thought I had clicked on the list of updated files.

@ogrisel
Member
ogrisel commented Mar 7, 2014

+1 for merging.

@ajtulloch
Contributor

I reran some benchmarks - with the symmetrically perturbed log1p from the unit tests, and a simulated logistic regression dataset, results below.

Does anyone have other suggestions for example problems to test perf on?

benchmarks47115
benchmarks27309

@larsmans
Member
larsmans commented Mar 8, 2014

The benchmark script is undocumented and uncommented. It takes three numbers and produces two other numbers, without a hint as to how to interpret them.

@ajtulloch ajtulloch Clean up benchmark tool to be more self documenting and
understandable.

For an example usage, run:

```
python benchmarks/bench_isotonic.py --iterations 10 --log_min_problem_size 2 --log_max_problem_size 8 --dataset logistic
```
6827311
@ajtulloch
Contributor

@larsmans - I've updated the benchmark script with more usage details, a proper parameter parser, and added comments where appropriate. Thanks for your comment.

@coveralls

Coverage Status

Coverage remained the same when pulling a3dba48 on ajtulloch:isotonic-regression-in-linear-time into 4437d56 on scikit-learn:master.

@larsmans
Member
larsmans commented Mar 9, 2014

Ok, merged as 3753563. Thanks!

@larsmans larsmans closed this Mar 9, 2014
@ajtulloch
Contributor

Great, thanks again for your comments.

@ajtulloch ajtulloch referenced this pull request in JuliaLang/METADATA.jl Mar 9, 2014
Merged

Register Isotonic package for isotonic regression #685

@ajtulloch ajtulloch deleted the ajtulloch:isotonic-regression-in-linear-time branch Mar 10, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment