New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Improve performance of isotonic regression #2944

Closed
wants to merge 10 commits into
base: master
from

Conversation

Projects
None yet
6 participants
@ajtulloch
Contributor

ajtulloch commented Mar 7, 2014

This supercedes #2940.

PAVA runs in linear time, but the existing algorithm was scaling
approximately quadratically, due to the array pop in the inner loop
being O(N).

I implemented the O(N) version of PAVA using the decreasing subsequences
trick, and the performance is significantly faster in benchmarking.

The benchmarks in this diff (adapted from one of the existing unit
tests) show significant performance improvements

Problem Size Relative Speedup of PAVA
10 2x
100 3x
1,000 14x
10,000 57x
100,000 561x
1,000,000 4645x

benchmarks47115

On correctness - unit tests cover the isotonic regression code fairly
well, and all pass before and after the change. It's a fairly well known
algorithm with a bunch of implementations, so I think this is correct.
In coding up this algorithm I made some mistakes and the unit tests
caught the failures, which makes me more confident in the correctness
now. Still, the performance improvements seem suspiciously large.

ENH - Improve performance of isotonic regression
This supercedes #2940.

PAVA runs in linear time, but the existing algorithm was scaling
approximately quadratically, due to the array pop in the inner loop
being O(N).

I implemented the O(N) version of PAVA using the decreasing subsequences
trick, and the performance is significantly faster in benchmarking.

The benchmarks in this diff (adapted from one of the existing unit
tests) show significant performance improvements - on the order of ~7x
faster for problems of size ~1000, ~30x faster for problems of size
10,000, and ~250x faster for problems of size 100,000.

On correctness - unit tests cover the isotonic regression code fairly
well, and all pass before and after the change. It's a fairly well known
algorithm with a bunch of implementations, so I think this is correct.
In coding up this algorithm I made some mistakes and the unit tests
caught the failures, which makes me more confident in the correctness
now. Still, the performance improvements are surprisingly high.
@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Mar 7, 2014

Coverage Status

Coverage remained the same when pulling 8e25dbc on ajtulloch:isotonic-regression-in-linear-time into 4437d56 on scikit-learn:master.

coveralls commented Mar 7, 2014

Coverage Status

Coverage remained the same when pulling 8e25dbc on ajtulloch:isotonic-regression-in-linear-time into 4437d56 on scikit-learn:master.

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Mar 7, 2014

Member

Thanks very much for the fix @ajtulloch. It looks good to me. One cosmetics note: please replace k+1 by k + 1 to be consistent with the project style. Also please add an entry to the doc/whats_new.rst changelog.

@fabianp @NelleV you might be interested in having a look at this.

Member

ogrisel commented Mar 7, 2014

Thanks very much for the fix @ajtulloch. It looks good to me. One cosmetics note: please replace k+1 by k + 1 to be consistent with the project style. Also please add an entry to the doc/whats_new.rst changelog.

@fabianp @NelleV you might be interested in having a look at this.

Show outdated Hide outdated benchmarks/bench_isotonic.py
timing = (n, np.mean(times))
timings.append(timing)
xs, ys = zip(*timings)
print json.dumps(timings)

This comment has been minimized.

@ogrisel

ogrisel Mar 7, 2014

Member

Please make it work under Python 3 by adding parens to print statements.

@ogrisel

ogrisel Mar 7, 2014

Member

Please make it work under Python 3 by adding parens to print statements.

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Mar 7, 2014

There are still k+1 lying around such as this one :)

There are still k+1 lying around such as this one :)

@fabianp

This comment has been minimized.

Show comment
Hide comment
@fabianp

fabianp Mar 7, 2014

Member

I didn't run the code but it looks awesome :-)

Member

fabianp commented Mar 7, 2014

I didn't run the code but it looks awesome :-)

@ajtulloch

This comment has been minimized.

Show comment
Hide comment
@ajtulloch

ajtulloch Mar 7, 2014

Contributor

To replicate/test for yourself, run

cython sklearn/_isotonic.pyx
make inplace
nosetests sklearn.tests.test_isotonic
python benchmarks/bench_isotonic.py 2 8 1 | gnuplot -e "set terminal dumb; set logscale xy; plot '<cat' title 'performance'"
Contributor

ajtulloch commented Mar 7, 2014

To replicate/test for yourself, run

cython sklearn/_isotonic.pyx
make inplace
nosetests sklearn.tests.test_isotonic
python benchmarks/bench_isotonic.py 2 8 1 | gnuplot -e "set terminal dumb; set logscale xy; plot '<cat' title 'performance'"
@NelleV

This comment has been minimized.

Show comment
Hide comment
@NelleV

NelleV Mar 7, 2014

Member

It's funny, because I originnally implemented pava, and it was slower than @fabianp's active set's implementation. Can you show us how you generated the benchmarks ?

Member

NelleV commented Mar 7, 2014

It's funny, because I originnally implemented pava, and it was slower than @fabianp's active set's implementation. Can you show us how you generated the benchmarks ?

@NelleV

This comment has been minimized.

Show comment
Hide comment
@NelleV

NelleV Mar 7, 2014

Member

My mistake: the benchmarks are in the PR…

Member

NelleV commented Mar 7, 2014

My mistake: the benchmarks are in the PR…

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Mar 7, 2014

Member

@ajtulloch could you please commit the result of the cythonization of the edited pyx file in this PR? Please use the latest stable version of cython (run pip install -U cython to be sure).

Member

ogrisel commented Mar 7, 2014

@ajtulloch could you please commit the result of the cythonization of the edited pyx file in this PR? Please use the latest stable version of cython (run pip install -U cython to be sure).

@@ -154,6 +154,10 @@ Changelog
:func:`feature_selection.f_regression` when variates are not centered.
By `VirgileFritsch`_.
- Significant performance improvements (more than 100x speedup for
large problems) in :class:`isotonic.IsotonicRegression` by
`Andrew Tulloch`_.

This comment has been minimized.

@ogrisel

ogrisel Mar 7, 2014

Member

If you want to link to your home page you must give the URL at the end of the file. Otherwise, just remove the link markup.

@ogrisel

ogrisel Mar 7, 2014

Member

If you want to link to your home page you must give the URL at the end of the file. Otherwise, just remove the link markup.

@ajtulloch

This comment has been minimized.

Show comment
Hide comment
@ajtulloch

ajtulloch Mar 7, 2014

Contributor

Hi @ogrisel,

The updated sklearn/_isotonic.c is attached (using version 0.20.1), see https://github.com/ajtulloch/scikit-learn/blob/isotonic-regression-in-linear-time/sklearn/_isotonic.c. GitHub doesn't show it in the diff due the line /sklearn/_isotonic.c -diff in .gitattributes.

I'll update the link in whats_new.rst, thanks!

@NelleV, I suspect the performance differences are down to a few things:

  • this is a somewhat tuned implementation of PAVA for Cython - entirely inlined, all type information given, so it's very easy for GCC/Clang to generate efficient code.
  • the performance of the algorithms depends on the given dataset, so if you have different example problems we should benchmark those as well.

screenshot 2014-03-07 13 38 43

Contributor

ajtulloch commented Mar 7, 2014

Hi @ogrisel,

The updated sklearn/_isotonic.c is attached (using version 0.20.1), see https://github.com/ajtulloch/scikit-learn/blob/isotonic-regression-in-linear-time/sklearn/_isotonic.c. GitHub doesn't show it in the diff due the line /sklearn/_isotonic.c -diff in .gitattributes.

I'll update the link in whats_new.rst, thanks!

@NelleV, I suspect the performance differences are down to a few things:

  • this is a somewhat tuned implementation of PAVA for Cython - entirely inlined, all type information given, so it's very easy for GCC/Clang to generate efficient code.
  • the performance of the algorithms depends on the given dataset, so if you have different example problems we should benchmark those as well.

screenshot 2014-03-07 13 38 43

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Mar 7, 2014

Member

My bad I somehow did not see the C file even though I thought I had clicked on the list of updated files.

Member

ogrisel commented Mar 7, 2014

My bad I somehow did not see the C file even though I thought I had clicked on the list of updated files.

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Mar 7, 2014

Member

+1 for merging.

Member

ogrisel commented Mar 7, 2014

+1 for merging.

@ajtulloch

This comment has been minimized.

Show comment
Hide comment
@ajtulloch

ajtulloch Mar 7, 2014

Contributor

I reran some benchmarks - with the symmetrically perturbed log1p from the unit tests, and a simulated logistic regression dataset, results below.

Does anyone have other suggestions for example problems to test perf on?

benchmarks47115
benchmarks27309

Contributor

ajtulloch commented Mar 7, 2014

I reran some benchmarks - with the symmetrically perturbed log1p from the unit tests, and a simulated logistic regression dataset, results below.

Does anyone have other suggestions for example problems to test perf on?

benchmarks47115
benchmarks27309

@larsmans

This comment has been minimized.

Show comment
Hide comment
@larsmans

larsmans Mar 8, 2014

Member

The benchmark script is undocumented and uncommented. It takes three numbers and produces two other numbers, without a hint as to how to interpret them.

Member

larsmans commented Mar 8, 2014

The benchmark script is undocumented and uncommented. It takes three numbers and produces two other numbers, without a hint as to how to interpret them.

Clean up benchmark tool to be more self documenting and
understandable.

For an example usage, run:

```
python benchmarks/bench_isotonic.py --iterations 10 --log_min_problem_size 2 --log_max_problem_size 8 --dataset logistic
```
@ajtulloch

This comment has been minimized.

Show comment
Hide comment
@ajtulloch

ajtulloch Mar 8, 2014

Contributor

@larsmans - I've updated the benchmark script with more usage details, a proper parameter parser, and added comments where appropriate. Thanks for your comment.

Contributor

ajtulloch commented Mar 8, 2014

@larsmans - I've updated the benchmark script with more usage details, a proper parameter parser, and added comments where appropriate. Thanks for your comment.

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Mar 8, 2014

Coverage Status

Coverage remained the same when pulling a3dba48 on ajtulloch:isotonic-regression-in-linear-time into 4437d56 on scikit-learn:master.

coveralls commented Mar 8, 2014

Coverage Status

Coverage remained the same when pulling a3dba48 on ajtulloch:isotonic-regression-in-linear-time into 4437d56 on scikit-learn:master.

@larsmans

This comment has been minimized.

Show comment
Hide comment
@larsmans

larsmans Mar 9, 2014

Member

Ok, merged as 3753563. Thanks!

Member

larsmans commented Mar 9, 2014

Ok, merged as 3753563. Thanks!

@larsmans larsmans closed this Mar 9, 2014

@ajtulloch

This comment has been minimized.

Show comment
Hide comment
@ajtulloch

ajtulloch Mar 9, 2014

Contributor

Great, thanks again for your comments.

Contributor

ajtulloch commented Mar 9, 2014

Great, thanks again for your comments.

@ajtulloch ajtulloch deleted the ajtulloch:isotonic-regression-in-linear-time branch Mar 10, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment