Add GPU Cholesky Primitive #1059

SteveBronder · 2018-10-30T05:06:27Z

Summary

Adds the gpu cholesky decomposition primitive and it's associated kernel. This also includes two small bug fixes for the inverse cholesky.

When users define STAN_OPENCL during compilation the cpu based cholesky will be replaced by the gpu version of the cholesky.

m = cholesky_decompose(x);

So users code will stay the same, but will then have access to the gpu routines. This was decided in favor of including a cholesky_decompose_gpu method since if users want fine grain decision making we can always introduce it later, but it will be harder to remove it.

The Cholesky's gpu algorithm is recursive, where The parameters block, divider, and min_block act as tuning parameters for the recursive step of the GPU based Cholesky decompostion. The matrix is subset by the block size, and if the block size is less than min_block then the Cholesky decomposition on the GPU is computed using that submatrix. If block is greater than block_size then cholesky_decompose is run again with block equal to block/divider. Once the Cholesky decomposition is computed, the full Cholesky is created by propagating the Cholesky forward as given in the reference report below.

https://github.com/SteveBronder/stancon2018/blob/master/report.pdf

Tests

Assuming OpenCL is setup and the preferred device has an index of 0, tests can be run with

echo STAN_OPENCL=true>> make/local
echo OPENCL_PLATFORM_ID=0>> make/local
echo OPENCL_DEVICE_ID=0>> make/local
./runTests.py ./test/unit/math/gpu/cholesky_decompose_test.cpp

cholesky_decompose_cpu_vs_gpu

Tests whether cpu and gpu outputs are close to each other

cholesky_decompose_small

Runs the cholesky_decompose_test tester on small input matrices

cholesky_decompose_big

Runs the cholesky_decompose_test tester on input matrices of size 500, 1000, 2000

The cholesky_decompose_test tester checks whether the GPU implementation's max error for any cell in the cholesky decomposition relative to Eigen's LLT method is less than 1E-8.

Side Effects

None

Checklist

Math issue Add GPU Cholesky Decomposition #1058
Copyright holder: (fill in copyright holder information)
Rok Češnovar and Erik Štrumbelj (Faculty of Computer and Information Science, University of Ljubljana)
Steve Bronder
the basic tests are passing
- unit tests pass (to run, use: ./runTests.py test/unit)
- header checks pass, (make test-headers)
- docs build, (make doxygen)
- code passes the built in C++ standards checks (make cpplint)
the code is written in idiomatic C++ and changes are documented in the doxygen
the new changes are tested

…gs/RELEASE_500/final)

…into gpu_cholesky_prim

…gs/RELEASE_500/final)

rok-cesnovar · 2018-10-30T09:04:43Z

One thing to discuss is whether we want to do CPU computation for small input sizes even if STAN_OPENCL is set.

test/unit/math/gpu/cholesky_decompose_test.cpp

SteveBronder · 2018-10-30T16:38:31Z

One thing to discuss is whether we want to do CPU computation for small input sizes even if STAN_OPENCL is set.

Yes I see two things to remove the WIP

Run performance tests to see at what size this should be moved over to the GPU

EDIT: idt we need tests for the square and symmetric. Those are called in the main prim Cholesky decompose so those would really just be testing the square and symmetric tests.

…stable/2017-11-14)

SteveBronder · 2018-11-09T04:11:46Z

@seantalts we are still waiting for Rok's tuning tests but I think the code is ready for review.

SteveBronder · 2018-11-09T19:06:22Z

I'm getting a ‘ec2-spot-c5d18x’ is offline from Jenkins

seantalts · 2018-11-09T19:18:16Z

Hmm, I think I fixed it - for some reason the other Linux Jenkins node wasn't listed as having MPI.

…

On Fri, Nov 9, 2018 at 2:06 PM Steve Bronder ***@***.***> wrote: I'm getting a ‘ec2-spot-c5d18x’ is offline from Jenkins — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1059 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAxJ7Caj11JrAi9Yrq4jC3Vh4hg7GmXyks5utdIvgaJpZM4YBML0> .

seantalts · 2018-11-10T15:11:50Z

Hey, I messed around with the tests and got them to a point where I now suspect something might actually be ... weird? with the new code in this PR... Any ideas?

[edit] Seems like GCC is crashing :(

SteveBronder · 2018-11-10T15:41:16Z

Oh that's v odd? Thanks for the heads up I'll take a look at this today.

seantalts · 2019-02-11T17:26:28Z

woops, is this back in my court now?

SteveBronder · 2019-02-11T17:34:28Z

Yep! Ready for review

SteveBronder · 2019-02-11T17:35:45Z

test/unit/math/gpu/cholesky_decompose_test.cpp

+        cholesky_decompose_test(1300);
+      }
+    }
+  }


@seantalts for the tests here should I just do two for each? Otherwise this takes a minute and not sure if all the combinations are needed

Yeah, we can reduce some of these. A minute isn't too bad though - the distribution tests take 12 hours, so these aren't really a bottleneck yet :P
What about:

TEST(MathMatrix, cholesky_decompose_big_tuning_opts) { std::vector<int> size_transfer({256, 512, 1300}); std::vector<int> cholesky_min_size({64, 256}); std::vector<int> cholesky_part({2, 4}); for (auto&& size_t_ : size_transfer) { for (auto&& min_size_ : cholesky_min_size) { for (auto&& part_ : cholesky_part) { stan::math::opencl_context.tuning_opts().cholesky_size_worth_transfer = size_t_; stan::math::opencl_context.tuning_opts().cholesky_min_L11_size = min_size_; stan::math::opencl_context.tuning_opts().cholesky_partition = part_; cholesky_decompose_test(1300); } } stan::math::opencl_context.tuning_opts().cholesky_size_worth_transfer = 128; stan::math::opencl_context.tuning_opts().cholesky_min_L11_size = 64; stan::math::opencl_context.tuning_opts().cholesky_partition = 4; cholesky_decompose_test(128); cholesky_decompose_test(65); cholesky_decompose_test(130); cholesky_decompose_test(128*4); cholesky_decompose_test(128*4-1); }

the ones at the bottom are doing more traditional testing where you try to look for edge cases and boundaries that you know the implementation will have to deal with.

Yeah sure! I'm at work but if you can copy/paste the above over the current test now to let it run that would be rad. Else I can do it when I get home from work

Done and pushed! I also had to fix the runTests.py script - the filtering thing wasn't working right. if you specified it twice as the help suggested, only the latest one would stick.

seantalts

Looks good! Just tweaks to the test and then we're good to go. I can actually just add them in there if they look good to you and you want me to do it.

seantalts · 2019-02-11T18:08:46Z

test/unit/math/gpu/cholesky_decompose_test.cpp

+        cholesky_decompose_test(1300);
+      }
+    }
+  }


Yeah, we can reduce some of these. A minute isn't too bad though - the distribution tests take 12 hours, so these aren't really a bottleneck yet :P
What about:

TEST(MathMatrix, cholesky_decompose_big_tuning_opts) { std::vector<int> size_transfer({256, 512, 1300}); std::vector<int> cholesky_min_size({64, 256}); std::vector<int> cholesky_part({2, 4}); for (auto&& size_t_ : size_transfer) { for (auto&& min_size_ : cholesky_min_size) { for (auto&& part_ : cholesky_part) { stan::math::opencl_context.tuning_opts().cholesky_size_worth_transfer = size_t_; stan::math::opencl_context.tuning_opts().cholesky_min_L11_size = min_size_; stan::math::opencl_context.tuning_opts().cholesky_partition = part_; cholesky_decompose_test(1300); } } stan::math::opencl_context.tuning_opts().cholesky_size_worth_transfer = 128; stan::math::opencl_context.tuning_opts().cholesky_min_L11_size = 64; stan::math::opencl_context.tuning_opts().cholesky_partition = 4; cholesky_decompose_test(128); cholesky_decompose_test(65); cholesky_decompose_test(130); cholesky_decompose_test(128*4); cholesky_decompose_test(128*4-1); }

the ones at the bottom are doing more traditional testing where you try to look for edge cases and boundaries that you know the implementation will have to deal with.

SteveBronder · 2019-02-12T03:07:42Z

Nice! Good to merge?

seantalts · 2019-02-12T12:58:01Z

🙌 🎉 Great work you two! Looking forward to rev and the really dope research we'll do on auto tuning 😎 😎 😎

rok-cesnovar · 2019-02-12T13:50:01Z

Thank you Sean for your help and patience :)

seantalts · 2019-02-13T18:35:48Z

Hey, looks like this didn't pass on develop for some reason:
http://d1m1s1b1.stat.columbia.edu:8080/blue/organizations/jenkins/Math%20Pipeline/detail/develop/214/pipeline

Looks like maybe when we added overloads for cholesky_decompose they were triggered but not compatible with forward mode autodiff. @syclik do we have a policy for this situation? It's only breaking forward mode unit tests with STAN_OPENCL enabled. Should we revert for now or try to roll forward? I could see it going either way.

syclik · 2019-02-13T19:24:40Z

Yes. Revert the PR. Then open a new PR that reverts the reversion which should also have a fix. develop shouldn't be broken. If it is, it's high priority to get it back to a state that's not broken. Do you know how this got through?

…

On Wed, Feb 13, 2019 at 1:35 PM seantalts ***@***.***> wrote: Hey, looks like this didn't pass on develop for some reason: http://d1m1s1b1.stat.columbia.edu:8080/blue/organizations/jenkins/Math%20Pipeline/detail/develop/214/pipeline Looks like maybe when we added overloads for cholesky_decompose they were triggered but not compatible with forward mode autodiff. @syclik <https://github.com/syclik> do we have a policy for this situation? It's only breaking forward mode unit tests with STAN_OPENCL enabled. Should we revert for now or try to roll forward? I could see it going either way. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1059 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAZ_FyHmtM2HRLre73NVrbtY58c-6akLks5vNFsFgaJpZM4YBML0> .

SteveBronder · 2019-02-13T19:28:48Z

Oh yikes! My apologies. Are there tests that happen just for develop that do not happen for branches?

I'm really not sure how fvar works so we can talk about how to handle this on the gpu in the stan meeting. My initial thought is just to not have the GPU stuff happen for fvar. But thats only because I don't know a lot about forward mode

SteveBronder · 2019-02-13T19:35:41Z

I think a quick fix for this would be to add & std::is_same<T, double>::value to the if statement in cholesky so it only goes off for type double. Though if there's an reasonable fix to make this work for fvar that would be cool

seantalts · 2019-02-13T20:00:52Z

Do you know how this got through?

Yes, we don't run the full unit test suite with STAN_OPENCL defined until merge to develop, running only the GPU-related tests on PRs. This PR is overloading cholesky_decompose and I allowed the PR through just on the basis of prim implementation and tests because when I asked for mixed and reverse mode, we decided we could do those with the rev PR (as we're all aiming to keep PRs narrower and more focused). I should have put 2+2 together and realized the overload could interact in non-prim modes with tests. That said, at least we won't get test failures on other PRs or for non-GPU folks.

Had some tests failures on develop. This reverts commit a933f65, reversing changes made to 70edefd.

…prim"" This reverts commit 0caabb0.

seantalts · 2019-02-13T20:03:53Z

Reverted develop and pushed a branch with the revert on it (though in the future I think it's probably fine to let contributors revert the revert): feature/issue-1058-gpu-chol-prim

syclik · 2019-02-13T20:07:39Z

Got it. Thanks. I think we determined it was worth the risk as long as we revert on failure, which we did.

…

On Wed, Feb 13, 2019 at 3:06 PM seantalts ***@***.***> wrote: Do you know how this got through? Yes, we don't run the full unit test suite with STAN_OPENCL defined until merge to develop, running only the GPU-related tests on PRs. This PR is overloading cholesky_decompose and I allowed the PR through just on the basis of prim implementation and tests because when I asked for mixed and reverse mode, we decided we could do those with the rev PR (as we're all aiming to keep PRs narrower and more focused). I should have put 2+2 together and realized the overload could interact in non-prim modes with tests. That said, at least we won't get test failures on other PRs or for non-GPU folks. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1059 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAZ_F81HmKr93C2RRQ3cnPQKhFwuyQUJks5vNG71gaJpZM4YBML0> .

SteveBronder · 2019-02-13T20:14:28Z

Thanks for pulling this back so quick, apologies again for the goof. Now that I know some of the tests do not go off for the GPU code on jenkins I'll make sure I run the full tests at home before we do a merge

seantalts · 2019-02-13T20:19:48Z

No worries, I think it was probably on me as the reviewer to think about this interaction.

To run all the tests that call cholesky_decompose (which would be a good idea), you can use the following shell command:

 find test -name *_test.cpp | xargs grep -l cholesky_decompose | xargs ./runTests.py -j2

though this is slightly less tests than

./runTests.py -j2 test/unit -f chol

rok-cesnovar · 2019-02-13T20:27:20Z

This is maybe more a question for the discourse but what would be the best resource that would help me understand forward mode a bit better?

syclik · 2019-02-13T20:28:17Z

Let's go to discourse.

…

On Wed, Feb 13, 2019 at 3:27 PM Rok Češnovar ***@***.***> wrote: This is maybe more a question for the discourse but what would be the best resource that would help me understand forward mode a bit better? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1059 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAZ_F6KjZ3WHdHhz8BuY1PZJyJ-_kA5Pks5vNHUpgaJpZM4YBML0> .

SteveBronder · 2019-02-14T05:03:27Z

Am I allowed to use C++17? This is actually a nice use case for if constexpr. If not then I think I have to have two cholesky_decompose functions. One that works for non-double types and one that works for double types

seantalts · 2019-02-14T16:05:01Z

Nope

seantalts · 2019-02-15T13:42:35Z

One thing we can do to fix this super simply is just rename the function to avoid overloading cholesky_decompose until the rev (and fwd? or something blocking fwd) mode versions go in.

syclik · 2019-02-15T15:09:51Z

Am I allowed to use C++17? This is actually a nice use case for if constexpr. If not then I think I have to have two cholesky_decompose functions. One that works for non-double types and one that works for double types

Can't you accomplish the same using std::enable_if? Doc: cppreference

SteveBronder · 2019-02-15T15:22:50Z

^yeah I'm going to use enable_if. Apologies for the delay I'll have this up on the weekend

rok-cesnovar and others added 14 commits October 28, 2018 09:33

revised cholesky prim

12e5d8e

added comments & minor stuff

ceee137

inverse fixes and added function to /prim

7528d50

removed files

8284c14

now passing

834ff4a

Merge remote-tracking branch 'upstream/develop' into gpu_cholesky_prim

4c87e21

Adds to docs, cleans up some code, use auto and const where possible

27beacc

Fixes docs and changes name of test against cpu for fixed matrix

87a67df

[Jenkins] auto-formatting by clang-format version 5.0.0-3~16.04.1 (ta…

be110d2

…gs/RELEASE_500/final)

forgot to include algorithm, we need to fix that lint check

287ae57

Merge branch 'gpu_cholesky_prim' of https://github.com/bstatcomp/math …

898ebca

…into gpu_cholesky_prim

include algorithm again

4fdb965

remove auto on return type

36bea8b

[Jenkins] auto-formatting by clang-format version 5.0.0-3~16.04.1 (ta…

c126ceb

…gs/RELEASE_500/final)

rok-cesnovar reviewed Oct 30, 2018

View reviewed changes

test/unit/math/gpu/cholesky_decompose_test.cpp Show resolved Hide resolved

SteveBronder added feature gpu labels Oct 30, 2018

SteveBronder assigned SteveBronder and rok-cesnovar Oct 30, 2018

move check for square and symmetric to top of cholesky decompose prim

6c0daff

SteveBronder changed the title ~~[WIP] gpu Cholesky primitive~~ Add GPU Cholesky Primitive Nov 9, 2018

SteveBronder requested a review from seantalts November 9, 2018 04:10

[Jenkins] auto-formatting by clang-format version 6.0.0 (tags/google/…

2d00cdf

…stable/2017-11-14)

SteveBronder commented Feb 11, 2019

View reviewed changes

seantalts suggested changes Feb 11, 2019

View reviewed changes

Reduce and broaden chol test

a3f8881

seantalts force-pushed the gpu_cholesky_prim branch from 3353c7f to a3f8881 Compare February 11, 2019 18:33

Fix runTests.py argument parsing

96118c8

seantalts approved these changes Feb 12, 2019

View reviewed changes

seantalts merged commit a933f65 into stan-dev:develop Feb 12, 2019

seantalts added a commit that referenced this pull request Feb 13, 2019

Revert "Merge pull request #1059 from bstatcomp/gpu_cholesky_prim"

0caabb0

Had some tests failures on develop. This reverts commit a933f65, reversing changes made to 70edefd.

seantalts added a commit that referenced this pull request Feb 13, 2019

Revert "Revert "Merge pull request #1059 from bstatcomp/gpu_cholesky_…

45851f6

…prim"" This reverts commit 0caabb0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GPU Cholesky Primitive #1059

Add GPU Cholesky Primitive #1059

SteveBronder commented Oct 30, 2018

rok-cesnovar commented Oct 30, 2018

SteveBronder commented Oct 30, 2018 •

edited

SteveBronder commented Nov 9, 2018

SteveBronder commented Nov 9, 2018

seantalts commented Nov 9, 2018 via email

seantalts commented Nov 10, 2018 •

edited

SteveBronder commented Nov 10, 2018

seantalts commented Feb 11, 2019

SteveBronder commented Feb 11, 2019

SteveBronder Feb 11, 2019

seantalts Feb 11, 2019

SteveBronder Feb 11, 2019

seantalts Feb 11, 2019

seantalts left a comment

seantalts Feb 11, 2019

SteveBronder commented Feb 12, 2019

seantalts commented Feb 12, 2019

rok-cesnovar commented Feb 12, 2019

seantalts commented Feb 13, 2019

syclik commented Feb 13, 2019 via email

SteveBronder commented Feb 13, 2019

SteveBronder commented Feb 13, 2019

seantalts commented Feb 13, 2019

seantalts commented Feb 13, 2019

syclik commented Feb 13, 2019 via email

SteveBronder commented Feb 13, 2019

seantalts commented Feb 13, 2019

rok-cesnovar commented Feb 13, 2019

syclik commented Feb 13, 2019 via email

SteveBronder commented Feb 14, 2019 •

edited

seantalts commented Feb 14, 2019

seantalts commented Feb 15, 2019

syclik commented Feb 15, 2019

SteveBronder commented Feb 15, 2019

Add GPU Cholesky Primitive #1059

Add GPU Cholesky Primitive #1059

Conversation

SteveBronder commented Oct 30, 2018

Summary

Tests

Side Effects

Checklist

rok-cesnovar commented Oct 30, 2018

SteveBronder commented Oct 30, 2018 • edited

SteveBronder commented Nov 9, 2018

SteveBronder commented Nov 9, 2018

seantalts commented Nov 9, 2018 via email

seantalts commented Nov 10, 2018 • edited

SteveBronder commented Nov 10, 2018

seantalts commented Feb 11, 2019

SteveBronder commented Feb 11, 2019

SteveBronder Feb 11, 2019

Choose a reason for hiding this comment

seantalts Feb 11, 2019

Choose a reason for hiding this comment

SteveBronder Feb 11, 2019

Choose a reason for hiding this comment

seantalts Feb 11, 2019

Choose a reason for hiding this comment

seantalts left a comment

Choose a reason for hiding this comment

seantalts Feb 11, 2019

Choose a reason for hiding this comment

SteveBronder commented Feb 12, 2019

seantalts commented Feb 12, 2019

rok-cesnovar commented Feb 12, 2019

seantalts commented Feb 13, 2019

syclik commented Feb 13, 2019 via email

SteveBronder commented Feb 13, 2019

SteveBronder commented Feb 13, 2019

seantalts commented Feb 13, 2019

seantalts commented Feb 13, 2019

syclik commented Feb 13, 2019 via email

SteveBronder commented Feb 13, 2019

seantalts commented Feb 13, 2019

rok-cesnovar commented Feb 13, 2019

syclik commented Feb 13, 2019 via email

SteveBronder commented Feb 14, 2019 • edited

seantalts commented Feb 14, 2019

seantalts commented Feb 15, 2019

syclik commented Feb 15, 2019

SteveBronder commented Feb 15, 2019

SteveBronder commented Oct 30, 2018 •

edited

seantalts commented Nov 10, 2018 •

edited

SteveBronder commented Feb 14, 2019 •

edited