New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GPU Cholesky Primitive #1059
Conversation
…gs/RELEASE_500/final)
…into gpu_cholesky_prim
…gs/RELEASE_500/final)
One thing to discuss is whether we want to do CPU computation for small input sizes even if STAN_OPENCL is set. |
Yes I see two things to remove the WIP
EDIT: idt we need tests for the square and symmetric. Those are called in the main prim Cholesky decompose so those would really just be testing the square and symmetric tests. |
…stable/2017-11-14)
@seantalts we are still waiting for Rok's tuning tests but I think the code is ready for review. |
I'm getting a |
Hmm, I think I fixed it - for some reason the other Linux Jenkins node
wasn't listed as having MPI.
…On Fri, Nov 9, 2018 at 2:06 PM Steve Bronder ***@***.***> wrote:
I'm getting a ‘ec2-spot-c5d18x’ is offline from Jenkins
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1059 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAxJ7Caj11JrAi9Yrq4jC3Vh4hg7GmXyks5utdIvgaJpZM4YBML0>
.
|
Hey, I messed around with the tests and got them to a point where I now suspect something might actually be ... weird? with the new code in this PR... Any ideas? [edit] Seems like GCC is crashing :( |
Oh that's v odd? Thanks for the heads up I'll take a look at this today. |
woops, is this back in my court now? |
Yep! Ready for review |
cholesky_decompose_test(1300); | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@seantalts for the tests here should I just do two for each? Otherwise this takes a minute and not sure if all the combinations are needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we can reduce some of these. A minute isn't too bad though - the distribution tests take 12 hours, so these aren't really a bottleneck yet :P
What about:
TEST(MathMatrix, cholesky_decompose_big_tuning_opts) {
std::vector<int> size_transfer({256, 512, 1300});
std::vector<int> cholesky_min_size({64, 256});
std::vector<int> cholesky_part({2, 4});
for (auto&& size_t_ : size_transfer) {
for (auto&& min_size_ : cholesky_min_size) {
for (auto&& part_ : cholesky_part) {
stan::math::opencl_context.tuning_opts().cholesky_size_worth_transfer
= size_t_;
stan::math::opencl_context.tuning_opts().cholesky_min_L11_size
= min_size_;
stan::math::opencl_context.tuning_opts().cholesky_partition = part_;
cholesky_decompose_test(1300);
}
}
stan::math::opencl_context.tuning_opts().cholesky_size_worth_transfer
= 128;
stan::math::opencl_context.tuning_opts().cholesky_min_L11_size
= 64;
stan::math::opencl_context.tuning_opts().cholesky_partition = 4;
cholesky_decompose_test(128);
cholesky_decompose_test(65);
cholesky_decompose_test(130);
cholesky_decompose_test(128*4);
cholesky_decompose_test(128*4-1);
}
the ones at the bottom are doing more traditional testing where you try to look for edge cases and boundaries that you know the implementation will have to deal with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah sure! I'm at work but if you can copy/paste the above over the current test now to let it run that would be rad. Else I can do it when I get home from work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done and pushed! I also had to fix the runTests.py script - the filtering thing wasn't working right. if you specified it twice as the help suggested, only the latest one would stick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Just tweaks to the test and then we're good to go. I can actually just add them in there if they look good to you and you want me to do it.
cholesky_decompose_test(1300); | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we can reduce some of these. A minute isn't too bad though - the distribution tests take 12 hours, so these aren't really a bottleneck yet :P
What about:
TEST(MathMatrix, cholesky_decompose_big_tuning_opts) {
std::vector<int> size_transfer({256, 512, 1300});
std::vector<int> cholesky_min_size({64, 256});
std::vector<int> cholesky_part({2, 4});
for (auto&& size_t_ : size_transfer) {
for (auto&& min_size_ : cholesky_min_size) {
for (auto&& part_ : cholesky_part) {
stan::math::opencl_context.tuning_opts().cholesky_size_worth_transfer
= size_t_;
stan::math::opencl_context.tuning_opts().cholesky_min_L11_size
= min_size_;
stan::math::opencl_context.tuning_opts().cholesky_partition = part_;
cholesky_decompose_test(1300);
}
}
stan::math::opencl_context.tuning_opts().cholesky_size_worth_transfer
= 128;
stan::math::opencl_context.tuning_opts().cholesky_min_L11_size
= 64;
stan::math::opencl_context.tuning_opts().cholesky_partition = 4;
cholesky_decompose_test(128);
cholesky_decompose_test(65);
cholesky_decompose_test(130);
cholesky_decompose_test(128*4);
cholesky_decompose_test(128*4-1);
}
the ones at the bottom are doing more traditional testing where you try to look for edge cases and boundaries that you know the implementation will have to deal with.
3353c7f
to
a3f8881
Compare
Nice! Good to merge? |
🙌 🎉 Great work you two! Looking forward to rev and the really dope research we'll do on auto tuning 😎 😎 😎 |
Thank you Sean for your help and patience :) |
Hey, looks like this didn't pass on develop for some reason: Looks like maybe when we added overloads for cholesky_decompose they were triggered but not compatible with forward mode autodiff. @syclik do we have a policy for this situation? It's only breaking forward mode unit tests with STAN_OPENCL enabled. Should we revert for now or try to roll forward? I could see it going either way. |
Yes. Revert the PR.
Then open a new PR that reverts the reversion which should also have a fix.
develop shouldn't be broken. If it is, it's high priority to get it back to
a state that's not broken.
Do you know how this got through?
…On Wed, Feb 13, 2019 at 1:35 PM seantalts ***@***.***> wrote:
Hey, looks like this didn't pass on develop for some reason:
http://d1m1s1b1.stat.columbia.edu:8080/blue/organizations/jenkins/Math%20Pipeline/detail/develop/214/pipeline
Looks like maybe when we added overloads for cholesky_decompose they were
triggered but not compatible with forward mode autodiff. @syclik
<https://github.com/syclik> do we have a policy for this situation? It's
only breaking forward mode unit tests with STAN_OPENCL enabled. Should we
revert for now or try to roll forward? I could see it going either way.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1059 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAZ_FyHmtM2HRLre73NVrbtY58c-6akLks5vNFsFgaJpZM4YBML0>
.
|
Oh yikes! My apologies. Are there tests that happen just for develop that do not happen for branches? I'm really not sure how fvar works so we can talk about how to handle this on the gpu in the stan meeting. My initial thought is just to not have the GPU stuff happen for |
I think a quick fix for this would be to add |
Yes, we don't run the full unit test suite with |
…prim"" This reverts commit 0caabb0.
Reverted develop and pushed a branch with the revert on it (though in the future I think it's probably fine to let contributors revert the revert): feature/issue-1058-gpu-chol-prim |
Got it. Thanks. I think we determined it was worth the risk as long as we
revert on failure, which we did.
…On Wed, Feb 13, 2019 at 3:06 PM seantalts ***@***.***> wrote:
Do you know how this got through?
Yes, we don't run the full unit test suite with STAN_OPENCL defined until
merge to develop, running only the GPU-related tests on PRs. This PR is
overloading cholesky_decompose and I allowed the PR through just on the
basis of prim implementation and tests because when I asked for mixed and
reverse mode, we decided we could do those with the rev PR (as we're all
aiming to keep PRs narrower and more focused). I should have put 2+2
together and realized the overload could interact in non-prim modes with
tests. That said, at least we won't get test failures on other PRs or for
non-GPU folks.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1059 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAZ_F81HmKr93C2RRQ3cnPQKhFwuyQUJks5vNG71gaJpZM4YBML0>
.
|
Thanks for pulling this back so quick, apologies again for the goof. Now that I know some of the tests do not go off for the GPU code on jenkins I'll make sure I run the full tests at home before we do a merge |
No worries, I think it was probably on me as the reviewer to think about this interaction. To run all the tests that call cholesky_decompose (which would be a good idea), you can use the following shell command:
though this is slightly less tests than
|
This is maybe more a question for the discourse but what would be the best resource that would help me understand forward mode a bit better? |
Let's go to discourse.
…On Wed, Feb 13, 2019 at 3:27 PM Rok Češnovar ***@***.***> wrote:
This is maybe more a question for the discourse but what would be the best
resource that would help me understand forward mode a bit better?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1059 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAZ_F6KjZ3WHdHhz8BuY1PZJyJ-_kA5Pks5vNHUpgaJpZM4YBML0>
.
|
Am I allowed to use C++17? This is actually a nice use case for |
One thing we can do to fix this super simply is just rename the function to avoid overloading |
Can't you accomplish the same using |
^yeah I'm going to use |
Summary
Adds the gpu cholesky decomposition primitive and it's associated kernel. This also includes two small bug fixes for the inverse cholesky.
When users define
STAN_OPENCL
during compilation the cpu based cholesky will be replaced by the gpu version of the cholesky.So users code will stay the same, but will then have access to the gpu routines. This was decided in favor of including a
cholesky_decompose_gpu
method since if users want fine grain decision making we can always introduce it later, but it will be harder to remove it.The Cholesky's gpu algorithm is recursive, where The parameters
block
,divider
, andmin_block
act as tuning parameters for the recursive step of the GPU based Cholesky decompostion. The matrix is subset by theblock
size, and if theblock
size is less thanmin_block
then the Cholesky decomposition on the GPU is computed using that submatrix. Ifblock
is greater thanblock_size
thencholesky_decompose
is run again withblock
equal toblock/divider
. Once the Cholesky decomposition is computed, the full Cholesky is created by propagating the Cholesky forward as given in the reference report below.https://github.com/SteveBronder/stancon2018/blob/master/report.pdf
Tests
Assuming OpenCL is setup and the preferred device has an index of 0, tests can be run with
cholesky_decompose_cpu_vs_gpu
cholesky_decompose_small
cholesky_decompose_test
tester on small input matricescholesky_decompose_big
cholesky_decompose_test
tester on input matrices of size 500, 1000, 2000The
cholesky_decompose_test
tester checks whether the GPU implementation's max error for any cell in the cholesky decomposition relative to Eigen's LLT method is less than 1E-8.Side Effects
None
Checklist
Math issue Add GPU Cholesky Decomposition #1058
Copyright holder: (fill in copyright holder information)
Rok Češnovar and Erik Štrumbelj (Faculty of Computer and Information Science, University of Ljubljana)
Steve Bronder
the basic tests are passing
./runTests.py test/unit
)make test-headers
)make doxygen
)make cpplint
)the code is written in idiomatic C++ and changes are documented in the doxygen
the new changes are tested