Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPU Cholesky Primitive #1059

Merged
merged 77 commits into from Feb 12, 2019
Merged

Conversation

SteveBronder
Copy link
Collaborator

Summary

Adds the gpu cholesky decomposition primitive and it's associated kernel. This also includes two small bug fixes for the inverse cholesky.

When users define STAN_OPENCL during compilation the cpu based cholesky will be replaced by the gpu version of the cholesky.

m = cholesky_decompose(x);

So users code will stay the same, but will then have access to the gpu routines. This was decided in favor of including a cholesky_decompose_gpu method since if users want fine grain decision making we can always introduce it later, but it will be harder to remove it.

The Cholesky's gpu algorithm is recursive, where The parameters block, divider, and min_block act as tuning parameters for the recursive step of the GPU based Cholesky decompostion. The matrix is subset by the block size, and if the block size is less than min_block then the Cholesky decomposition on the GPU is computed using that submatrix. If block is greater than block_size then cholesky_decompose is run again with block equal to block/divider. Once the Cholesky decomposition is computed, the full Cholesky is created by propagating the Cholesky forward as given in the reference report below.

https://github.com/SteveBronder/stancon2018/blob/master/report.pdf

Tests

Assuming OpenCL is setup and the preferred device has an index of 0, tests can be run with

echo STAN_OPENCL=true>> make/local
echo OPENCL_PLATFORM_ID=0>> make/local
echo OPENCL_DEVICE_ID=0>> make/local
./runTests.py ./test/unit/math/gpu/cholesky_decompose_test.cpp 

cholesky_decompose_cpu_vs_gpu

  • Tests whether cpu and gpu outputs are close to each other

cholesky_decompose_small

  • Runs the cholesky_decompose_test tester on small input matrices

cholesky_decompose_big

  • Runs the cholesky_decompose_test tester on input matrices of size 500, 1000, 2000

The cholesky_decompose_test tester checks whether the GPU implementation's max error for any cell in the cholesky decomposition relative to Eigen's LLT method is less than 1E-8.

Side Effects

None

Checklist

  • Math issue Add GPU Cholesky Decomposition #1058

  • Copyright holder: (fill in copyright holder information)

  • Rok Češnovar and Erik Štrumbelj (Faculty of Computer and Information Science, University of Ljubljana)

  • Steve Bronder

  • the basic tests are passing

    • unit tests pass (to run, use: ./runTests.py test/unit)
    • header checks pass, (make test-headers)
    • docs build, (make doxygen)
    • code passes the built in C++ standards checks (make cpplint)
  • the code is written in idiomatic C++ and changes are documented in the doxygen

  • the new changes are tested

@rok-cesnovar
Copy link
Member

One thing to discuss is whether we want to do CPU computation for small input sizes even if STAN_OPENCL is set.

@SteveBronder
Copy link
Collaborator Author

SteveBronder commented Oct 30, 2018

One thing to discuss is whether we want to do CPU computation for small input sizes even if STAN_OPENCL is set.

Yes I see two things to remove the WIP

  • Run performance tests to see at what size this should be moved over to the GPU

EDIT: idt we need tests for the square and symmetric. Those are called in the main prim Cholesky decompose so those would really just be testing the square and symmetric tests.

@SteveBronder SteveBronder changed the title [WIP] gpu Cholesky primitive Add GPU Cholesky Primitive Nov 9, 2018
@SteveBronder
Copy link
Collaborator Author

@seantalts we are still waiting for Rok's tuning tests but I think the code is ready for review.

@SteveBronder
Copy link
Collaborator Author

I'm getting a ‘ec2-spot-c5d18x’ is offline from Jenkins

@seantalts
Copy link
Member

seantalts commented Nov 9, 2018 via email

@seantalts
Copy link
Member

seantalts commented Nov 10, 2018

Hey, I messed around with the tests and got them to a point where I now suspect something might actually be ... weird? with the new code in this PR... Any ideas?

[edit] Seems like GCC is crashing :(

@SteveBronder
Copy link
Collaborator Author

Oh that's v odd? Thanks for the heads up I'll take a look at this today.

@seantalts
Copy link
Member

woops, is this back in my court now?

@SteveBronder
Copy link
Collaborator Author

Yep! Ready for review

cholesky_decompose_test(1300);
}
}
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seantalts for the tests here should I just do two for each? Otherwise this takes a minute and not sure if all the combinations are needed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can reduce some of these. A minute isn't too bad though - the distribution tests take 12 hours, so these aren't really a bottleneck yet :P
What about:

TEST(MathMatrix, cholesky_decompose_big_tuning_opts) {
  std::vector<int> size_transfer({256, 512, 1300});
  std::vector<int> cholesky_min_size({64, 256});
  std::vector<int> cholesky_part({2, 4});
  for (auto&& size_t_ : size_transfer) {
    for (auto&& min_size_ : cholesky_min_size) {
      for (auto&& part_ : cholesky_part) {
        stan::math::opencl_context.tuning_opts().cholesky_size_worth_transfer
            = size_t_;
        stan::math::opencl_context.tuning_opts().cholesky_min_L11_size
            = min_size_;
        stan::math::opencl_context.tuning_opts().cholesky_partition = part_;
        cholesky_decompose_test(1300);
      }
    }
        stan::math::opencl_context.tuning_opts().cholesky_size_worth_transfer
            = 128;
        stan::math::opencl_context.tuning_opts().cholesky_min_L11_size
            = 64;
        stan::math::opencl_context.tuning_opts().cholesky_partition = 4;
        cholesky_decompose_test(128);
        cholesky_decompose_test(65);
        cholesky_decompose_test(130);
        cholesky_decompose_test(128*4);
        cholesky_decompose_test(128*4-1);
}
  

the ones at the bottom are doing more traditional testing where you try to look for edge cases and boundaries that you know the implementation will have to deal with.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sure! I'm at work but if you can copy/paste the above over the current test now to let it run that would be rad. Else I can do it when I get home from work

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done and pushed! I also had to fix the runTests.py script - the filtering thing wasn't working right. if you specified it twice as the help suggested, only the latest one would stick.

Copy link
Member

@seantalts seantalts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just tweaks to the test and then we're good to go. I can actually just add them in there if they look good to you and you want me to do it.

cholesky_decompose_test(1300);
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can reduce some of these. A minute isn't too bad though - the distribution tests take 12 hours, so these aren't really a bottleneck yet :P
What about:

TEST(MathMatrix, cholesky_decompose_big_tuning_opts) {
  std::vector<int> size_transfer({256, 512, 1300});
  std::vector<int> cholesky_min_size({64, 256});
  std::vector<int> cholesky_part({2, 4});
  for (auto&& size_t_ : size_transfer) {
    for (auto&& min_size_ : cholesky_min_size) {
      for (auto&& part_ : cholesky_part) {
        stan::math::opencl_context.tuning_opts().cholesky_size_worth_transfer
            = size_t_;
        stan::math::opencl_context.tuning_opts().cholesky_min_L11_size
            = min_size_;
        stan::math::opencl_context.tuning_opts().cholesky_partition = part_;
        cholesky_decompose_test(1300);
      }
    }
        stan::math::opencl_context.tuning_opts().cholesky_size_worth_transfer
            = 128;
        stan::math::opencl_context.tuning_opts().cholesky_min_L11_size
            = 64;
        stan::math::opencl_context.tuning_opts().cholesky_partition = 4;
        cholesky_decompose_test(128);
        cholesky_decompose_test(65);
        cholesky_decompose_test(130);
        cholesky_decompose_test(128*4);
        cholesky_decompose_test(128*4-1);
}
  

the ones at the bottom are doing more traditional testing where you try to look for edge cases and boundaries that you know the implementation will have to deal with.

@SteveBronder
Copy link
Collaborator Author

Nice! Good to merge?

@seantalts seantalts merged commit a933f65 into stan-dev:develop Feb 12, 2019
@seantalts
Copy link
Member

🙌 🎉 Great work you two! Looking forward to rev and the really dope research we'll do on auto tuning 😎 😎 😎

@rok-cesnovar
Copy link
Member

Thank you Sean for your help and patience :)

@seantalts
Copy link
Member

Hey, looks like this didn't pass on develop for some reason:
http://d1m1s1b1.stat.columbia.edu:8080/blue/organizations/jenkins/Math%20Pipeline/detail/develop/214/pipeline

Looks like maybe when we added overloads for cholesky_decompose they were triggered but not compatible with forward mode autodiff. @syclik do we have a policy for this situation? It's only breaking forward mode unit tests with STAN_OPENCL enabled. Should we revert for now or try to roll forward? I could see it going either way.

@syclik
Copy link
Member

syclik commented Feb 13, 2019 via email

@SteveBronder
Copy link
Collaborator Author

Oh yikes! My apologies. Are there tests that happen just for develop that do not happen for branches?

I'm really not sure how fvar works so we can talk about how to handle this on the gpu in the stan meeting. My initial thought is just to not have the GPU stuff happen for fvar. But thats only because I don't know a lot about forward mode

@SteveBronder
Copy link
Collaborator Author

I think a quick fix for this would be to add & std::is_same<T, double>::value to the if statement in cholesky so it only goes off for type double. Though if there's an reasonable fix to make this work for fvar that would be cool

@seantalts
Copy link
Member

Do you know how this got through?

Yes, we don't run the full unit test suite with STAN_OPENCL defined until merge to develop, running only the GPU-related tests on PRs. This PR is overloading cholesky_decompose and I allowed the PR through just on the basis of prim implementation and tests because when I asked for mixed and reverse mode, we decided we could do those with the rev PR (as we're all aiming to keep PRs narrower and more focused). I should have put 2+2 together and realized the overload could interact in non-prim modes with tests. That said, at least we won't get test failures on other PRs or for non-GPU folks.

seantalts added a commit that referenced this pull request Feb 13, 2019
Had some tests failures on develop.

This reverts commit a933f65, reversing
changes made to 70edefd.
seantalts added a commit that referenced this pull request Feb 13, 2019
@seantalts
Copy link
Member

Reverted develop and pushed a branch with the revert on it (though in the future I think it's probably fine to let contributors revert the revert): feature/issue-1058-gpu-chol-prim

@syclik
Copy link
Member

syclik commented Feb 13, 2019 via email

@SteveBronder
Copy link
Collaborator Author

Thanks for pulling this back so quick, apologies again for the goof. Now that I know some of the tests do not go off for the GPU code on jenkins I'll make sure I run the full tests at home before we do a merge

@seantalts
Copy link
Member

No worries, I think it was probably on me as the reviewer to think about this interaction.

To run all the tests that call cholesky_decompose (which would be a good idea), you can use the following shell command:

 find test -name *_test.cpp | xargs grep -l cholesky_decompose | xargs ./runTests.py -j2 

though this is slightly less tests than

./runTests.py -j2 test/unit -f chol

@rok-cesnovar
Copy link
Member

This is maybe more a question for the discourse but what would be the best resource that would help me understand forward mode a bit better?

@syclik
Copy link
Member

syclik commented Feb 13, 2019 via email

@SteveBronder
Copy link
Collaborator Author

SteveBronder commented Feb 14, 2019

Am I allowed to use C++17? This is actually a nice use case for if constexpr. If not then I think I have to have two cholesky_decompose functions. One that works for non-double types and one that works for double types

@seantalts
Copy link
Member

Nope

@seantalts
Copy link
Member

One thing we can do to fix this super simply is just rename the function to avoid overloading cholesky_decompose until the rev (and fwd? or something blocking fwd) mode versions go in.

@syclik
Copy link
Member

syclik commented Feb 15, 2019

Am I allowed to use C++17? This is actually a nice use case for if constexpr. If not then I think I have to have two cholesky_decompose functions. One that works for non-double types and one that works for double types

Can't you accomplish the same using std::enable_if? Doc: cppreference

@SteveBronder
Copy link
Collaborator Author

^yeah I'm going to use enable_if. Apologies for the delay I'll have this up on the weekend

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants