Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARIMA - Kalman loop rewrite: single megakernel instead of host loop #4006

Merged
merged 17 commits into from
Jul 12, 2021

Conversation

Nyrio
Copy link
Contributor

@Nyrio Nyrio commented Jun 23, 2021

This PR brings speedups of the order of 10x to seasonal ARIMA. It replaces a legacy host loop based on cuBLAS batched operations and RAPIDS prims, with a custom kernel, reducing launch overheads and unnecessary reads and writes in global memory. On top of that, it paves the way for support for missing observations.

The PR introduces a set of prims in linalg/block.cuh to compute block-local linear algebra operations, and corresponding unit tests in test/prims/linalg_block.cu.

@Nyrio Nyrio requested review from a team as code owners June 23, 2021 17:08
@Nyrio Nyrio added 3 - Ready for Review Ready for review by team Perf Related to runtime performance of the underlying code non-breaking Non-breaking change improvement Improvement / enhancement to an existing function and removed CMake labels Jun 23, 2021
@Nyrio Nyrio added this to PR-WIP in v21.08 Release via automation Jun 24, 2021
@tfeher tfeher self-assigned this Jun 28, 2021
Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Louis for this PR, it is exciting to see the speedup offered by these changes!

The implementation looks good in general. My main question is whether the implementation of the block linalg prims in block.cuh should belong here, or to the raft repository? Tagging @teju85, in case he has some input on these points:

  • From ARIMA's point of view they are temporary helper functions, and that would justify having them in cuML.
  • One could also argue that it would be a more natural fit to place these in RAFT, even if they are not overly optimized. This fact can be marked in their docstring.

Further question:

  • "my version of the loading function seemed to work much faster than the one I adapted from raft" Is this specific to your use case, or do you have any suggestions for improvement for contractions.cuh?

cpp/src_prims/linalg/block.cuh Outdated Show resolved Hide resolved
cpp/test/prims/linalg_block.cu Outdated Show resolved Hide resolved
cpp/src_prims/linalg/block.cuh Outdated Show resolved Hide resolved
cpp/test/prims/linalg_block.cu Outdated Show resolved Hide resolved
cpp/test/prims/linalg_block.cu Outdated Show resolved Hide resolved
cpp/test/prims/linalg_block.cu Outdated Show resolved Hide resolved
cpp/test/prims/linalg_block.cu Outdated Show resolved Hide resolved
cpp/test/prims/linalg_block.cu Outdated Show resolved Hide resolved
cpp/test/prims/linalg_block.cu Outdated Show resolved Hide resolved
cpp/test/prims/linalg_block.cu Outdated Show resolved Hide resolved
v21.08 Release automation moved this from PR-WIP to PR-Needs review Jul 5, 2021
@github-actions github-actions bot added the CMake label Jul 5, 2021
@Nyrio
Copy link
Contributor Author

Nyrio commented Jul 5, 2021

@tfeher Thanks for your review!

Further question:
"my version of the loading function seemed to work much faster than the one I adapted from raft" Is this specific to your use case, or do you have any suggestions for improvement for contractions.cuh?

The loading function in raft loads first from global memory to registers, then registers to shared memory. This increases the register pressure, which is a bottleneck for my kernel for which occupancy is limited by the number of registers used. So it's not necessarily something to change in raft, just in my case.

The implementation looks good in general. My main question is whether the implementation of the block linalg prims in block.cuh should belong here, or to the raft repository? Tagging @teju85, in case he has some input on these points:

From ARIMA's point of view they are temporary helper functions, and that would justify having them in cuML.

One could also argue that it would be a more natural fit to place these in RAFT, even if they are not overly optimized. This fact can be marked in their docstring.

These prims can clearly be useful for other algorithms, but as you said they're more of a temporary solution and not very optimized. I don't mind making the API a bit more generic and adding a perf warning, if @teju85 and other PICs think it belongs to raft.

@Nyrio Nyrio requested a review from tfeher July 5, 2021 14:16
Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Louis for fixing the issues and adding more tests! The PR looks good to me.

I had an offline discussion with @teju85 about the location of the block linalg prims. Currently ARIMA is the only use case for them, therefore we can keep them at the current location (unless someone else has strong feelings otherwise).

Please update the PR description: the TODO items should preferably go to the ARIMA optimization tracker, instead of the description (which will be part of the commit message). Additionally "part 1" could be removed from the PR title.

@Nyrio Nyrio changed the title ARIMA - Kalman loop rewrite part 1: single megakernel instead of host loop ARIMA - Kalman loop rewrite: single megakernel instead of host loop Jul 9, 2021
@Nyrio
Copy link
Contributor Author

Nyrio commented Jul 9, 2021

@tfeher Thanks for the review. I've shortened the PR description to what we want in the commit message.

Tagging @rapidsai/cuml-cmake-codeowners for required approval

Copy link
Contributor

@robertmaynard robertmaynard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMake changes LGTM

v21.08 Release automation moved this from PR-Needs review to PR-Reviewer approved Jul 9, 2021
@codecov-commenter
Copy link

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.08@bcc4cad). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.08    #4006   +/-   ##
===============================================
  Coverage                ?   85.59%           
===============================================
  Files                   ?      230           
  Lines                   ?    18221           
  Branches                ?        0           
===============================================
  Hits                    ?    15596           
  Misses                  ?     2625           
  Partials                ?        0           
Flag Coverage Δ
dask 48.14% <0.00%> (?)
non-dask 77.92% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bcc4cad...b07d5b1. Read the comment docs.

@dantegd
Copy link
Member

dantegd commented Jul 12, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit c9abba1 into rapidsai:branch-21.08 Jul 12, 2021
v21.08 Release automation moved this from PR-Reviewer approved to Done Jul 12, 2021
@Nyrio Nyrio mentioned this pull request Jul 20, 2021
11 tasks
vimarsh6739 pushed a commit to vimarsh6739/cuml that referenced this pull request Oct 9, 2023
…apidsai#4006)

This PR brings **speedups** of the order of **10x** to seasonal ARIMA. It replaces a legacy host loop based on cuBLAS batched operations and RAPIDS prims, with a custom kernel, reducing launch overheads and unnecessary reads and writes in global memory. On top of that, it paves the way for support for missing observations.

The PR introduces a set of prims in `linalg/block.cuh` to compute block-local linear algebra operations, and corresponding unit tests in `test/prims/linalg_block.cu`.

Authors:
  - Louis Sugy (https://github.com/Nyrio)

Approvers:
  - Tamas Bela Feher (https://github.com/tfeher)
  - Robert Maynard (https://github.com/robertmaynard)

URL: rapidsai#4006
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CUDA/C++ improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Perf Related to runtime performance of the underlying code
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

5 participants