ARIMA - Kalman loop rewrite: single megakernel instead of host loop #4006

Nyrio · 2021-06-23T17:08:24Z

This PR brings speedups of the order of 10x to seasonal ARIMA. It replaces a legacy host loop based on cuBLAS batched operations and RAPIDS prims, with a custom kernel, reducing launch overheads and unnecessary reads and writes in global memory. On top of that, it paves the way for support for missing observations.

The PR introduces a set of prims in linalg/block.cuh to compute block-local linear algebra operations, and corresponding unit tests in test/prims/linalg_block.cu.

tfeher

Thanks Louis for this PR, it is exciting to see the speedup offered by these changes!

The implementation looks good in general. My main question is whether the implementation of the block linalg prims in block.cuh should belong here, or to the raft repository? Tagging @teju85, in case he has some input on these points:

From ARIMA's point of view they are temporary helper functions, and that would justify having them in cuML.
One could also argue that it would be a more natural fit to place these in RAFT, even if they are not overly optimized. This fact can be marked in their docstring.

Further question:

"my version of the loading function seemed to work much faster than the one I adapted from raft" Is this specific to your use case, or do you have any suggestions for improvement for contractions.cuh?

cpp/src_prims/linalg/block.cuh

cpp/test/prims/linalg_block.cu

cpp/src_prims/linalg/block.cuh

cpp/test/prims/linalg_block.cu

Nyrio · 2021-07-05T14:16:01Z

@tfeher Thanks for your review!

Further question:
"my version of the loading function seemed to work much faster than the one I adapted from raft" Is this specific to your use case, or do you have any suggestions for improvement for contractions.cuh?

The loading function in raft loads first from global memory to registers, then registers to shared memory. This increases the register pressure, which is a bottleneck for my kernel for which occupancy is limited by the number of registers used. So it's not necessarily something to change in raft, just in my case.

The implementation looks good in general. My main question is whether the implementation of the block linalg prims in block.cuh should belong here, or to the raft repository? Tagging @teju85, in case he has some input on these points:

From ARIMA's point of view they are temporary helper functions, and that would justify having them in cuML.

One could also argue that it would be a more natural fit to place these in RAFT, even if they are not overly optimized. This fact can be marked in their docstring.

These prims can clearly be useful for other algorithms, but as you said they're more of a temporary solution and not very optimized. I don't mind making the API a bit more generic and adding a perf warning, if @teju85 and other PICs think it belongs to raft.

tfeher

Thanks Louis for fixing the issues and adding more tests! The PR looks good to me.

I had an offline discussion with @teju85 about the location of the block linalg prims. Currently ARIMA is the only use case for them, therefore we can keep them at the current location (unless someone else has strong feelings otherwise).

Please update the PR description: the TODO items should preferably go to the ARIMA optimization tracker, instead of the description (which will be part of the commit message). Additionally "part 1" could be removed from the PR title.

Nyrio · 2021-07-09T11:26:18Z

@tfeher Thanks for the review. I've shortened the PR description to what we want in the commit message.

Tagging @rapidsai/cuml-cmake-codeowners for required approval

robertmaynard

CMake changes LGTM

…w formatting rules

codecov-commenter · 2021-07-12T14:23:18Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.08@bcc4cad). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-21.08    #4006   +/-   ##
===============================================
  Coverage                ?   85.59%           
===============================================
  Files                   ?      230           
  Lines                   ?    18221           
  Branches                ?        0           
===============================================
  Hits                    ?    15596           
  Misses                  ?     2625           
  Partials                ?        0

Flag	Coverage Δ
dask	`48.14% <0.00%> (?)`
non-dask	`77.92% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bcc4cad...b07d5b1. Read the comment docs.

dantegd · 2021-07-12T14:27:17Z

@gpucibot merge

…apidsai#4006) This PR brings **speedups** of the order of **10x** to seasonal ARIMA. It replaces a legacy host loop based on cuBLAS batched operations and RAPIDS prims, with a custom kernel, reducing launch overheads and unnecessary reads and writes in global memory. On top of that, it paves the way for support for missing observations. The PR introduces a set of prims in `linalg/block.cuh` to compute block-local linear algebra operations, and corresponding unit tests in `test/prims/linalg_block.cu`. Authors: - Louis Sugy (https://github.com/Nyrio) Approvers: - Tamas Bela Feher (https://github.com/tfeher) - Robert Maynard (https://github.com/robertmaynard) URL: rapidsai#4006

Nyrio added 8 commits June 16, 2021 05:41

Initial rewrite of the Kalman filter

8edacaf

Remove unused sparse matrices

1f1bce5

More efficient block-level gemv

8a0ac86

Shared mem Z and alpha

a4f8fa2

Shared mem K

c21b0a0

Rewrite loads for block gemm with support for vectorization

b1a27f6

Tune Kalman loop gemm policy

ffdb182

Clean-up, add docs and add test

9c7200a

Nyrio requested review from a team as code owners June 23, 2021 17:08

github-actions bot added CMake CUDA/C++ labels Jun 23, 2021

Nyrio added 3 - Ready for Review Ready for review by team Perf Related to runtime performance of the underlying code non-breaking Non-breaking change improvement Improvement / enhancement to an existing function and removed CMake labels Jun 23, 2021

Nyrio added this to PR-WIP in v21.08 Release via automation Jun 24, 2021

Merge branch 'branch-21.08' into opt-kalman-loop

eeada6f

tfeher self-assigned this Jun 28, 2021

Merge branch 'branch-21.08' into opt-kalman-loop

1aa9b85

tfeher requested changes Jul 5, 2021

View reviewed changes

v21.08 Release automation moved this from PR-WIP to PR-Needs review Jul 5, 2021

Nyrio added 4 commits July 5, 2021 05:53

Test improvements + small fixes

c07967c

Test multiple policies for block gemm and gemv

5f048b4

Overlap host and device computations in block prims tests

5e84fcf

Test with and without preload for gemv and xAx'

3073f8a

github-actions bot added the CMake label Jul 5, 2021

Nyrio requested a review from tfeher July 5, 2021 14:16

Include style fix

c4685f8

tfeher approved these changes Jul 9, 2021

View reviewed changes

Nyrio changed the title ~~ARIMA - Kalman loop rewrite part 1: single megakernel instead of host loop~~ ARIMA - Kalman loop rewrite: single megakernel instead of host loop Jul 9, 2021

robertmaynard approved these changes Jul 9, 2021

View reviewed changes

v21.08 Release automation moved this from PR-Needs review to PR-Reviewer approved Jul 9, 2021

Nyrio added 2 commits July 9, 2021 06:42

Merge branch 'branch-21.08' into opt-kalman-loop

e1e5e01

Format block linalg prims and corresponding unit test according to ne…

b07d5b1

…w formatting rules

rapids-bot bot merged commit c9abba1 into rapidsai:branch-21.08 Jul 12, 2021

v21.08 Release automation moved this from PR-Reviewer approved to Done Jul 12, 2021

Nyrio mentioned this pull request Jul 20, 2021

ARIMA performance tracker #2912

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARIMA - Kalman loop rewrite: single megakernel instead of host loop #4006

ARIMA - Kalman loop rewrite: single megakernel instead of host loop #4006

Nyrio commented Jun 23, 2021 •

edited

tfeher left a comment

Nyrio commented Jul 5, 2021

tfeher left a comment

Nyrio commented Jul 9, 2021

robertmaynard left a comment

codecov-commenter commented Jul 12, 2021

dantegd commented Jul 12, 2021

ARIMA - Kalman loop rewrite: single megakernel instead of host loop #4006

ARIMA - Kalman loop rewrite: single megakernel instead of host loop #4006

Conversation

Nyrio commented Jun 23, 2021 • edited

tfeher left a comment

Choose a reason for hiding this comment

Nyrio commented Jul 5, 2021

tfeher left a comment

Choose a reason for hiding this comment

Nyrio commented Jul 9, 2021

robertmaynard left a comment

Choose a reason for hiding this comment

codecov-commenter commented Jul 12, 2021

Codecov Report

dantegd commented Jul 12, 2021

Nyrio commented Jun 23, 2021 •

edited