Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster glm ols-via-eigendecomposition algorithm #4201

Merged
merged 10 commits into from
Sep 24, 2021

Conversation

achirkin
Copy link
Contributor

@achirkin achirkin commented Sep 10, 2021

This PR makes some improvements on glm ordinary least squares (ols) implementation:

  1. Split MLCommon::LinAlg::lstsq, which solves OLS by computing SVD of the feature matrix (UΣV = A). Originally, this function provided two algorithms to do so (using cusolver QR decomposition and using eigenvalue decomposition).
  2. Add a third algorithm for solving OLS via SVD - via cusolver Jacobi iterations.
  3. Inline raft::linalg::svdEig. This function computes SVD decomposition of the feature matrix UΣV = A via eigenvalue decomposition of QΛQT = AT A, which is faster, but requires additional manipulations to recover Σ and U (while V = Q). Instead, I use raft::linalg::eigDC directly to compute w = (AT A)-1Ab = (QΛ-1QT)(Ab). This also allows to compute Ab concurrently (in another cuda stream) to the rest of the operations, which often fit in my GPU at the same time when n_cols << n_rows.
  4. Just a small thing: reduce the number of memory allocations using rmm::device_uvector. Supposedly, this reduces the wilderness of benchmarking uncertainty when using the default memory allocator.
  5. Force the SVD-Jacobi algorithm when n_cols > n_rows, because none of the others support this case either in theory or due to cusolver limitations.
  6. Update the python interface to allow choosing among all available algorithms.

@achirkin achirkin added 2 - In Progress Currenty a work in progress improvement Improvement / enhancement to an existing function non-breaking Non-breaking change and removed CUDA/C++ labels Sep 10, 2021
@achirkin achirkin self-assigned this Sep 10, 2021
@achirkin
Copy link
Contributor Author

Some pictures for the reference. Tested on AMD Ryzen 9 5950X and RTX 3090. The green line is the results of sklearn patched by intel intelex project. The blue line is cuml. The benchmark does not include input_to_cuml_array, which otherwise takes 80% of the execution time. The fit_intercept is also turned off, which otherwise takes 40-60% of the remaining execution time.

Original:
glm_eig_cuvsix_orig_nofi_notrans_float32_1000000rows_upto500cols

After modifications:
glm_eig_cuvsix_conc_nofi_notrans_float32_1000000rows_upto500cols

@achirkin
Copy link
Contributor Author

achirkin commented Sep 10, 2021

For reference, the same setting after modifications, with:

fit_intercept = True
glm_eig_cuvsix_conc_fi_notrans_float32_1000000rows_upto500cols

fit_intercept = True and input_to_cuml_array included in the benchmark:
glm_eig_cuvsix_conc_fi_trans_float32_1000000rows_upto500cols

Mind the weird spikes on the training/exec time plots. I ran the test only once for every plot point; maybe the spikes come from memory allocation?.. I tried to run profiling with the troubled input sizes, did not find any anomalies.

@achirkin
Copy link
Contributor Author

Nsys timeline with 1000000 rows and 100 columns:

Original:
Screenshot from 2021-09-10 13-08-37

After modifications:
Screenshot from 2021-09-10 13-09-42

(the orange is the time spent in raft::linalg::eigDC, which is the same in both variants.

@caryr35 caryr35 added this to PR-WIP in v21.10 Release via automation Sep 13, 2021
@achirkin
Copy link
Contributor Author

This PR now depends on rapidsai/raft#327 and rapidsai/rmm#870 , so I can reduce the unnecessary/duplicate helper code.

@github-actions github-actions bot added the Cython / Python Cython or Python issue label Sep 15, 2021
@achirkin
Copy link
Contributor Author

achirkin commented Sep 15, 2021

There is also a weird accuracy drop when n_cols == n_rows and fit_intercept == True for SVD-based algorithms, I'm investigating it.

@cjnolet
Copy link
Member

cjnolet commented Sep 17, 2021

@achirkin, I've held off on posting a review of this PR because it's still marked as a draft. Let me know when you are ready and I can also give it a look over for you.

@achirkin achirkin marked this pull request as ready for review September 20, 2021 11:35
@achirkin achirkin requested review from a team as code owners September 20, 2021 11:35
@achirkin
Copy link
Contributor Author

Thanks, @cjnolet , before I've been waiting for my PRs to raft and rmm to get through. Now I added the changes I wanted; if the ci checks pass, consider it ready for the review.

@achirkin
Copy link
Contributor Author

rerun tests

Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look great. Very minor things (and a couple comments in preparation for movement to raft).

int algo,
cudaStream_t stream)
void lstsqSvdQR(const raft::handle_t& handle,
math_t* A, // apparently, this must not be const, because cusolverDn<t>gesvd() says
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if we immediately cast it away, it would be nice to be consistent and keep A as a const so that users know that it shouldn't be expected to change. Maybe also add a little note in the function comment. It would be nice to see consts used in the other lstsq* functions as well.

Copy link
Contributor Author

@achirkin achirkin Sep 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is I'm not sure A stays unmodified. For example, lstsqQR is guaranteed to update it - convert to a triangular system. For lstsqSvdJacobi, cuSOLVER's cusolverDn<t>gesvdj says "On exit, the contents of A are destroyed." and it's not marked as const. The same story for lstsqSvdJacobi, except cusolverDn<t>gesvd explicitly provides an option to override outputs into A to save memory (although I disabled it). In both cusolverDn<t>gesvd and cusolverDn<t>gesvdj parameter A is marked as [in/out], so we cannot guarantee it's not modified :( https://docs.nvidia.com/cuda/cusolver/index.html#cuSolverDN-lt-t-gt-gesvd )

I can copy const array A though, but I'm not sure it's worth it.

cpp/src_prims/linalg/lstsq.cuh Outdated Show resolved Hide resolved
cpp/src_prims/linalg/lstsq.cuh Outdated Show resolved Hide resolved
cpp/src_prims/linalg/lstsq.cuh Show resolved Hide resolved
v21.10 Release automation moved this from PR-WIP to PR-Needs review Sep 20, 2021
@achirkin
Copy link
Contributor Author

rerun tests

@codecov-commenter
Copy link

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.10@d657178). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.10    #4201   +/-   ##
===============================================
  Coverage                ?   85.99%           
===============================================
  Files                   ?      231           
  Lines                   ?    19238           
  Branches                ?        0           
===============================================
  Hits                    ?    16543           
  Misses                  ?     2695           
  Partials                ?        0           
Flag Coverage Δ
dask 46.89% <0.00%> (?)
non-dask 78.74% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d657178...3b7da99. Read the comment docs.

@dantegd dantegd added 4 - Waiting on Reviewer Waiting for reviewer to review or respond and removed 2 - In Progress Currenty a work in progress labels Sep 23, 2021
Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

v21.10 Release automation moved this from PR-Needs review to PR-Reviewer approved Sep 24, 2021
@cjnolet
Copy link
Member

cjnolet commented Sep 24, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 26408d3 into rapidsai:branch-21.10 Sep 24, 2021
v21.10 Release automation moved this from PR-Reviewer approved to Done Sep 24, 2021
vimarsh6739 pushed a commit to vimarsh6739/cuml that referenced this pull request Oct 9, 2023
This PR makes some improvements on glm ordinary least squares (ols) implementation:

  1. Split `MLCommon::LinAlg::lstsq`, which solves OLS by computing SVD of the feature matrix (**UΣV = A**). Originally, this function provided two algorithms to do so (using cusolver QR decomposition and using eigenvalue decomposition).
  2. Add a third algorithm for solving OLS via SVD - via cusolver Jacobi iterations.
  3. Inline `raft::linalg::svdEig`. This function computes SVD decomposition of the feature matrix **UΣV = A** via eigenvalue decomposition of **QΛQ<sup>T</sup> = A<sup>T</sup> A**, which is faster, but requires additional manipulations to recover **Σ** and **U** (while **V = Q**). Instead, I use `raft::linalg::eigDC` directly to compute **w = (A<sup>T</sup> A)<sup>-1</sup>Ab =  (QΛ<sup>-1</sup>Q<sup>T</sup>)(Ab)**. This also allows to compute **Ab** concurrently (in another cuda stream) to the rest of the operations, which often fit in my GPU at the same time when n_cols << n_rows.
  4. Just a small thing: reduce the number of memory allocations using `rmm::device_uvector`. Supposedly, this reduces the wilderness of benchmarking uncertainty when using the default memory allocator.
  5. Force the SVD-Jacobi algorithm when n_cols > n_rows, because none of the others support this case either in theory or due to cusolver limitations.
  6. Update the python interface to allow choosing among all available algorithms.

Authors:
  - Artem M. Chirkin (https://github.com/achirkin)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#4201
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 - Waiting on Reviewer Waiting for reviewer to review or respond CUDA/C++ Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

4 participants