Add MG weighted k-means #3959

lowener · 2021-06-08T07:41:18Z

This PR adds support for MG weighted k-means and is a continuation of @akkamesh and @cjnolet work on PR #2126.

Conflicts: CHANGELOG.md

Conflicts: python/cuml/test/dask/test_kmeans.py python/cuml/test/test_kmeans.py

cjnolet

Thanks for picking this one up. Overall it looks great but we do still have an issue to fix (see comment in the review).

cjnolet · 2021-06-09T18:25:10Z

cpp/src/kmeans/kmeans_mg_impl.cuh

@@ -620,6 +657,9 @@ void fit(const raft::handle_t &handle, const KMeansParams &params,
  MLCommon::device_buffer<char> workspace(handle.get_device_allocator(),
                                          stream);

+  // check if weights sum up to n_samples
+  checkWeights(handle, workspace, weight, stream);


Great you pulled this over from cumlprims! IIRC, the one remaining issue should be that the single-GPU k-means normalizes the weights in predict, however that will cause the multi-gpu version to normalize each partition individually since it's embarrassingly parallel.

The weights are being normalized globally in the Dask-based predict but the single-GPU predict is going to re-normalize them locally. The more straightforward path to fixing this might be to have the C++ predict() function accept a normalize_weights argument which defaults to true but we can have the multi-GPU predict function flip it off. The goal here is to eliminate the need for predict() to use the comms, because then it would no longer be able to execute embarassingly parallel.

lowener · 2021-06-22T20:34:25Z

rerun tests

cjnolet

LGTM pending CI

cjnolet · 2021-06-23T21:27:25Z

rerun tests

dantegd · 2021-06-29T13:09:50Z

rerun tests

dantegd · 2021-06-29T15:20:06Z

Docstring fix identified in CI:

Generating docs for compound /workspace/cpp/include/cuml/cluster/kmeans_mg.hpp:49: error: The following parameter of ML::kmeans::opg::fit(const raft::handle_t &handle, const KMeansParams &params, const float *X, int n_samples, int n_features, const float *sample_weight, float *centroids, float &inertia, int &n_iter) is not documented:

dantegd · 2021-06-29T18:22:53Z

@gpucibot merge

codecov-commenter · 2021-06-29T20:09:47Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.08@3887e32). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-21.08    #3959   +/-   ##
===============================================
  Coverage                ?   85.46%           
===============================================
  Files                   ?      230           
  Lines                   ?    18116           
  Branches                ?        0           
===============================================
  Hits                    ?    15482           
  Misses                  ?     2634           
  Partials                ?        0

Flag	Coverage Δ
dask	`48.11% <0.00%> (?)`
non-dask	`77.73% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3887e32...7e09369. Read the comment docs.

@akkamesh

This PR adds support for MG weighted k-means and is a continuation of @akkamesh and @cjnolet work on PR rapidsai#2126. Authors: - Micka (https://github.com/lowener) - https://github.com/akkamesh - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: rapidsai#3959

akkamesh and others added 20 commits April 22, 2020 16:55

[kmeans] add support for mg weighted k-means

3e1b003

[kmeans] update changelog

6463900

Commiting for merge

b67c60f

Merge branch 'enh-ext-weighted-kmeans' into enh-ext-mg-weighted-kmeans

822e92e

Conflicts: CHANGELOG.md

Fixing style

986ecfc

Added a reasonable test for kmeans weights

2cecda7

Adding more clusters

351f340

Adding dask kmeans fit() tests for highly skewed distributions

de5b724

Checking in tests

1545cab

Fixing style issue

372b4d7

Adding sample weight normalization

71997d3

Merge branch 'branch-0.14' into enh-ext-mg-weighted-kmeans

d567a43

Conflicts: python/cuml/test/dask/test_kmeans.py python/cuml/test/test_kmeans.py

Fixing style

839d849

Merge branch 'branch-21.06' into enh-ext-mg-weighted-kmeans

85855b2

Add initial C++ weighted MG KMeans

f6acb47

Fix changelog

0a8943b

Add MG weight normalization

eb18ae8

Merge branch 'branch-21.08' into enh-ext-mg-weighted-kmeans

155867d

Cleanup unnecessary changes

3cb817a

Fix imports

f16a86f

lowener requested review from a team as code owners June 8, 2021 07:41

github-actions bot added CUDA/C++ Cython / Python Cython or Python issue labels Jun 8, 2021

lowener added 2 commits June 8, 2021 09:42

Fix style

133895e

Fix copyright

678c96d

lowener added Dask / cuml.dask Issue/PR related to Python level dask or cuml.dask features. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jun 8, 2021

dantegd added this to PR-WIP in v21.08 Release via automation Jun 8, 2021

dantegd added the 3 - Ready for Review Ready for review by team label Jun 8, 2021

cjnolet requested changes Jun 9, 2021

View reviewed changes

v21.08 Release automation moved this from PR-WIP to PR-Needs review Jun 9, 2021

dantegd added 4 - Waiting on Author Waiting for author to respond to review and removed 3 - Ready for Review Ready for review by team labels Jun 10, 2021

lowener added 2 commits June 14, 2021 15:48

Add normalize_weights to C++ predict function

d2186a2

Fix style

897917c

dantegd mentioned this pull request Jun 16, 2021

[REVIEW] Add MG weighted k-means #2126

Closed

lowener requested a review from cjnolet June 22, 2021 20:34

cjnolet approved these changes Jun 23, 2021

View reviewed changes

v21.08 Release automation moved this from PR-Needs review to PR-Reviewer approved Jun 23, 2021

Fix docstring

7e09369

rapids-bot bot merged commit 166667b into rapidsai:branch-21.08 Jun 29, 2021

v21.08 Release automation moved this from PR-Reviewer approved to Done Jun 29, 2021

lowener deleted the enh-ext-mg-weighted-kmeans branch June 29, 2021 22:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MG weighted k-means #3959

Add MG weighted k-means #3959

lowener commented Jun 8, 2021

cjnolet left a comment

cjnolet Jun 9, 2021

lowener commented Jun 22, 2021

cjnolet left a comment

cjnolet commented Jun 23, 2021

dantegd commented Jun 29, 2021

dantegd commented Jun 29, 2021

dantegd commented Jun 29, 2021

codecov-commenter commented Jun 29, 2021

Add MG weighted k-means #3959

Add MG weighted k-means #3959

Conversation

lowener commented Jun 8, 2021

cjnolet left a comment

Choose a reason for hiding this comment

cjnolet Jun 9, 2021

Choose a reason for hiding this comment

lowener commented Jun 22, 2021

cjnolet left a comment

Choose a reason for hiding this comment

cjnolet commented Jun 23, 2021

dantegd commented Jun 29, 2021

dantegd commented Jun 29, 2021

dantegd commented Jun 29, 2021

codecov-commenter commented Jun 29, 2021

Codecov Report