[FEA] neighbor sampling in COO/CSR format #1982

MatthiasKohl · 2021-12-06T10:25:03Z

This pull request adds neighborhood sampling, as needed by GNN frameworks (DGL, PyTorch-Geometric).

Since I did not hear back on most of the other issues that need to be addressed before this, I am continuing with my plan of first opening a PR with just the API. Once we agree on the final API, and once a minimal version of cugraph-ops is integrated, we can add the implementation of this API.

In particular, for now I am suggesting that the sampling type is exposed in the public API (it does not exist yet in cugraph-ops since that has not been integrated yet). This must be decided ahead of sampling for best performance (either by the end user or some automatic heuristic on the original graph), which is why it makes sense to have as a separate parameter for this API.

EDIT: link to issue #1978

seunghwak

Looks good in general with some minor complaints.

And can we start with just the COO version first?

CSR version may require more thinking as the input graph (in case of SG) can be either CSR or CSC. Not sure how this will actually be implemented, but returning a CSR output for the CSC input graph can be tricky.

And we may use edgelist instead of coo in the public API name. cugraph has from_cudf_edgeelist function so it might be more consistent and edgelist sounds more generic to me than coo.

cpp/include/cugraph/algorithms.hpp

MatthiasKohl · 2021-12-06T18:27:06Z

Thanks for all the comments @seunghwak . I added some of my own comments inline.

In general, I'd prefer to restrict input graph to CSR for now (similar to !multi_gpu I'd only provide an implementation when the graph is not transposed). The reason is that none of the GNN frameworks work with transposed graphs and there is no plans to include this at the moment as far as I am aware. There probably good use cases for it, but as of now, I'd focus only on the non-transposed version.

And I agree that we can use edgelist.

aschaffer · 2021-12-06T18:43:36Z

cpp/include/cugraph/algorithms.hpp

+                      typename graph_t::vertex_type const* ptr_d_start,
+                      index_t num_vertices,
+                      index_t sampling_size,
+                      ops::sampling::SampleTypeT sampling_type);


I assume sampling_type captures how the sampling is done. This should probably be injected as a template policy / strategy (probably a functor: say, sampling_strategy_t ), rather than an enum, but that would complicate dispatching from C-API / Python.

I assume, then there would be a detail::<name>_impl that has that template policy, and this API would just instantiate the _impl with the proper sampling_strategy_t template argument. Is my assumption correct?

Yes, this captures how sampling is done.

There are several dimensions for how sampling is done actually:

The sampling algorithm being used to select the indexes: this will have an effect in terms of performance (possibly depending on input data), but it will not affect the functionality.

The kind of sampling in terms of functionality: for example, whether we use weights or not. I didn't plan to add this in the API for now, but with a functor (sampling_strategy_t kind of thing as you suggested, we may be able to add this already). This complicates things a little bit because we would have to read in data based on whether the functor requires it, but I'll see if I can think of an easy way to do that.

Whether to extract other data along with the sampled IDs: this will be useful for heterogeneous graphs, but it isn't addressed in this API at all since we would have to return more than just the sampled IDs. So this will be another overload in the future.

And yes, your assumption is correct that there will be a sampling_strategy_t in the implementation which will be selected based on the enum value.

For now, we don't need to care so much about C-API / Python since the integration of this with DGL will happen at the C++ level AFAIU, but the reason why we had an enum in cugnn is because it facilitates integration with Python.

So all of this to say that I'm open to changing this to a template type if it makes more sense to you, but basically, I'd consider this template type more of an implementation detail which the user shouldn't have to include.

Additionally, users will (in general) not have access to the implementation of that functor since cugraph-ops is closed-source. This means that there isn't much of an added value to the user to have such a functor interface, IMO.

At some point in the future, we may want to change this and allow users to add their own functors but this will most likely by as part of a revised/new API.

I think we need to create a GitHub issue with the larger picture of the goals that these pieces of API are trying to achieve (and link to that issue in this PR).

Specifically, I would like to see if I can, for example, use this for the purposes of MNMG MFG, or if there is some overlap. So that we can avoid code / effort duplication. Because I'm also looking into a generic MNMG neighborhood retrieval; by that I mean the process of scanning the neighborhood of a (set of) vertex (vertices) and retrieving some vertices / edges in that neighborhood according with an injected how criteria (again, most likely a static strategy = template policy). I'm specifically avoiding here the term neighborhood sampling because that is too closely tied with G(C)NN's purposes and random processes. As I see it, neighborhood sampling are just special cases of neighborhood retrieval that are dedicated to some specific G(C)NN purposes.

What I'm after is the more generic neighborhood retrieval one (I assumed BFS is that, but per recent conversations that might be too expensive; so something more "customizable" is needed here, where a retrieval policy can be injected, which knows how to take advantage of shortcuts and not "oversolve", like BFS does; that being said, a customized multi-seed BFS might be how this could be implemented, etc.). So then the question is: is there any overlap between this idea of generic neighborhood retrieval and the implementation(s) you are proposing here? If so, let's identify the overlap and plan accordingly to not duplicate code.

Sorry I forgot to link the issue, I commented below and edited the description above as well.

I agree that you'll be able to use part of the implementation of this API in order to implement other things, like MNMG neighborhood sampling or possibly a more generic neighborhood retrieval.

But what is the purpose of making the public-facing API more generic?
Do you have specific applications of the more generic neighborhood retrieval in mind?
Do we know whether the users want this generic neighborhood retrieval API over a set of more specific APIs?

In addition to this, even though the implementation can be re-used for selecting neighbors in other ways, it will be mostly tuned for uniform neighborhood sampling.
Thus, it's likely that other implementations are better for different ways of selecting neighbors.

Finally, the suggested API with output template<typename retrieval_strategy_t,...> is highly problematic since the implementation within cugraph-ops will be closed source, at least for now.

Thus, you would have to give me an exact list of instantiations of this retrieval_strategy_t in order for us not to expose the implementation.
Do you have this list?

The fact that it's SG doesn't seem to be mentioned in the linked issue (unless I missed it). If it's not, then probably it should. In fact all this comment above is much more descriptive than the content in the issue, I think.

Also, MFG is mentioned in the issue. It would probably be helpful if we don't both work on implementing overlapping functionality. However, given the specifics of MNMG implementation, there probably is not a real overlap. We need to revisit this later and possibly refactor. So, in light of the SG spec for this I will then assume there isn't any overlap and the 2 implementations, although both targetting MFG but in different contexts, don't have common parts (yet).

I don't understand how you plan to provide an open source API for which some implementation details are closed source. This intention along with some information about what exactly is closed source and how (to the extent that that information is shareable...you've already shared such information above on this public forum, so I assume some of that info is shareable) should be added into the description of the issue and / or this PR. Not become uncovered in PR comments.

The more we discuss this the more gets revealed. At this point I am confused about the scope of this PR. It seems that a lot of questions are still un-answered about that. Perhaps answering them before pushing an API signature might make more sense.

I don't have an exact list of possible retrieval_strategy_t, by virtue of it being generic. The immediate needs for this would have been, again, MFG and RW (MNMG, both). I don't know what other possible future uses of it might be, yet.

I've edited the issue and I'm mentioning cugraph-ops in the issue description now, which will provide the implementation.

I don't have an exact list of possible retrieval_strategy_t, by virtue of it being generic.

In this case, we cannot use your suggested API. What we could do is to allow users to provide an opaque structure representing an operator, similar to what cusparse does in SpMMOp (https://docs.nvidia.com/cuda/cusparse/index.html#cusparse-generic-function-spmm-op).
Edit: I am aware that this is a C API, but C++ will not allow us to use compile-time typing here either since that obviously involves access to the sources of cugraph-ops.
Because of the complications this involves and because there is no indication that we will use this for anything other than random selection at the moment, I'd suggest not to opt for that right away.
We can always provide such an API later on if necessary.

seunghwak · 2021-12-06T18:54:45Z

In general, I'd prefer to restrict input graph to CSR for now.

Sounds fine to me. Actually, CSC-ish formats are used only for a limited set of algorithms (PageRank, Katz Centrality, and HITS) and we use CSR-ish (-ish as it is no-longer a simple CSR in multi-GPU) formats by default.

MatthiasKohl · 2021-12-07T08:45:44Z

Forgot to link the issue here, I also edited the description above: #1978

MatthiasKohl · 2021-12-07T11:20:09Z

And we may use edgelist instead of coo in the public API name. cugraph has from_cudf_edgeelist function so it might be more consistent and edgelist sounds more generic to me than coo.

Coming back to this, for the CSR interface, it actually doesn't make much sense to return an edgelist. Plus, the edgelist structure contains weights as well which makes sense to have in the future, but it's likely that DGL will simply gather things like weights and/or node/edge type (heterogeneous graphs) afterwards anyway. So for now, I think it makes more sense to simply return the CSR or COO structure directly.

Plus, in the mid-term, we definitely want to use the CSR-based method since both DGL and PyG use CSR ultimately.
The COO-based interface is mainly a compromise to allow DGL to integrate more easily in the beginning.

seunghwak · 2021-12-07T17:13:53Z

I agree that we should return COO or CSR.

I am just complaining about function names. I believe edge list and COO basically mean the same thing and adjacency list and CSR are (roughly) same as well, but COO & CSR are more rigorous technical terminologies for hardcore computer scientists while data scientists may feel more familiar with edge list or adjacency list (please correct me this understanding is not right). NetworkX has from_pandas_edgelist and from_pandas_adjacency but AFAIK, NetworkX doesn't use CSR or COO in their public API.

We have used COO & CSR in our legacy graph classes, but we're sort of moving away from these terminologies in our new public API. So, I am a bit uncomfortable about using COO or CSR in our public API, but pretty much agree that we should return COO or CSR.

MatthiasKohl · 2021-12-07T18:17:31Z

I am just complaining about function names. I believe edge list and COO basically mean the same thing and adjacency list and CSR are (roughly) same as well, but COO & CSR are more rigorous technical terminologies for hardcore computer scientists while data scientists may feel more familiar with edge list or adjacency list (please correct me this understanding is not right).

I do agree with this, I think the reason why I used COO/CSR is because DGL & PyG already use those terms and they are of course the primary target for this API.
I'm happy to rename it to something else though, maybe neighbor_sampling_edge_list or sample_neighbors_edge_list ?
Do you have any other/better suggestions?

seunghwak · 2021-12-07T18:41:32Z

maybe neighbor_sampling_edge_list or sample_neighbors_edge_list ? Do you have any other/better suggestions?

Either is fine but for some reason, we have been using edgelist instead of edge_list (maybe due to NetowrkX's from_pandas_edgelist). I think we'd better be consistent and stick with edgelist or change edgelist elsewhere in the C++ public API to edge_list.

@ChuckHastings Any thoughts?

edgelist is somewhat OK, but I guess adjacency_list is better than adjacencylist (NetworkX is skipping list in from_pandas_adjacency but I guess this is a bit inconsistent).

github-actions · 2022-01-07T14:05:06Z

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

…raph-ops

MatthiasKohl · 2022-01-12T11:36:22Z

@seunghwak @aschaffer I addressed your comments above and added an implementation based on cugraph-ops.
Of course, this cannot work yet since cugraph-ops isn't available for build right now, but could you already take a look at the updated interface and implementation to see if this makes sense to you in general?

In terms of the interface, I mainly changed the names and added a Rng argument which will be needed to keep random state in between calls to this API.
This argument can be treated essentially the same way as the RAFT handle for the user of the API.

…des must be available at runtime)

robertmaynard · 2022-02-22T13:55:23Z

cpp/CMakeLists.txt

@@ -285,6 +285,7 @@ target_include_directories(cugraph
 # - link libraries -------------------------------------------------------------
 target_link_libraries(cugraph
    PUBLIC
+        cugraphops::cugraphops


As cugraphops is part of the public api for cugraph, it needs to be properly exported.

This is causing the CI failures as consumers of the cugraph-config.cmake have no way of finding cugraphops::cugraphops. This causes the below error:

11:32:16 Target "cugraph_etl" links to target "cugraphops::cugraphops" but the 11:32:16 target was not found. Perhaps a find_package() call is missing for an 11:32:16 IMPORTED target, or an ALIAS target is missing?

https://github.com/rapidsai/cugraph/blob/branch-22.04/cpp/cmake/thirdparty/get_libcugraphops.cmake
needs to add BUILD and INSTALL export set entries to the rapids_find_generate_module and rapids_find_package lines ( cugraph-exports is the export set ).

@robertmaynard if you see the similar setup in cuML for the libcumlprims, it seems to not follow your suggestion, but that's working for cuML. Any ideas why?

Because cumlprims_mg is a private dependency of cuml and therefore doesn't need to be found by consumers of cuml.

Got it. Thanks Robert.

ajschmidt8

Approving ops-codeowner file changes

BradReesWork · 2022-02-22T17:54:22Z

@gpucibot merge

ChuckHastings · 2022-02-22T18:01:37Z

maybe neighbor_sampling_edge_list or sample_neighbors_edge_list ? Do you have any other/better suggestions?

Either is fine but for some reason, we have been using edgelist instead of edge_list (maybe due to NetowrkX's from_pandas_edgelist). I think we'd better be consistent and stick with edgelist or change edgelist elsewhere in the C++ public API to edge_list.

@ChuckHastings Any thoughts?

edgelist is somewhat OK, but I guess adjacency_list is better than adjacencylist (NetworkX is skipping list in from_pandas_adjacency but I guess this is a bit inconsistent).

Don't have a strong preference. I agree that networkx is inconsistent. I also don't think it's necessary for us to follow networkx at the C++ level. It is sometimes convenient. But I expect as we start integrating with more graph frameworks we will find different organizational structures and naming conventions that will be different.

But for now I'd say let's go with what's in the PR. We can always rename things later if we settle on a different naming convention.

ChuckHastings · 2022-02-22T18:57:32Z

@gpucibot merge

This PR tracks work on MNMG Neighborhood Sampling, for G(C)NN needs. Dependencies: 1. #1982 2. #2064 3. Integration of rapidsai/cugraph-ops#24 into `cugraph` Authors: - Andrei Schaffer (https://github.com/aschaffer) Approvers: - Seunghwa Kang (https://github.com/seunghwak) - Kumar Aatish (https://github.com/kaatish) - Chuck Hastings (https://github.com/ChuckHastings) - Rick Ratzel (https://github.com/rlratzel) URL: #2073

added neighbor sampling API

e2479de

MatthiasKohl requested a review from a team as a code owner December 6, 2021 10:25

BradReesWork requested review from seunghwak and aschaffer December 6, 2021 14:43

BradReesWork changed the title ~~[FEA] neighbor sampling in COO/CSR format~~ [API Review] neighbor sampling in COO/CSR format Dec 6, 2021

BradReesWork added the 3 - Ready for Review label Dec 6, 2021

BradReesWork added this to the 22.02 milestone Dec 6, 2021

BradReesWork added this to PRs in 22.02 Release Dec 6, 2021

BradReesWork added the DRAFT label Dec 6, 2021

BradReesWork requested a review from ChuckHastings December 6, 2021 14:45

seunghwak reviewed Dec 6, 2021

View reviewed changes

cpp/include/cugraph/algorithms.hpp Outdated Show resolved Hide resolved

cpp/include/cugraph/algorithms.hpp Outdated Show resolved Hide resolved

cpp/include/cugraph/algorithms.hpp Outdated Show resolved Hide resolved

cpp/include/cugraph/algorithms.hpp Outdated Show resolved Hide resolved

aschaffer requested changes Dec 6, 2021

View reviewed changes

rlratzel added non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels Dec 6, 2021

improved neighbor sampling api according to suggestions

dddbf07

github-actions bot added the inactive-30d label Jan 7, 2022

Merge branch 'branch-22.02' into fea_sampling_mfg_sg

983f0de

MatthiasKohl changed the title ~~[API Review] neighbor sampling in COO/CSR format~~ [WIP] [FEA] neighbor sampling in COO/CSR format Jan 12, 2022

added implementation of single-GPU neighborhood sampling based on cug…

59e3c3c

…raph-ops

github-actions bot removed the inactive-30d label Jan 12, 2022

BradReesWork added the 4 - Waiting on Author label Jan 20, 2022

BradReesWork added this to PRs in 22.04 Release Jan 26, 2022

aschaffer mentioned this pull request Feb 14, 2022

MNMG Neighborhood Sampling #2073

Merged

MatthiasKohl added 5 commits February 15, 2022 02:24

Merge branch 'branch-22.04' into fea_sampling_mfg_sg

e63787f

add sg sampling: clang format/copyright fix

71329dc

Merge branch 'branch-22.04' into fea_sampling_mfg_sg

b75c8cc

using latest cugraph-ops internal RNG

d60eb8d

libcugraphops added as run dependency (since shared library and inclu…

967ca79

…des must be available at runtime)

MatthiasKohl requested a review from a team as a code owner February 21, 2022 12:47

updated copyright

3a7a785

MatthiasKohl changed the title ~~[WIP] [FEA] neighbor sampling in COO/CSR format~~ [FEA] neighbor sampling in COO/CSR format Feb 21, 2022

cugraph-ops: added cmake dependency

9d94156

MatthiasKohl requested a review from a team as a code owner February 21, 2022 16:05

robertmaynard suggested changes Feb 22, 2022

View reviewed changes

MatthiasKohl added 3 commits February 22, 2022 06:11

fixed no install build

578e442

added cugraph ops to cmake build and install exports

b4e3122

added neighborhood sampling to Cmake and fixed compilation

96c00c8

ajschmidt8 approved these changes Feb 22, 2022

View reviewed changes

BradReesWork removed DRAFT labels Feb 22, 2022

robertmaynard approved these changes Feb 22, 2022

View reviewed changes

aschaffer self-requested a review February 22, 2022 18:13

aschaffer approved these changes Feb 22, 2022

View reviewed changes

ChuckHastings approved these changes Feb 22, 2022

View reviewed changes

rapids-bot bot merged commit e95171f into rapidsai:branch-22.04 Feb 22, 2022

22.04 Release automation moved this from PRs to Done Feb 22, 2022

MatthiasKohl mentioned this pull request Apr 14, 2022

[FEA] Neighborhood sampling (MFG, single-GPU) #1978

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] neighbor sampling in COO/CSR format #1982

[FEA] neighbor sampling in COO/CSR format #1982

MatthiasKohl commented Dec 6, 2021 •

edited

seunghwak left a comment

MatthiasKohl commented Dec 6, 2021

aschaffer Dec 6, 2021

MatthiasKohl Dec 6, 2021

MatthiasKohl Dec 6, 2021

aschaffer Dec 6, 2021

MatthiasKohl Dec 7, 2021

MatthiasKohl Dec 7, 2021

MatthiasKohl Dec 7, 2021

aschaffer Dec 7, 2021

aschaffer Dec 7, 2021

MatthiasKohl Dec 8, 2021 •

edited

seunghwak commented Dec 6, 2021

MatthiasKohl commented Dec 7, 2021

MatthiasKohl commented Dec 7, 2021

seunghwak commented Dec 7, 2021 •

edited

MatthiasKohl commented Dec 7, 2021

seunghwak commented Dec 7, 2021 •

edited

github-actions bot commented Jan 7, 2022

MatthiasKohl commented Jan 12, 2022

robertmaynard Feb 22, 2022

teju85 Feb 22, 2022

robertmaynard Feb 22, 2022

teju85 Feb 22, 2022

ajschmidt8 left a comment

BradReesWork commented Feb 22, 2022

ChuckHastings commented Feb 22, 2022

ChuckHastings commented Feb 22, 2022

[FEA] neighbor sampling in COO/CSR format #1982

[FEA] neighbor sampling in COO/CSR format #1982

Conversation

MatthiasKohl commented Dec 6, 2021 • edited

seunghwak left a comment

Choose a reason for hiding this comment

MatthiasKohl commented Dec 6, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MatthiasKohl Dec 8, 2021 • edited

Choose a reason for hiding this comment

seunghwak commented Dec 6, 2021

MatthiasKohl commented Dec 7, 2021

MatthiasKohl commented Dec 7, 2021

seunghwak commented Dec 7, 2021 • edited

MatthiasKohl commented Dec 7, 2021

seunghwak commented Dec 7, 2021 • edited

github-actions bot commented Jan 7, 2022

MatthiasKohl commented Jan 12, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajschmidt8 left a comment

Choose a reason for hiding this comment

BradReesWork commented Feb 22, 2022

ChuckHastings commented Feb 22, 2022

ChuckHastings commented Feb 22, 2022

MatthiasKohl commented Dec 6, 2021 •

edited

MatthiasKohl Dec 8, 2021 •

edited

seunghwak commented Dec 7, 2021 •

edited

seunghwak commented Dec 7, 2021 •

edited