Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] neighbor sampling in COO/CSR format #1982

Merged
merged 14 commits into from Feb 22, 2022

Conversation

MatthiasKohl
Copy link
Contributor

@MatthiasKohl MatthiasKohl commented Dec 6, 2021

This pull request adds neighborhood sampling, as needed by GNN frameworks (DGL, PyTorch-Geometric).

Since I did not hear back on most of the other issues that need to be addressed before this, I am continuing with my plan of first opening a PR with just the API. Once we agree on the final API, and once a minimal version of cugraph-ops is integrated, we can add the implementation of this API.

In particular, for now I am suggesting that the sampling type is exposed in the public API (it does not exist yet in cugraph-ops since that has not been integrated yet). This must be decided ahead of sampling for best performance (either by the end user or some automatic heuristic on the original graph), which is why it makes sense to have as a separate parameter for this API.

EDIT: link to issue #1978

@MatthiasKohl MatthiasKohl requested a review from a team as a code owner December 6, 2021 10:25
@BradReesWork BradReesWork changed the title [FEA] neighbor sampling in COO/CSR format [API Review] neighbor sampling in COO/CSR format Dec 6, 2021
@BradReesWork BradReesWork added this to the 22.02 milestone Dec 6, 2021
@BradReesWork BradReesWork added this to PRs in 22.02 Release Dec 6, 2021
Copy link
Contributor

@seunghwak seunghwak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general with some minor complaints.

And can we start with just the COO version first?

CSR version may require more thinking as the input graph (in case of SG) can be either CSR or CSC. Not sure how this will actually be implemented, but returning a CSR output for the CSC input graph can be tricky.

And we may use edgelist instead of coo in the public API name. cugraph has from_cudf_edgeelist function so it might be more consistent and edgelist sounds more generic to me than coo.

cpp/include/cugraph/algorithms.hpp Outdated Show resolved Hide resolved
cpp/include/cugraph/algorithms.hpp Outdated Show resolved Hide resolved
cpp/include/cugraph/algorithms.hpp Outdated Show resolved Hide resolved
cpp/include/cugraph/algorithms.hpp Outdated Show resolved Hide resolved
@MatthiasKohl
Copy link
Contributor Author

Thanks for all the comments @seunghwak . I added some of my own comments inline.

In general, I'd prefer to restrict input graph to CSR for now (similar to !multi_gpu I'd only provide an implementation when the graph is not transposed). The reason is that none of the GNN frameworks work with transposed graphs and there is no plans to include this at the moment as far as I am aware. There probably good use cases for it, but as of now, I'd focus only on the non-transposed version.

And I agree that we can use edgelist.

typename graph_t::vertex_type const* ptr_d_start,
index_t num_vertices,
index_t sampling_size,
ops::sampling::SampleTypeT sampling_type);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume sampling_type captures how the sampling is done. This should probably be injected as a template policy / strategy (probably a functor: say, sampling_strategy_t ), rather than an enum, but that would complicate dispatching from C-API / Python.

I assume, then there would be a detail::<name>_impl that has that template policy, and this API would just instantiate the _impl with the proper sampling_strategy_t template argument. Is my assumption correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this captures how sampling is done.

There are several dimensions for how sampling is done actually:

  1. The sampling algorithm being used to select the indexes: this will have an effect in terms of performance (possibly depending on input data), but it will not affect the functionality.
  2. The kind of sampling in terms of functionality: for example, whether we use weights or not. I didn't plan to add this in the API for now, but with a functor (sampling_strategy_t kind of thing as you suggested, we may be able to add this already). This complicates things a little bit because we would have to read in data based on whether the functor requires it, but I'll see if I can think of an easy way to do that.
  3. Whether to extract other data along with the sampled IDs: this will be useful for heterogeneous graphs, but it isn't addressed in this API at all since we would have to return more than just the sampled IDs. So this will be another overload in the future.

And yes, your assumption is correct that there will be a sampling_strategy_t in the implementation which will be selected based on the enum value.

For now, we don't need to care so much about C-API / Python since the integration of this with DGL will happen at the C++ level AFAIU, but the reason why we had an enum in cugnn is because it facilitates integration with Python.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So all of this to say that I'm open to changing this to a template type if it makes more sense to you, but basically, I'd consider this template type more of an implementation detail which the user shouldn't have to include.

Additionally, users will (in general) not have access to the implementation of that functor since cugraph-ops is closed-source. This means that there isn't much of an added value to the user to have such a functor interface, IMO.

At some point in the future, we may want to change this and allow users to add their own functors but this will most likely by as part of a revised/new API.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to create a GitHub issue with the larger picture of the goals that these pieces of API are trying to achieve (and link to that issue in this PR).

Specifically, I would like to see if I can, for example, use this for the purposes of MNMG MFG, or if there is some overlap. So that we can avoid code / effort duplication. Because I'm also looking into a generic MNMG neighborhood retrieval; by that I mean the process of scanning the neighborhood of a (set of) vertex (vertices) and retrieving some vertices / edges in that neighborhood according with an injected how criteria (again, most likely a static strategy = template policy). I'm specifically avoiding here the term neighborhood sampling because that is too closely tied with G(C)NN's purposes and random processes. As I see it, neighborhood sampling are just special cases of neighborhood retrieval that are dedicated to some specific G(C)NN purposes.

What I'm after is the more generic neighborhood retrieval one (I assumed BFS is that, but per recent conversations that might be too expensive; so something more "customizable" is needed here, where a retrieval policy can be injected, which knows how to take advantage of shortcuts and not "oversolve", like BFS does; that being said, a customized multi-seed BFS might be how this could be implemented, etc.). So then the question is: is there any overlap between this idea of generic neighborhood retrieval and the implementation(s) you are proposing here? If so, let's identify the overlap and plan accordingly to not duplicate code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I forgot to link the issue, I commented below and edited the description above as well.

I agree that you'll be able to use part of the implementation of this API in order to implement other things, like MNMG neighborhood sampling or possibly a more generic neighborhood retrieval.

But what is the purpose of making the public-facing API more generic?
Do you have specific applications of the more generic neighborhood retrieval in mind?
Do we know whether the users want this generic neighborhood retrieval API over a set of more specific APIs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to this, even though the implementation can be re-used for selecting neighbors in other ways, it will be mostly tuned for uniform neighborhood sampling.
Thus, it's likely that other implementations are better for different ways of selecting neighbors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally, the suggested API with output template<typename retrieval_strategy_t,...> is highly problematic since the implementation within cugraph-ops will be closed source, at least for now.

Thus, you would have to give me an exact list of instantiations of this retrieval_strategy_t in order for us not to expose the implementation.
Do you have this list?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that it's SG doesn't seem to be mentioned in the linked issue (unless I missed it). If it's not, then probably it should. In fact all this comment above is much more descriptive than the content in the issue, I think.

Also, MFG is mentioned in the issue. It would probably be helpful if we don't both work on implementing overlapping functionality. However, given the specifics of MNMG implementation, there probably is not a real overlap. We need to revisit this later and possibly refactor. So, in light of the SG spec for this I will then assume there isn't any overlap and the 2 implementations, although both targetting MFG but in different contexts, don't have common parts (yet).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how you plan to provide an open source API for which some implementation details are closed source. This intention along with some information about what exactly is closed source and how (to the extent that that information is shareable...you've already shared such information above on this public forum, so I assume some of that info is shareable) should be added into the description of the issue and / or this PR. Not become uncovered in PR comments.

The more we discuss this the more gets revealed. At this point I am confused about the scope of this PR. It seems that a lot of questions are still un-answered about that. Perhaps answering them before pushing an API signature might make more sense.

I don't have an exact list of possible retrieval_strategy_t, by virtue of it being generic. The immediate needs for this would have been, again, MFG and RW (MNMG, both). I don't know what other possible future uses of it might be, yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've edited the issue and I'm mentioning cugraph-ops in the issue description now, which will provide the implementation.

I don't have an exact list of possible retrieval_strategy_t, by virtue of it being generic.

In this case, we cannot use your suggested API. What we could do is to allow users to provide an opaque structure representing an operator, similar to what cusparse does in SpMMOp (https://docs.nvidia.com/cuda/cusparse/index.html#cusparse-generic-function-spmm-op).
Edit: I am aware that this is a C API, but C++ will not allow us to use compile-time typing here either since that obviously involves access to the sources of cugraph-ops.
Because of the complications this involves and because there is no indication that we will use this for anything other than random selection at the moment, I'd suggest not to opt for that right away.
We can always provide such an API later on if necessary.

@seunghwak
Copy link
Contributor

In general, I'd prefer to restrict input graph to CSR for now.

Sounds fine to me. Actually, CSC-ish formats are used only for a limited set of algorithms (PageRank, Katz Centrality, and HITS) and we use CSR-ish (-ish as it is no-longer a simple CSR in multi-GPU) formats by default.

@rlratzel rlratzel added non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels Dec 6, 2021
@MatthiasKohl
Copy link
Contributor Author

Forgot to link the issue here, I also edited the description above: #1978

@MatthiasKohl
Copy link
Contributor Author

And we may use edgelist instead of coo in the public API name. cugraph has from_cudf_edgeelist function so it might be more consistent and edgelist sounds more generic to me than coo.

Coming back to this, for the CSR interface, it actually doesn't make much sense to return an edgelist. Plus, the edgelist structure contains weights as well which makes sense to have in the future, but it's likely that DGL will simply gather things like weights and/or node/edge type (heterogeneous graphs) afterwards anyway. So for now, I think it makes more sense to simply return the CSR or COO structure directly.

Plus, in the mid-term, we definitely want to use the CSR-based method since both DGL and PyG use CSR ultimately.
The COO-based interface is mainly a compromise to allow DGL to integrate more easily in the beginning.

@seunghwak
Copy link
Contributor

seunghwak commented Dec 7, 2021

I agree that we should return COO or CSR.

I am just complaining about function names. I believe edge list and COO basically mean the same thing and adjacency list and CSR are (roughly) same as well, but COO & CSR are more rigorous technical terminologies for hardcore computer scientists while data scientists may feel more familiar with edge list or adjacency list (please correct me this understanding is not right). NetworkX has from_pandas_edgelist and from_pandas_adjacency but AFAIK, NetworkX doesn't use CSR or COO in their public API.

We have used COO & CSR in our legacy graph classes, but we're sort of moving away from these terminologies in our new public API. So, I am a bit uncomfortable about using COO or CSR in our public API, but pretty much agree that we should return COO or CSR.

@MatthiasKohl
Copy link
Contributor Author

I am just complaining about function names. I believe edge list and COO basically mean the same thing and adjacency list and CSR are (roughly) same as well, but COO & CSR are more rigorous technical terminologies for hardcore computer scientists while data scientists may feel more familiar with edge list or adjacency list (please correct me this understanding is not right).

I do agree with this, I think the reason why I used COO/CSR is because DGL & PyG already use those terms and they are of course the primary target for this API.
I'm happy to rename it to something else though, maybe neighbor_sampling_edge_list or sample_neighbors_edge_list ?
Do you have any other/better suggestions?

@seunghwak
Copy link
Contributor

seunghwak commented Dec 7, 2021

maybe neighbor_sampling_edge_list or sample_neighbors_edge_list ? Do you have any other/better suggestions?

Either is fine but for some reason, we have been using edgelist instead of edge_list (maybe due to NetowrkX's from_pandas_edgelist). I think we'd better be consistent and stick with edgelist or change edgelist elsewhere in the C++ public API to edge_list.

@ChuckHastings Any thoughts?

edgelist is somewhat OK, but I guess adjacency_list is better than adjacencylist (NetworkX is skipping list in from_pandas_adjacency but I guess this is a bit inconsistent).

@github-actions
Copy link

github-actions bot commented Jan 7, 2022

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

@MatthiasKohl MatthiasKohl changed the title [API Review] neighbor sampling in COO/CSR format [WIP] [FEA] neighbor sampling in COO/CSR format Jan 12, 2022
@MatthiasKohl
Copy link
Contributor Author

@seunghwak @aschaffer I addressed your comments above and added an implementation based on cugraph-ops.
Of course, this cannot work yet since cugraph-ops isn't available for build right now, but could you already take a look at the updated interface and implementation to see if this makes sense to you in general?

In terms of the interface, I mainly changed the names and added a Rng argument which will be needed to keep random state in between calls to this API.
This argument can be treated essentially the same way as the RAFT handle for the user of the API.

@MatthiasKohl MatthiasKohl requested a review from a team as a code owner February 21, 2022 12:47
@MatthiasKohl MatthiasKohl changed the title [WIP] [FEA] neighbor sampling in COO/CSR format [FEA] neighbor sampling in COO/CSR format Feb 21, 2022
@MatthiasKohl MatthiasKohl requested a review from a team as a code owner February 21, 2022 16:05
@@ -285,6 +285,7 @@ target_include_directories(cugraph
# - link libraries -------------------------------------------------------------
target_link_libraries(cugraph
PUBLIC
cugraphops::cugraphops
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As cugraphops is part of the public api for cugraph, it needs to be properly exported.

This is causing the CI failures as consumers of the cugraph-config.cmake have no way of finding cugraphops::cugraphops. This causes the below error:

11:32:16   Target "cugraph_etl" links to target "cugraphops::cugraphops" but the
11:32:16   target was not found.  Perhaps a find_package() call is missing for an
11:32:16   IMPORTED target, or an ALIAS target is missing?

https://github.com/rapidsai/cugraph/blob/branch-22.04/cpp/cmake/thirdparty/get_libcugraphops.cmake
needs to add BUILD and INSTALL export set entries to the rapids_find_generate_module and rapids_find_package lines ( cugraph-exports is the export set ).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robertmaynard if you see the similar setup in cuML for the libcumlprims, it seems to not follow your suggestion, but that's working for cuML. Any ideas why?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because cumlprims_mg is a private dependency of cuml and therefore doesn't need to be found by consumers of cuml.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thanks Robert.

Copy link
Member

@ajschmidt8 ajschmidt8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving ops-codeowner file changes

@BradReesWork
Copy link
Member

@gpucibot merge

@ChuckHastings
Copy link
Collaborator

maybe neighbor_sampling_edge_list or sample_neighbors_edge_list ? Do you have any other/better suggestions?

Either is fine but for some reason, we have been using edgelist instead of edge_list (maybe due to NetowrkX's from_pandas_edgelist). I think we'd better be consistent and stick with edgelist or change edgelist elsewhere in the C++ public API to edge_list.

@ChuckHastings Any thoughts?

edgelist is somewhat OK, but I guess adjacency_list is better than adjacencylist (NetworkX is skipping list in from_pandas_adjacency but I guess this is a bit inconsistent).

Don't have a strong preference. I agree that networkx is inconsistent. I also don't think it's necessary for us to follow networkx at the C++ level. It is sometimes convenient. But I expect as we start integrating with more graph frameworks we will find different organizational structures and naming conventions that will be different.

But for now I'd say let's go with what's in the PR. We can always rename things later if we settle on a different naming convention.

@aschaffer aschaffer self-requested a review February 22, 2022 18:13
@ChuckHastings
Copy link
Collaborator

@gpucibot merge

@rapids-bot rapids-bot bot merged commit e95171f into rapidsai:branch-22.04 Feb 22, 2022
22.04 Release automation moved this from PRs to Done Feb 22, 2022
rapids-bot bot pushed a commit that referenced this pull request Mar 17, 2022
This PR tracks work on MNMG Neighborhood Sampling, for G(C)NN needs.

Dependencies:
1. #1982
2. #2064
3. Integration of rapidsai/cugraph-ops#24 into `cugraph`

Authors:
  - Andrei Schaffer (https://github.com/aschaffer)

Approvers:
  - Seunghwa Kang (https://github.com/seunghwak)
  - Kumar Aatish (https://github.com/kaatish)
  - Chuck Hastings (https://github.com/ChuckHastings)
  - Rick Ratzel (https://github.com/rlratzel)

URL: #2073
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
No open projects
22.04 Release
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

9 participants