Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Optimize Sampling for graph_store #2283

Merged
merged 2 commits into from
May 19, 2022

Conversation

VibhuJawa
Copy link
Member

This PR optimizes the sampling function for graph_store by 3x+ by getting rid of the host side code and doing the sampling end to end on GPUs.

More importantly this code makes sure that the actual sampling in batched_ego_graphs is the bottleneck , previously we only spent 32% in the core sampling code while now we spend 98.5% of the time there.

See Below Benchmarks :

Before PR

Timer unit: 1e-06 s

Total time: 17.3772 s
File: /home/nfs/vjawa/dgl/cugraph/python/cugraph/cugraph/gnn/graph_store.py
Function: sample_neighbors at line 73

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    73                                               def sample_neighbors(self,
    74                                                                    nodes,
    75                                                                    fanout=-1,
    76                                                                    edge_dir='in',
    77                                                                    prob=None,
    78                                                                    replace=False):
    79                                                   """
    80                                                   Sample neighboring edges of the given nodes and return the subgraph.
    81                                           
    82                                                   Parameters
    83                                                   ----------
    84                                                   nodes : array (single dimension)
    85                                                       Node IDs to sample neighbors from.
    86                                                   fanout : int
    87                                                       The number of edges to be sampled for each node on each edge type.
    88                                                   edge_dir : str {"in" or "out"}
    89                                                       Determines whether to sample inbound or outbound edges.
    90                                                       Can take either in for inbound edges or out for outbound edges.
    91                                                   prob : str
    92                                                       Feature name used as the (unnormalized) probabilities associated
    93                                                       with each neighboring edge of a node. Each feature must be a
    94                                                       scalar. The features must be non-negative floats, and the sum of
    95                                                       the features of inbound/outbound edges for every node must be
    96                                                       positive (though they don't have to sum up to one). Otherwise,
    97                                                       the result will be undefined. If not specified, sample uniformly.
    98                                                   replace : bool
    99                                                       If True, sample with replacement.
   100                                           
   101                                                   Returns
   102                                                   -------
   103                                                   CuPy array
   104                                                       The sampled arrays for bipartite graph.
   105                                                   """
   106         1         18.0     18.0      0.0          num_nodes = len(nodes)
   107         1       7833.0   7833.0      0.0          current_seeds = nodes.reindex(index=np.arange(0, num_nodes))
   108         2     129790.0  64895.0      0.7          _g = self.__G.extract_subgraph(create_using=cugraph.Graph,
   109         1          1.0      1.0      0.0                                         allow_multi_edges=True)
   110         2    5467307.0 2733653.5     31.5          ego_edge_list, seeds_offsets = batched_ego_graphs(_g,
   111         1          1.0      1.0      0.0                                                            current_seeds,
   112         1          0.0      0.0      0.0                                                            radius=1)
   113         1        123.0    123.0      0.0          all_parents = cupy.ndarray(0)
   114         1         12.0     12.0      0.0          all_children = cupy.ndarray(0)
   115                                                   # filter and get a certain size neighborhood
   116      1001       1143.0      1.1      0.0          for i in range(1, len(seeds_offsets)):
   117      1000     262330.0    262.3      1.5              pos0 = seeds_offsets.values_host[i-1]
   118      1000     211487.0    211.5      1.2              pos1 = seeds_offsets.values_host[i]
   119      1000     335515.0    335.5      1.9              edge_list = ego_edge_list[pos0:pos1]
   120                                                       # get randomness fanout
   121      1000    6202089.0   6202.1     35.7              filtered_list = edge_list[edge_list['dst'] == current_seeds[i-1]]
   122                                           
   123                                                       # get sampled_list
   124      1000      14097.0     14.1      0.1              if len(filtered_list) > fanout:
   125      1654      19502.0     11.8      0.1                  sampled_indices = random.sample(
   126       827     192781.0    233.1      1.1                          filtered_list.index.to_arrow().to_pylist(), fanout)
   127       827    4080293.0   4933.8     23.5                  filtered_list = filtered_list.reindex(index=sampled_indices)
   128                                           
   129      1000     146293.0    146.3      0.8              children = cupy.asarray(filtered_list['src'])
   130      1000     126122.0    126.1      0.7              parents = cupy.asarray(filtered_list['dst'])
   131      1000     105440.0    105.4      0.6              all_parents = cupy.append(all_parents, parents)
   132      1000      74987.0     75.0      0.4              all_children = cupy.append(all_children, children)
   133         1          1.0      1.0      0.0          return all_parents, all_children

After PR:

Timer unit: 1e-06 s

Total time: 5.73069 s
File: /datasets/vjawa/miniconda3/envs/cugraph_dev/lib/python3.8/site-packages/cugraph-22.6.0a0+86.gd9ec8f718.dirty-py3.8-linux-x86_64.egg/cugraph/gnn/graph_store.py
Function: sample_neighbors at line 73

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
 73                                               def sample_neighbors(self,
 74                                                                    nodes,
 75                                                                    fanout=-1,
 76                                                                    edge_dir='in',
 77                                                                    prob=None,
 78                                                                    replace=False):
 79                                                   """
 80                                                   Sample neighboring edges of the given nodes and return the subgraph.
 81                                           
 82                                                   Parameters
 83                                                   ----------
 84                                                   nodes : array (single dimension)
 85                                                       Node IDs to sample neighbors from.
 86                                                   fanout : int
 87                                                       The number of edges to be sampled for each node on each edge type.
 88                                                   edge_dir : str {"in" or "out"}
 89                                                       Determines whether to sample inbound or outbound edges.
 90                                                       Can take either in for inbound edges or out for outbound edges.
 91                                                   prob : str
 92                                                       Feature name used as the (unnormalized) probabilities associated
 93                                                       with each neighboring edge of a node. Each feature must be a
 94                                                       scalar. The features must be non-negative floats, and the sum of
 95                                                       the features of inbound/outbound edges for every node must be
 96                                                       positive (though they don't have to sum up to one). Otherwise,
 97                                                       the result will be undefined. If not specified, sample uniformly.
 98                                                   replace : bool
 99                                                       If True, sample with replacement.
100                                           
101                                                   Returns
102                                                   -------
103                                                   CuPy array
104                                                       The sampled arrays for bipartite graph.
105                                                   """
106         1         20.0     20.0      0.0          num_nodes = len(nodes)
107         1       7681.0   7681.0      0.1          current_seeds = nodes.reindex(index=np.arange(0, num_nodes))
108         2     143943.0  71971.5      2.5          _g = self.__G.extract_subgraph(create_using=cugraph.Graph,
109         1          0.0      0.0      0.0                                         allow_multi_edges=True)
110         2    5500286.0 2750143.0     96.0          ego_edge_list, seeds_offsets = batched_ego_graphs(_g,
111         1          1.0      1.0      0.0                                                            current_seeds,
112         1          0.0      0.0      0.0                                                            radius=1)
113                                                   # filter and get a certain size neighborhood
114                                           
115                                                   # Step 1
116                                                   # Get Filtered List of ego_edge_list corresposing to current_seeds
117                                                   # We filter by creating a series of destination nodes
118                                                   # corresponding to the offsets and filtering non matching vallues
119                                           
120         1        719.0    719.0      0.0          seeds_offsets_s = cudf.Series(seeds_offsets).values
121         1        174.0    174.0      0.0          offset_lens = seeds_offsets_s[1:] - seeds_offsets_s[0:-1]
122         1       4042.0   4042.0      0.1          dst_seeds = current_seeds.repeat(offset_lens)
123         1        637.0    637.0      0.0          dst_seeds.index = ego_edge_list.index
124         1       5196.0   5196.0      0.1          filtered_list = ego_edge_list[ego_edge_list["dst"] == dst_seeds]
125                                           
126                                                   # Step 2
127                                                   # Sample Fan Out
128                                                   # for each dst take maximum of fanout samples
129         2      67247.0  33623.5      1.2          filtered_list = sample_groups(filtered_list,
130         1          1.0      1.0      0.0                                        by="dst",
131         1          1.0      1.0      0.0                                        n_samples=fanout)
132                                           
133         1        744.0    744.0      0.0          return filtered_list['src'].values,  filtered_list['dst'].values

Todo:

  • Add Unit Tests

@VibhuJawa VibhuJawa requested a review from a team as a code owner May 17, 2022 22:22
@rlratzel rlratzel added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change python labels May 18, 2022
@rlratzel rlratzel added this to the 22.06 milestone May 18, 2022
@codecov-commenter
Copy link

codecov-commenter commented May 19, 2022

Codecov Report

Merging #2283 (25af2e1) into branch-22.06 (38be932) will decrease coverage by 7.02%.
The diff coverage is n/a.

❗ Current head 25af2e1 differs from pull request most recent head c79b2b2. Consider uploading reports for the commit c79b2b2 to get more accurate results

@@               Coverage Diff                @@
##           branch-22.06    #2283      +/-   ##
================================================
- Coverage         70.82%   63.80%   -7.03%     
================================================
  Files               170      100      -70     
  Lines             11036     4481    -6555     
================================================
- Hits               7816     2859    -4957     
+ Misses             3220     1622    -1598     
Impacted Files Coverage Δ
python/cugraph/cugraph/__init__.py 100.00% <ø> (ø)
python/cugraph/cugraph/centrality/__init__.py 100.00% <ø> (ø)
...graph/cugraph/centrality/betweenness_centrality.py 89.65% <ø> (ø)
...on/cugraph/cugraph/centrality/degree_centrality.py 81.81% <ø> (ø)
...thon/cugraph/cugraph/centrality/katz_centrality.py 88.23% <ø> (-1.24%) ⬇️
python/cugraph/cugraph/community/egonet.py 97.36% <ø> (ø)
...ython/cugraph/cugraph/community/ktruss_subgraph.py 88.23% <ø> (+2.94%) ⬆️
python/cugraph/cugraph/community/leiden.py 100.00% <ø> (+7.69%) ⬆️
python/cugraph/cugraph/community/louvain.py 100.00% <ø> (+7.69%) ⬆️
python/cugraph/cugraph/community/triangle_count.py 100.00% <ø> (+11.11%) ⬆️
... and 139 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d9ec8f7...c79b2b2. Read the comment docs.

@BradReesWork
Copy link
Member

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 3dcc4b8 into rapidsai:branch-22.06 May 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants