Parallelize child column construction in scatter() for lists columns #6791

mythrocks · 2020-11-17T22:44:06Z

This is a followup to #6768 (which adds scatter() support for list columns). @harrism advises that the child-column construction could be a lot faster:

This is doing a parallel-for with each thread doing a sequential copy of a range of values to a destination offset. This means your parallelism is limited to the length of list_vector, rather than to the size of child_column. This is important because if you have only k=100 rows in the list column but a total of N=100M elements, you will only use 100 threads and each thread will process 1M elements. This could be suboptimal by a factor of 10-100K or more!

If you flatten these loops you can get full parallelism. Each thread just has to figure out which row it is part of using a thrust::lower_bound(thrust::seq, ...), and use that to look up the appropriate destination index. Then you can just use a single thrust::transform to avoid having to capture offsets or index the array directly.

This would use N threads and have work complexity O(N log k). This is fine as long as k << N. You can do better (asymptotically). You can convert the offsets to head flags using a thrust::scatter, and then scan the head flags to get the offset for each thread, and then do your O(N) transform.

Offsets: [0, 3, 5, 9]
Scatter to head flags (scatter a 1 to each location in offsets) (O(k))
Head flags: [0, 0, 0, 1, 0, 1, 0, 0, 0]
Exclusive scan (O(N)) to get offsets:
Offsets: [0, 0, 0, 1, 1, 2, 2, 2, 2]
Then use the offsets in your transform

This is O(k) + O(N) + O(N) == O(N), so asymptotically better, but requires 3 thrust calls (kernel launches) instead of 1. So may be slower for small N. For our example above (100, 100M) the scan version could be up to 10x faster than the lower_bound version. But for a lot of short lists the lower_bound version might be faster due to only launching a single kernel.

My initial attempt at this works for columns without empty lists. E.g.

Offsets:        [0,       3,    5, 5,      9] // ------> Lists[2] is an empty list.
Head flags:     [0, 0, 0, 1, 0, 1, 0, 0, 0] // -----> Scatter produces the same mapping as with [0,3,5,9].
Inclusive scan: [0, 0, 0, 1, 1, 2, 2, 2, 2] // -----> Scan produces the same output as before.

For the above, we should have ideally produced [0, 0, 0, 1, 1, 3, 3, 3, 3] to correctly map the children back to the lists. This needs figuring out to support empty lists.

Supports lists of: 1. fixed width data types 2. strings 3. lists (of all of the above, and lists(of ...))

1. Additional type checking for child columns 2. Moved list_device_view functions inline

WIP: Attempting to replace O(N**2) with a single thrust::for_each_n(). Borked, because of empty lists. Will need closer look.

GPUtester · 2020-11-17T22:44:38Z

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

harrism · 2020-11-17T23:55:51Z

For the above, we should have ideally produced [0, 0, 0, 1, 1, 3, 3, 3, 3] to correctly map the children back to the lists. This needs figuring out to support empty lists.

Just change your scatter to atomic increment rather than setting the value to 1 for each offset.

Offsets:        [0,       3,    5, 5,      9] // ------> Lists[2] is an empty list.
Head flags:     [0, 0, 0, 1, 0, 2, 0, 0, 0] // -----> Scatter produces the same mapping as with [0,3,5,9].
Inclusive scan: [0, 0, 0, 1, 1, 3, 3, 3, 3] // -----> Scan produces the same output as before.

BTW, can you please file an issue for this for tracking (rather than just a PR)?

harrism · 2020-11-18T00:01:24Z

Hmmm, I realized that there's no way to atomically increment the scattered output with Thrust... This requires some thought.

harrism · 2020-11-18T00:08:37Z

Ah. You could use a thrust::reduce_by_key on offsets to get the number of lists with the same offset. Then scatter the result of that rather than just scattering ones. This would not be as efficient as the atomic version I mentioned, but at least it's possible with pure Thrust. It would still overall be O(N).

harrism

Should be easy to support empty lists.

harrism · 2020-11-18T00:23:28Z

cpp/include/cudf/lists/detail/scatter.cuh

+// Helper to generate mapping between each child row and which list it belongs to.
+rmm::device_vector<cudf::size_type> get_child_row_to_list_map(cudf::size_type num_child_rows,
+                                                             column_view const& list_offsets,
+                                                             rmm::cuda_stream_view stream)
+{
+  CUDF_EXPECTS(list_offsets.size() >= 2, "Invalid list offsets.");
+
+  auto scatter_map = cudf::slice(list_offsets, {1, list_offsets.size()-1})[0];
+  auto d_scatter_map = scatter_map.data<cudf::size_type>();
+  auto ret = rmm::device_vector<cudf::size_type>(static_cast<std::size_t>(num_child_rows), 0);
+  auto scatter_1 = thrust::make_constant_iterator<cudf::size_type>(1);
+
+  thrust::scatter(
+    rmm::exec_policy(stream)->on(stream.value()),
+    scatter_1,
+    scatter_1 + scatter_map.size(), 
+    d_scatter_map,
+    ret.begin()
+  );
+
+  thrust::inclusive_scan(
+    rmm::exec_policy(stream)->on(stream.value()),
+    ret.begin(),
+    ret.end(),
+    ret.begin()
+  );
+
+  return ret;
+}


To support empty lists...

Suggested change

// Helper to generate mapping between each child row and which list it belongs to.

rmm::device_vector<cudf::size_type> get_child_row_to_list_map(cudf::size_type num_child_rows,

column_view const& list_offsets,

rmm::cuda_stream_view stream)

{

CUDF_EXPECTS(list_offsets.size() >= 2, "Invalid list offsets.");

auto scatter_map = cudf::slice(list_offsets, {1, list_offsets.size()-1})[0];

auto d_scatter_map = scatter_map.data<cudf::size_type>();

auto ret = rmm::device_vector<cudf::size_type>(static_cast<std::size_t>(num_child_rows), 0);

auto scatter_1 = thrust::make_constant_iterator<cudf::size_type>(1);

thrust::scatter(

rmm::exec_policy(stream)->on(stream.value()),

scatter_1,

scatter_1 + scatter_map.size(),

d_scatter_map,

ret.begin()

);

thrust::inclusive_scan(

rmm::exec_policy(stream)->on(stream.value()),

ret.begin(),

ret.end(),

ret.begin()

);

return ret;

}

// Helper to generate mapping between each child row and which list it belongs to.

rmm::device_vector<cudf::size_type> get_child_row_to_list_map(cudf::size_type num_rows,

cudf::size_type num_child_rows,

column_view const& list_offsets,

rmm::cuda_stream_view stream)

{

CUDF_EXPECTS(list_offsets.size() >= 2, "Invalid list offsets.");

auto d_scatter_map = list_offsets.data<cudf::size_type>();

rmm::device_uvector<cudf::size_type> head_keys{num_rows};

rmm::device_uvector<cudf::size_type> head_flags{num_rows};

auto new_end = thrust::reduce_by_key(

rmm::exec_policy(stream)->on(stream.value()),

list_offsets.begin<cudf::size_type>();

list_offsets.end<cudf::size_type>();

thrust::make_constant_iterator<cudf::size_type>(1),

head_keys.begin();

head_flags.begin()

);

auto ret = rmm::device_vector<cudf::size_type>(static_cast<std::size_t>(num_child_rows), 0);

thrust::scatter(

rmm::exec_policy(stream)->on(stream.value()),

head_flags.begin(),

new_end.second,

head_keys.begin(),

ret.begin()

);

thrust::inclusive_scan(

rmm::exec_policy(stream)->on(stream.value()),

ret.begin(),

ret.end(),

ret.begin()

);

return ret;

}

harrism · 2020-11-18T00:37:34Z

A benchmark should be added as part of this PR, to demonstrate the value before and after, and catch future regressions. There is an existing scatter benchmark (under copying), just need to add benchmark cases for the different nested types.

mythrocks · 2020-11-18T01:35:39Z

I finally got my head around what needs doing here, based on the following example from @harrism:

[0, 0, 3, 5, 5, 9, 9] RBK --> ([0, 3, 5, 9], [2, 1, 2, 2]) slice--> ([3, 5], [1, 2]) scatter--> [0, 0, 0, 1, 0, 2, 0, 0, 0] scan--> [0, 0, 0, 1, 1, 3, 3, 3, 3]

I've been slicing the offsets column before the scatter, and that should remain unchanged. The slice needs to happen after the reduce_by_key().
This is heady stuff.

github-actions · 2021-02-16T20:19:34Z

This PR has been marked stale due to no recent activity in the past 30d. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be marked rotten if there is no activity in the next 60d.

harrism · 2021-02-16T23:57:22Z

@mythrocks do you plan to resume this?

mythrocks · 2021-02-18T06:05:33Z

@mythrocks do you plan to resume this?

I'd like to resume this work at some point. We were able to put the idea to use in #7189.

This PR has gone stale. I'll raise a new PR when I do resume.

mythrocks · 2021-06-14T17:13:32Z

#8255 addresses this. Thanks, @isVoid!

mythrocks added 10 commits November 13, 2020 16:04

[scatter<list>] Initial impl:

5741f69

Supports lists of: 1. fixed width data types 2. strings 3. lists (of all of the above, and lists(of ...))

[scatter<list>] Minor refactoring:

3cac246

1. Additional type checking for child columns 2. Moved list_device_view functions inline

[scatter<list>] Changelog, + clang-format.

bce24b4

[scatter<list>] Switch to rmm::cuda_stream_view

a7dc525

[scatter<list>] Move label_t to scoped label_type

c764de2

[scatter<list>] Switch from device_vector to device_uvector.

b1caeef

[scatter<list>] Fix documentation. Switch size types to size_type.

2da4f15

[scatter<list>] Fix documentation.

09325c7

[scatter<list>] More documentation fixes, const correctness

4b6a54f

[scatter<list>] Optimize for parallel child column construction:

6931ac8

WIP: Attempting to replace O(N**2) with a single thrust::for_each_n(). Borked, because of empty lists. Will need closer look.

mythrocks added the 2 - In Progress Currently a work in progress label Nov 17, 2020

mythrocks requested review from a team as code owners November 17, 2020 22:44

mythrocks self-assigned this Nov 17, 2020

mythrocks requested review from harrism and davidwendt November 17, 2020 22:44

mythrocks marked this pull request as draft November 17, 2020 22:45

harrism requested changes Nov 18, 2020

View reviewed changes

harrism added this to PR-WIP in v0.18 Release via automation Nov 18, 2020

mythrocks mentioned this pull request Nov 19, 2020

Support scatter() for list columns #6768

Merged

mythrocks added 3 - Ready for Review Ready for review by team 2 - In Progress Currently a work in progress and removed 2 - In Progress Currently a work in progress 3 - Ready for Review Ready for review by team labels Nov 24, 2020

mythrocks mentioned this pull request Jan 13, 2021

[FEA] COLLECT aggregation for rolling windows #7133

Closed

mythrocks mentioned this pull request Jan 22, 2021

Implement COLLECT rolling window aggregation #7189

Merged

harrism removed this from PR-WIP in v0.18 Release Feb 15, 2021

harrism added this to PR-WIP in v0.19 Release via automation Feb 15, 2021

github-actions bot added the stale label Feb 16, 2021

github-actions bot removed the stale label Feb 17, 2021

mythrocks closed this Feb 18, 2021

v0.19 Release automation moved this from PR-WIP to Done Feb 18, 2021

mythrocks mentioned this pull request May 15, 2021

Refactor scatter for list columns #8255

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize child column construction in scatter() for lists columns #6791

Parallelize child column construction in scatter() for lists columns #6791

mythrocks commented Nov 17, 2020 •

edited

Loading

GPUtester commented Nov 17, 2020

harrism commented Nov 17, 2020

harrism commented Nov 18, 2020

harrism commented Nov 18, 2020

harrism left a comment

harrism Nov 18, 2020

harrism commented Nov 18, 2020

mythrocks commented Nov 18, 2020

github-actions bot commented Feb 16, 2021

harrism commented Feb 16, 2021

mythrocks commented Feb 18, 2021

mythrocks commented Jun 14, 2021

Parallelize child column construction in scatter() for lists columns #6791

Parallelize child column construction in scatter() for lists columns #6791

Conversation

mythrocks commented Nov 17, 2020 • edited Loading

GPUtester commented Nov 17, 2020

harrism commented Nov 17, 2020

harrism commented Nov 18, 2020

harrism commented Nov 18, 2020

harrism left a comment

Choose a reason for hiding this comment

harrism Nov 18, 2020

Choose a reason for hiding this comment

harrism commented Nov 18, 2020

mythrocks commented Nov 18, 2020

github-actions bot commented Feb 16, 2021

harrism commented Feb 16, 2021

mythrocks commented Feb 18, 2021

mythrocks commented Jun 14, 2021

mythrocks commented Nov 17, 2020 •

edited

Loading