Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize child column construction in scatter() for lists columns #6791

Closed

Conversation

mythrocks
Copy link
Contributor

@mythrocks mythrocks commented Nov 17, 2020

This is a followup to #6768 (which adds scatter() support for list columns). @harrism advises that the child-column construction could be a lot faster:

This is doing a parallel-for with each thread doing a sequential copy of a range of values to a destination offset. This means your parallelism is limited to the length of list_vector, rather than to the size of child_column. This is important because if you have only k=100 rows in the list column but a total of N=100M elements, you will only use 100 threads and each thread will process 1M elements. This could be suboptimal by a factor of 10-100K or more!

If you flatten these loops you can get full parallelism. Each thread just has to figure out which row it is part of using a thrust::lower_bound(thrust::seq, ...), and use that to look up the appropriate destination index. Then you can just use a single thrust::transform to avoid having to capture offsets or index the array directly.

This would use N threads and have work complexity O(N log k). This is fine as long as k << N. You can do better (asymptotically). You can convert the offsets to head flags using a thrust::scatter, and then scan the head flags to get the offset for each thread, and then do your O(N) transform.

Offsets: [0, 3, 5, 9]
Scatter to head flags (scatter a 1 to each location in offsets) (O(k))
Head flags: [0, 0, 0, 1, 0, 1, 0, 0, 0]
Exclusive scan (O(N)) to get offsets:
Offsets: [0, 0, 0, 1, 1, 2, 2, 2, 2]
Then use the offsets in your transform

This is O(k) + O(N) + O(N) == O(N), so asymptotically better, but requires 3 thrust calls (kernel launches) instead of 1. So may be slower for small N. For our example above (100, 100M) the scan version could be up to 10x faster than the lower_bound version. But for a lot of short lists the lower_bound version might be faster due to only launching a single kernel.

My initial attempt at this works for columns without empty lists. E.g.

Offsets:        [0,       3,    5, 5,      9] // ------> Lists[2] is an empty list.
Head flags:     [0, 0, 0, 1, 0, 1, 0, 0, 0] // -----> Scatter produces the same mapping as with [0,3,5,9].
Inclusive scan: [0, 0, 0, 1, 1, 2, 2, 2, 2] // -----> Scan produces the same output as before.

For the above, we should have ideally produced [0, 0, 0, 1, 1, 3, 3, 3, 3] to correctly map the children back to the lists. This needs figuring out to support empty lists.

@mythrocks mythrocks added the 2 - In Progress Currently a work in progress label Nov 17, 2020
@mythrocks mythrocks requested review from a team as code owners November 17, 2020 22:44
@mythrocks mythrocks self-assigned this Nov 17, 2020
@GPUtester
Copy link
Collaborator

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@mythrocks mythrocks marked this pull request as draft November 17, 2020 22:45
@harrism
Copy link
Member

harrism commented Nov 17, 2020

For the above, we should have ideally produced [0, 0, 0, 1, 1, 3, 3, 3, 3] to correctly map the children back to the lists. This needs figuring out to support empty lists.

Just change your scatter to atomic increment rather than setting the value to 1 for each offset.

Offsets:        [0,       3,    5, 5,      9] // ------> Lists[2] is an empty list.
Head flags:     [0, 0, 0, 1, 0, 2, 0, 0, 0] // -----> Scatter produces the same mapping as with [0,3,5,9].
Inclusive scan: [0, 0, 0, 1, 1, 3, 3, 3, 3] // -----> Scan produces the same output as before.

BTW, can you please file an issue for this for tracking (rather than just a PR)?

@harrism
Copy link
Member

harrism commented Nov 18, 2020

Hmmm, I realized that there's no way to atomically increment the scattered output with Thrust... This requires some thought.

@harrism
Copy link
Member

harrism commented Nov 18, 2020

Ah. You could use a thrust::reduce_by_key on offsets to get the number of lists with the same offset. Then scatter the result of that rather than just scattering ones. This would not be as efficient as the atomic version I mentioned, but at least it's possible with pure Thrust. It would still overall be O(N).

Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be easy to support empty lists.

Comment on lines +275 to +303
// Helper to generate mapping between each child row and which list it belongs to.
rmm::device_vector<cudf::size_type> get_child_row_to_list_map(cudf::size_type num_child_rows,
column_view const& list_offsets,
rmm::cuda_stream_view stream)
{
CUDF_EXPECTS(list_offsets.size() >= 2, "Invalid list offsets.");

auto scatter_map = cudf::slice(list_offsets, {1, list_offsets.size()-1})[0];
auto d_scatter_map = scatter_map.data<cudf::size_type>();
auto ret = rmm::device_vector<cudf::size_type>(static_cast<std::size_t>(num_child_rows), 0);
auto scatter_1 = thrust::make_constant_iterator<cudf::size_type>(1);

thrust::scatter(
rmm::exec_policy(stream)->on(stream.value()),
scatter_1,
scatter_1 + scatter_map.size(),
d_scatter_map,
ret.begin()
);

thrust::inclusive_scan(
rmm::exec_policy(stream)->on(stream.value()),
ret.begin(),
ret.end(),
ret.begin()
);

return ret;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To support empty lists...

Suggested change
// Helper to generate mapping between each child row and which list it belongs to.
rmm::device_vector<cudf::size_type> get_child_row_to_list_map(cudf::size_type num_child_rows,
column_view const& list_offsets,
rmm::cuda_stream_view stream)
{
CUDF_EXPECTS(list_offsets.size() >= 2, "Invalid list offsets.");
auto scatter_map = cudf::slice(list_offsets, {1, list_offsets.size()-1})[0];
auto d_scatter_map = scatter_map.data<cudf::size_type>();
auto ret = rmm::device_vector<cudf::size_type>(static_cast<std::size_t>(num_child_rows), 0);
auto scatter_1 = thrust::make_constant_iterator<cudf::size_type>(1);
thrust::scatter(
rmm::exec_policy(stream)->on(stream.value()),
scatter_1,
scatter_1 + scatter_map.size(),
d_scatter_map,
ret.begin()
);
thrust::inclusive_scan(
rmm::exec_policy(stream)->on(stream.value()),
ret.begin(),
ret.end(),
ret.begin()
);
return ret;
}
// Helper to generate mapping between each child row and which list it belongs to.
rmm::device_vector<cudf::size_type> get_child_row_to_list_map(cudf::size_type num_rows,
cudf::size_type num_child_rows,
column_view const& list_offsets,
rmm::cuda_stream_view stream)
{
CUDF_EXPECTS(list_offsets.size() >= 2, "Invalid list offsets.");
auto d_scatter_map = list_offsets.data<cudf::size_type>();
rmm::device_uvector<cudf::size_type> head_keys{num_rows};
rmm::device_uvector<cudf::size_type> head_flags{num_rows};
auto new_end = thrust::reduce_by_key(
rmm::exec_policy(stream)->on(stream.value()),
list_offsets.begin<cudf::size_type>();
list_offsets.end<cudf::size_type>();
thrust::make_constant_iterator<cudf::size_type>(1),
head_keys.begin();
head_flags.begin()
);
auto ret = rmm::device_vector<cudf::size_type>(static_cast<std::size_t>(num_child_rows), 0);
thrust::scatter(
rmm::exec_policy(stream)->on(stream.value()),
head_flags.begin(),
new_end.second,
head_keys.begin(),
ret.begin()
);
thrust::inclusive_scan(
rmm::exec_policy(stream)->on(stream.value()),
ret.begin(),
ret.end(),
ret.begin()
);
return ret;
}

@harrism harrism added this to PR-WIP in v0.18 Release via automation Nov 18, 2020
@harrism
Copy link
Member

harrism commented Nov 18, 2020

A benchmark should be added as part of this PR, to demonstrate the value before and after, and catch future regressions. There is an existing scatter benchmark (under copying), just need to add benchmark cases for the different nested types.

@mythrocks
Copy link
Contributor Author

I finally got my head around what needs doing here, based on the following example from @harrism:

[0, 0, 3, 5, 5, 9, 9] RBK --> ([0, 3, 5, 9], [2, 1, 2, 2]) slice--> ([3, 5], [1, 2]) scatter--> [0, 0, 0, 1, 0, 2, 0, 0, 0] scan--> [0, 0, 0, 1, 1, 3, 3, 3, 3]

I've been slicing the offsets column before the scatter, and that should remain unchanged. The slice needs to happen after the reduce_by_key().
This is heady stuff.

@mythrocks mythrocks added 3 - Ready for Review Ready for review by team 2 - In Progress Currently a work in progress and removed 2 - In Progress Currently a work in progress 3 - Ready for Review Ready for review by team labels Nov 24, 2020
@github-actions
Copy link

This PR has been marked stale due to no recent activity in the past 30d. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be marked rotten if there is no activity in the next 60d.

@github-actions github-actions bot added the stale label Feb 16, 2021
@harrism
Copy link
Member

harrism commented Feb 16, 2021

@mythrocks do you plan to resume this?

@github-actions github-actions bot removed the stale label Feb 17, 2021
@mythrocks
Copy link
Contributor Author

@mythrocks do you plan to resume this?

I'd like to resume this work at some point. We were able to put the idea to use in #7189.

This PR has gone stale. I'll raise a new PR when I do resume.

@mythrocks mythrocks closed this Feb 18, 2021
v0.19 Release automation moved this from PR-WIP to Done Feb 18, 2021
@mythrocks
Copy link
Contributor Author

#8255 addresses this. Thanks, @isVoid!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress
Projects
No open projects
v0.19 Release
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

3 participants