Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Implement full support for nested types #11844

Closed
GregoryKimball opened this issue Oct 3, 2022 · 3 comments
Closed

[FEA] Implement full support for nested types #11844

GregoryKimball opened this issue Oct 3, 2022 · 3 comments
Assignees
Labels
feature request New feature or request helps: Spark Functionality that helps Spark RAPIDS libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue

Comments

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Oct 3, 2022

After the introduction of new row operators for nested types (#10186), it's time to build on that success with a new issue to focus on the outstanding items. The row operators for equality = (#10289) and hashing # (#10641) included support for arbitrarily-nested List and Struct types. List inequality < (#11129) added support for arbitrarily-nested List<List> and Struct<List>. Please note that the Spark-RAPIDS plugin has a companion issue (NVIDIA/spark-rapids#8550) to this story issue.

Part 1: Transition algorithms to new row operators

Algorithm Status Operation File Notes
scan_rank ✅ ️ ️ = rank_scan.cu #10289
distinct ✅ ️ # distinct.cu #10641
distinct_count #12776 #, =
unique_count #12776 #, =
hash-groupby ✅ ️ #, = groupby.cu #10770
sort ✅ ️ < sort_impl.cuh #11129, using legacy relational_compare in simple_comparator as part of two-path solution in sorted_order
list contains = contains.cu #10548, using legacy equality_compare as part of two-path solution in search_list_non_nested_types_fn, see also #11330 for the two-path solution for equality
struct binops =, < #11153
murmur hash # plus spark_murmur_hash
merge #14250 < merge.cuh merge creates a row_lexicographic_tagged_comparator that uses legacy element_relational_comparator as part of internal operations. Also see discussion in #13514 and #8050
sort-groupby nunique #11792 = group_nunique.cu uses legacy element_equality_comparator in is_unique_iterator_fn
sort-groupby rank_scan #11792 = group_rank_scan.cu uses legacy row_equality_comparator in permuted_comparator
sort-groupby #11792 = sort_helper.cu uses legacy row_equality_comparator in permuted_row_equality_comparator, linked issue #8039
hash map in parquet #12918 = chunk_dict.cu uses legacy equality_compare in find/insert. #8476 convert to cuco and #10635 optimize
unused includes round_robin.cu search_ordered.cu group_single_pass... #11857
partition #12761 # partitioning.cu uses legacy row_hasher in hash_partition_table
struct min/max #13069 < struct_minmax_util.cuh uses legacy row_lexicographic_comparator in row_arg_minmax_fn #8974, see also #8964. see draft work in #10811 and #13069
is_sorted #12752 < is_sorted.cu uses legacy row_lexicographic_comparator in is_sorted
rank #12481 = rank.cu uses legacy row_equality_comparator in unique_comparator
unique = unique.cu uses legacy row_equality_comparator in unique
stream compaction #12776 # stream_co... uses legacy row_hasher renamed as row_hash
one-hot encode = one_hot_encode.cu uses legacy element_equality_comparator in one_hot_encode_functor
join #12787 #, = join_common_utils.hpp uses legacy row_hasher and row_equality_comparator
mixed join #13028 #, = uses struct flattening
test util #12777 = column_utilities.cu uses legacy row_equality_comparator in corresponding_rows_unequal
contains_table #13119 #, = See #13032 and previous work in #11330
list contains #13810 = search_list uses legacy cudf::equality_compare also see #13672
list min/max #13676 < list_minmax_util.cuh See #13667, We need to introduce a utility similar to struct_minmax_util.cuh to handle top-level lists in aggregation values

Part 2: Improving performance and adding functionality

Name Status Notes
Make experimental comparators default see #12593, depends on Part 1
Improve two-path < ⏸️ (paused) See #11667. #11129 added a two-path implementation for sorted_order based on templating experimental <, with one instance for simple types and one for nested.
Improve two-path = #12676 See #12676 #11016, #11667, #11330 added a two-path implementation for lists::contains, with one path for simple types using legacy = and one path for nested using experimental =.
Enable < operator on Struct<List> types #13005 see #11672, we expected that these types would work as part of #11129, but we encountered failures and disabled it
Support < operator for List<Struct> types #12953 Proposal and initial design in #11222, impacts #10408, to be solved by #12953
Support self-comparison in two-table < #10730 added strong right- and left-table index types for the two table comparator. For refactoring "merge" (see Part 1) we need to allow right-right and left-left indices in the two table comparator
Support self-comparison in two-table = see #13371
Add strong index types to two-table # See #13116 for more information, may require upstream changes in cuco

Part 3: Expand support for nested types in cuDF-python

Update: Part 3 will be made into a separate story issue once this issue is closed

Name Issue Notes
drop_duplicates() on nested types #6784 doesn't drop duplicates correctly
groupby.apply on nested types #8039 throws exception from index with List type
binop comparison TBD exception handling in #11609
@GregoryKimball GregoryKimball added feature request New feature or request Needs Triage Need team to review and classify labels Oct 3, 2022
@github-actions github-actions bot added this to Needs prioritizing in Feature Planning Oct 3, 2022
@GregoryKimball GregoryKimball changed the title [WIP] Story - Full support for nested types [FEA] Story - Full support for nested types Oct 3, 2022
@GregoryKimball GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue helps: Spark Functionality that helps Spark RAPIDS helps: Python and removed Needs Triage Need team to review and classify labels Oct 3, 2022
@vyasr
Copy link
Contributor

vyasr commented Oct 3, 2022

Is this issue just a more complete version of the last table in #10186 describing algorithms that should be using the comparators?

rapids-bot bot pushed a commit that referenced this issue Feb 8, 2023
Converts the `rank` function to use experimental row comparators, which support list and struct types. Part of #11844.

[Throughput benchmarks](#12481 (comment)) are available below. It seems like when `size_bytes` is constrained, the generator generates fewer rows in `list` types for increasing depths. That's why, `depth=4` has a higher throughput than `depth=1` because the number of leaf elements generated are the same, but with much fewer rows.

Authors:
  - Divye Gala (https://github.com/divyegala)
  - Jordan Jacobelli (https://github.com/Ethyling)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Yunsong Wang (https://github.com/PointKernel)
  - AJ Schmidt (https://github.com/ajschmidt8)

URL: #12481
rapids-bot bot pushed a commit that referenced this issue Feb 16, 2023
This PR is a part of #11844.

Authors:
  - Divye Gala (https://github.com/divyegala)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Yunsong Wang (https://github.com/PointKernel)

URL: #12752
rapids-bot bot pushed a commit that referenced this issue Feb 17, 2023
rapids-bot bot pushed a commit that referenced this issue Mar 7, 2023
…or (#12776)

This PR is a part of #11844

Authors:
  - Divye Gala (https://github.com/divyegala)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #12776
rapids-bot bot pushed a commit that referenced this issue Mar 10, 2023
Contributes to #11844

This PR migrates parquet encoding to use the experimental `nan_equal` equality check instead of the legacy `equality_compare`.

Authors:
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Nghia Truong (https://github.com/ttnghia)
  - Divye Gala (https://github.com/divyegala)

URL: #12918
rapids-bot bot pushed a commit that referenced this issue Mar 22, 2023
…omparator (#12777)

This PR is a part of #11844

Authors:
  - Divye Gala (https://github.com/divyegala)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Nghia Truong (https://github.com/ttnghia)

URL: #12777
rapids-bot bot pushed a commit that referenced this issue Apr 6, 2023
Part of #11844. I will create a separate PR for `mixed_join`.

Compilation times:
`main` 94bbc82 : `16m47.513s`
This PR 5d75db8 : `16m47.520s`

Benchmarks: #12787 (comment)

Authors:
  - Divye Gala (https://github.com/divyegala)

Approvers:
  - Yunsong Wang (https://github.com/PointKernel)
  - Nghia Truong (https://github.com/ttnghia)

URL: #12787
rapids-bot bot pushed a commit that referenced this issue Apr 21, 2023
…3028)

Part of #11844 

`mixed_join` cannot support nested types as the conditional part relies on AST. This PR adds no new tests or benchmarks for this reason. 

[Benchmarks](#13028 (comment))

Authors:
  - Divye Gala (https://github.com/divyegala)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - David Wendt (https://github.com/davidwendt)

URL: #13028
rapids-bot bot pushed a commit that referenced this issue Apr 25, 2023
…rator (#13119)

This is a part of #11844 

Benchmarks: #13119 (comment)

Authors:
  - Divye Gala (https://github.com/divyegala)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Nghia Truong (https://github.com/ttnghia)

URL: #13119
@ttnghia
Copy link
Contributor

ttnghia commented Apr 26, 2023

This should also address #13116.

rapids-bot bot pushed a commit that referenced this issue Jun 29, 2023
Par of #11844

Authors:
  - Divye Gala (https://github.com/divyegala)
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #13069
rapids-bot bot pushed a commit that referenced this issue Aug 3, 2023
rapids-bot bot pushed a commit that referenced this issue Oct 28, 2023
…14250)

Part of #11844 

This PR also uses new experimental comparators for non-nested types by introducing a new device constructor for `cudf::experimental::row::lexicographic::device_row_comparator`. In the case of non-nested types, preprocessing can be skipped so comparators can be created on the fly. This solution helps us avoid creating 3 comparator types because `thrust::merge` can call the operator with indices from either side of the table.

Furthermore, the PR reworks `cudf/detail/merge.cuh` by removing any CUDA headers/components to expose a true detail API of the form `cudf/detail/merge.hpp`.

[Benchmark comparison for non-nested types](#14250 (comment))

Compilation time increases from ~6 mins to ~7 mins.

Authors:
  - Divye Gala (https://github.com/divyegala)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - MithunR (https://github.com/mythrocks)

URL: #14250
@GregoryKimball
Copy link
Contributor Author

Closing for now. We believe this to be complete. Congratulations team!

@GregoryKimball GregoryKimball changed the title [FEA] Story - Full support for nested types [FEA] Implement full support for nested types Mar 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request helps: Spark Functionality that helps Spark RAPIDS libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue
Projects
Archived in project
Status: Story Issue
Feature Planning
Needs prioritizing
Development

No branches or pull requests

4 participants