[FEA] Implement full support for nested types #11844

GregoryKimball · 2022-10-03T05:38:10Z

After the introduction of new row operators for nested types (#10186), it's time to build on that success with a new issue to focus on the outstanding items. The row operators for equality = (#10289) and hashing # (#10641) included support for arbitrarily-nested List and Struct types. List inequality < (#11129) added support for arbitrarily-nested List<List> and Struct<List>. Please note that the Spark-RAPIDS plugin has a companion issue (NVIDIA/spark-rapids#8550) to this story issue.

Part 1: Transition algorithms to new row operators

Algorithm	Status	Operation	File	Notes
scan_rank	✅ ️ ️	`=`️	rank_scan.cu	#10289
distinct	✅ ️	`#`	distinct.cu	#10641
distinct_count	✅ #12776	`#`, `=`
unique_count	✅ #12776	`#`, `=`
hash-groupby	✅ ️	`#`, `=`	groupby.cu	#10770
sort	✅ ️	`<`	sort_impl.cuh	#11129, using legacy `relational_compare` in `simple_comparator` as part of two-path solution in `sorted_order`
list contains	✅	`=`	contains.cu	#10548, using legacy `equality_compare` as part of two-path solution in `search_list_non_nested_types_fn`, see also #11330 for the two-path solution for equality
struct binops	✅	`=`, `<`		#11153
murmur hash	✅	`#`		plus spark_murmur_hash
merge	✅ #14250	`<`	merge.cuh	merge creates a `row_lexicographic_tagged_comparator` that uses legacy `element_relational_comparator` as part of internal operations. Also see discussion in #13514 and #8050
sort-groupby nunique	✅ #11792	`=`	group_nunique.cu	uses legacy `element_equality_comparator` in `is_unique_iterator_fn`
sort-groupby rank_scan	✅ #11792	`=`	group_rank_scan.cu	uses legacy `row_equality_comparator` in `permuted_comparator`
sort-groupby	✅ #11792	`=`	sort_helper.cu	uses legacy `row_equality_comparator` in `permuted_row_equality_comparator`, linked issue #8039
hash map in parquet	✅ #12918	`=`	chunk_dict.cu	uses legacy `equality_compare` in find/insert. #8476 convert to cuco and #10635 optimize
unused includes	✅		round_robin.cu search_ordered.cu group_single_pass...	#11857
partition	✅ #12761	`#`	partitioning.cu	uses legacy `row_hasher` in `hash_partition_table`
struct min/max	✅ #13069	`<`	struct_minmax_util.cuh	uses legacy `row_lexicographic_comparator` in `row_arg_minmax_fn` #8974, see also #8964. see draft work in #10811 and #13069
is_sorted	✅ #12752	`<`	is_sorted.cu	uses legacy `row_lexicographic_comparator` in `is_sorted`
rank	✅ #12481	`=`	rank.cu	uses legacy `row_equality_comparator` in `unique_comparator`
unique	✅	`=`	unique.cu	uses legacy `row_equality_comparator` in `unique`
stream compaction	✅ #12776	`#`	stream_co...	uses legacy `row_hasher` renamed as `row_hash`
one-hot encode	✅	`=`	one_hot_encode.cu	uses legacy `element_equality_comparator` in `one_hot_encode_functor`
join	✅ #12787	`#`, `=`	join_common_utils.hpp	uses legacy `row_hasher` and `row_equality_comparator`
mixed join	✅ #13028	`#`, `=`		uses struct flattening
test util	✅ #12777	`=`	column_utilities.cu	uses legacy `row_equality_comparator` in `corresponding_rows_unequal`
contains_table	✅ #13119	`#`, `=`		See #13032 and previous work in #11330
list contains	✅ #13810	`=`	search_list	uses legacy `cudf::equality_compare` also see #13672
list min/max	✅ #13676	`<`	`list_minmax_util.cuh`	See #13667, We need to introduce a utility similar to `struct_minmax_util.cuh` to handle top-level lists in aggregation values

Part 2: Improving performance and adding functionality

Name	Status	Notes
Make experimental comparators default		see #12593, depends on Part 1
Improve two-path `<`	⏸️ (paused)	See #11667. #11129 added a two-path implementation for `sorted_order` based on templating experimental `<`, with one instance for simple types and one for nested.
Improve two-path `=`	✅ #12676	See #12676 #11016, #11667, #11330 added a two-path implementation for `lists::contains`, with one path for simple types using legacy `=` and one path for nested using experimental `=`.
Enable `<` operator on `Struct<List>` types	✅ #13005	see #11672, we expected that these types would work as part of #11129, but we encountered failures and disabled it
Support `<` operator for `List<Struct>` types	✅ #12953	Proposal and initial design in #11222, impacts #10408, to be solved by #12953
Support self-comparison in two-table `<`		#10730 added strong right- and left-table index types for the two table comparator. For refactoring "merge" (see Part 1) we need to allow right-right and left-left indices in the two table comparator
Support self-comparison in two-table `=`		see #13371
Add strong index types to two-table `#`		See #13116 for more information, may require upstream changes in cuco

Part 3: Expand support for nested types in cuDF-python

Update: Part 3 will be made into a separate story issue once this issue is closed

Name	Issue	Notes
`drop_duplicates()` on nested types	#6784	doesn't drop duplicates correctly
`groupby.apply` on nested types	#8039	throws exception from index with List type
binop comparison	TBD	exception handling in #11609

The text was updated successfully, but these errors were encountered:

vyasr · 2022-10-03T23:39:50Z

Is this issue just a more complete version of the last table in #10186 describing algorithms that should be using the comparators?

Converts the `rank` function to use experimental row comparators, which support list and struct types. Part of #11844. [Throughput benchmarks](#12481 (comment)) are available below. It seems like when `size_bytes` is constrained, the generator generates fewer rows in `list` types for increasing depths. That's why, `depth=4` has a higher throughput than `depth=1` because the number of leaf elements generated are the same, but with much fewer rows. Authors: - Divye Gala (https://github.com/divyegala) - Jordan Jacobelli (https://github.com/Ethyling) Approvers: - Bradley Dice (https://github.com/bdice) - Yunsong Wang (https://github.com/PointKernel) - AJ Schmidt (https://github.com/ajschmidt8) URL: #12481

This PR is a part of #11844. Authors: - Divye Gala (https://github.com/divyegala) Approvers: - Nghia Truong (https://github.com/ttnghia) - Yunsong Wang (https://github.com/PointKernel) URL: #12752

This PR is a part of #11844. Authors: - Divye Gala (https://github.com/divyegala) Approvers: - Nghia Truong (https://github.com/ttnghia) - Yunsong Wang (https://github.com/PointKernel) URL: #12761

…or (#12776) This PR is a part of #11844 Authors: - Divye Gala (https://github.com/divyegala) Approvers: - Nghia Truong (https://github.com/ttnghia) - Vyas Ramasubramani (https://github.com/vyasr) URL: #12776

Contributes to #11844 This PR migrates parquet encoding to use the experimental `nan_equal` equality check instead of the legacy `equality_compare`. Authors: - Yunsong Wang (https://github.com/PointKernel) Approvers: - David Wendt (https://github.com/davidwendt) - Nghia Truong (https://github.com/ttnghia) - Divye Gala (https://github.com/divyegala) URL: #12918

…omparator (#12777) This PR is a part of #11844 Authors: - Divye Gala (https://github.com/divyegala) Approvers: - David Wendt (https://github.com/davidwendt) - Nghia Truong (https://github.com/ttnghia) URL: #12777

Part of #11844. I will create a separate PR for `mixed_join`. Compilation times: `main` 94bbc82 : `16m47.513s` This PR 5d75db8 : `16m47.520s` Benchmarks: #12787 (comment) Authors: - Divye Gala (https://github.com/divyegala) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Nghia Truong (https://github.com/ttnghia) URL: #12787

…3028) Part of #11844 `mixed_join` cannot support nested types as the conditional part relies on AST. This PR adds no new tests or benchmarks for this reason. [Benchmarks](#13028 (comment)) Authors: - Divye Gala (https://github.com/divyegala) Approvers: - Nghia Truong (https://github.com/ttnghia) - David Wendt (https://github.com/davidwendt) URL: #13028

…rator (#13119) This is a part of #11844 Benchmarks: #13119 (comment) Authors: - Divye Gala (https://github.com/divyegala) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Nghia Truong (https://github.com/ttnghia) URL: #13119

ttnghia · 2023-04-26T16:44:25Z

This should also address #13116.

Par of #11844 Authors: - Divye Gala (https://github.com/divyegala) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Nghia Truong (https://github.com/ttnghia) - Vyas Ramasubramani (https://github.com/vyasr) URL: #13069

Part of #11844 Authors: - Divye Gala (https://github.com/divyegala) - Nghia Truong (https://github.com/ttnghia) Approvers: - Bradley Dice (https://github.com/bdice) - Nghia Truong (https://github.com/ttnghia) - Yunsong Wang (https://github.com/PointKernel) URL: #13810

…14250) Part of #11844 This PR also uses new experimental comparators for non-nested types by introducing a new device constructor for `cudf::experimental::row::lexicographic::device_row_comparator`. In the case of non-nested types, preprocessing can be skipped so comparators can be created on the fly. This solution helps us avoid creating 3 comparator types because `thrust::merge` can call the operator with indices from either side of the table. Furthermore, the PR reworks `cudf/detail/merge.cuh` by removing any CUDA headers/components to expose a true detail API of the form `cudf/detail/merge.hpp`. [Benchmark comparison for non-nested types](#14250 (comment)) Compilation time increases from ~6 mins to ~7 mins. Authors: - Divye Gala (https://github.com/divyegala) Approvers: - Bradley Dice (https://github.com/bdice) - MithunR (https://github.com/mythrocks) URL: #14250

GregoryKimball · 2023-12-13T17:48:31Z

Closing for now. We believe this to be complete. Congratulations team!

GregoryKimball added feature request New feature or request Needs Triage Need team to review and classify labels Oct 3, 2022

github-actions bot added this to Needs prioritizing in Feature Planning Oct 3, 2022

GregoryKimball changed the title ~~[WIP] Story - Full support for nested types~~ [FEA] Story - Full support for nested types Oct 3, 2022

GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue helps: Spark Functionality that helps Spark RAPIDS helps: Python and removed Needs Triage Need team to review and classify labels Oct 3, 2022

GregoryKimball added this to the List and Struct data types and operations milestone Oct 3, 2022

GregoryKimball mentioned this issue Oct 4, 2022

[FEA] Story - Supporting row operators on nested types #10186

Closed

vyasr mentioned this issue Oct 21, 2022

[FEA] Split experimental/row_operators.cuh #11012

Open

GregoryKimball assigned divyegala Jan 6, 2023

GregoryKimball mentioned this issue Jan 23, 2023

[FEA] Refactor experimental/row_operators.cuh and make it default #12593

Open

4 tasks

bdice mentioned this issue Feb 6, 2023

Convert rank to use to experimental row comparators #12481

Merged

3 tasks

PointKernel mentioned this issue Mar 9, 2023

Migrate parquet encoding to use experimental row operators #12918

Merged

3 tasks

divyegala mentioned this issue Mar 28, 2023

Update mixed_join to use experimental row hasher and comparator #13028

Merged

3 tasks

divyegala mentioned this issue Apr 5, 2023

Update struct_minmax_util to experimental row comparator #13069

Merged

3 tasks

divyegala mentioned this issue Apr 11, 2023

Update contains_table to experimental row hasher and equality comparator #13119

Merged

3 tasks

GregoryKimball mentioned this issue Jun 29, 2023

GitHub infra updates #13542

Closed

5 tasks

divyegala mentioned this issue Jul 7, 2023

Update lists::contains to experimental row comparator #13672

Closed

3 tasks

divyegala mentioned this issue Aug 2, 2023

Update lists::contains to experimental row comparator #13810

Merged

3 tasks

divyegala mentioned this issue Oct 20, 2023

Reimplement cudf::merge for nested types without using comparators #14250

Merged

3 tasks

revans2 mentioned this issue Oct 30, 2023

[FEA] Audit nested type support NVIDIA/spark-rapids#9580

Open

GregoryKimball closed this as completed Dec 13, 2023

GregoryKimball changed the title ~~[FEA] Story - Full support for nested types~~ [FEA] Implement full support for nested types Mar 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Implement full support for nested types #11844

[FEA] Implement full support for nested types #11844

GregoryKimball commented Oct 3, 2022 •

edited

vyasr commented Oct 3, 2022

ttnghia commented Apr 26, 2023

GregoryKimball commented Dec 13, 2023

[FEA] Implement full support for nested types #11844

[FEA] Implement full support for nested types #11844

Comments

GregoryKimball commented Oct 3, 2022 • edited

Part 1: Transition algorithms to new row operators

Part 2: Improving performance and adding functionality

Part 3: Expand support for nested types in cuDF-python

vyasr commented Oct 3, 2022

ttnghia commented Apr 26, 2023

GregoryKimball commented Dec 13, 2023

GregoryKimball commented Oct 3, 2022 •

edited