Add column sanitization checks in `CUDF_TEST_EXPECT_COLUMN_*` macros #14559

SurajAralihalli · 2023-12-04T16:58:07Z

This PR addresses Issue #12786

The listed functions have been modified to incorporate a column sanitization check; otherwise, they will raise a std::invalid_argument error.

expect_column_properties_equal
expect_column_properties_equivalent
expect_columns_equal
expect_columns_equivalent

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

copy-pr-bot · 2023-12-04T16:58:10Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

SurajAralihalli · 2023-12-05T13:40:08Z

During local testing, the following tests did not pass:

REDUCTIONS_TEST
- SegmentedReductionStringTest.MaxIncludeNulls
CLAMP_TEST
- ClampStringTest.WithNullableColumn
- ClampStringTest.WithNullableColumnNullHigh
- ClampStringTest.WithReplaceString
- ClampDictionaryTest.WithNullableColumn
COPYING_TEST
- ...32 tests (specific tests not listed)
UTILITIES_TEST
- ColumnUtilitiesListsTest.UnsanitaryLists
ROLLING_TEST
- RollingDictionaryTest.LeadLag
JSON_PATH_TEST
- JsonPathTests.GetJsonObjectInvalidQuery
DICTIONARY_TEST
- DictionarySetKeysTest.StringsKeys
- DictionarySliceTest.SliceColumn

SurajAralihalli · 2023-12-05T14:01:48Z

It appears that the functions tested in the mentioned tests do not seem to produce sanitized columns. Each function needs a closer examination.

I started by analyzing ClampStringTest.WithNullableColumn which tests cudf::clamp(). Upon manual comparison:

Returned:

Column: B, b, c, NULL, e, F, G, H, NULL, e, B
Offsets: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
Nullmask: 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1

Expected:

Column: B, b, c, NULL, e, F, G, H, NULL, e, B
Offsets: 0, 1, 2, 3, 3, 4, 5, 6, 7, 7, 8, 9
Nullmask: 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1

I suspect this discrepancy could be because the offsets are computed without considering the null mask. See

cpp/tests/utilities/column_utilities.cu

davidwendt · 2023-12-05T16:42:28Z

Thanks for finding these. I can look into the strings and dictionary failures.

GregoryKimball · 2023-12-05T20:50:10Z

Thank you @SurajAralihalli. This is such a great investigation!

vyasr · 2023-12-05T23:18:26Z

Thanks! This should really help us tighten up our null handling throughout libcudf.

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

mythrocks · 2023-12-06T22:59:18Z

Your change looks good.

ROLLING_TEST
RollingDictionaryTest.LeadLag

I'm beginning to wonder if cudf::gather() on dictionary input might not be removing empty nulls. :/

davidwendt · 2023-12-06T23:02:00Z

Your change looks good.
ROLLING_TEST
RollingDictionaryTest.LeadLag
I'm beginning to wonder if cudf::gather() on dictionary input might not be removing empty nulls. :/

I verified that PR #14578 fixes this error.

Fixes the strings specialization logic in `cudf::clamp` to not produce unsanitized null entries. The code was refactored and simplified as well. Also removed unsanitized nulls in test input in the `cudf::clamp` gtests. Reference: #14559 - fixes several of these gtests Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Mark Harris (https://github.com/harrism) - Nghia Truong (https://github.com/ttnghia) - Mike Wilson (https://github.com/hyperbolic2346) URL: #14580

Fixes `cudf::dictionary::decode` logic to produced sanitized null entries for compound column types. Reference: #14559 -- fixes many of the errors found here concerning dictionary column gtests. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - MithunR (https://github.com/mythrocks) URL: #14578

Fixes the string specialization logic in `cudf::segmented_reduce` to not produce unsanitized null entries. The functor used to build a gather map for argmin/argmax was corrected to handle include/exclude nulls correctly. Reference: #14559 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Bradley Dice (https://github.com/bdice) URL: #14586

Removes unsanitized rows from input data in gtests for COPYING_TEST. This fixes some errors found in #14559 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Karthikeyan (https://github.com/karthikeyann) URL: #14600

Fixes the strings specialization logic in `cudf::clamp` to not produce unsanitized null entries. The code was refactored and simplified as well. Also removed unsanitized nulls in test input in the `cudf::clamp` gtests. Reference: rapidsai#14559 - fixes several of these gtests Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Mark Harris (https://github.com/harrism) - Nghia Truong (https://github.com/ttnghia) - Mike Wilson (https://github.com/hyperbolic2346) URL: rapidsai#14580

…#14578) Fixes `cudf::dictionary::decode` logic to produced sanitized null entries for compound column types. Reference: rapidsai#14559 -- fixes many of the errors found here concerning dictionary column gtests. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - MithunR (https://github.com/mythrocks) URL: rapidsai#14578

Fixes the string specialization logic in `cudf::segmented_reduce` to not produce unsanitized null entries. The functor used to build a gather map for argmin/argmax was corrected to handle include/exclude nulls correctly. Reference: rapidsai#14559 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Bradley Dice (https://github.com/bdice) URL: rapidsai#14586

Removes unsanitized rows from input data in gtests for COPYING_TEST. This fixes some errors found in rapidsai#14559 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Karthikeyan (https://github.com/karthikeyann) URL: rapidsai#14600

Removes unsanitized rows from output result of `cudf::get_json_object` which may occur when querying an array path `$[*]` does not find any matches in the target string. In this case, a single `'['` remains in the output buffer (per row) though the row is marked as a null entry. This fixes the `JsonPathTests.GetJsonObjectInvalidQuery` error found in #14559 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Vyas Ramasubramani (https://github.com/vyasr) URL: #14609

davidwendt · 2023-12-12T21:33:28Z

The UTILITIES_TEST ColumnUtilitiesListsTest.UnsanitaryLists is specifically coded in opposition to this change. It checks that an unsanitized list column matches its sanitized partner. If this PR is accepted, then this test would need to be removed.

cudf/cpp/tests/utilities_tests/column_utilities_tests.cpp

Line 284 in 21c90d6

TEST_F(ColumnUtilitiesListsTest, UnsanitaryLists)

vyasr · 2023-12-14T00:40:53Z

The UTILITIES_TEST ColumnUtilitiesListsTest.UnsanitaryLists is specifically coded in opposition to this change. It checks that an unsanitized list column matches its sanitized partner. If this PR is accepted, then this test would need to be removed.

cudf/cpp/tests/utilities_tests/column_utilities_tests.cpp

Line 284 in 21c90d6

TEST_F(ColumnUtilitiesListsTest, UnsanitaryLists)

I think that's fine. We did something similar in 25ebec7 from #14363, and we discussed that in https://nvidia.slack.com/archives/C01CW5L51QC/p1699980706006349. Since our policies prohibit libcudf APIs from producing unsanitized outputs, we don't need to test unsanitized inputs. It's up to the user not to construct such inputs if they're creating their inputs manually.

davidwendt · 2023-12-14T00:50:17Z

@SurajAralihalli I would recommend you remove this test

cudf/cpp/tests/utilities_tests/column_utilities_tests.cpp

Line 284 in 21c90d6

TEST_F(ColumnUtilitiesListsTest, UnsanitaryLists)

in this PR.
And if you merge with the latest branch-24.02, I expect this PR should then pass CI.

SurajAralihalli · 2023-12-14T01:13:54Z

@SurajAralihalli I would recommend you remove this test

cudf/cpp/tests/utilities_tests/column_utilities_tests.cpp

Line 284 in 21c90d6

TEST_F(ColumnUtilitiesListsTest, UnsanitaryLists)

in this PR.

And if you merge with the latest branch-24.02, I expect this PR should then pass CI.

Sure @davidwendt! I'll do that by the end of this week when I get back from my trip. Thanks so much for letting me know.

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

davidwendt · 2023-12-18T13:08:17Z

/ok to test

vyasr · 2023-12-18T21:25:06Z

/merge

This PR removes an extra code path used for checking the equality of the null count when verifying if columns are equivalent (not equal). The purpose of this code path was to verify a specific definition of equivalence for columns containing unsanitized nulls, i.e. by ignoring the stored null count and directly verifying the validity of the underlying null mask. This is no longer necessary because we required sanitized null masks to be output from all libcudf APIs now (see the "libcudf expects nested types to have sanitized null masks" section in the [developer guide](https://docs.rapids.ai/api/libcudf/stable/developer_guide)), and this requirement will be enforced with the merge of #14559. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Mike Wilson (https://github.com/hyperbolic2346) - Bradley Dice (https://github.com/bdice) URL: #13312

SurajAralihalli added 2 commits December 2, 2023 05:33

check has_nonempty_nulls in macros

3a70103

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

update cu file

7f83e83

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

SurajAralihalli marked this pull request as ready for review December 5, 2023 14:07

SurajAralihalli requested a review from a team as a code owner December 5, 2023 14:07

SurajAralihalli requested review from mythrocks and ttnghia December 5, 2023 14:07

ttnghia reviewed Dec 5, 2023

View reviewed changes

cpp/tests/utilities/column_utilities.cu Outdated Show resolved Hide resolved

ttnghia reviewed Dec 5, 2023

View reviewed changes

cpp/tests/utilities/column_utilities.cu Outdated Show resolved Hide resolved

ttnghia changed the title ~~Add column sanitization checks in CUDF_TEST_EXPECT_COLUMN_ macros~~ Add column sanitization checks in CUDF_TEST_EXPECT_COLUMN_* macros Dec 5, 2023

davidwendt mentioned this pull request Dec 5, 2023

Fix unsanitized nulls produced by libcudf dictionary decode #14578

Merged

3 tasks

davidwendt mentioned this pull request Dec 6, 2023

Fix unsanitized nulls produced by cudf::clamp APIs #14580

Merged

3 tasks

reduce code duplication and improve readability

1f4bb46

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Dec 6, 2023

ttnghia approved these changes Dec 6, 2023

View reviewed changes

davidwendt mentioned this pull request Dec 6, 2023

Fix unsanitized nulls from strings segmented-reduce #14586

Merged

3 tasks

mythrocks approved these changes Dec 6, 2023

View reviewed changes

davidwendt mentioned this pull request Dec 8, 2023

Remove unsanitized input test data from copy gtests #14600

Merged

3 tasks

davidwendt mentioned this pull request Dec 11, 2023

Remove non-empty nulls in cudf::get_json_object #14609

Merged

3 tasks

mythrocks assigned SurajAralihalli Dec 11, 2023

mythrocks added tests Unit testing for project improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Dec 11, 2023

SurajAralihalli added 2 commits December 18, 2023 12:55

Merge branch 'branch-24.02' into verify_column_sanity_cudf_tests

fe57348

remove UnsanitaryLists test

6a6fc86

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

davidwendt added the 3 - Ready for Review Ready for review by team label Dec 18, 2023

vyasr mentioned this pull request Dec 18, 2023

Simplify null count checking in column equality comparator #13312

Merged

3 tasks

rapids-bot bot merged commit 3602816 into rapidsai:branch-24.02 Dec 18, 2023
67 checks passed

GregoryKimball mentioned this pull request Jan 5, 2024

[FEA] Make calling to purge_nonempty_nulls optional in various places #12567

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add column sanitization checks in `CUDF_TEST_EXPECT_COLUMN_*` macros #14559

Add column sanitization checks in `CUDF_TEST_EXPECT_COLUMN_*` macros #14559

SurajAralihalli commented Dec 4, 2023 •

edited

Loading

copy-pr-bot bot commented Dec 4, 2023

SurajAralihalli commented Dec 5, 2023 •

edited

Loading

SurajAralihalli commented Dec 5, 2023 •

edited

Loading

davidwendt commented Dec 5, 2023

GregoryKimball commented Dec 5, 2023

vyasr commented Dec 5, 2023

mythrocks commented Dec 6, 2023

davidwendt commented Dec 6, 2023

davidwendt commented Dec 12, 2023

vyasr commented Dec 14, 2023

davidwendt commented Dec 14, 2023

SurajAralihalli commented Dec 14, 2023

davidwendt commented Dec 18, 2023

vyasr commented Dec 18, 2023

Add column sanitization checks in CUDF_TEST_EXPECT_COLUMN_* macros #14559

Add column sanitization checks in CUDF_TEST_EXPECT_COLUMN_* macros #14559

Conversation

SurajAralihalli commented Dec 4, 2023 • edited Loading

Checklist

copy-pr-bot bot commented Dec 4, 2023

SurajAralihalli commented Dec 5, 2023 • edited Loading

SurajAralihalli commented Dec 5, 2023 • edited Loading

davidwendt commented Dec 5, 2023

GregoryKimball commented Dec 5, 2023

vyasr commented Dec 5, 2023

mythrocks commented Dec 6, 2023

davidwendt commented Dec 6, 2023

davidwendt commented Dec 12, 2023

vyasr commented Dec 14, 2023

davidwendt commented Dec 14, 2023

SurajAralihalli commented Dec 14, 2023

davidwendt commented Dec 18, 2023

vyasr commented Dec 18, 2023

Add column sanitization checks in `CUDF_TEST_EXPECT_COLUMN_*` macros #14559

Add column sanitization checks in `CUDF_TEST_EXPECT_COLUMN_*` macros #14559

SurajAralihalli commented Dec 4, 2023 •

edited

Loading

SurajAralihalli commented Dec 5, 2023 •

edited

Loading

SurajAralihalli commented Dec 5, 2023 •

edited

Loading