Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add column sanitization checks in CUDF_TEST_EXPECT_COLUMN_* macros #14559

Conversation

SurajAralihalli
Copy link
Contributor

@SurajAralihalli SurajAralihalli commented Dec 4, 2023

This PR addresses Issue #12786

The listed functions have been modified to incorporate a column sanitization check; otherwise, they will raise a std::invalid_argument error.

  • expect_column_properties_equal
  • expect_column_properties_equivalent
  • expect_columns_equal
  • expect_columns_equivalent

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>
Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>
Copy link

copy-pr-bot bot commented Dec 4, 2023

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@SurajAralihalli
Copy link
Contributor Author

SurajAralihalli commented Dec 5, 2023

During local testing, the following tests did not pass:

  1. REDUCTIONS_TEST
    • SegmentedReductionStringTest.MaxIncludeNulls
  2. CLAMP_TEST
    • ClampStringTest.WithNullableColumn
    • ClampStringTest.WithNullableColumnNullHigh
    • ClampStringTest.WithReplaceString
    • ClampDictionaryTest.WithNullableColumn
  3. COPYING_TEST
    • ...32 tests (specific tests not listed)
  4. UTILITIES_TEST
    • ColumnUtilitiesListsTest.UnsanitaryLists
  5. ROLLING_TEST
    • RollingDictionaryTest.LeadLag
  6. JSON_PATH_TEST
    • JsonPathTests.GetJsonObjectInvalidQuery
  7. DICTIONARY_TEST
    • DictionarySetKeysTest.StringsKeys
    • DictionarySliceTest.SliceColumn

@SurajAralihalli
Copy link
Contributor Author

SurajAralihalli commented Dec 5, 2023

It appears that the functions tested in the mentioned tests do not seem to produce sanitized columns. Each function needs a closer examination.

I started by analyzing ClampStringTest.WithNullableColumn which tests cudf::clamp(). Upon manual comparison:

Returned:

  • Column: B, b, c, NULL, e, F, G, H, NULL, e, B
  • Offsets: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
  • Nullmask: 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1

Expected:

  • Column: B, b, c, NULL, e, F, G, H, NULL, e, B
  • Offsets: 0, 1, 2, 3, 3, 4, 5, 6, 7, 7, 8, 9
  • Nullmask: 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1

I suspect this discrepancy could be because the offsets are computed without considering the null mask. See

@SurajAralihalli SurajAralihalli marked this pull request as ready for review December 5, 2023 14:07
@SurajAralihalli SurajAralihalli requested a review from a team as a code owner December 5, 2023 14:07
@davidwendt
Copy link
Contributor

Thanks for finding these. I can look into the strings and dictionary failures.

@ttnghia ttnghia changed the title Add column sanitization checks in CUDF_TEST_EXPECT_COLUMN_ macros Add column sanitization checks in CUDF_TEST_EXPECT_COLUMN_* macros Dec 5, 2023
@GregoryKimball
Copy link
Contributor

Thank you @SurajAralihalli. This is such a great investigation!

@vyasr
Copy link
Contributor

vyasr commented Dec 5, 2023

Thanks! This should really help us tighten up our null handling throughout libcudf.

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Dec 6, 2023
@mythrocks
Copy link
Contributor

Your change looks good.

ROLLING_TEST
RollingDictionaryTest.LeadLag

I'm beginning to wonder if cudf::gather() on dictionary input might not be removing empty nulls. :/

@davidwendt
Copy link
Contributor

Your change looks good.

ROLLING_TEST
RollingDictionaryTest.LeadLag

I'm beginning to wonder if cudf::gather() on dictionary input might not be removing empty nulls. :/

I verified that PR #14578 fixes this error.

rapids-bot bot pushed a commit that referenced this pull request Dec 7, 2023
Fixes the strings specialization logic in `cudf::clamp` to not produce unsanitized null entries. The code was refactored and simplified as well. Also removed unsanitized nulls in test input in the `cudf::clamp` gtests.

Reference: #14559 - fixes several of these gtests

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Mark Harris (https://github.com/harrism)
  - Nghia Truong (https://github.com/ttnghia)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: #14580
rapids-bot bot pushed a commit that referenced this pull request Dec 8, 2023
Fixes `cudf::dictionary::decode` logic to produced sanitized null entries for compound column types.

Reference: #14559 -- fixes many of the errors found here concerning dictionary column gtests.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - MithunR (https://github.com/mythrocks)

URL: #14578
rapids-bot bot pushed a commit that referenced this pull request Dec 8, 2023
Fixes the string specialization logic in `cudf::segmented_reduce` to not produce unsanitized null entries. The functor used to build a gather map for argmin/argmax was corrected to handle include/exclude nulls correctly.

Reference: #14559

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Bradley Dice (https://github.com/bdice)

URL: #14586
@mythrocks mythrocks added tests Unit testing for project improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Dec 11, 2023
rapids-bot bot pushed a commit that referenced this pull request Dec 12, 2023
Removes unsanitized rows from input data in gtests for COPYING_TEST.
This fixes some errors found in #14559

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #14600
karthikeyann pushed a commit to karthikeyann/cudf that referenced this pull request Dec 12, 2023
Fixes the strings specialization logic in `cudf::clamp` to not produce unsanitized null entries. The code was refactored and simplified as well. Also removed unsanitized nulls in test input in the `cudf::clamp` gtests.

Reference: rapidsai#14559 - fixes several of these gtests

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Mark Harris (https://github.com/harrism)
  - Nghia Truong (https://github.com/ttnghia)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: rapidsai#14580
karthikeyann pushed a commit to karthikeyann/cudf that referenced this pull request Dec 12, 2023
…#14578)

Fixes `cudf::dictionary::decode` logic to produced sanitized null entries for compound column types.

Reference: rapidsai#14559 -- fixes many of the errors found here concerning dictionary column gtests.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - MithunR (https://github.com/mythrocks)

URL: rapidsai#14578
karthikeyann pushed a commit to karthikeyann/cudf that referenced this pull request Dec 12, 2023
Fixes the string specialization logic in `cudf::segmented_reduce` to not produce unsanitized null entries. The functor used to build a gather map for argmin/argmax was corrected to handle include/exclude nulls correctly.

Reference: rapidsai#14559

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Bradley Dice (https://github.com/bdice)

URL: rapidsai#14586
karthikeyann pushed a commit to karthikeyann/cudf that referenced this pull request Dec 12, 2023
Removes unsanitized rows from input data in gtests for COPYING_TEST.
This fixes some errors found in rapidsai#14559

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Karthikeyan (https://github.com/karthikeyann)

URL: rapidsai#14600
rapids-bot bot pushed a commit that referenced this pull request Dec 12, 2023
Removes unsanitized rows from output result of `cudf::get_json_object` which may occur when querying an array path `$[*]` does not find any matches in the target string. In this case, a single `'['` remains in the output buffer (per row) though the row is marked as a null entry.
This fixes the `JsonPathTests.GetJsonObjectInvalidQuery` error found in #14559

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #14609
@davidwendt
Copy link
Contributor

The UTILITIES_TEST ColumnUtilitiesListsTest.UnsanitaryLists is specifically coded in opposition to this change. It checks that an unsanitized list column matches its sanitized partner. If this PR is accepted, then this test would need to be removed.

TEST_F(ColumnUtilitiesListsTest, UnsanitaryLists)

@vyasr
Copy link
Contributor

vyasr commented Dec 14, 2023

The UTILITIES_TEST ColumnUtilitiesListsTest.UnsanitaryLists is specifically coded in opposition to this change. It checks that an unsanitized list column matches its sanitized partner. If this PR is accepted, then this test would need to be removed.

TEST_F(ColumnUtilitiesListsTest, UnsanitaryLists)

I think that's fine. We did something similar in 25ebec7 from #14363, and we discussed that in https://nvidia.slack.com/archives/C01CW5L51QC/p1699980706006349. Since our policies prohibit libcudf APIs from producing unsanitized outputs, we don't need to test unsanitized inputs. It's up to the user not to construct such inputs if they're creating their inputs manually.

@davidwendt
Copy link
Contributor

@SurajAralihalli I would recommend you remove this test

TEST_F(ColumnUtilitiesListsTest, UnsanitaryLists)

in this PR.
And if you merge with the latest branch-24.02, I expect this PR should then pass CI.

@SurajAralihalli
Copy link
Contributor Author

@SurajAralihalli I would recommend you remove this test

TEST_F(ColumnUtilitiesListsTest, UnsanitaryLists)

in this PR.

And if you merge with the latest branch-24.02, I expect this PR should then pass CI.

Sure @davidwendt! I'll do that by the end of this week when I get back from my trip. Thanks so much for letting me know.

@davidwendt davidwendt added the 3 - Ready for Review Ready for review by team label Dec 18, 2023
@davidwendt
Copy link
Contributor

/ok to test

@vyasr
Copy link
Contributor

vyasr commented Dec 18, 2023

/merge

@rapids-bot rapids-bot bot merged commit 3602816 into rapidsai:branch-24.02 Dec 18, 2023
67 checks passed
rapids-bot bot pushed a commit that referenced this pull request Dec 19, 2023
This PR removes an extra code path used for checking the equality of the null count when verifying if columns are equivalent (not equal). The purpose of this code path was to verify a specific definition of equivalence for columns containing unsanitized nulls, i.e. by ignoring the stored null count and directly verifying the validity of the underlying null mask. This is no longer necessary because we required sanitized null masks to be output from all libcudf APIs now (see the "libcudf expects nested types to have sanitized null masks" section in the [developer guide](https://docs.rapids.ai/api/libcudf/stable/developer_guide)), and this requirement will be enforced with the merge of #14559.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Mike Wilson (https://github.com/hyperbolic2346)
  - Bradley Dice (https://github.com/bdice)

URL: #13312
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change tests Unit testing for project
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

6 participants