column: calculate null_count before release()ing the cudf::column #11365

wence- · 2022-07-27T09:28:20Z

Description

release() sets the null_count of a column to zero, so previously
asking for the null_count provided an incorrect value. Fortunately
this never exhibited in the final column, since Column.init always
ignores the provided null_count and computes it from the null_mask (if
one is given).

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

release() sets the null_count of a column to zero, so previously asking for the null_count provided an incorrect value. Fortunately this never exhibited in the final column, since Column.__init__ always ignores the provided null_count and computes it from the null_mask (if one is given).

wence- · 2022-07-27T09:31:18Z

I spotted this while reading the code as a result of #11354. As noted, this error has always been benign because the null_count argument to Column.__init__ is always ignored. Which suggests that perhaps it should be removed, given that it is always calculated from the null mask.

cc @shwina, @vyasr

galipremsagar · 2022-07-28T13:41:23Z

rerun tests

codecov · 2022-07-28T16:08:23Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.10@e7e5f45). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-22.10   #11365   +/-   ##
===============================================
  Coverage                ?   86.43%           
===============================================
  Files                   ?      144           
  Lines                   ?    22808           
  Branches                ?        0           
===============================================
  Hits                    ?    19714           
  Misses                  ?     3094           
  Partials                ?        0

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

vyasr · 2022-08-02T00:35:10Z

@wence- could you explain a little more? It seems like if Column.__init__ sets self._null_count to a value other than None that the first access of the self.null_count property triggers a computation. What am I missing?

wence- · 2022-08-02T09:52:26Z

@wence- could you explain a little more? It seems like if Column.__init__ sets self._null_count to a value other than None that the first access of the self.null_count property triggers a computation. What am I missing?

If you mean "to a value of None" (rather than "other than None") then you're right, Column.__init__ takes both a null_mask (may be None) and a null_count (presumably counting the number of nulls in the null mask). The passed in null_count is, however, unused, since __init__ calls:

self._null_count = null_count  # perhaps not None
self.set_base_mask(null_mask) # sets self._null_count = None

And any later access of self.null_count calls cudf::...compute_null_count if _null_count is None.

So what is this PR fixing? Given that we are passing null_count into build_column and thence Column.__init__, we should probably provide the right value, right now the code in Column.from_unique_ptr does this:

        has_nulls = c_col.get()[0].has_nulls() # Do we have a null mask?

        # After call to release(), c_col is unusable
        cdef column_contents contents = move(c_col.get()[0].release())
        ...
        if has_nulls:
            mask = DeviceBuffer.c_from_unique_ptr(move(contents.null_mask))
            mask = Buffer(mask)
            null_count = c_col.get()[0].null_count() # This always returns 0, because we've called `release` on the `col`
        else:
            mask = None
            null_count = 0

release invalidates the contents of the c_col object, setting the null count to zero (so asking for null_count() produces the wrong answer).

This change just moves the access of null_count() to before calling release (so that we are passing correct values into build_column, and not relying on resetting and recomputing being done in __init__).

Arguably a better change (although it is more pervasive since it changes the signature of build_column) is to remove the null_count argument and always compute it from the null mask, I can do that instead if we prefer.

vyasr · 2022-08-02T21:29:35Z

Ah thanks for that! I missed that set_base_mask invalidates the null count.

I am fine with this fix, although I agree that removing the null count would probably be a more robust change. However, it would be nice if we could actually reliably set the null count since that would save us having to run the count kernel. @shwina is much more familiar with this part of the code base than I am. Ashwin, do you think it would be very disruptive to instead reorder

self.set_base_mask(null_mask)
self._null_count = null_count

so that we could actually optimize out the null count calculation if the calling code knows it beforehand? Do you think that's worth trying that to at least see how much code is currently relying on the implementation detail that @wence- discovered here? I wouldn't be surprised if some places are perhaps setting invalid null counts that could actually benefit performance-wise from setting the null count correctly if we reordered these calls.

wence- · 2022-08-03T11:57:27Z

However, it would be nice if we could actually reliably set the null count since that would save us having to run the count kernel.

This is effectively a cache invalidation problem (AKA, one of the two hard problems in computer science). There are two levels at which the cache invalidation might be incorrectly done, both in the C++ cudf::column/column_view case and in the python wrapping. The invariant we want in the C++ layer is:

auto null_count = cv->_null_count;
cv.set_null_count(UNKNOWN); // force recalculation
assert cv.null_count() == null_count;

The API certainly allows this invariant to be broken, since we can call column.set_null_mask(bitmask, null_count) and no checking is done that these two agree. Similarly, on a mutable_column_view, we can get a mutable pointer to the null mask, and anything we do doesn't guarantee that we update null_count. So, I think this can happen:

auto mcv = mutable_column_view(...);
auto nulls = mcv->null_mask();
// modify nulls
// mcv->null_count() is incorrect

This latter is a reasonably local issue, since when grabbing a mutable column view from a column, we invalidate the null_count on the column so it will be recalculated the next time we ask the column.

Other lurking issues (low possibility of happening):

one grabs a mutable_column_view and column_view of the same column -> potential for the null counts to go out of whack (they share the same pointer to the null_mask, but have their own copies of the cached null_count).
one grabs a column_view and someone modifies the null_mask of the underlying column (low likelihood since column doesn't expose the _null_mask pointer), since column_views share pointers to data with the column, but take copies of the null_count.

I haven't looked as carefully in the cython/python code, but I suspect many of the same issues apply.

vyasr · 2022-08-05T22:24:23Z

I agree that getting caching to work just right is extremely difficult. However, most of the issues that you've outlined are pretty orthogonal to the specific use case here. If any of those examples materialize, then even if we don't propagate that caching to the Cython layer you're going to end up with code breaking somewhere. Memoization of quantities that are susceptible to this kind of modification always invites some level of bugs. I think the Python code has to assume that the inputs are valid; if a user wants to optimize their code by passing in a null count along with a mask to prevent having to launch a kernel for computing that count, they are responsible for ensuring that the count is correct. At the Python level we should feel comfortable making the assumption that the numbers are correct since, if they weren't, they were invalidated by incorrect code that will cause errors somewhere else.

In any case, I don't think this is a blocker for this PR. Your fix is good. If we want to revisit the possibility of actually using the passed value to speed things up, we can always come back to it later.

vyasr · 2022-08-05T22:25:08Z

@gpucibot merge

wence- added the 2 - In Progress Currently a work in progress label Jul 27, 2022

github-actions bot added the Python Affects Python cuDF API. label Jul 27, 2022

wence- added code quality non-breaking Non-breaking change and removed Python Affects Python cuDF API. labels Jul 27, 2022

shwina approved these changes Jul 27, 2022

View reviewed changes

shwina added the bug Something isn't working label Jul 27, 2022

wence- added 3 - Ready for Review Ready for review by team and removed tech debt labels Jul 28, 2022

wence- marked this pull request as ready for review July 28, 2022 12:39

wence- requested a review from a team as a code owner July 28, 2022 12:39

wence- requested review from vyasr and charlesbluca July 28, 2022 12:39

ttnghia approved these changes Jul 28, 2022

View reviewed changes

wence- removed the 2 - In Progress Currently a work in progress label Jul 28, 2022

galipremsagar approved these changes Jul 28, 2022

View reviewed changes

galipremsagar added this to PR-WIP in v22.10 Release via automation Jul 28, 2022

galipremsagar moved this from PR-WIP to PR-Reviewer approved in v22.10 Release Jul 28, 2022

vyasr changed the title ~~column: calculate null_count before release()ing the cudf::column~~ column: calculate null_count before release()ing the cudf::column Aug 5, 2022

vyasr approved these changes Aug 5, 2022

View reviewed changes

rapids-bot bot merged commit 6fa49c7 into rapidsai:branch-22.10 Aug 5, 2022

v22.10 Release automation moved this from PR-Reviewer approved to Done Aug 5, 2022

wence- deleted the wence/fix/column-access-after-release branch August 6, 2022 11:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

column: calculate null_count before release()ing the cudf::column #11365

column: calculate null_count before release()ing the cudf::column #11365

wence- commented Jul 27, 2022

wence- commented Jul 27, 2022

galipremsagar commented Jul 28, 2022

codecov bot commented Jul 28, 2022 •

edited

Loading

vyasr commented Aug 2, 2022

wence- commented Aug 2, 2022

vyasr commented Aug 2, 2022

wence- commented Aug 3, 2022

vyasr commented Aug 5, 2022

vyasr commented Aug 5, 2022

column: calculate null_count before release()ing the cudf::column #11365

column: calculate null_count before release()ing the cudf::column #11365

Conversation

wence- commented Jul 27, 2022

Description

Checklist

wence- commented Jul 27, 2022

galipremsagar commented Jul 28, 2022

codecov bot commented Jul 28, 2022 • edited Loading

Codecov Report

vyasr commented Aug 2, 2022

wence- commented Aug 2, 2022

vyasr commented Aug 2, 2022

wence- commented Aug 3, 2022

vyasr commented Aug 5, 2022

vyasr commented Aug 5, 2022

codecov bot commented Jul 28, 2022 •

edited

Loading