-
Notifications
You must be signed in to change notification settings - Fork 871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
column: calculate null_count before release()ing the cudf::column #11365
column: calculate null_count before release()ing the cudf::column #11365
Conversation
release() sets the null_count of a column to zero, so previously asking for the null_count provided an incorrect value. Fortunately this never exhibited in the final column, since Column.__init__ always ignores the provided null_count and computes it from the null_mask (if one is given).
rerun tests |
Codecov Report
@@ Coverage Diff @@
## branch-22.10 #11365 +/- ##
===============================================
Coverage ? 86.43%
===============================================
Files ? 144
Lines ? 22808
Branches ? 0
===============================================
Hits ? 19714
Misses ? 3094
Partials ? 0 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
@wence- could you explain a little more? It seems like if |
If you mean "to a value of
And any later access of So what is this PR fixing? Given that we are passing
This change just moves the access of Arguably a better change (although it is more pervasive since it changes the signature of |
Ah thanks for that! I missed that I am fine with this fix, although I agree that removing the null count would probably be a more robust change. However, it would be nice if we could actually reliably set the null count since that would save us having to run the count kernel. @shwina is much more familiar with this part of the code base than I am. Ashwin, do you think it would be very disruptive to instead reorder
so that we could actually optimize out the null count calculation if the calling code knows it beforehand? Do you think that's worth trying that to at least see how much code is currently relying on the implementation detail that @wence- discovered here? I wouldn't be surprised if some places are perhaps setting invalid null counts that could actually benefit performance-wise from setting the null count correctly if we reordered these calls. |
This is effectively a cache invalidation problem (AKA, one of the two hard problems in computer science). There are two levels at which the cache invalidation might be incorrectly done, both in the C++ auto null_count = cv->_null_count;
cv.set_null_count(UNKNOWN); // force recalculation
assert cv.null_count() == null_count; The API certainly allows this invariant to be broken, since we can call auto mcv = mutable_column_view(...);
auto nulls = mcv->null_mask();
// modify nulls
// mcv->null_count() is incorrect This latter is a reasonably local issue, since when grabbing a mutable column view from a column, we invalidate the null_count on the column so it will be recalculated the next time we ask the column. Other lurking issues (low possibility of happening):
I haven't looked as carefully in the cython/python code, but I suspect many of the same issues apply. |
I agree that getting caching to work just right is extremely difficult. However, most of the issues that you've outlined are pretty orthogonal to the specific use case here. If any of those examples materialize, then even if we don't propagate that caching to the Cython layer you're going to end up with code breaking somewhere. Memoization of quantities that are susceptible to this kind of modification always invites some level of bugs. I think the Python code has to assume that the inputs are valid; if a user wants to optimize their code by passing in a null count along with a mask to prevent having to launch a kernel for computing that count, they are responsible for ensuring that the count is correct. At the Python level we should feel comfortable making the assumption that the numbers are correct since, if they weren't, they were invalidated by incorrect code that will cause errors somewhere else. In any case, I don't think this is a blocker for this PR. Your fix is good. If we want to revisit the possibility of actually using the passed value to speed things up, we can always come back to it later. |
@gpucibot merge |
Description
release() sets the null_count of a column to zero, so previously
asking for the null_count provided an incorrect value. Fortunately
this never exhibited in the final column, since Column.init always
ignores the provided null_count and computes it from the null_mask (if
one is given).
Checklist