Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Adds fill, validityAsBooleanVector, replaceNulls, and isNull/isNotNull [skip ci] #2129

Merged
merged 8 commits into from
Jul 2, 2019

Conversation

abellina
Copy link
Contributor

@abellina abellina commented Jun 27, 2019

This PR adds support in the java api for:

  • Whole vector filling (internal api): to mirror cudf::fill for the full vector case
  • ColumnVector.replaceNulls: to mirror cudf::replace_nulls
  • ColumnVector.validityAsBooleanVector: Obtaining the validity bitmask as a boolean ColumnVector. Note: this function will go away if libcudf adds support for isNull as a column operation.
  • ColumnVector.isNull/isNotNull: required when filtering nullable columns
  • ColumnVector.fromScalar: given a Scalar and a size, returns a ColumnVector of that size filled with the scalar's value

@abellina abellina requested a review from a team as a code owner June 27, 2019 14:30
@abellina abellina added 3 - Ready for Review Ready for review by team 4 - Needs cuDF (Java) Reviewer Java Affects Java cuDF API. labels Jun 27, 2019
@jrhemstad
Copy link
Contributor

Out of curiosity, how is validityAsBooleanVector implemented? I don't believe we have this in libcudf yet, but we have planned to add it eventually.

@abellina
Copy link
Contributor Author

@jrhemstad we filled a boolean vector with true (using cudf::fill), then we take the validity bitmask from the original vector and set it on this new vector. This allows us to call cudf::replace_nulls with false.

@revans2 revans2 changed the title Adds fill, validityAsBooleanVector, replaceNulls, and isNull/isNotNull [REVIEW] Adds fill, validityAsBooleanVector, replaceNulls, and isNull/isNotNull [skip ci] Jun 27, 2019
@revans2
Copy link
Contributor

revans2 commented Jun 27, 2019

@abellina when the PR is ready for review please tag it, but also put the [REVIEW] in the title. Java is also not tied into the CI system yet, so until that happens you can put [skip ci] at the end to avoid running the ci tests.

@jrhemstad
Copy link
Contributor

@jrhemstad we filled a boolean vector with true (using cudf::fill), then we take the validity bitmask from the original vector and set it on this new vector. This allows us to call cudf::replace_nulls with false.

Ah, okay, that makes sense. It'd certainly be more efficient to implement this in a custom kernel.

I think your solution makes sense in the medium term. Would you mind creating an issue to add this as a native libcudf feature?

@abellina
Copy link
Contributor Author

@jrhemstad will do. It would be best if isNull and isNotNull also moved to libcudf as unary ops. I'll create issues for these as well.

Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little nervous about fill being in place, simply because none of our other code expects to have the contents of the vector change underneath it and we have to do things to update the cached java-side code etc. The way we are using it now is fine, but we may want to discuss a bit more about if we do expose it publicly how are we going to handle it so we don't have issues later on, and if we don't expose it then we should fold fill and fromScalar into a single function and remove a lot of the overlap between them.

java/src/main/java/ai/rapids/cudf/ColumnVector.java Outdated Show resolved Hide resolved
java/src/test/java/ai/rapids/cudf/ReplaceNullsTest.java Outdated Show resolved Hide resolved
@harrism
Copy link
Member

harrism commented Jun 27, 2019

I had requested @abellina to file an issue first so we could discuss the best approach for this in libcudf before implementing it...

@harrism
Copy link
Member

harrism commented Jun 27, 2019

  • Whole vector filling (internal api): to mirror cudf::fill for the full vector case

cudf::fill is not just "whole-vector" filling. You can use it to fill any contiguous range within a vector.

  • ColumnVector.replaceNulls: to mirror cudf::replace_nulls
  • ColumnVector.validityAsBooleanVector: Obtaining the validity bitmask as a boolean ColumnVector
  • ColumnVector.isNull/isNotNull: required when filtering nullable columns
  • ColumnVector.fromScalar: given a Scalar and a size, returns a ColumnVector of that size filled with the scalar's value

All of these things seem useful (except validityAsBooleanVector, which to me sounds identical to isNotNull) to have in libcudf, where we could ensure they are efficient.

@abellina
Copy link
Contributor Author

@harrism, for cudf::fill I didn't include the range version because I experienced this #2142. I can predict the null_count if we fill the whole vector (all nulls, or no nulls), so that was easy for me to work around in this case.

validityAsBooleanVector and isNotNull are equivalent. The idea is to remove validityAsBooleanVector once isNotNull is implemented in libcudf. I should have made that clear in the description of this PR. Here's the promised issue: #2143

@abellina abellina requested a review from revans2 June 28, 2019 14:17
Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is conditional on approval from @harrism too.

I really would like to see isNull and isNotNull go in, because it is blocking us doing other things with Spark, but at the same time, I want to be sure that we do it right eventually so waiting until we know how to do it right is fine.

@harrism
Copy link
Member

harrism commented Jul 1, 2019

I'm fine with you guys proceeding as long as you replace your code with calls to the libcudf version once it exists.

@abellina
Copy link
Contributor Author

abellina commented Jul 1, 2019

I added a commit to remove the overwrite I did for null_count (after fill), as we now have #2142 merged. @tgravescs could you give it a look?

A follow up pr should add the java cudf::fill functions.

@abellina
Copy link
Contributor Author

abellina commented Jul 1, 2019

Actually, please hold on merging this. I think I found something else.

@abellina
Copy link
Contributor Author

abellina commented Jul 1, 2019

@tgravescs ok it's good now. I had to set the bitmask to all 1's before sending it to cudf::fill in my case, as it should be a valid bitmask and corresponding null_count. This makes sense, and it broke one of my tests once I removed a workaround for null_count I had added, so I had to debug.

java/src/main/java/ai/rapids/cudf/ColumnVector.java Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/ColumnVector.java Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/ColumnVector.java Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/Cuda.java Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/ColumnVector.java Outdated Show resolved Hide resolved
java/src/main/native/src/CudaJni.cpp Outdated Show resolved Hide resolved
java/src/test/java/ai/rapids/cudf/ColumnVectorTest.java Outdated Show resolved Hide resolved
java/src/test/java/ai/rapids/cudf/ColumnVectorTest.java Outdated Show resolved Hide resolved
@abellina
Copy link
Contributor Author

abellina commented Jul 1, 2019

@tgravescs comments addressed.

@abellina
Copy link
Contributor Author

abellina commented Jul 2, 2019

@kkraus14 I think we are good to go. Also, should we just not do [skip ci] s.t. pull requests don't get stuck here?

@kkraus14 kkraus14 merged commit 0254381 into rapidsai:branch-0.9 Jul 2, 2019
@vyasr vyasr added 4 - Needs Review Waiting for reviewer to review or respond and removed 4 - Needs cuDF (Java) Reviewer labels Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond Java Affects Java cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants