-
Notifications
You must be signed in to change notification settings - Fork 905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add functionality to apply Dtype
metadata to ColumnBase
#8373
Add functionality to apply Dtype
metadata to ColumnBase
#8373
Conversation
Just realized this is targeting |
Just noticed that a pretty significant refactor of the dtype metadata copying got merged yesterday (#8278), and now the |
I just chatted briefly with @charlesbluca about this code. I'm not super familiar with this function yet, so @shwina it would be good to sync with you before I make a final recommendation because I suspect that I'm missing some details here. What I'm noticing thus far suggests that The fact that it actually copies the passed column rather than simply applying metadata change is an unexpected and IMO undesirable divergence from |
@vyasr as I mentioned in #8333 (comment), I really think we just need the one method |
for i in range(len(self.base_children)) | ||
) | ||
other.set_base_children(base_children) | ||
other = other._apply_type_metadata(self.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method, along with _copy_type_metadata_from_arrow
, is now a one-liner since all the logic is handled in apply
. Is it worth it to keep the copy
functions, or should we just replace all their appearances with the corresponding apply
method call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If they are really the same thing I'd vote to remove the old ones, less clutter, less confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to avoid the indirection and just delete the methods.
pa.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]]), | ||
cudf.core.column.as_column( | ||
[[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]] | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test gave a somewhat cryptic warning:
cudf/tests/test_column.py::test_as_column_arrow_array[data2-expected2]
/home/charlesbluca/Documents/GitHub/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/site-packages/pandas/core/dtypes/missing.py:484: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
if np.any(np.asarray(left_value != right_value)):
-- Docs: https://docs.pytest.org/en/stable/warnings.html
Is there something happening in the as_column
operation here that would be causing the elementwise comparison to fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert_eq
does it's final checking in a loop in python/pandas land. It's probably trying to compare two non-scalar objects ultimately and something weird is happening. I'd suggest calling to_pandas
on the cuDF side of the comparison and seeing what the things that are being compared look like in the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking out to_pandas
shows the two objects do look the same:
>>> cudf.Series(actual_column).to_pandas()
0 [[1, 2, 3], [4, 5, 6]]
1 [[7, 8, 9], [10, 11, 12]]
dtype: object
>>> cudf.Series(expected).to_pandas()
0 [[1, 2, 3], [4, 5, 6]]
1 [[7, 8, 9], [10, 11, 12]]
dtype: object
Could this warning just be a consequence of comparing list columns with nested lists?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably. When cudf objects convert to pandas, it first goes through pyarrow. Upon converting arrow to pandas format, it converts nested lists into nested arrays:
>>> arr = pa.array([[1, 2, 3]])
>>> arr
<pyarrow.lib.ListArray object at 0x7f43187e8a60>
[
[
1,
2,
3
]
]
>>> arr.to_pandas()
0 [1, 2, 3]
dtype: object
>>> arr.to_pandas().iloc[0]
array([1, 2, 3])
# in some cases, libcudf will return an empty ListColumn with no | ||
# indices; in these cases, we must manually set the base_size to 0 to | ||
# avoid it being negative | ||
return max(0, len(self.base_children[0]) - 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting this change; @shwina and I noticed that in some test cases, libcudf was returning empty a ListColumn
with no indices (i.e. self.base_children[0]
is just an empty NumericalColumn
). Before, this would result in a ListColumn
with a base_size
of -1, which was breaking some test cases where this refactor required us to reconstruct the empty ListColumn
using this erroneous size.
This change ensures that ListColumn.base_size
will always be at least 0.
Codecov Report
@@ Coverage Diff @@
## branch-21.08 #8373 +/- ##
===============================================
Coverage ? 82.89%
===============================================
Files ? 110
Lines ? 18099
Branches ? 0
===============================================
Hits ? 15004
Misses ? 3095
Partials ? 0 Continue to review full report at Codecov.
|
@charlesbluca Sorry its been a few days since I looked at this - can you merge the latest and rerun tests? |
ugh I did the thing where I said "rerun t*sts" in a comment so it did it anyways :( |
No worries 😄 looks like all tests are passing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Great work, @charlesbluca !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
@gpucibot merge |
…ing (#430) This PR contains three distinct changes required to get cuspatial builds working and tests passing again: 1. RMM switched to rapids-cmake (rapidsai/rmm#800), which requires CMake 3.20.1, so this PR includes the required updates for that. 2. The Arrow upgrade in cudf also moved the location of testing utilities (rapidsai/cudf#7495). Long term cuspatial needs to move away from use of the testing utilities, which are not part of cudf's public API, but we are currently blocked by rapidsai/cudf#8646, so this PR just imports the internal `assert_eq` method as a stopgap to get tests passing. 3. The changes in rapidsai/cudf#8373 altered the way that metadata was propagated to libcudf outputs from previously existing cuDF Python objects. The new code paths require cuspatial to override metadata copying at the GeoDataFrame rather than the GeoColumn level in order to ensure that information about column types is lost in the libcudf round trip and the metadata copying functions are now called on the output DataFrame rather than the input one. This PR supersedes #427, #428, and #429, all of which can now be closed. Authors: - Vyas Ramasubramani (https://github.com/vyasr) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) - Christopher Harris (https://github.com/cwharris) URL: #430
Based on discussion on #8333:
_with_type_metadata()
toColumnBase
to return a new column with the metadata ofdtype
applied_copy_type_metadata[_from_arrow]()
and uses this function in their placeThese changes would be helpful for #8153, as we want to be able to copy metadata from one column to another using only the dtype object.