Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] Add support for slicing Arrow blocks that contain tensor columns. #19494

Conversation

clarkzinzow
Copy link
Contributor

Adds support for slicing ArrrowBlocks that contain tensor columns.

Our current method of copying a table during block slicing breaks tensor column support, since ChunkedArray.to_pandas() doesn't respect our extension array, giving a Pandas Series with an opaque np.object_ dtype, which causes later table construction to fail.

This PR fixes this by doing a Pandas DataFrame roundtrip, pyarrow.Table.from_pandas(view.to_pandas()), to copy the table, which preserves the extension type metadata.

Related Issue

Closes #19476

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 18, 2021
@clarkzinzow clarkzinzow removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 18, 2021
@clarkzinzow
Copy link
Contributor Author

clarkzinzow commented Oct 19, 2021

It seems as if the failing RayDP test is unrelated? It looks like the Java gateway is failing to launch: https://buildkite.com/ray-project/ray-builders-pr/builds/16131#5696868f-e647-48a8-a9d5-61bf0379b387

@clarkzinzow
Copy link
Contributor Author

Yep it should be unrelated: #19471

@clarkzinzow clarkzinzow added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Oct 19, 2021
@ericl ericl merged commit ad03917 into ray-project:master Oct 19, 2021
xwjiang2010 added a commit that referenced this pull request Oct 19, 2021
ericl pushed a commit that referenced this pull request Oct 19, 2021
clarkzinzow added a commit to clarkzinzow/ray that referenced this pull request Oct 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Datasets] Copying within Arrow block slicing breaks tensor column support.
3 participants