Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add support for __arrow_array__ #2650

Merged
merged 9 commits into from
Aug 15, 2023
Merged

Conversation

agoose77
Copy link
Collaborator

Fixes #2647

@agoose77 agoose77 marked this pull request as ready for review August 15, 2023 18:26
@@ -1325,6 +1325,14 @@ def __array__(self, *args, **kwargs):
with ak._errors.OperationErrorContext("numpy.asarray", (self, *args), kwargs):
return ak._connect.numpy.convert_to_array(self._layout, args, kwargs)

def __arrow_array__(self, *args, **kwargs):
with ak._errors.OperationErrorContext(
f"{type(self).__name__}.__arrow_array__", (self, *args), kwargs
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to be clear that this is Awkward's code, not pyarrow.array(...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this, the error message would show up in ak.to_arrow, since you're calling the high-level version of that and not its _impl. But users may be confused because they don't have any ak.to_arrow in their own code, so this helps.

@codecov
Copy link

codecov bot commented Aug 15, 2023

Codecov Report

Merging #2650 (dfa5d06) into main (a960a56) will decrease coverage by 0.01%.
The diff coverage is 72.22%.

Additional details and impacted files
Files Changed Coverage Δ
src/awkward/operations/str/__init__.py 96.62% <50.00%> (-2.20%) ⬇️
src/awkward/highlevel.py 76.83% <66.66%> (-0.15%) ⬇️
src/awkward/_connect/pyarrow.py 91.23% <100.00%> (+0.08%) ⬆️

@agoose77 agoose77 temporarily deployed to docs-preview August 15, 2023 18:48 — with GitHub Actions Inactive
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve of the PR and how it's implemented, but something needs to be done about the Arrow bug it triggers in old versions of Arrow (see below).

@@ -1325,6 +1325,14 @@ def __array__(self, *args, **kwargs):
with ak._errors.OperationErrorContext("numpy.asarray", (self, *args), kwargs):
return ak._connect.numpy.convert_to_array(self._layout, args, kwargs)

def __arrow_array__(self, *args, **kwargs):
with ak._errors.OperationErrorContext(
f"{type(self).__name__}.__arrow_array__", (self, *args), kwargs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this, the error message would show up in ak.to_arrow, since you're calling the high-level version of that and not its _impl. But users may be confused because they don't have any ak.to_arrow in their own code, so this helps.

if args == () and kwargs == {}:
return out
else:
return pyarrow.array(out, *args, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your test passes, but I tried what ought to be the same thing and it didn't:

>>> import pyarrow as pa
>>> a = pa.array([1.1, 2.2, 3.3, 4.4, 5.5])
>>> b = pa.array(a, type=pa.float64())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/array.pxi", line 317, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not convert <pyarrow.DoubleScalar: 1.1> with type pyarrow.lib.DoubleScalar: tried to convert to double

(I was going to do a test to find out if wrapping an Arrow array with pa.array is zero-copy. It ought to be—I was just checking.)

It might be my pyarrow version (9.0.0): I checkout out this repo and tried to run the test and it failed (same error).

We currently support Arrow versions as low as 7.0.0. This is apparently a bug that they fixed in version $x$ such that $9 &lt; x \le 12$. We could increase that threshold, except that Arrow is an important requirement for basic things like saving and reading files, and other packages have a tendency of putting upper bounds on the Arrow version. (condastats wanted to lower it to Arrow 6, which made it tricky to build an environment in which conda stats can actually be analyzed.)

We can't pass pa.array arguments into to_arrow because to_arrow doesn't usually call pa.array directly; it uses pa.Array.from_buffers, which doesn't take the same arguments. Arrow version $x$ is only needed for the else case of this if-else: maybe check the Arrow version there? If it's less than $x$, raise an error message saying that passing arguments requires Arrow version $x$?

Or maybe that's overkill and we should just explain that it's an Arrow bug, fixed in version $x$, when our users encounter this. It would at least require an Arrow version check in the test, so that we can run the tests with lower versions of Arrow. (And maybe that needs to be added to the CI matrix, not just NumPy minimum version, but Arrow, too.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, my choice would be for the last one: just ensure that our tests work with the Arrow versions we say we support, even if the Arrow version we support has known bugs when some (not all) operations are attempted.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good spot; I assumed that this part of the Arrow API was more stable than this. It looks like only 12.0.0 supports the casting that is performed by the test case in question. I've made it skipif

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the other tests weren't working for me, so I've added a minimum pyarrow to the build matrix. We don't depend on it quite as much as NumPy, but increasingly, and it has more radical changes from one version to the next.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two of these failures are the new tests I was expecting, but some others are from pyarrow string functions. I'm going to test those locally now.

https://github.com/scikit-hep/awkward/actions/runs/5871689180/job/15921684610?pr=2650

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're all fine in my pyarrow 9.0.0, but apparently not in the pyarrow 7.0.0 that we say we support. It was good to test!

They're encoding mistakes. I can fix the string functions (in this PR) while you fix the casting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@agoose77 agoose77 temporarily deployed to docs-preview August 15, 2023 20:54 — with GitHub Actions Inactive
@jpivarski jpivarski temporarily deployed to docs-preview August 15, 2023 21:08 — with GitHub Actions Inactive
@agoose77 agoose77 temporarily deployed to docs-preview August 15, 2023 21:30 — with GitHub Actions Inactive
@jpivarski
Copy link
Member

I think this one will pass because the only error in your last commit was in the strings.

Feel free to merge when you're ready!

@jpivarski jpivarski temporarily deployed to docs-preview August 15, 2023 21:44 — with GitHub Actions Inactive
@agoose77 agoose77 merged commit 961637e into main Aug 15, 2023
34 checks passed
@agoose77 agoose77 deleted the agoose77/feat-arrow-protocol branch August 15, 2023 22:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support pyarrow.array(ak.Array(...))
2 participants