-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: support typetracer in ak.str.
operations
#2679
Conversation
@jpivarski I've noticed that PyArrow doesn't seem to return proper-sized offsets for the following import pyarrow as pa
import pyarrow.compute as pc
print(
pc.utf8_split_whitespace(
pa.array([], type=pa.large_string()),
).buffers()
)
print(pa.array([], type=pa.large_string()).buffers()) Could you give this code-block a once-over, and confirm that we should expect both sets of offsets buffers to have |
It looks like >>> [None if x is None else len(x)
... for x in pa.array([], type=pa.large_string()).buffers()]
[None, 8, 0]
>>> # string-mask (None), string-offsets (8 bytes), string-data (0 bytes)
>>> [None if x is None else len(x)
... for x in pc.utf8_split_whitespace(pa.array([], type=pa.large_string())).buffers()]
[None, 4, None, 4, 4]
>>> # list-mask (None), list-offsets (4), string-mask (None), string-offsets (4), string-data (4???) versus >>> [None if x is None else len(x)
... for x in pa.array([], type=pa.string()).buffers()]
[None, 4, 0]
>>> # string-mask (None), string-offsets (4 bytes), string-data (0 bytes)
>>> [None if x is None else len(x)
... for x in pc.utf8_split_whitespace(pa.array([], type=pa.string())).buffers()]
[None, 4, None, 4, 4]
>>> # list-mask (None), list-offsets (4), string-mask (None), string-offsets (4), string-data (4???) That's not obviously wrong: we have plenty of functions that take Nope, it really is an Arrow bug: the declared type is still >>> pc.utf8_split_whitespace(pa.array([], type=pa.large_string())).type
ListType(list<item: large_string>)
>>> pc.utf8_split_whitespace(pa.array([], type=pa.string())).type
ListType(list<item: string>) It's definitely wrong for Arrow to declare the type I also find it a bit odd that the string data for the empty lists of strings returned by |
Good, that matches my understanding here. I'll file a bug with PyArrow. |
b6f6ad0
to
0e2dd69
Compare
0e2dd69
to
e2dc138
Compare
feat: add typetracer support to `index_in`
e2dc138
to
7c62d3f
Compare
I've rebased the commits such that they could be applied in sequence. That might make for easier per-commit reviewing than the whole changeset in one go. |
Codecov Report
Additional details and impacted files
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
string_to32=True, bytestring_to32=True
is a fine way to solve the problem of Arrow functions not being properly implemented for 64-bit.
Solving the typetracer problem in one place, through _apply_through_arrow
, is a good idea, but I've noticed that only some of the functions use it; others use to_arrow
and from_arrow
manually. Is it the case that only some of them need it? (There are several different classes of function signatures, I know.) Even if only some of them need it, could they be made to all go through the same helper function, even if that helper function has a trivial action on the ones that don't need it, for regularity?
(If there's a good reason not to, then that's fine.)
…r' into agoose77/fix-arrow-str-typetracer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! This unblocks dask-awkward, so that it can wrap these functions.
All of the changes are things that look plausible to me; having to coerce Forms in a few cases, dealing with Arrow sometimes adding option, sometimes not (different versions), 32-bit versus 64-bit, etc.
We now have a pyarrow==7
test to make sure that this code is compliant with our range of Arrow versions.
So it all looks good! Merge when you're ready!
Fixes #2675 by introducing a typetracer branch in a new
_apply_through_arrow
helper. This helper only handles the single-argument case, because the multi-array cases are more complex (they need finer control over how things get converted to arrow).Some functions in PyArrow have bugs that require us to workaround by manually creating a layout from a form.