Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: support typetracer in ak.str. operations #2679

Merged
merged 20 commits into from
Sep 1, 2023

Conversation

agoose77
Copy link
Collaborator

@agoose77 agoose77 commented Aug 29, 2023

Fixes #2675 by introducing a typetracer branch in a new _apply_through_arrow helper. This helper only handles the single-argument case, because the multi-array cases are more complex (they need finer control over how things get converted to arrow).

Some functions in PyArrow have bugs that require us to workaround by manually creating a layout from a form.

@agoose77
Copy link
Collaborator Author

@jpivarski I've noticed that PyArrow doesn't seem to return proper-sized offsets for the following

import pyarrow as pa
import pyarrow.compute as pc

print(
    pc.utf8_split_whitespace(
        pa.array([], type=pa.large_string()),
    ).buffers()
)

print(pa.array([], type=pa.large_string()).buffers())

Could you give this code-block a once-over, and confirm that we should expect both sets of offsets buffers to have size>=8, i.e. the size of a single offset (64 bit)? I assume this is a PyArrow bug, but I want to be sure before reporting.

@agoose77 agoose77 temporarily deployed to docs-preview August 29, 2023 13:52 — with GitHub Actions Inactive
@jpivarski
Copy link
Member

It looks like pc.utf8_split_whitespace on arrays of large_string is returning arrays of lists of string.

>>> [None if x is None else len(x)
...  for x in pa.array([], type=pa.large_string()).buffers()]
[None, 8, 0]
>>> # string-mask (None), string-offsets (8 bytes), string-data (0 bytes)

>>> [None if x is None else len(x)
...  for x in pc.utf8_split_whitespace(pa.array([], type=pa.large_string())).buffers()]
[None, 4, None, 4, 4]
>>> # list-mask (None), list-offsets (4), string-mask (None), string-offsets (4), string-data (4???)

versus

>>> [None if x is None else len(x)
...  for x in pa.array([], type=pa.string()).buffers()]
[None, 4, 0]
>>> # string-mask (None), string-offsets (4 bytes), string-data (0 bytes)

>>> [None if x is None else len(x)
...  for x in pc.utf8_split_whitespace(pa.array([], type=pa.string())).buffers()]
[None, 4, None, 4, 4]
>>> # list-mask (None), list-offsets (4), string-mask (None), string-offsets (4), string-data (4???)

That's not obviously wrong: we have plenty of functions that take Index32 inputs and return Index64 outputs. Arrow's bias is toward 32-bit; 64-bit was a late addition before version 1.0. Maybe Arrow is even deciding on the index size of the output based on the lengths of the string values, which would be a problem for our Form-stability, but we can cast the outputs if we need to. We could just have a policy of always turning Arrow 32-bit output indexes into 64-bit—that way, it would never be value-dependent.


Nope, it really is an Arrow bug: the declared type is still large_string, despite the fact that the buffer is 4 bytes long:

>>> pc.utf8_split_whitespace(pa.array([], type=pa.large_string())).type
ListType(list<item: large_string>)
>>> pc.utf8_split_whitespace(pa.array([], type=pa.string())).type
ListType(list<item: string>)

It's definitely wrong for Arrow to declare the type large_string and have only 4 bytes for the offset. (It needs a single [0] for the offsets of an empty string, just like Awkward.) The large_ means 8-byte index.

I also find it a bit odd that the string data for the empty lists of strings returned by pc.utf8_split_whitespace is non-empty (the "???" above). I don't know what rules for too-large buffers Arrow allows; it would be allowed in Awkward, but perhaps unexpected in this case because I don't see what good it might do.

@agoose77
Copy link
Collaborator Author

It's definitely wrong for Arrow to declare the type large_string and have only 4 bytes for the offset. (It needs a single [0] for the offsets of an empty string, just like Awkward.) The large_ means 8-byte index.

Good, that matches my understanding here. I'll file a bug with PyArrow.

@agoose77 agoose77 temporarily deployed to docs-preview August 30, 2023 10:53 — with GitHub Actions Inactive
@agoose77 agoose77 force-pushed the agoose77/fix-arrow-str-typetracer branch from b6f6ad0 to 0e2dd69 Compare August 30, 2023 11:01
@agoose77 agoose77 force-pushed the agoose77/fix-arrow-str-typetracer branch from 0e2dd69 to e2dc138 Compare August 30, 2023 11:04
@agoose77 agoose77 force-pushed the agoose77/fix-arrow-str-typetracer branch from e2dc138 to 7c62d3f Compare August 30, 2023 11:10
@agoose77
Copy link
Collaborator Author

I've rebased the commits such that they could be applied in sequence. That might make for easier per-commit reviewing than the whole changeset in one go.

@agoose77 agoose77 marked this pull request as ready for review August 30, 2023 11:11
@codecov
Copy link

codecov bot commented Aug 30, 2023

Codecov Report

Merging #2679 (65710fe) into main (59d4235) will increase coverage by 0.03%.
The diff coverage is 99.25%.

Additional details and impacted files
Files Changed Coverage Δ
src/awkward/operations/str/akstr_extract_regex.py 100.00% <ø> (ø)
src/awkward/operations/str/__init__.py 99.11% <98.79%> (+2.48%) ⬆️
src/awkward/operations/str/akstr_index_in.py 97.22% <100.00%> (+0.34%) ⬆️
src/awkward/operations/str/akstr_is_in.py 97.22% <100.00%> (+0.34%) ⬆️
src/awkward/operations/str/akstr_join.py 92.85% <100.00%> (-0.48%) ⬇️
.../awkward/operations/str/akstr_join_element_wise.py 96.42% <100.00%> (+0.13%) ⬆️
src/awkward/operations/str/akstr_repeat.py 95.00% <100.00%> (-0.13%) ⬇️
src/awkward/operations/str/akstr_slice.py 100.00% <100.00%> (ø)
src/awkward/operations/str/akstr_split_pattern.py 100.00% <100.00%> (ø)
...wkward/operations/str/akstr_split_pattern_regex.py 100.00% <100.00%> (ø)
... and 2 more

... and 2 files with indirect coverage changes

@agoose77 agoose77 temporarily deployed to docs-preview August 30, 2023 11:20 — with GitHub Actions Inactive
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string_to32=True, bytestring_to32=True is a fine way to solve the problem of Arrow functions not being properly implemented for 64-bit.

Solving the typetracer problem in one place, through _apply_through_arrow, is a good idea, but I've noticed that only some of the functions use it; others use to_arrow and from_arrow manually. Is it the case that only some of them need it? (There are several different classes of function signatures, I know.) Even if only some of them need it, could they be made to all go through the same helper function, even if that helper function has a trivial action on the ones that don't need it, for regularity?

(If there's a good reason not to, then that's fine.)

@agoose77 agoose77 temporarily deployed to docs-preview September 1, 2023 10:37 — with GitHub Actions Inactive
@agoose77 agoose77 temporarily deployed to docs-preview September 1, 2023 10:46 — with GitHub Actions Inactive
@agoose77 agoose77 temporarily deployed to docs-preview September 1, 2023 11:07 — with GitHub Actions Inactive
@agoose77 agoose77 temporarily deployed to docs-preview September 1, 2023 11:38 — with GitHub Actions Inactive
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! This unblocks dask-awkward, so that it can wrap these functions.

All of the changes are things that look plausible to me; having to coerce Forms in a few cases, dealing with Arrow sometimes adding option, sometimes not (different versions), 32-bit versus 64-bit, etc.

We now have a pyarrow==7 test to make sure that this code is compliant with our range of Arrow versions.

So it all looks good! Merge when you're ready!

@agoose77 agoose77 merged commit 81fc063 into main Sep 1, 2023
34 checks passed
@agoose77 agoose77 deleted the agoose77/fix-arrow-str-typetracer branch September 1, 2023 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix typetracer for string operations
2 participants