fix: support typetracer in `ak.str.` operations #2679

agoose77 · 2023-08-29T13:38:56Z

Fixes #2675 by introducing a typetracer branch in a new _apply_through_arrow helper. This helper only handles the single-argument case, because the multi-array cases are more complex (they need finer control over how things get converted to arrow).

Some functions in PyArrow have bugs that require us to workaround by manually creating a layout from a form.

agoose77 · 2023-08-29T13:40:36Z

@jpivarski I've noticed that PyArrow doesn't seem to return proper-sized offsets for the following

import pyarrow as pa
import pyarrow.compute as pc

print(
    pc.utf8_split_whitespace(
        pa.array([], type=pa.large_string()),
    ).buffers()
)

print(pa.array([], type=pa.large_string()).buffers())

Could you give this code-block a once-over, and confirm that we should expect both sets of offsets buffers to have size>=8, i.e. the size of a single offset (64 bit)? I assume this is a PyArrow bug, but I want to be sure before reporting.

jpivarski · 2023-08-29T14:55:15Z

It looks like pc.utf8_split_whitespace on arrays of large_string is returning arrays of lists of string.

>>> [None if x is None else len(x)
...  for x in pa.array([], type=pa.large_string()).buffers()]
[None, 8, 0]
>>> # string-mask (None), string-offsets (8 bytes), string-data (0 bytes)

>>> [None if x is None else len(x)
...  for x in pc.utf8_split_whitespace(pa.array([], type=pa.large_string())).buffers()]
[None, 4, None, 4, 4]
>>> # list-mask (None), list-offsets (4), string-mask (None), string-offsets (4), string-data (4???)

versus

>>> [None if x is None else len(x)
...  for x in pa.array([], type=pa.string()).buffers()]
[None, 4, 0]
>>> # string-mask (None), string-offsets (4 bytes), string-data (0 bytes)

>>> [None if x is None else len(x)
...  for x in pc.utf8_split_whitespace(pa.array([], type=pa.string())).buffers()]
[None, 4, None, 4, 4]
>>> # list-mask (None), list-offsets (4), string-mask (None), string-offsets (4), string-data (4???)

That's not obviously wrong: we have plenty of functions that take Index32 inputs and return Index64 outputs. Arrow's bias is toward 32-bit; 64-bit was a late addition before version 1.0. Maybe Arrow is even deciding on the index size of the output based on the lengths of the string values, which would be a problem for our Form-stability, but we can cast the outputs if we need to. We could just have a policy of always turning Arrow 32-bit output indexes into 64-bit—that way, it would never be value-dependent.

Nope, it really is an Arrow bug: the declared type is still large_string, despite the fact that the buffer is 4 bytes long:

>>> pc.utf8_split_whitespace(pa.array([], type=pa.large_string())).type
ListType(list<item: large_string>)
>>> pc.utf8_split_whitespace(pa.array([], type=pa.string())).type
ListType(list<item: string>)

It's definitely wrong for Arrow to declare the type large_string and have only 4 bytes for the offset. (It needs a single [0] for the offsets of an empty string, just like Awkward.) The large_ means 8-byte index.

I also find it a bit odd that the string data for the empty lists of strings returned by pc.utf8_split_whitespace is non-empty (the "???" above). I don't know what rules for too-large buffers Arrow allows; it would be allowed in Awkward, but perhaps unexpected in this case because I don't see what good it might do.

agoose77 · 2023-08-29T14:57:50Z

It's definitely wrong for Arrow to declare the type large_string and have only 4 bytes for the offset. (It needs a single [0] for the offsets of an empty string, just like Awkward.) The large_ means 8-byte index.

Good, that matches my understanding here. I'll file a bug with PyArrow.

feat: add typetracer support to `index_in`

agoose77 · 2023-08-30T11:11:00Z

I've rebased the commits such that they could be applied in sequence. That might make for easier per-commit reviewing than the whole changeset in one go.

codecov · 2023-08-30T11:20:38Z

Codecov Report

Merging #2679 (65710fe) into main (59d4235) will increase coverage by 0.03%.
The diff coverage is 99.25%.

Additional details and impacted files

Files Changed	Coverage Δ
src/awkward/operations/str/akstr_extract_regex.py	`100.00% <ø> (ø)`
src/awkward/operations/str/__init__.py	`99.11% <98.79%> (+2.48%)`	⬆️
src/awkward/operations/str/akstr_index_in.py	`97.22% <100.00%> (+0.34%)`	⬆️
src/awkward/operations/str/akstr_is_in.py	`97.22% <100.00%> (+0.34%)`	⬆️
src/awkward/operations/str/akstr_join.py	`92.85% <100.00%> (-0.48%)`	⬇️
.../awkward/operations/str/akstr_join_element_wise.py	`96.42% <100.00%> (+0.13%)`	⬆️
src/awkward/operations/str/akstr_repeat.py	`95.00% <100.00%> (-0.13%)`	⬇️
src/awkward/operations/str/akstr_slice.py	`100.00% <100.00%> (ø)`
src/awkward/operations/str/akstr_split_pattern.py	`100.00% <100.00%> (ø)`
...wkward/operations/str/akstr_split_pattern_regex.py	`100.00% <100.00%> (ø)`
... and 2 more

... and 2 files with indirect coverage changes

jpivarski

string_to32=True, bytestring_to32=True is a fine way to solve the problem of Arrow functions not being properly implemented for 64-bit.

Solving the typetracer problem in one place, through _apply_through_arrow, is a good idea, but I've noticed that only some of the functions use it; others use to_arrow and from_arrow manually. Is it the case that only some of them need it? (There are several different classes of function signatures, I know.) Even if only some of them need it, could they be made to all go through the same helper function, even if that helper function has a trivial action on the ones that don't need it, for regularity?

(If there's a good reason not to, then that's fine.)

…r' into agoose77/fix-arrow-str-typetracer

jpivarski

Great! This unblocks dask-awkward, so that it can wrap these functions.

All of the changes are things that look plausible to me; having to coerce Forms in a few cases, dealing with Arrow sometimes adding option, sometimes not (different versions), 32-bit versus 64-bit, etc.

We now have a pyarrow==7 test to make sure that this code is compliant with our range of Arrow versions.

So it all looks good! Merge when you're ready!

agoose77 temporarily deployed to docs-preview August 29, 2023 13:52 — with GitHub Actions Inactive

agoose77 mentioned this pull request Aug 29, 2023

[C++] "utf8_split_whitespace" kernel returned offset buffers are too small for large string in case of empty array apache/arrow#37437

Closed

agoose77 added 2 commits August 30, 2023 11:47

feat: add ability to roundtrip through arrow

95f693f

refactor: use roundtrip and generate bitmasks

42ecd62

agoose77 temporarily deployed to docs-preview August 30, 2023 10:53 — with GitHub Actions Inactive

agoose77 force-pushed the agoose77/fix-arrow-str-typetracer branch from b6f6ad0 to 0e2dd69 Compare August 30, 2023 11:01

feat: add typetracer support to extract_regex

5159cdb

agoose77 force-pushed the agoose77/fix-arrow-str-typetracer branch from 0e2dd69 to e2dc138 Compare August 30, 2023 11:04

agoose77 added 11 commits August 30, 2023 12:05

feat: add typetracer support to is_in

f6afaf8

feat: add typetracer support to index_in

d35fb4b

feat: add typetracer support to `index_in`

feat: add typetracer support to join

1e7230d

feat: add typetracer support to repeat

2147596

feat: add typetracer support to slice

6aac44f

feat: add typetracer support to split_pattern

d6df16b

feat: add typetracer support to split_pattern_regex

799947a

feat: add typetracer support to join_element_wise

561d5c9

refactor: use _apply_through_arrow to to_categorical

68c8273

feat: add typetracer support to split_whitespace

db0e105

test: ensure simple functions support typetracer

7c62d3f

agoose77 force-pushed the agoose77/fix-arrow-str-typetracer branch from e2dc138 to 7c62d3f Compare August 30, 2023 11:10

agoose77 marked this pull request as ready for review August 30, 2023 11:11

agoose77 requested a review from jpivarski August 30, 2023 11:11

test: add simple test for typetracer to_categorical

1fc7db9

agoose77 temporarily deployed to docs-preview August 30, 2023 11:20 — with GitHub Actions Inactive

jpivarski approved these changes Aug 30, 2023

View reviewed changes

refactor: cleanup option handling

12ebf50

agoose77 temporarily deployed to docs-preview September 1, 2023 10:37 — with GitHub Actions Inactive

fix: preserve categorical parameters

cb0249f

agoose77 temporarily deployed to docs-preview September 1, 2023 10:46 — with GitHub Actions Inactive

Merge branch 'main' into agoose77/fix-arrow-str-typetracer

35e8977

agoose77 temporarily deployed to docs-preview September 1, 2023 11:07 — with GitHub Actions Inactive

agoose77 added 2 commits September 1, 2023 12:25

refactor: re-use apply_through_arrow

7ff0e59

Merge remote-tracking branch 'origin/agoose77/fix-arrow-str-typetrace…

65710fe

…r' into agoose77/fix-arrow-str-typetracer

agoose77 requested a review from jpivarski September 1, 2023 11:25

agoose77 temporarily deployed to docs-preview September 1, 2023 11:38 — with GitHub Actions Inactive

jpivarski approved these changes Sep 1, 2023

View reviewed changes

agoose77 merged commit 81fc063 into main Sep 1, 2023
34 checks passed

agoose77 deleted the agoose77/fix-arrow-str-typetracer branch September 1, 2023 20:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: support typetracer in `ak.str.` operations #2679

fix: support typetracer in `ak.str.` operations #2679

agoose77 commented Aug 29, 2023 •

edited

Loading

agoose77 commented Aug 29, 2023

jpivarski commented Aug 29, 2023

agoose77 commented Aug 29, 2023

agoose77 commented Aug 30, 2023

codecov bot commented Aug 30, 2023 •

edited

Loading

jpivarski left a comment

jpivarski left a comment

fix: support typetracer in ak.str. operations #2679

fix: support typetracer in ak.str. operations #2679

Conversation

agoose77 commented Aug 29, 2023 • edited Loading

agoose77 commented Aug 29, 2023

jpivarski commented Aug 29, 2023

agoose77 commented Aug 29, 2023

agoose77 commented Aug 30, 2023

codecov bot commented Aug 30, 2023 • edited Loading

Codecov Report

jpivarski left a comment

Choose a reason for hiding this comment

jpivarski left a comment

Choose a reason for hiding this comment

fix: support typetracer in `ak.str.` operations #2679

fix: support typetracer in `ak.str.` operations #2679

agoose77 commented Aug 29, 2023 •

edited

Loading

codecov bot commented Aug 30, 2023 •

edited

Loading