feat: move string features into core #2547

agoose77 · 2023-06-27T14:05:12Z

This PR tackes #1682 by moving existing behavior-based string logic into the core Awkward routines.

Adds __bytes__ to ak.Array, to replace the ByteBehavior.__bytes__
Reworks __getitem__ bypass logic
Removes __add__ and __radd__ support for chat/byte arrays.
Hardcodes ufunc handling for strings
Hardcodes typestr handling for strings / chars

agoose77 · 2023-06-27T14:28:36Z

From #873, whilst the __iter__ and __getitem__ logic on string arrays will give us Python strings, the same operations on arrays of char do not.

The reason for this is that #873 implements the uint8[] → str transformation in ak.Array.__getitem__. We can't do the same for char, because the NumpyArray._getitem_at operation on char returns an integer, with no metadata about the fact that it is a unicode code point.

The "obvious" solution is to make NumpyArray._getitem_at aware of char/byte, and return the Python type. But, for symmetry reasons that makes me wonder whether ListOffsetArray should do the same for strings, and the existing logic on ak.Array be removed.

I can't clearly see what the risks of doing this are. My hunch is that it would be fine; anything interacting with layouts on a per-item basis can pay the cost of string conversion. But, your thoughts are welcome here @jpivarski

jpivarski · 2023-06-27T18:52:07Z

I think you should go with the "obvious" solution of adding the parameter check in NumpyArray._getitem_at, just as it is in ListOffsetArray._getitem_at. We can say that the reason the check is duplicated is for performance: putting together a string by repeatedly calling NumpyArray._getitem_at and having it repeatedly check the parameters would be a waste when it can be done once at the list level.

The weirdness of having lists check for stringiness and return a str and also having list contents check for stringiness and return a str is due to Python: the characters of a string are not a different type from the string itself. Formally, you'd expect them to be different, as they are in C (char vs char*) or C++ (char vs std::string) or Java (char vs java.lang.String). The weirdness is evident in that this doesn't terminate or raise an error:

def iterate_over(thing):
    for x in thing:
        print(x)
        iterate_over(x)

iterate_over("a string")

But to comply with Pythonic expectations, both list types and NumpyArray should check parameters and return a str.

agoose77 · 2023-06-27T19:19:12Z

I think you should go with the "obvious" solution of adding the parameter check in NumpyArray._getitem_at, just as it is in ListOffsetArray._getitem_at. We can say that the reason the check is duplicated is for performance: putting together a string by repeatedly calling NumpyArray._getitem_at and having it repeatedly check the parameters would be a waste when it can be done once at the list level.

The weirdness of having lists check for stringiness and return a str and also having list contents check for stringiness and return a str is due to Python: the characters of a string are not a different type from the string itself. Formally, you'd expect them to be different, as they are in C (char vs char*) or C++ (char vs std::string) or Java (char vs java.lang.String). The weirdness is evident in that this doesn't terminate or raise an error:
def iterate_over(thing):
    for x in thing:
        print(x)
        iterate_over(x)

iterate_over("a string")
But to comply with Pythonic expectations, both list types and NumpyArray should check parameters and return a str.

Could you clarify this slightly? The ListOffsetArray (et al.) and NumpyArray cater to different types; char is set on NumpyArray, whilst string is set on ListOffsetArray. Whilst it's hard to get a bare NumpyArray(..., parameters={"__array__": "char"}), it's not impossible (ak.Array(ak.to_layout("hello"))), and it is also trivial to pull out a character from a string (ak.Array(["hello"])[0, 0]). Pulling out characters and whole strings should both involve casting to a Python type (str, and str again!).

The question for me is whether this should always happen in Content._getitem_at, rather than only enforced at the highlevel. We can't actually enforce the char logic at highlevel for the reasons outlined above.

jpivarski · 2023-06-27T20:15:14Z

Oh, that's right: it was handled in ak.Array.__getitem__ (high-level) because there are multiple Content._getitem paths that can get you to a string or char: direct _getitem_at and indirect _getitem_next. We don't want one to return a length-1 str and the other to return an np.uint8.

Yeah, this test only happens at high-level:

awkward/src/awkward/highlevel.py

Lines 950 to 958 in f2783ba

    
           out = self._layout[where] 
        
           if isinstance(out, ak.contents.NumpyArray): 
        
               array_param = out.parameter("__array__") 
        
               if array_param == "byte": 
        
                   return ak._util.tobytes(out._raw(numpy)) 
        
               elif array_param == "char": 
        
                   return ak._util.tobytes(out._raw(numpy)).decode( 
        
                       errors="surrogateescape" 
        
                   )

and not at low-level:

awkward/src/awkward/contents/listoffsetarray.py

Lines 290 to 303 in f2783ba

    
           def _getitem_at(self, where: IndexType): 
        
               # Wrap `where` by length 
        
               if not is_unknown_scalar(where) and where < 0: 
        
                   length_index = self._backend.index_nplike.shape_item_as_index(self.length) 
        
                   where += length_index 
        
               # Validate `where` 
        
               if not ( 
        
                   is_unknown_scalar(where) 
        
                   or self.length is unknown_length 
        
                   or (0 <= where < self.length) 
        
               ): 
        
                   raise ak._errors.index_error(self, where) 
        
               start, stop = self._offsets[where], self._offsets[where + 1] 
        
               return self._content._getitem_range(start, stop)

awkward/src/awkward/contents/numpyarray.py

Lines 295 to 303 in f2783ba

    
           try: 
        
               out = self._data[where] 
        
           except IndexError as err: 
        
               raise ak._errors.index_error(self, where, str(err)) from err 
        
           if hasattr(out, "shape") and len(out.shape) != 0: 
        
               return NumpyArray(out, parameters=None, backend=self._backend) 
        
           else: 
        
               return out

>>> del ak.behavior["char"]
>>> ak.Array(ak.to_layout("hello"))
<Array [104, 101, 108, 108, 111] type='5 * char'>
>>> ak.Array(ak.to_layout("hellothere"))[0]
104

So the problem is conveying the information that the final result of the slice is a char from the last step of Content.__getitem__, which could be _getitem_at, _getitem_next, or (maybe?) something else, to the ak.Array.__getitem__ routine, which does not deeply inspect the Content's __getitem__ machinery, just the final result. By the time it gets a np.uint8, the information about it being a char is gone.

Well, NumpyArray._getitem_at, NumpyArray._getitem_next, etc. could return np.character...?

No: here's a better idea. Note that Awkward slicing does not make a distinction between string and bytestring: if you index the contents of a Unicode string, your indexes are not counting Unicode characters; they're counting bytes. That was a decision to keep the slicing rules simple; a Unicode-aware codepoint-index slice could be implemented in a future strings module. (It would need compiled routines to get all the variable-width UTF-8 rules right.)

>>> money = ak.Array(["$", "¢", "€", "💰"])

>>> ak.num(money)
<Array [1, 2, 3, 4] type='4 * int64'>

>>> money.layout
<ListOffsetArray len='4'>
    <parameter name='__array__'>'string'</parameter>
    <offsets><Index dtype='int64' len='5'>
        [ 0  1  3  6 10]
    </Index></offsets>
    <content><NumpyArray dtype='uint8' len='10'>
        <parameter name='__array__'>'char'</parameter>
        [ 36 194 162 226 130 172 240 159 146 176]
    </NumpyArray></content>
</ListOffsetArray>

>>> money[0, 0]
36
>>> money[1, 0]
194
>>> money[1, 1]
162
>>> money[2, 0]
226
>>> money[2, 1]
130
>>> money[2, 2]
172
>>> money[3, 0]
240
>>> money[3, 1]
159
>>> money[3, 2]
146
>>> money[3, 3]
176

Considering that ak.Array.__getitem__ into a string does not give you Unicode characters, we can't expect it to give you a length-1 string. It's showing you part of a UTF-8 encoding.

So this could just be a policy decision: we always show individual-item selections of a string as integers. That's what Python 3 users expect of bytes but not str:

>>> list("$¢€💰")
['$', '¢', '€', '💰']

>>> list("$¢€💰".encode("utf-8"))
[36, 194, 162, 226, 130, 172, 240, 159, 146, 176]

The policy can be that Awkward returns integers from both bytestrings and strings because it's not really selecting characters, anyway. I see a stronger case that this is the right thing to do, though it's not obvious at first and we may find ourselves explaining it as a gotcha.

In other words,

>>> del ak.behavior["char"]

>>> ak.Array(ak.to_layout("hello"))
<Array [104, 101, 108, 108, 111] type='5 * char'>

>>> ak.Array(ak.to_layout(["hello", "there"]))
<Array ['hello', 'there'] type='2 * string'>

is fine as-is. What do you think of that?

agoose77 · 2023-06-27T21:22:14Z

is fine as-is. What do you think of that?

In general, I strongly prefer this. I'd really like to drop as much magic from these routines as possible, so that we can more easily reason about what's going on here!

I was looking to modify NumpyArray.to_list() to return integers instead of bytestrings for char, so this all fits in very nicely.

src/awkward/_connect/avro.py

agoose77 · 2023-06-27T22:46:15Z

OK, the latest iteration is thus:

to_list still converts char arrays to a single string.
N * char → str conversion still happens in Array.__getitem__ & Array.__iter__ (we want to be able to access the low-level char nodes at low-level, e.g. in from_iter with strings).
ak.prettyprint renders char arrays as [101, 93, 50, ...] instead of "e]2...". This is to ensure that parameters are still visible. If we really want this to render like a string, we can.
char elements are always returned as integers, because they may be partical codepoints

codecov · 2023-06-27T22:46:35Z

Codecov Report

Merging #2547 (79dbd36) into main (0901aa4) will increase coverage by 0.06%.
The diff coverage is 91.72%.

❗ Current head 79dbd36 differs from pull request most recent head b370831. Consider uploading reports for the commit b370831 to get more accurate results

Additional details and impacted files

Impacted Files	Coverage Δ
src/awkward/__init__.py	`96.96% <ø> (-0.18%)`	⬇️
src/awkward/_connect/avro.py	`86.94% <ø> (ø)`
src/awkward/behaviors/string.py	`63.63% <ø> (-10.87%)`	⬇️
src/awkward/behaviors/categorical.py	`71.42% <50.00%> (-3.92%)`	⬇️
src/awkward/_categorical.py	`58.62% <58.62%> (ø)`
src/awkward/_prettyprint.py	`69.38% <60.00%> (-2.76%)`	⬇️
src/awkward/types/type.py	`93.95% <88.88%> (-0.39%)`	⬇️
src/awkward/contents/numpyarray.py	`91.18% <95.83%> (+0.19%)`	⬆️
src/awkward/highlevel.py	`77.11% <95.83%> (+0.28%)`	⬆️
src/awkward/_connect/numpy.py	`91.80% <96.47%> (-2.74%)`	⬇️
... and 7 more

... and 1 file with indirect coverage changes

jpivarski · 2023-06-27T22:51:14Z

Having char arrays convert to str/bytes sounds like a good idea, and it's probably implemented that way either because your or I expected it or someone asked for it.

Individual items pulled from a char array should be integers, because they could be a partial codepoint.

The only odd one out is the pretty-print of a char array as a list of integers. Any alternative would be a very special case, since it would be a pretty-printed array that doesn't begin and end with [ and ]. I'd say we don't need to do that unless asked, when asked. Changing repr is not what I would consider a backward-incompatible change, since no one should be depending on the formatting of a repr (Python anti-pattern).

src/awkward/operations/ak_to_list.py

agoose77 · 2023-06-28T11:38:14Z

I believe this is good to go. Overall, it removes a feature: the ability to add two char / byte arrays. It isn't very easy to identify this from existing repo usage given, but my feeling is that this is very niche. Due to Array._getitem, it should be difficult to obtain a bare char array unless one performs ak.to_layout("hello world").

docs/user-guide/how-to-convert-python.md

agoose77 · 2023-06-28T14:52:15Z

@jpivarski could you check whether I've added the typer and lowering in the best place? It seems to wrap the getitem result, but I don't know enough about what to expect here!

jpivarski · 2023-06-28T15:00:38Z

@jpivarski could you check whether I've added the typer and lowering in the best place? It seems to wrap the getitem result, but I don't know enough about what to expect here!

While talking about this on Zoom, I looked into it, and I think that checking at the level of ContentType.getitem_at_check is fine. The string check could be done only in the list types, but going to the extra work of changing it to do that is unnecessary. Upon entering a Numba context, we assume that the array is fully valid; __array__: string would only be on list-types, so the check on all other types would always return False.

jpivarski

All good, except for the removal of get_at and get_field in prettyprint.py. I think we might still need these, so that a simple error doesn't explode into a RecursionError.

src/awkward/_connect/avro.py

jpivarski · 2023-06-28T19:49:24Z

src/awkward/_prettyprint.py

-# avoid recursion in which ak.Array.__getitem__ calls prettyprint
-# to form an error string: private reimplementation of ak.Array.__getitem__
-
-
-def get_at(data, index):
-    out = data._layout._getitem_at(index)
-    if isinstance(out, ak.contents.NumpyArray):
-        array_param = out.parameter("__array__")
-        if array_param == "byte":
-            return ak._util.tobytes(out._raw(numpy))
-        elif array_param == "char":
-            return ak._util.tobytes(out._raw(numpy)).decode(errors="surrogateescape")
-    if isinstance(out, (ak.contents.Content, ak.record.Record)):
-        return wrap_layout(out, data._behavior)
-    else:
-        return out
-
-
-def get_field(data, field):
-    out = data._layout._getitem_field(field)
-    if isinstance(out, ak.contents.NumpyArray):
-        array_param = out.parameter("__array__")
-        if array_param == "byte":
-            return ak._util.tobytes(out._raw(numpy))
-        elif array_param == "char":
-            return ak._util.tobytes(out._raw(numpy)).decode(errors="surrogateescape")
-    if isinstance(out, (ak.contents.Content, ak.record.Record)):
-        return wrap_layout(out, data._behavior)
-    else:
-        return out


Why is it okay to remove these? I think the problem they were addressing is that there was an error in constructing the error message, which is not easily tested in automated tests.

Also, they avoid the rather complex dispatch of Content._getitem, which checks for any kind of slice, with Content._getitem_at or Content._getitem_field. That's not (just) a performance thing, but it also avoids the possibility of errors in that dispatch, which can be a problem if it needs to print the array when there's an error.

agoose77 added 4 commits June 27, 2023 13:30

feat: add _getitem_ impl

bfd0101

feat: add __bytes__ to ak.Array

6fccb90

fix: don't use _raw

4f5adce

fix: support ufuncs internally

f8a6241

agoose77 added 3 commits June 27, 2023 20:41

fix: move char handling to contents

fcaca4a

fix: we don't support np.add on strings now

d116bee

fix: name is bytes, not bytestring

5b8a568

agoose77 added 6 commits June 27, 2023 22:27

feat!: drop chr representation of char

b4d44ef

fix: support strings in ak.to_list (stopgap)

ffdaa69

fix: typestr for regular

e957bc8

fix: avro should add __string__

cf9eefa

fix: char arrays

c2dc277

feat!: revert _getitem_at changes

2bf89fb

agoose77 commented Jun 27, 2023

View reviewed changes

src/awkward/_connect/avro.py Show resolved Hide resolved

agoose77 added 7 commits June 27, 2023 23:54

refactor: move string ufunc handling

e0e8d51

refactor: small code re-ordering

1a26d80

docs: remove note about behavior

9a602a4

refactor: move categorical to internals

623332d

refactor: move string lowering to built-in

64c6773

fix: wrap __iter__ yielded results

51a0467

fix: ufuncs on categoricals

69c8145

agoose77 commented Jun 28, 2023

View reviewed changes

src/awkward/operations/ak_to_list.py Show resolved Hide resolved

agoose77 temporarily deployed to docs-preview June 28, 2023 11:24 — with GitHub Actions Inactive

chore: appease pylint!

5581827

agoose77 marked this pull request as ready for review June 28, 2023 11:27

refactor: small typing

c0ca270

agoose77 temporarily deployed to docs-preview June 28, 2023 11:45 — with GitHub Actions Inactive

agoose77 added 3 commits June 28, 2023 13:48

refactor: small type hints

af27ee8

fix: don't show None parameters

e9afab8

refactor: more type hints

79dbd36

agoose77 requested a review from jpivarski June 28, 2023 12:54

agoose77 commented Jun 28, 2023

View reviewed changes

docs/user-guide/how-to-convert-python.md Show resolved Hide resolved

agoose77 temporarily deployed to docs-preview June 28, 2023 13:03 — with GitHub Actions Inactive

jpivarski approved these changes Jun 28, 2023

View reviewed changes

agoose77 added 4 commits June 28, 2023 21:00

fix: don't use error context in low-level slicer

9693772

chore: revert changes to prettyprint

ddeef5e

chore: revert introduction of _getitem

38f0455

Merge branch 'main' into agoose77/feat-builtin-string-ops

b370831

agoose77 enabled auto-merge (squash) June 28, 2023 21:17

agoose77 temporarily deployed to docs-preview June 28, 2023 21:28 — with GitHub Actions Inactive

agoose77 merged commit 1f6041e into main Jun 28, 2023
35 checks passed

agoose77 deleted the agoose77/feat-builtin-string-ops branch June 28, 2023 21:33

This was referenced Jun 30, 2023

Remove special-casing for strings #1682

Closed

ak.to_list has stopped working on some edge cases between awkward 1.8.0 and 2.1.1 #2376

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: move string features into core #2547

feat: move string features into core #2547

agoose77 commented Jun 27, 2023 •

edited

Loading

agoose77 commented Jun 27, 2023 •

edited

Loading

jpivarski commented Jun 27, 2023

agoose77 commented Jun 27, 2023

jpivarski commented Jun 27, 2023

agoose77 commented Jun 27, 2023 •

edited

Loading

agoose77 commented Jun 27, 2023

codecov bot commented Jun 27, 2023 •

edited

Loading

jpivarski commented Jun 27, 2023

agoose77 commented Jun 28, 2023

agoose77 commented Jun 28, 2023

jpivarski commented Jun 28, 2023

jpivarski left a comment

jpivarski Jun 28, 2023

feat: move string features into core #2547

feat: move string features into core #2547

Conversation

agoose77 commented Jun 27, 2023 • edited Loading

agoose77 commented Jun 27, 2023 • edited Loading

jpivarski commented Jun 27, 2023

agoose77 commented Jun 27, 2023

jpivarski commented Jun 27, 2023

agoose77 commented Jun 27, 2023 • edited Loading

agoose77 commented Jun 27, 2023

codecov bot commented Jun 27, 2023 • edited Loading

Codecov Report

jpivarski commented Jun 27, 2023

agoose77 commented Jun 28, 2023

agoose77 commented Jun 28, 2023

jpivarski commented Jun 28, 2023

jpivarski left a comment

Choose a reason for hiding this comment

jpivarski Jun 28, 2023

Choose a reason for hiding this comment

agoose77 commented Jun 27, 2023 •

edited

Loading

agoose77 commented Jun 27, 2023 •

edited

Loading

agoose77 commented Jun 27, 2023 •

edited

Loading

codecov bot commented Jun 27, 2023 •

edited

Loading