Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: move string features into core #2547

Merged
merged 29 commits into from
Jun 28, 2023
Merged

Conversation

agoose77
Copy link
Collaborator

@agoose77 agoose77 commented Jun 27, 2023

This PR tackes #1682 by moving existing behavior-based string logic into the core Awkward routines.

  • Adds __bytes__ to ak.Array, to replace the ByteBehavior.__bytes__
  • Reworks __getitem__ bypass logic
  • Removes __add__ and __radd__ support for chat/byte arrays.
  • Hardcodes ufunc handling for strings
  • Hardcodes typestr handling for strings / chars

@agoose77
Copy link
Collaborator Author

agoose77 commented Jun 27, 2023

From #873, whilst the __iter__ and __getitem__ logic on string arrays will give us Python strings, the same operations on arrays of char do not.

The reason for this is that #873 implements the uint8[] → str transformation in ak.Array.__getitem__. We can't do the same for char, because the NumpyArray._getitem_at operation on char returns an integer, with no metadata about the fact that it is a unicode code point.

The "obvious" solution is to make NumpyArray._getitem_at aware of char/byte, and return the Python type. But, for symmetry reasons that makes me wonder whether ListOffsetArray should do the same for strings, and the existing logic on ak.Array be removed.

I can't clearly see what the risks of doing this are. My hunch is that it would be fine; anything interacting with layouts on a per-item basis can pay the cost of string conversion. But, your thoughts are welcome here @jpivarski

@jpivarski
Copy link
Member

I think you should go with the "obvious" solution of adding the parameter check in NumpyArray._getitem_at, just as it is in ListOffsetArray._getitem_at. We can say that the reason the check is duplicated is for performance: putting together a string by repeatedly calling NumpyArray._getitem_at and having it repeatedly check the parameters would be a waste when it can be done once at the list level.

The weirdness of having lists check for stringiness and return a str and also having list contents check for stringiness and return a str is due to Python: the characters of a string are not a different type from the string itself. Formally, you'd expect them to be different, as they are in C (char vs char*) or C++ (char vs std::string) or Java (char vs java.lang.String). The weirdness is evident in that this doesn't terminate or raise an error:

def iterate_over(thing):
    for x in thing:
        print(x)
        iterate_over(x)

iterate_over("a string")

But to comply with Pythonic expectations, both list types and NumpyArray should check parameters and return a str.

@agoose77
Copy link
Collaborator Author

I think you should go with the "obvious" solution of adding the parameter check in NumpyArray._getitem_at, just as it is in ListOffsetArray._getitem_at. We can say that the reason the check is duplicated is for performance: putting together a string by repeatedly calling NumpyArray._getitem_at and having it repeatedly check the parameters would be a waste when it can be done once at the list level.

The weirdness of having lists check for stringiness and return a str and also having list contents check for stringiness and return a str is due to Python: the characters of a string are not a different type from the string itself. Formally, you'd expect them to be different, as they are in C (char vs char*) or C++ (char vs std::string) or Java (char vs java.lang.String). The weirdness is evident in that this doesn't terminate or raise an error:

def iterate_over(thing):
    for x in thing:
        print(x)
        iterate_over(x)

iterate_over("a string")

But to comply with Pythonic expectations, both list types and NumpyArray should check parameters and return a str.

Could you clarify this slightly? The ListOffsetArray (et al.) and NumpyArray cater to different types; char is set on NumpyArray, whilst string is set on ListOffsetArray. Whilst it's hard to get a bare NumpyArray(..., parameters={"__array__": "char"}), it's not impossible (ak.Array(ak.to_layout("hello"))), and it is also trivial to pull out a character from a string (ak.Array(["hello"])[0, 0]). Pulling out characters and whole strings should both involve casting to a Python type (str, and str again!).

The question for me is whether this should always happen in Content._getitem_at, rather than only enforced at the highlevel. We can't actually enforce the char logic at highlevel for the reasons outlined above.

@jpivarski
Copy link
Member

Oh, that's right: it was handled in ak.Array.__getitem__ (high-level) because there are multiple Content._getitem paths that can get you to a string or char: direct _getitem_at and indirect _getitem_next. We don't want one to return a length-1 str and the other to return an np.uint8.

Yeah, this test only happens at high-level:

out = self._layout[where]
if isinstance(out, ak.contents.NumpyArray):
array_param = out.parameter("__array__")
if array_param == "byte":
return ak._util.tobytes(out._raw(numpy))
elif array_param == "char":
return ak._util.tobytes(out._raw(numpy)).decode(
errors="surrogateescape"
)

and not at low-level:

def _getitem_at(self, where: IndexType):
# Wrap `where` by length
if not is_unknown_scalar(where) and where < 0:
length_index = self._backend.index_nplike.shape_item_as_index(self.length)
where += length_index
# Validate `where`
if not (
is_unknown_scalar(where)
or self.length is unknown_length
or (0 <= where < self.length)
):
raise ak._errors.index_error(self, where)
start, stop = self._offsets[where], self._offsets[where + 1]
return self._content._getitem_range(start, stop)

try:
out = self._data[where]
except IndexError as err:
raise ak._errors.index_error(self, where, str(err)) from err
if hasattr(out, "shape") and len(out.shape) != 0:
return NumpyArray(out, parameters=None, backend=self._backend)
else:
return out


>>> del ak.behavior["char"]
>>> ak.Array(ak.to_layout("hello"))
<Array [104, 101, 108, 108, 111] type='5 * char'>
>>> ak.Array(ak.to_layout("hellothere"))[0]
104

So the problem is conveying the information that the final result of the slice is a char from the last step of Content.__getitem__, which could be _getitem_at, _getitem_next, or (maybe?) something else, to the ak.Array.__getitem__ routine, which does not deeply inspect the Content's __getitem__ machinery, just the final result. By the time it gets a np.uint8, the information about it being a char is gone.

Well, NumpyArray._getitem_at, NumpyArray._getitem_next, etc. could return np.character...?


No: here's a better idea. Note that Awkward slicing does not make a distinction between string and bytestring: if you index the contents of a Unicode string, your indexes are not counting Unicode characters; they're counting bytes. That was a decision to keep the slicing rules simple; a Unicode-aware codepoint-index slice could be implemented in a future strings module. (It would need compiled routines to get all the variable-width UTF-8 rules right.)

>>> money = ak.Array(["$", "¢", "€", "💰"])

>>> ak.num(money)
<Array [1, 2, 3, 4] type='4 * int64'>

>>> money.layout
<ListOffsetArray len='4'>
    <parameter name='__array__'>'string'</parameter>
    <offsets><Index dtype='int64' len='5'>
        [ 0  1  3  6 10]
    </Index></offsets>
    <content><NumpyArray dtype='uint8' len='10'>
        <parameter name='__array__'>'char'</parameter>
        [ 36 194 162 226 130 172 240 159 146 176]
    </NumpyArray></content>
</ListOffsetArray>

>>> money[0, 0]
36
>>> money[1, 0]
194
>>> money[1, 1]
162
>>> money[2, 0]
226
>>> money[2, 1]
130
>>> money[2, 2]
172
>>> money[3, 0]
240
>>> money[3, 1]
159
>>> money[3, 2]
146
>>> money[3, 3]
176

Considering that ak.Array.__getitem__ into a string does not give you Unicode characters, we can't expect it to give you a length-1 string. It's showing you part of a UTF-8 encoding.

So this could just be a policy decision: we always show individual-item selections of a string as integers. That's what Python 3 users expect of bytes but not str:

>>> list("$¢€💰")
['$', '¢', '€', '💰']

>>> list("$¢€💰".encode("utf-8"))
[36, 194, 162, 226, 130, 172, 240, 159, 146, 176]

The policy can be that Awkward returns integers from both bytestrings and strings because it's not really selecting characters, anyway. I see a stronger case that this is the right thing to do, though it's not obvious at first and we may find ourselves explaining it as a gotcha.

In other words,

>>> del ak.behavior["char"]

>>> ak.Array(ak.to_layout("hello"))
<Array [104, 101, 108, 108, 111] type='5 * char'>

>>> ak.Array(ak.to_layout(["hello", "there"]))
<Array ['hello', 'there'] type='2 * string'>

is fine as-is. What do you think of that?

@agoose77
Copy link
Collaborator Author

agoose77 commented Jun 27, 2023

is fine as-is. What do you think of that?

In general, I strongly prefer this. I'd really like to drop as much magic from these routines as possible, so that we can more easily reason about what's going on here!

I was looking to modify NumpyArray.to_list() to return integers instead of bytestrings for char, so this all fits in very nicely.

@agoose77
Copy link
Collaborator Author

OK, the latest iteration is thus:

  • to_list still converts char arrays to a single string.
  • N * charstr conversion still happens in Array.__getitem__ & Array.__iter__ (we want to be able to access the low-level char nodes at low-level, e.g. in from_iter with strings).
  • ak.prettyprint renders char arrays as [101, 93, 50, ...] instead of "e]2...". This is to ensure that parameters are still visible. If we really want this to render like a string, we can.
  • char elements are always returned as integers, because they may be partical codepoints

@codecov
Copy link

codecov bot commented Jun 27, 2023

Codecov Report

Merging #2547 (79dbd36) into main (0901aa4) will increase coverage by 0.06%.
The diff coverage is 91.72%.

❗ Current head 79dbd36 differs from pull request most recent head b370831. Consider uploading reports for the commit b370831 to get more accurate results

Additional details and impacted files
Impacted Files Coverage Δ
src/awkward/__init__.py 96.96% <ø> (-0.18%) ⬇️
src/awkward/_connect/avro.py 86.94% <ø> (ø)
src/awkward/behaviors/string.py 63.63% <ø> (-10.87%) ⬇️
src/awkward/behaviors/categorical.py 71.42% <50.00%> (-3.92%) ⬇️
src/awkward/_categorical.py 58.62% <58.62%> (ø)
src/awkward/_prettyprint.py 69.38% <60.00%> (-2.76%) ⬇️
src/awkward/types/type.py 93.95% <88.88%> (-0.39%) ⬇️
src/awkward/contents/numpyarray.py 91.18% <95.83%> (+0.19%) ⬆️
src/awkward/highlevel.py 77.11% <95.83%> (+0.28%) ⬆️
src/awkward/_connect/numpy.py 91.80% <96.47%> (-2.74%) ⬇️
... and 7 more

... and 1 file with indirect coverage changes

@jpivarski
Copy link
Member

Having char arrays convert to str/bytes sounds like a good idea, and it's probably implemented that way either because your or I expected it or someone asked for it.

Individual items pulled from a char array should be integers, because they could be a partial codepoint.

The only odd one out is the pretty-print of a char array as a list of integers. Any alternative would be a very special case, since it would be a pretty-printed array that doesn't begin and end with [ and ]. I'd say we don't need to do that unless asked, when asked. Changing repr is not what I would consider a backward-incompatible change, since no one should be depending on the formatting of a repr (Python anti-pattern).

@agoose77 agoose77 temporarily deployed to docs-preview June 28, 2023 11:24 — with GitHub Actions Inactive
@agoose77 agoose77 marked this pull request as ready for review June 28, 2023 11:27
@agoose77
Copy link
Collaborator Author

I believe this is good to go. Overall, it removes a feature: the ability to add two char / byte arrays. It isn't very easy to identify this from existing repo usage given, but my feeling is that this is very niche. Due to Array._getitem, it should be difficult to obtain a bare char array unless one performs ak.to_layout("hello world").

@agoose77 agoose77 temporarily deployed to docs-preview June 28, 2023 11:45 — with GitHub Actions Inactive
@agoose77 agoose77 requested a review from jpivarski June 28, 2023 12:54
@agoose77 agoose77 temporarily deployed to docs-preview June 28, 2023 13:03 — with GitHub Actions Inactive
@agoose77
Copy link
Collaborator Author

@jpivarski could you check whether I've added the typer and lowering in the best place? It seems to wrap the getitem result, but I don't know enough about what to expect here!

@jpivarski
Copy link
Member

@jpivarski could you check whether I've added the typer and lowering in the best place? It seems to wrap the getitem result, but I don't know enough about what to expect here!

While talking about this on Zoom, I looked into it, and I think that checking at the level of ContentType.getitem_at_check is fine. The string check could be done only in the list types, but going to the extra work of changing it to do that is unnecessary. Upon entering a Numba context, we assume that the array is fully valid; __array__: string would only be on list-types, so the check on all other types would always return False.

Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good, except for the removal of get_at and get_field in prettyprint.py. I think we might still need these, so that a simple error doesn't explode into a RecursionError.

src/awkward/_connect/avro.py Show resolved Hide resolved
Comment on lines 38 to 67
# avoid recursion in which ak.Array.__getitem__ calls prettyprint
# to form an error string: private reimplementation of ak.Array.__getitem__


def get_at(data, index):
out = data._layout._getitem_at(index)
if isinstance(out, ak.contents.NumpyArray):
array_param = out.parameter("__array__")
if array_param == "byte":
return ak._util.tobytes(out._raw(numpy))
elif array_param == "char":
return ak._util.tobytes(out._raw(numpy)).decode(errors="surrogateescape")
if isinstance(out, (ak.contents.Content, ak.record.Record)):
return wrap_layout(out, data._behavior)
else:
return out


def get_field(data, field):
out = data._layout._getitem_field(field)
if isinstance(out, ak.contents.NumpyArray):
array_param = out.parameter("__array__")
if array_param == "byte":
return ak._util.tobytes(out._raw(numpy))
elif array_param == "char":
return ak._util.tobytes(out._raw(numpy)).decode(errors="surrogateescape")
if isinstance(out, (ak.contents.Content, ak.record.Record)):
return wrap_layout(out, data._behavior)
else:
return out
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it okay to remove these? I think the problem they were addressing is that there was an error in constructing the error message, which is not easily tested in automated tests.

Also, they avoid the rather complex dispatch of Content._getitem, which checks for any kind of slice, with Content._getitem_at or Content._getitem_field. That's not (just) a performance thing, but it also avoids the possibility of errors in that dispatch, which can be a problem if it needs to print the array when there's an error.

@agoose77 agoose77 enabled auto-merge (squash) June 28, 2023 21:17
@agoose77 agoose77 temporarily deployed to docs-preview June 28, 2023 21:28 — with GitHub Actions Inactive
@agoose77 agoose77 merged commit 1f6041e into main Jun 28, 2023
35 checks passed
@agoose77 agoose77 deleted the agoose77/feat-builtin-string-ops branch June 28, 2023 21:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants