-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: move string features into core #2547
Conversation
From #873, whilst the The reason for this is that #873 implements the The "obvious" solution is to make I can't clearly see what the risks of doing this are. My hunch is that it would be fine; anything interacting with layouts on a per-item basis can pay the cost of string conversion. But, your thoughts are welcome here @jpivarski |
I think you should go with the "obvious" solution of adding the The weirdness of having lists check for stringiness and return a def iterate_over(thing):
for x in thing:
print(x)
iterate_over(x)
iterate_over("a string") But to comply with Pythonic expectations, both list types and |
Could you clarify this slightly? The The question for me is whether this should always happen in |
Oh, that's right: it was handled in Yeah, this test only happens at high-level: awkward/src/awkward/highlevel.py Lines 950 to 958 in f2783ba
and not at low-level: awkward/src/awkward/contents/listoffsetarray.py Lines 290 to 303 in f2783ba
awkward/src/awkward/contents/numpyarray.py Lines 295 to 303 in f2783ba
>>> del ak.behavior["char"]
>>> ak.Array(ak.to_layout("hello"))
<Array [104, 101, 108, 108, 111] type='5 * char'>
>>> ak.Array(ak.to_layout("hellothere"))[0]
104 So the problem is conveying the information that the final result of the slice is a char from the last step of Well, No: here's a better idea. Note that Awkward slicing does not make a distinction between string and bytestring: if you index the contents of a Unicode string, your indexes are not counting Unicode characters; they're counting bytes. That was a decision to keep the slicing rules simple; a Unicode-aware codepoint-index slice could be implemented in a future strings module. (It would need compiled routines to get all the variable-width UTF-8 rules right.) >>> money = ak.Array(["$", "¢", "€", "💰"])
>>> ak.num(money)
<Array [1, 2, 3, 4] type='4 * int64'>
>>> money.layout
<ListOffsetArray len='4'>
<parameter name='__array__'>'string'</parameter>
<offsets><Index dtype='int64' len='5'>
[ 0 1 3 6 10]
</Index></offsets>
<content><NumpyArray dtype='uint8' len='10'>
<parameter name='__array__'>'char'</parameter>
[ 36 194 162 226 130 172 240 159 146 176]
</NumpyArray></content>
</ListOffsetArray>
>>> money[0, 0]
36
>>> money[1, 0]
194
>>> money[1, 1]
162
>>> money[2, 0]
226
>>> money[2, 1]
130
>>> money[2, 2]
172
>>> money[3, 0]
240
>>> money[3, 1]
159
>>> money[3, 2]
146
>>> money[3, 3]
176 Considering that So this could just be a policy decision: we always show individual-item selections of a string as integers. That's what Python 3 users expect of >>> list("$¢€💰")
['$', '¢', '€', '💰']
>>> list("$¢€💰".encode("utf-8"))
[36, 194, 162, 226, 130, 172, 240, 159, 146, 176] The policy can be that Awkward returns integers from both bytestrings and strings because it's not really selecting characters, anyway. I see a stronger case that this is the right thing to do, though it's not obvious at first and we may find ourselves explaining it as a gotcha. In other words, >>> del ak.behavior["char"]
>>> ak.Array(ak.to_layout("hello"))
<Array [104, 101, 108, 108, 111] type='5 * char'>
>>> ak.Array(ak.to_layout(["hello", "there"]))
<Array ['hello', 'there'] type='2 * string'> is fine as-is. What do you think of that? |
In general, I strongly prefer this. I'd really like to drop as much magic from these routines as possible, so that we can more easily reason about what's going on here! I was looking to modify |
OK, the latest iteration is thus:
|
Codecov Report
Additional details and impacted files
|
Having char arrays convert to Individual items pulled from a char array should be integers, because they could be a partial codepoint. The only odd one out is the pretty-print of a char array as a list of integers. Any alternative would be a very special case, since it would be a pretty-printed array that doesn't begin and end with |
I believe this is good to go. Overall, it removes a feature: the ability to add two |
@jpivarski could you check whether I've added the typer and lowering in the best place? It seems to wrap the getitem result, but I don't know enough about what to expect here! |
While talking about this on Zoom, I looked into it, and I think that checking at the level of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good, except for the removal of get_at
and get_field
in prettyprint.py. I think we might still need these, so that a simple error doesn't explode into a RecursionError
.
src/awkward/_prettyprint.py
Outdated
# avoid recursion in which ak.Array.__getitem__ calls prettyprint | ||
# to form an error string: private reimplementation of ak.Array.__getitem__ | ||
|
||
|
||
def get_at(data, index): | ||
out = data._layout._getitem_at(index) | ||
if isinstance(out, ak.contents.NumpyArray): | ||
array_param = out.parameter("__array__") | ||
if array_param == "byte": | ||
return ak._util.tobytes(out._raw(numpy)) | ||
elif array_param == "char": | ||
return ak._util.tobytes(out._raw(numpy)).decode(errors="surrogateescape") | ||
if isinstance(out, (ak.contents.Content, ak.record.Record)): | ||
return wrap_layout(out, data._behavior) | ||
else: | ||
return out | ||
|
||
|
||
def get_field(data, field): | ||
out = data._layout._getitem_field(field) | ||
if isinstance(out, ak.contents.NumpyArray): | ||
array_param = out.parameter("__array__") | ||
if array_param == "byte": | ||
return ak._util.tobytes(out._raw(numpy)) | ||
elif array_param == "char": | ||
return ak._util.tobytes(out._raw(numpy)).decode(errors="surrogateescape") | ||
if isinstance(out, (ak.contents.Content, ak.record.Record)): | ||
return wrap_layout(out, data._behavior) | ||
else: | ||
return out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it okay to remove these? I think the problem they were addressing is that there was an error in constructing the error message, which is not easily tested in automated tests.
Also, they avoid the rather complex dispatch of Content._getitem
, which checks for any kind of slice, with Content._getitem_at
or Content._getitem_field
. That's not (just) a performance thing, but it also avoids the possibility of errors in that dispatch, which can be a problem if it needs to print the array when there's an error.
This PR tackes #1682 by moving existing behavior-based string logic into the core Awkward routines.
__bytes__
toak.Array
, to replace theByteBehavior.__bytes__
__getitem__
bypass logic__add__
and__radd__
support forchat
/byte
arrays.