Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: remove string casting from ak.to_layout #2098

Merged
merged 4 commits into from
Jan 10, 2023

Conversation

agoose77
Copy link
Collaborator

@agoose77 agoose77 commented Jan 10, 2023

As per #2070, this PR changes ak.to_layout("string") to return the string (if allow_other=True).

To support the primary use case of the old behaviour, a new __cast__ behavior has been added for [byte]strings so that ufunc operations can accept Python strings.

Changes

  • Add new is_non_string_like_sequence, rename is_non_string_like_iterable helper
  • Treat bytes/str as leaf types in to_layout
  • Add __cast__ support to string-likes
  • Replaces usages of is_non_string_like_iterable with is_non_string_like_sequence where appropriate

@codecov
Copy link

codecov bot commented Jan 10, 2023

Codecov Report

Merging #2098 (30be100) into main (4db99b6) will increase coverage by 0.01%.
The diff coverage is 100.00%.

❗ Current head 30be100 differs from pull request most recent head 3474d51. Consider uploading reports for the commit 3474d51 to get more accurate results

Additional details and impacted files
Impacted Files Coverage Δ
src/awkward/behaviors/string.py 76.66% <100.00%> (+0.63%) ⬆️
src/awkward/operations/ak_to_layout.py 88.37% <100.00%> (-0.76%) ⬇️
src/awkward/_util.py 82.00% <0.00%> (+0.18%) ⬆️
src/awkward/_connect/numpy.py 68.96% <0.00%> (+0.43%) ⬆️

@agoose77 agoose77 temporarily deployed to docs-preview January 10, 2023 13:30 — with GitHub Actions Inactive
@agoose77
Copy link
Collaborator Author

Just adding support for bytestrings into is_non_string_iterable.

@agoose77 agoose77 temporarily deployed to docs-preview January 10, 2023 16:16 — with GitHub Actions Inactive
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so if to_layout is passed a string (or bytestring) and allow_other=True, ak_to_layout.py line 122 will return the string (or bytestring) as-is.

It's telling that no tests depended on this. That helpsto justify calling it a bug-fix.

It's good to merge!

@agoose77
Copy link
Collaborator Author

It's telling that no tests depended on this. That helpsto justify calling it a bug-fix.

Technically a couple of tests did implicitly ­— requiring that ufuncs support bare strings. I think the solution is actually closer to what we want to happen now, using __cast__ (and our first usage of __cast__ internally that I can tell).

@agoose77 agoose77 merged commit 6d4de4c into main Jan 10, 2023
@agoose77 agoose77 deleted the agoose77/fix-to-layout-string branch January 10, 2023 17:03
@agoose77
Copy link
Collaborator Author

@jpivarski before this makes it into a release ..., after our discussion about strings decaying to character arrays, do we want to revert this and replace it with a character-array codepath?

@jpivarski
Copy link
Member

I've lost track of what this refers to. What are you saying would be the consequences of cutting a release with this in it?

@agoose77
Copy link
Collaborator Author

agoose77 commented Jan 12, 2023

This PR does two things:

  1. changes ak.to_layout("string") from returning 1 * string to str iff. allow_other=True.
  2. converts str to 1 * string via a __cast__ behavior for ufunc operations.

We then spoke about single strings simply decaying to char arrays, so really this PR is not right either — ak.to_layout("string") should return N * char`.

I think we still need the __cast__: we don't want ufunc operations treating Python strings as char arrays, I don't think.

Making this release would mean that users could start relying on ak.to_layout("string", allow_other) == "string", which would break in the future.1

Footnotes

  1. Well, CharBehavior means that array_of_char == "string" should work after we make the suggested "decay" change, but it's not actually a str.

@jpivarski
Copy link
Member

allow_other=True is for passing through non-array arguments, some of which may be (Python) string-valued options. (In particular, do any ufuncs have arguments of string type, e.g. direction="up"? It would be hard to search for such things. We also accept ufuncs from libraries other than NumPy.)

Perhaps it should not be used so loosely; functions that call it with allow_other=True should pick which arguments are supposed to be arrays, make them arrays, and which arguments are not supposed to be arrays, and leave them as they are.

The reimplementation to make this return an array of characters ("decay") would be pretty easy:

>>> ak.to_layout(["hello"])[0]
<NumpyArray dtype='uint8' len='5'>
    <parameter name='__array__'>'char'</parameter>
    [104 101 108 108 111]
</NumpyArray>

Since the old bug (to_layout returned an array of one string) didn't trip over any tests, we're evidently not testing this case (other than tests/test_2070_to_layout_string.py) and the change should go through quickly.

So yes, it would be good to update to_layout to return an array of characters before the release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants