FIX: use str dtype without size information #8932

kmuehlbauer · 2024-04-12T10:59:45Z

Aims to resolve parts of #8844.

xarray/tests/test_accessor_str.py::test_case_str: AssertionError: assert dtype('<U26') == dtype('<U30')

I'm not sure this is the right location for the fix, at least it fixes those errors. AFAICT this is some issue somewhere inside apply_ufunc where the string dtype size is kept. So this fix removes the size information from the dtype (actually recreating it).

Closes #xxxx
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst

keewis · 2024-04-12T11:13:17Z

I have no idea, either (but I also didn't do an exhaustive search).

@seberg, do you know which of the changes to the 2.0 branch of numpy in the past ~2-3 weeks might have caused this (or longer, we're using the scientific-python-nightly-wheels repo)

(I'll try to investigate / write a reproducer that uses just numpy so you don't have to look through our code to understand what's going on)

seberg · 2024-04-12T11:18:56Z

Hmmm, thanks for the ping. Puzzles me on first sight. A lot of the string functions (i.e. in np.char.... did see changes, but the code change looks more fundamental, which I find odd. A "NumPy repro" would be awesome.

Let me ping @lysnikolaou who has pushed and taken the helm for many of these string improvements.

kmuehlbauer · 2024-04-12T11:49:04Z

Thanks @seberg for chiming in. I've tracked the code further, we are using

np.vectorize along the process with the otypes kwarg set to something like dtype('<U26') (the number depends on the size of the string) and the string size changes in the vectorized function.

I can't immediately see a problem inside np.vectorize as it uses a similar approach to get rid of any size information. So the issue might still be somewhere in our code.

seberg · 2024-04-12T12:29:58Z

What I find strange, is that the line just before seems to test a very similar function, that would seems to go through a lambda x: x.capitalize() while this is just lambda x: x.casefold(), if I understand that right.

I wonder if we have a very subtle bug completely unrelated. I.e. if you are not careful in NumPy, it is possible to forget to make a new dtype and accidentally modify an existing one. If it is something like that, it may be hard to isolate.

keewis · 2024-04-12T13:48:25Z

the pure numpy reproducer is (which was kinda what @kmuehlbauer described, but now you can try yourself):

import numpy as np

arr = np.array(["SOme wOrd Ǆ ß ᾛ ΣΣ ﬃ⁵Å Ç Ⅰ"]).astype(np.str_)
f = str.casefold
np.vectorize(f, otypes=[arr.dtype])(arr)

str.casefold returns a string that's four characters longer, and before numpy<2 the dtype would be extended to fit the string (<30U) while with numpy>=2 the old dtype would be used, truncating the string.

str.capitalize does not change the length of the string, which is why that doesn't fail.

FWIW, the new string dtype doesn't have that issue, and numpy.strings has a lot of the vectorized functions we've been creating using np.vectorize (i.e. those will most likely be much faster).

ngoldbaum · 2024-04-12T14:36:05Z

FWIW, the new string dtype doesn't have that issue, and numpy.strings has a lot of the vectorized functions we've been creating using np.vectorize (i.e. those will most likely be much faster).

Unfortunately not casefold yet but it's on our list.

Adding new string ufuncs isn't terribly hard, we'd be happy to add more or review contributions if there are missing ones. I already know that pandas has some functions that wrap the regex module. Adding a regex engine to numpy is probably not going to happen but other similar things can be added. For casefolding we'll need the unicode database CPython uses.

mathause · 2024-04-12T17:25:53Z

Could it be related to numpy/numpy#26136?

seberg · 2024-04-12T18:15:24Z

Yes it wasNathan already pushed a fix.

keewis · 2024-04-13T12:23:29Z

@kmuehlbauer, should we close this, now that it has been fixed upstream in numpy?

dcherian · 2024-04-13T13:00:34Z

Thanks for looking in to this Kai. It's not easy!

kmuehlbauer · 2024-04-13T13:27:28Z

Thanks for looking in to this Kai. It's not easy!

Indeed, apply_ufunc triggered my interest. Nevertheless, good it could be fixed upstream.

FIX: use str dtype without size information

2cd70c6

kmuehlbauer mentioned this pull request Apr 12, 2024

⚠️ Nightly upstream-dev CI failed ⚠️ #8844

Open

kmuehlbauer added the run-upstream Run upstream CI label Apr 12, 2024

seberg mentioned this pull request Apr 12, 2024

BUG/REG: Possible bug/regression in np.vectorize with strings numpy/numpy#26269

Closed

keewis added this to In progress in numpy2 compat via automation Apr 13, 2024

kmuehlbauer closed this Apr 13, 2024

numpy2 compat automation moved this from In progress to Done Apr 13, 2024

kmuehlbauer deleted the string_dtype branch April 15, 2024 19:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX: use str dtype without size information #8932

FIX: use str dtype without size information #8932

kmuehlbauer commented Apr 12, 2024 •

edited

keewis commented Apr 12, 2024 •

edited

seberg commented Apr 12, 2024

kmuehlbauer commented Apr 12, 2024

seberg commented Apr 12, 2024

keewis commented Apr 12, 2024 •

edited

ngoldbaum commented Apr 12, 2024

mathause commented Apr 12, 2024

seberg commented Apr 12, 2024

keewis commented Apr 13, 2024

dcherian commented Apr 13, 2024

kmuehlbauer commented Apr 13, 2024

FIX: use str dtype without size information #8932

FIX: use str dtype without size information #8932

Conversation

kmuehlbauer commented Apr 12, 2024 • edited

keewis commented Apr 12, 2024 • edited

seberg commented Apr 12, 2024

kmuehlbauer commented Apr 12, 2024

seberg commented Apr 12, 2024

keewis commented Apr 12, 2024 • edited

ngoldbaum commented Apr 12, 2024

mathause commented Apr 12, 2024

seberg commented Apr 12, 2024

keewis commented Apr 13, 2024

dcherian commented Apr 13, 2024

kmuehlbauer commented Apr 13, 2024

kmuehlbauer commented Apr 12, 2024 •

edited

keewis commented Apr 12, 2024 •

edited

keewis commented Apr 12, 2024 •

edited