fix: disallow nullable on-disk categoricals for strings#2254
Conversation
ilan-gold
commented
Dec 16, 2025
- See Keep providing a non-nullable string type? #2252 (comment)
- Tests added
- Release note added (or unnecessary)
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2254 +/- ##
==========================================
- Coverage 86.72% 84.65% -2.08%
==========================================
Files 46 46
Lines 7196 7196
==========================================
- Hits 6241 6092 -149
- Misses 955 1104 +149
|
| v.categories._values | ||
| if not pd.api.types.is_string_dtype(v.categories) | ||
| else np.array(v.categories), | ||
| dataset_kwargs=dataset_kwargs, |
There was a problem hiding this comment.
I wonder if we should call into write_vlen_string_array{,_zarr} instead of converting this multiple times. _values is also iffy …
Also I noticed that we write T arrays as non-nullable even though we should check if they’re nullable and decide based on that (but that’s maybe out-of-scope for this PR)
anndata/src/anndata/_io/specs/methods.py
Line 581 in 46a8c60
There was a problem hiding this comment.
_values is also iffy …
Oh it's crazy that we use this, no doubt.
I wonder if we should call into write_vlen_string_array{,_zarr} instead of converting this multiple times. _values is also iffy …
What would this solve? I think v.categories is a pandas object and when we do np.array (I guess I assumed this), it should go to object dtype in which case write_vlen_string_array is dispatched. So you're saying we should do np.array(v.categories, dtype=object) then?
There was a problem hiding this comment.
Your current code (both if branches) are just v.categories.to_numpy(), right? I don’t understand why that ternary exists at all.
There was a problem hiding this comment.
I can't say I know exactly what v.categories._values is for or expected return type, just that it works at the moment. So I didn't want to touch it yet (hence draft). But yes, to_numpy on the other end for strings is better.
I want to minimize numpy conversion because if arrow types land in zarr, we should preserve the type of the categories (they could be not string) in which case we would use .array and not ._values
There was a problem hiding this comment.
Using to_numpy now on the string side :)
There was a problem hiding this comment.
(happy to do both sides if you feel strongly, just trying to minimze behavior changes)
There was a problem hiding this comment.
._values is very well documented: https://github.com/pandas-dev/pandas/blob/9c40b37d7102a2a21ec45e31a4a21dc12756d59b/pandas/core/series.py#L794
It’s “.to_numpy() if it’s a NumpyExtensionArray, else .array”.
So I’m saying let’s get rid of private APIs, no matter how well-documented, and use public ones instead.
There was a problem hiding this comment.
Alright, sounds good!
There was a problem hiding this comment.
Behaviorally, I’d say we should use what causes the fewest repetitions of “go through the whole array and convert every single element”.
As pointed out in #2272, hdf5 understands numpy string dtype arrays, but I think pandas doesn’t do them yet.
…tegoricals for strings) (#2289) Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>