Skip to content

fix: disallow nullable on-disk categoricals for strings#2254

Merged
ilan-gold merged 4 commits intomainfrom
ig/categorical_on_disk
Jan 8, 2026
Merged

fix: disallow nullable on-disk categoricals for strings#2254
ilan-gold merged 4 commits intomainfrom
ig/categorical_on_disk

Conversation

@ilan-gold
Copy link
Copy Markdown
Contributor

@codecov
Copy link
Copy Markdown

codecov Bot commented Dec 16, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 84.65%. Comparing base (bb7a31c) to head (a28fc4c).
⚠️ Report is 49 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2254      +/-   ##
==========================================
- Coverage   86.72%   84.65%   -2.08%     
==========================================
  Files          46       46              
  Lines        7196     7196              
==========================================
- Hits         6241     6092     -149     
- Misses        955     1104     +149     
Files with missing lines Coverage Δ
src/anndata/_io/specs/methods.py 90.57% <ø> (-0.35%) ⬇️

... and 6 files with indirect coverage changes

@ilan-gold ilan-gold added this to the 0.12.7 milestone Dec 16, 2025
@flying-sheep flying-sheep modified the milestones: 0.12.7, 0.12.8 Dec 16, 2025
Comment thread src/anndata/_io/specs/methods.py Outdated
Comment on lines +1098 to +1101
v.categories._values
if not pd.api.types.is_string_dtype(v.categories)
else np.array(v.categories),
dataset_kwargs=dataset_kwargs,
Copy link
Copy Markdown
Member

@flying-sheep flying-sheep Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should call into write_vlen_string_array{,_zarr} instead of converting this multiple times. _values is also iffy …

Also I noticed that we write T arrays as non-nullable even though we should check if they’re nullable and decide based on that (but that’s maybe out-of-scope for this PR)

@_REGISTRY.register_write(ZarrGroup, (np.ndarray, "T"), IOSpec("string-array", "0.2.0"))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_values is also iffy …

Oh it's crazy that we use this, no doubt.

I wonder if we should call into write_vlen_string_array{,_zarr} instead of converting this multiple times. _values is also iffy …

What would this solve? I think v.categories is a pandas object and when we do np.array (I guess I assumed this), it should go to object dtype in which case write_vlen_string_array is dispatched. So you're saying we should do np.array(v.categories, dtype=object) then?

Copy link
Copy Markdown
Member

@flying-sheep flying-sheep Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your current code (both if branches) are just v.categories.to_numpy(), right? I don’t understand why that ternary exists at all.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't say I know exactly what v.categories._values is for or expected return type, just that it works at the moment. So I didn't want to touch it yet (hence draft). But yes, to_numpy on the other end for strings is better.

I want to minimize numpy conversion because if arrow types land in zarr, we should preserve the type of the categories (they could be not string) in which case we would use .array and not ._values

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using to_numpy now on the string side :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(happy to do both sides if you feel strongly, just trying to minimze behavior changes)

Copy link
Copy Markdown
Member

@flying-sheep flying-sheep Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

._values is very well documented: https://github.com/pandas-dev/pandas/blob/9c40b37d7102a2a21ec45e31a4a21dc12756d59b/pandas/core/series.py#L794

It’s “.to_numpy() if it’s a NumpyExtensionArray, else .array”.

So I’m saying let’s get rid of private APIs, no matter how well-documented, and use public ones instead.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, sounds good!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Behaviorally, I’d say we should use what causes the fewest repetitions of “go through the whole array and convert every single element”.

As pointed out in #2272, hdf5 understands numpy string dtype arrays, but I think pandas doesn’t do them yet.

@ilan-gold ilan-gold marked this pull request as ready for review January 8, 2026 12:30
@ilan-gold ilan-gold requested a review from flying-sheep January 8, 2026 12:30
@ilan-gold ilan-gold enabled auto-merge (squash) January 8, 2026 12:48
@ilan-gold ilan-gold merged commit 088fa16 into main Jan 8, 2026
21 checks passed
@ilan-gold ilan-gold deleted the ig/categorical_on_disk branch January 8, 2026 12:52
meeseeksmachine pushed a commit to meeseeksmachine/anndata that referenced this pull request Jan 8, 2026
ilan-gold added a commit that referenced this pull request Jan 8, 2026
…tegoricals for strings) (#2289)

Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants