fix: disallow nullable on-disk categoricals for strings by ilan-gold · Pull Request #2254 · scverse/anndata

ilan-gold · 2025-12-16T15:42:43Z

See Keep providing a non-nullable string type? #2252 (comment)
Tests added
Release note added (or unnecessary)

codecov · 2025-12-16T15:44:03Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 84.65%. Comparing base (bb7a31c) to head (a28fc4c).
⚠️ Report is 49 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2254      +/-   ##
==========================================
- Coverage   86.72%   84.65%   -2.08%     
==========================================
  Files          46       46              
  Lines        7196     7196              
==========================================
- Hits         6241     6092     -149     
- Misses        955     1104     +149

Files with missing lines	Coverage Δ
src/anndata/_io/specs/methods.py	`90.57% <ø> (-0.35%)`	⬇️

... and 6 files with indirect coverage changes

flying-sheep · 2025-12-16T17:36:27Z

+        v.categories._values
+        if not pd.api.types.is_string_dtype(v.categories)
+        else np.array(v.categories),
+        dataset_kwargs=dataset_kwargs,


I wonder if we should call into write_vlen_string_array{,_zarr} instead of converting this multiple times. _values is also iffy …

Also I noticed that we write T arrays as non-nullable even though we should check if they’re nullable and decide based on that (but that’s maybe out-of-scope for this PR)

anndata/src/anndata/_io/specs/methods.py

Line 581 in 46a8c60

@_REGISTRY.register_write(ZarrGroup, (np.ndarray, "T"), IOSpec("string-array", "0.2.0"))

_values is also iffy …

Oh it's crazy that we use this, no doubt.

I wonder if we should call into write_vlen_string_array{,_zarr} instead of converting this multiple times. _values is also iffy …

What would this solve? I think v.categories is a pandas object and when we do np.array (I guess I assumed this), it should go to object dtype in which case write_vlen_string_array is dispatched. So you're saying we should do np.array(v.categories, dtype=object) then?

Your current code (both if branches) are just v.categories.to_numpy(), right? I don’t understand why that ternary exists at all.

I can't say I know exactly what v.categories._values is for or expected return type, just that it works at the moment. So I didn't want to touch it yet (hence draft). But yes, to_numpy on the other end for strings is better.

I want to minimize numpy conversion because if arrow types land in zarr, we should preserve the type of the categories (they could be not string) in which case we would use .array and not ._values

Using to_numpy now on the string side :)

(happy to do both sides if you feel strongly, just trying to minimze behavior changes)

._values is very well documented: https://github.com/pandas-dev/pandas/blob/9c40b37d7102a2a21ec45e31a4a21dc12756d59b/pandas/core/series.py#L794

It’s “.to_numpy() if it’s a NumpyExtensionArray, else .array”.

So I’m saying let’s get rid of private APIs, no matter how well-documented, and use public ones instead.

Alright, sounds good!

Behaviorally, I’d say we should use what causes the fewest repetitions of “go through the whole array and convert every single element”.

As pointed out in #2272, hdf5 understands numpy string dtype arrays, but I think pandas doesn’t do them yet.

… for strings

…tegoricals for strings) (#2289) Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>

fix: disallow nullable on-disk categoricals for strings

46a8c60

ilan-gold added the skip-gpu-ci label Dec 16, 2025

ilan-gold added this to the 0.12.7 milestone Dec 16, 2025

flying-sheep modified the milestones: 0.12.7, 0.12.8 Dec 16, 2025

flying-sheep reviewed Dec 16, 2025

View reviewed changes

ilan-gold added 2 commits January 8, 2026 13:28

fix: use to_numpy

c589183

Merge branch 'main' into ig/categorical_on_disk

0b7c5fb

ilan-gold marked this pull request as ready for review January 8, 2026 12:30

ilan-gold requested a review from flying-sheep January 8, 2026 12:30

fix: remove ._values

a28fc4c

flying-sheep approved these changes Jan 8, 2026

View reviewed changes

ilan-gold enabled auto-merge (squash) January 8, 2026 12:48

ilan-gold merged commit 088fa16 into main Jan 8, 2026
21 checks passed

ilan-gold deleted the ig/categorical_on_disk branch January 8, 2026 12:52

meeseeksmachine pushed a commit to meeseeksmachine/anndata that referenced this pull request Jan 8, 2026

Backport PR scverse#2254: fix: disallow nullable on-disk categoricals…

1578c1b

… for strings

meeseeksmachine mentioned this pull request Jan 8, 2026

Backport PR #2254 on branch 0.12.x (fix: disallow nullable on-disk categoricals for strings) #2289

Merged

ilan-gold added a commit that referenced this pull request Jan 8, 2026

Backport PR #2254 on branch 0.12.x (fix: disallow nullable on-disk ca…

a7e9659

…tegoricals for strings) (#2289) Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: disallow nullable on-disk categoricals for strings#2254

fix: disallow nullable on-disk categoricals for strings#2254
ilan-gold merged 4 commits intomainfrom
ig/categorical_on_disk

ilan-gold commented Dec 16, 2025

Uh oh!

codecov Bot commented Dec 16, 2025 •

edited

Loading

Uh oh!

flying-sheep Dec 16, 2025 •

edited

Loading

Uh oh!

ilan-gold Dec 17, 2025

Uh oh!

flying-sheep Dec 17, 2025 •

edited

Loading

Uh oh!

ilan-gold Dec 17, 2025

Uh oh!

ilan-gold Jan 8, 2026

Uh oh!

ilan-gold Jan 8, 2026

Uh oh!

flying-sheep Jan 8, 2026 •

edited

Loading

Uh oh!

ilan-gold Jan 8, 2026

Uh oh!

flying-sheep Jan 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ilan-gold commented Dec 16, 2025

Uh oh!

codecov Bot commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

flying-sheep Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ilan-gold Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

flying-sheep Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ilan-gold Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

ilan-gold Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

ilan-gold Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

flying-sheep Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ilan-gold Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

flying-sheep Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Dec 16, 2025 •

edited

Loading

flying-sheep Dec 16, 2025 •

edited

Loading

flying-sheep Dec 17, 2025 •

edited

Loading

flying-sheep Jan 8, 2026 •

edited

Loading