Skip to content

feat: subchunk write order#160

Open
ilan-gold wants to merge 9 commits intomainfrom
ig/subchunk_write_order
Open

feat: subchunk write order#160
ilan-gold wants to merge 9 commits intomainfrom
ig/subchunk_write_order

Conversation

@ilan-gold
Copy link
Collaborator

@ilan-gold ilan-gold commented Mar 22, 2026

  1. I could not find a clean way to do Literal -> Enum in PyO3 despite extensive googling
  2. Maybe @d-v-b we standardize where this option goes since it applies to everyone? Happy to contribute this upstream to zarr-python (including C + random write order)
  3. Major release after this?

Comment on lines 214 to +223
#[pyo3(signature = (
array_metadata,
store_config,
*,
validate_checksums=false,
chunk_concurrent_minimum=None,
chunk_concurrent_maximum=None,
num_threads=None,
direct_io=false,
subchunk_write_order=SubchunkWriteOrderWrapper(SubchunkWriteOrder::Random),
Copy link
Collaborator Author

@ilan-gold ilan-gold Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I basically just got this working with the compiler but is it used since we have stubgen? I found https://pyo3.rs/v0.28.2/function/signature#type-annotations-in-the-signature which might be interesting

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat, yeah that’s new! We could try migrating to it in a different PR.

@ilan-gold
Copy link
Collaborator Author

A complication: this only applies to the outer sharding codec.

Also maybe this isn't a pipeline setting in zarr-python but a codec setting (even though I think a lot of people just use shards= for sharding). Not sure here....

src/utils.rs Outdated

impl pyo3_stub_gen::PyStubType for SubchunkWriteOrderWrapper {
fn type_output() -> pyo3_stub_gen::TypeInfo {
pyo3_stub_gen::TypeInfo::builtin("str")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a bit cleaner for type hints: what about pyo3_stub_gen::TypeInfo::with_module("zarrs.SubchunkWriteOrder", "zarrs".into()) with

class SubchunkWriteOrder(StrEnum):
    C = "C"
    random = "random"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write_order: SubchunkWriteOrder would only allow passing a enum member and not a regular string then right?

So users couldn’t do write_order="C" according to Python’s type system, they’d have to do write_order=SubchunkWriteOrder.C

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit but i find "lexicographic" a bit nicer and less NumPy-specific than "C"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So users couldn’t do write_order="C" according to Python’s type system, they’d have to do write_order=SubchunkWriteOrder.C

I didn't even try to check if the zarr.config accepts arbitrary objects but I think in theory one could set these via environment variables as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit but i find "lexicographic" a bit nicer and less NumPy-specific than "C"

Wow I didn't realize C was numpy-specific!

@d-v-b
Copy link

d-v-b commented Mar 23, 2026

Also maybe this isn't a pipeline setting in zarr-python but a codec setting (even though I think a lot of people just use shards= for sharding). Not sure here....

My feeling is that in zarr-python today the sharding codec owns the subchunk order, and that this setting should be passed to the sharding codec constructor. But why does this only apply to the outer sharding codec? The subchunk order is a degree of freedom for inner sharding codecs too, no?

@ilan-gold
Copy link
Collaborator Author

ilan-gold commented Mar 23, 2026

But why does this only apply to the outer sharding codec?

Yes in zarrs this is no problem: the option gets put in the codec, which can be nested

However, since we have just been making everything a runtime config setting here, I then though "oh, this could be done in zarr-python as well in zarr.config." But then I (re)realized last night that this is a codec setting as I had literally just implemented in rust. And therefore it applies per (potentially nested) codec, which I forgot in the course of making this PR.

So I think that since we have a way of grabbing the codecs from the ArrayMetadata, a codec setting is the way to go, you're right. And then zarrs-python is responsible for handling that setting.

@d-v-b
Copy link

d-v-b commented Mar 23, 2026

what does "random" mean? is there a randomness guarantee, or is it just that there is no order guarantee?

@ilan-gold
Copy link
Collaborator Author

ilan-gold commented Mar 23, 2026

what does "random" mean? is there a randomness guarantee, or is it just that there is no order guarantee?

That's a good point, "no order guarantee" I would say. This is effectively random in zarrs due to the threading so I guess that was the naming convention but I am not sure what happesns with only one thread.

@d-v-b
Copy link

d-v-b commented Mar 23, 2026

If the order is sensitive to execution time, then subchunks that take more time to process (e.g., they are harder to compress, or just larger if we get a rectilinear chunk grid inside shards) will tend to get stored "toward the back" of the chunk. In this case, "random" might mislead people into thinking the order is really random.

@ilan-gold
Copy link
Collaborator Author

Great point @d-v-b - random is definitely a misnomer then.

}

#[derive(Debug, Clone)]
pub struct SubchunkWriteOrderWrapper(pub SubchunkWriteOrder);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is requeired because of the orphan rule right? SubchunkWriteOrder is not our type and IntoPyObject isn’t either. Did you try implementing SubchunkWriteOrderExt instead for which you then implement IntoPyObject? The orphan rule probably prevents that too, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is requeired because of the orphan rule right? SubchunkWriteOrder is not our type and IntoPyObject isn’t either.

So went my thinking, although I think IntoPyObject not being our trait has nothing to do with this - it's just that we don't create SubchunkWriteOrderhere

Did you try implementing SubchunkWriteOrderExt instead for which you then implement IntoPyObject?

Is that different than what we have here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it’d be a trait instead of a type, which would let us use the unwrapped type. But as said, probably doesn’t work.

README.md Outdated
- Defaults to `False`.
- `codec_pipeline.strict`: raise exceptions for unsupported operations instead of falling back to the default codec pipeline of `zarr-python`.
- Defaults to `False`.
- `codec_pipeline.subchunk_write_order`: Tells `zarrs` in what order to write subchunks within a shard. One of "C" or "random."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a bit of reasoning why one would be chosen over the other.

ilan-gold and others added 4 commits March 23, 2026 12:00
@flying-sheep
Copy link
Collaborator

Regarding "random": maybe "arbitrary" would work instead? Or just None if we don’t want to reserve that to mean “the default” or “auto-derive” or so?

We don’t have to keep (and therefore explain in the readme) the same name as zarrs if we agree it’s a misnomer.

@ilan-gold
Copy link
Collaborator Author

We don’t have to keep (and therefore explain in the readme) the same name as zarrs if we agree it’s a misnomer.

Agreed, but I'd like to figure out what zarr-python is going to do. There is not much rush on this feature so I'm happy to take some time to get it right.

impl pyo3_stub_gen::PyStubType for SubchunkWriteOrderWrapper {
fn type_output() -> pyo3_stub_gen::TypeInfo {
pyo3_stub_gen::TypeInfo::builtin("str")
pyo3_stub_gen::TypeInfo::with_module("typing.Literal['C', 'random']", "typing".into())
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah duh!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the parameter is called name, so it’s not clear that it can accept any code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants