Remove attrs #1660

d-v-b · 2024-02-08T14:07:34Z

This PR removes the use of attrs from the v3 branch. We were using attrs because it provides a convenient way to define classes with runtime type validation and serialization to / from dicts, among other features. However we decided that, if possible, we would like to avoid the dependency on attrs. Therefore, we need to implement the following features:

faux-immutable classes
runtime validation of inputs to classes constructors, which entails simple type checking ("is the input an int?") and also potentially recursively deserializing metadata classes from dicts.
to / from dict serialization (I distinguish this operation from JSON de/serialization, which we can handle separately)

this PR uses @dataclass(frozen=True) to decorate metadata classes; for each attribute a of class X, there should be a parser function (typically called parse_a) that X.__init__ calls; these functions either return parsed input, or raise an exception. These functions generally have the type signature parse_x(data: JSON) -> SpecificType, that is, they narrow the type of their input. These functions are not tightly coupled to classes, so they can be reused easily.

After parsing an attribute, __init__ then calls object.__setattr__ to assign the parsed attribute to self (we cannot do self.a = parsed_a because the dataclass is frozen). Writing the parse_ functions is tedious but important, because if we make these functions strict, then we can avoid type checking inside elsewhere.

See an illustrative example here for v3 array metadata.

For to/from dictification, all the metadata classes inherit from this base class (right now it's abstract but maybe this is pointless, since I define concrete methods) which defines base to_dict and from_dict methods, that do what you would expect from the name. Subclasses refine these methods as needed.

This is still evolving, and there are a lot of other changes in here that might get obviated by other PRs, so I'm happy to wade through messy merges on my side.

Specifically, I am working through major changes to the codec API here that were necessary to ensure that we do all of our input validation in one place (i.e., on construction of the ArrayMetadata). In v3 codecs get validated in the CodecPipeline object, which isn't associated with ArrayMetadata, but this means we are doing validation in two places, and ideally we only do it in one place. So in this branch, ArrayMetadata.codecs is a list of validated Codec instances, instead of CodecMetadata, and the codec pipeline object is obviated, which entails a lot of changes that I'm still ironing out.

Because of the above, tests are definitely not passing, but I'm working on it!

…data to NamedConfig; turn chunk encoding into stand-alone functions; put codecs on ArrayMetadata instead of just CodecMetadata; add runtime_configuration parameter to codec encode / decode methods; add parsers to codecs

…ve_attrs

pep8speaks · 2024-02-08T14:07:49Z

Hello @d-v-b! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file tests/v3/test_metadata.py:

Line 11:1: E302 expected 2 blank lines, found 1

Comment last updated at 2024-02-19 13:12:48 UTC

Making Codec classes self-contained

src/zarr/v3/codecs/blosc.py

…or c / f order; add order to ArrayV2Metadata.__init__

…nto remove_attrs

d-v-b · 2024-02-13T08:56:46Z

we are using the following pattern here

def parse_property(data: Any) -> Literal["correct"]:
    # check that the data is correct
    if data == 'correct':
        return data
    raise ValueError('It was not correct')

@dataclass(frozen=True)
class foo:
    property: Literal["correct"]

    def __init__(self, property_in):
        property_parsed = parse_property(property_in)
        object.__setattr__(self, "property", property_parsed)

I am a little worried about the cost of inheritance with this approach. For example, merely changing the type of property in the above example requires a new implementation of __init__. Does anyone think this is worth worrying about? (I am aware that addressing a finite sum of concerns like these will take us to a re-implementation of attrs / pydantic 😀 )

normanrz · 2024-02-13T09:18:58Z

Not sure I understand what you mean wrt cost of inheritance here?

d-v-b · 2024-02-13T09:30:08Z

I will expand the example:

def parse_property(data: Any) -> Literal["correct"]:
    # check that the data is correct
    if data == 'correct':
        return data
    raise ValueError('It was not correct')

@dataclass(frozen=True)
class Foo:
    property_a: Literal["correct"]

    def __init__(self, property_a_in):
        property_a_parsed = parse_property(property_a_in)
        object.__setattr__(self, "property_a", property_a_parsed)

@dataclass(frozen=True)
class Bar(Foo)
   # add a new property with the same type as `Foo.property_a`
    property_b: Literal["correct"]

    # we have re-implement the __init__ method, even though we "just" added a new attribute  
   def __init__(self, property_a_in, property_b_in):
        property_a_parsed = parse_property(property_a_in)
        property_b_parsed = parse_property(property_b_in)
        object.__setattr__(self, "property_a", property_a_parsed)   
        object.__setattr__(self, "property_b", property_b_parsed)

Basically, if we subclass and add a new attribute (or subclass and change a type of an attribute), then we end up needing to re-implement the __init__ method, which could be error-prone if we have classes with lots of attributes. An alternative would be invest in a registry between types and validator functions, and have classes use this registry in __init__ (i'm guessing pydantic and attrs use this approach? but I could be wrong).

To be clear, I'm not arguing against the approach we are taking in this PR. rather, I'm pointing out a drawback of that approach, and inviting criticism :)

normanrz · 2024-02-13T09:52:06Z

Thanks for expanding. We could delegate in Bar.__init__:

@dataclass(frozen=True)
class Bar(Foo)
   # add a new property with the same type as `Foo.property_a`
    property_b: Literal["correct"]

    # we have re-implement the __init__ method, even though we "just" added a new attribute  
   def __init__(self, property_a_in, property_b_in):
        super().__init__(property_a_in)
        property_b_parsed = parse_property(property_b_in)
        object.__setattr__(self, "property_b", property_b_parsed)

The method signature would have to deal with all arguments. I don't it is a bad thing to have all arguments listed. That is what users will also have to know about.

d-v-b · 2024-02-13T10:06:29Z

While it works if an attribute is being added, delegation with super() is unattractive if the child class is narrowing the type of an attribute defined on the parent class -- we end up setting the same attribute twice (once in parent.__init__ and again in child.__init__), and wasting the compute required to parse the attribute in parent.__init__. And given python's lax type system, there may be cases where a child class overrides the type of an attribute in a way that fails the parent's validation, but passes the child's validation.

Of course there's a direct solution for these situations -- we can just re-write __init__ from scratch in the subclass -- but if we think that's too onerous, then we would need to make our ad-hoc "classes with runtime-validated attributes" system a bit fancier.

normanrz · 2024-02-13T13:06:56Z

but if we think that's too onerous,

I don't think so. Most init functions are only a few lines of code.

normanrz · 2024-02-16T16:44:35Z

I fixed some typing issues to make mypy happier. I think it would be great to turn that back on soon. I think it really helps for development.
Anyways, I feel like this PR is getting bigger and bigger. We should merge it quickly. @d-v-b would you be ready for that?

d-v-b · 2024-02-19T10:06:29Z

Anyways, I feel like this PR is getting bigger and bigger.

I think this is OK? I want to get this design right, since it feels pretty core to the library. I worry that if we sign off on a design in this PR that is provisional, then we won't have a chance to fix it in the future.

In particular, one thing I'm still turning over in my head is whether we want these metadata classes to be easy to subclass. Our current design adds just enough friction to this process that I worry that it would discourage people, or lead them to make mistakes in the __init__ method that would produce invalid zarr metadata.

A concrete example: suppose a user wants to subclass ArrayMetadata to model an array where the data type must be uint8. In our current design, they have to do this:

import numpy as np
from dataclasses import dataclass

def parse_uint8_dtype(data: Any) -> Literal[np.uint8]:
    if data == 'uint8':
        return(np.uint8)
    if data == np.uint8:
        return data
    raise ValueError("data wasn't uint8")

@dataclass(frozen=True)
class Uint8ArrayMetadata(ArrayMetadata):
    data_type: Literal[np.uint8]

    def __init__(self, **kwargs):
        super().__init__(**kwargs) # this will redundantly set the `data_type` attribute
        dtype_parsed = parse_uint8_dtype(kwargs["data_type"]) # the user had to write a parser that handles this type
        object.__setattr__(self,  'data_typ',  dtype_parsed)

I included a subtle, but realistic bug in the top example -- object.__setattr__ is setting the data_typ attribute, not data_type. Because data_type was set in super().__init__(), this will not error, but the class will not do what the user expects. We should try to prevent these kinds of mistakes by design.

An alternative design that i'm leaning towards is to have a central registry that's basically a type : parser mapping, and the __init__ routine of classes that inherit from Metadata uses the type annotation of its fields to look up the correct parsing function. E.g.

def parse_int(data: Any) -> int:
    if isinstance(data, int):
        return data
    raise TypeError(f'Expected an int, got {type(data)}')

# the real thing wouldn't be a mere dict
parser_registry = {int: parse_int}

@dataclass(frozen=True)
class Metadata:

    def __init__(self, **kwargs):
        for field in fields(self):
            parser = parser_registry.get(field.type)
            parsed_value = parser(kwargs[field.name])
            object.__setattr__(self, field.name, parsed_value)

@dataclass(frozen=True)
class Foo(Metadata)
    a: int # because this is annotated and the type is registered, no need to implement `__init__` for this class

With this design, a user who wants to subclass ArrayMetadata to restrict the data_type to uint8 would write the following code:

@dataclass(frozen=True)
class Uint8ArrayMetadata(ArrayMetadata):
    data_type: Literal[np.uint8]

Provided that the type Literal[np.uint8] can be resolved to a parser, this is a lot easier for the user, and less error prone. Also, it superficiallylooks like pydantic / other runtime type checkers that check types purely on the annotation, which is nicer for users.

Logically, the type annotation of a field should determine which parser is run against that field. If we can implement that constraint in code, then we should. I think our task is simplified by the finite set of input types we have to deal with with -- we don't need to handle arbitrary python types, only the types that we are using for zarr metadata.

I have some travel coming up but I will try to find time to push on this.

normanrz · 2024-02-19T10:26:42Z

Anyways, I feel like this PR is getting bigger and bigger.

I think this is OK? I want to get this design right, since it feels pretty core to the library. I worry that if we sign off on a design in this PR that is provisional, then we won't have a chance to fix it in the future.

I don't think designs have to be fixed once we merge PRs into v3. Later PRs can change the APIs. On the other side, it is quite difficult to progress with large PRs lingering around. For example, I based my #1670 PR on this branch which is not ideal, because then by default the PR would end up in your fork.

I am not a fan of a parser registry. It makes things more complex for imo very little gain. The current approach is much more simple and easy to reason about. Also, I find that most attributes are specific to a particular class. Where they are shared among multiple class, we can share the function--simple. Using the Python type system is not sufficient. For example, how would you specify a type that is an integer between -131072 and 22 (zstd level).

I am not concerned with making it easy to subclass. I don't see real use cases for subclassing the ArrayMetadata. Zarr 3 has explicit extension points (e.g. codecs, chunk grids, data types). We should aim to make those easier to extend with base classes and registries.

d-v-b · 2024-02-19T13:00:52Z

For example, how would you specify a type that is an integer between -131072 and 22 (zstd level).

I would use typing.Annotated and declare the range check as a validator, like pydantic does it. Alternatively, one can define a class that wraps int..

I am not concerned with making it easy to subclass. I don't see real use cases for subclassing the ArrayMetadata. Zarr 3 has explicit extension points (e.g. codecs, chunk grids, data types). We should aim to make those easier to extend with base classes and registries.

You personally may not be concerned with subclassing ArrayMetadata, but the use cases are real, and bringing up specification-defined extension points indicates that you don't appreciate the goal I am advocating for. My interest in subclassing is not to support extensions of ArrayMetadata, but rather constraints. In the example I gave above, a user wants to define array metadata with constraints, e.g. that the dtype only be uint8. I think this is actually a realistic scenario. Another realistic scenario is a user who wants to define an array or group with constrained attributes. Many communities using Zarr today are working with groups or arrays that depend on structured attributes, e.g. ome-ngff, xarray, anndata, etc. If we think of the Zarr group / array metadata document as a type, then these arrays / groups with structured metadata are subtypes of that broader type; it's pretty natural to represent this type / subtype relationship with inheritance in python, but only if we write an API that facilitates it. The alternative is people re-writing the array metadata in their own library which does support validation of custom attributes (e.g., pydantic-zarr).

All that being said, I do take your point that we don't need to settle this API in this PR, so I'm happy deferring these decisions for later, and I'm happy to merge this PR as-is

normanrz · 2024-02-19T13:37:20Z

You personally may not be concerned with subclassing ArrayMetadata, but the use cases are real, and bringing up specification-defined extension points indicates that you don't appreciate the goal I am advocating for.

Maybe I am just not understanding it, sorry!

Many communities using Zarr today are working with groups or arrays that depend on structured attributes, e.g. ome-ngff, xarray, anndata, etc. If we think of the Zarr group / array metadata document as a type, then these arrays / groups with structured metadata are subtypes of that broader type; it's pretty natural to represent this type / subtype relationship with inheritance in python, but only if we write an API that facilitates it. The alternative is people re-writing the array metadata in their own library which does support validation of custom attributes (e.g., pydantic-zarr).

I also thought about the structured attributes use case. I'm not sure the only way to do that is through subclassing. I think composition would also work well. The ArrayMetadata class could have a hook for parsing/validating attributes, e.g. an init arg.

I can see the need for subclassing Arrays to implement different behavior. With the ArrayMetadata, I like that it is quite strict and not as easy to customize, except for the well-defined hooks.

d-v-b · 2024-02-19T13:41:13Z

and to be fair, our baseline in zarr-python today is basically no declarative representation of the metadata document, so I think we have already surpassed this considerably. I think we can do much more, but I will leave that stuff for other prs.

d-v-b · 2024-02-19T13:42:17Z

any objections to me merging this?

normanrz · 2024-02-19T13:44:29Z

any objections to me merging this?

Please go ahead

d-v-b · 2024-02-19T13:45:36Z

thanks for your help here @normanrz!

d-v-b added 6 commits February 4, 2024 17:14

begin removing attrs in favor of frozen dataclasses

67c64ea

Merge branch 'v3' of github.com:zarr-developers/zarr-python into remo…

c95ba00

…ve_attrs

add typing_extensions dependency

5d745eb

finalize merge

af92ebf

remove CodecMetadatas and data_types; move basic parsers to common

b398bb8

d-v-b mentioned this pull request Feb 8, 2024

[v3] codec design in v3 #1659

Closed

d-v-b added the V3 Related to compatibility with V3 spec label Feb 8, 2024

jhamman mentioned this pull request Feb 8, 2024

Zarr Sprint Topics zarr-developers/geozarr-spec#33

Open

d-v-b and others added 15 commits February 9, 2024 14:04

feat: base to_dict method that actually works

e756d62

Making Codec classes self-contained

0ca3554

fixes

f163bb4

chunk_grids

8c16d7d

chunk key encodings

ce87d89

dry-er parse_*

0fb188d

serialize to enums

249b05a

ruff

1e103e3

organize imports

42e83e8

merge upstream

5778126

rm src/zarr/v3/codecs/common.py

29b3d2a

Merge pull request #109 from zarr-developers/single-codec-classes

2f535c7

Making Codec classes self-contained

resolve merge conflicts

abea690

cleanup error messages

45adb5e

better codec evolve

3f9c388

normanrz reviewed Feb 9, 2024

View reviewed changes

src/zarr/v3/codecs/blosc.py Outdated Show resolved Hide resolved

normanrz changed the title ~~[WIP] Remove attrs~~ Remove attrs Feb 12, 2024

normanrz and others added 3 commits February 12, 2024 17:31

__init__ types

fb18e9d

add validation to RuntimeConfiguration.__init__; create a validator f…

4c448dc

…or c / f order; add order to ArrayV2Metadata.__init__

Merge branch 'v3' of https://github.com/zarr-developers/zarr-python i…

132ef97

…nto remove_attrs

d-v-b added 2 commits February 12, 2024 21:36

import Dict at runtime

45c14a4

fix parse_dimension_names

3a22bf2

d-v-b added 2 commits February 14, 2024 18:06

add tests; tweak parse_name function; adjust some exception messages

7a10b85

fix shapelike parser and tests; clean up imports

8b882b7

d-v-b mentioned this pull request Feb 14, 2024

Remove attrs from V3 #1624

Closed

normanrz mentioned this pull request Feb 16, 2024

Reworked codec pipelines #1670

Merged

6 tasks

normanrz added 2 commits February 16, 2024 14:52

typing

b7e0d75

improve typing

efd8351

blacken test_common

286deec

d-v-b merged commit 76c3450 into zarr-developers:v3 Feb 19, 2024
7 checks passed

d-v-b deleted the remove_attrs branch February 19, 2024 22:08

jhamman added this to the 3.0.0.alpha milestone Apr 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove attrs #1660

Remove attrs #1660

d-v-b commented Feb 8, 2024 •

edited by normanrz

Loading

pep8speaks commented Feb 8, 2024 •

edited

Loading

d-v-b commented Feb 13, 2024

normanrz commented Feb 13, 2024

d-v-b commented Feb 13, 2024

normanrz commented Feb 13, 2024

d-v-b commented Feb 13, 2024

normanrz commented Feb 13, 2024

normanrz commented Feb 16, 2024

d-v-b commented Feb 19, 2024

normanrz commented Feb 19, 2024

d-v-b commented Feb 19, 2024

normanrz commented Feb 19, 2024

d-v-b commented Feb 19, 2024

d-v-b commented Feb 19, 2024

normanrz commented Feb 19, 2024

d-v-b commented Feb 19, 2024

Remove attrs #1660

Remove attrs #1660

Conversation

d-v-b commented Feb 8, 2024 • edited by normanrz Loading

pep8speaks commented Feb 8, 2024 • edited Loading

Comment last updated at 2024-02-19 13:12:48 UTC

d-v-b commented Feb 13, 2024

normanrz commented Feb 13, 2024

d-v-b commented Feb 13, 2024

normanrz commented Feb 13, 2024

d-v-b commented Feb 13, 2024

normanrz commented Feb 13, 2024

normanrz commented Feb 16, 2024

d-v-b commented Feb 19, 2024

normanrz commented Feb 19, 2024

d-v-b commented Feb 19, 2024

normanrz commented Feb 19, 2024

d-v-b commented Feb 19, 2024

d-v-b commented Feb 19, 2024

normanrz commented Feb 19, 2024

d-v-b commented Feb 19, 2024

d-v-b commented Feb 8, 2024 •

edited by normanrz

Loading

pep8speaks commented Feb 8, 2024 •

edited

Loading