Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove attrs #1660

Merged
merged 31 commits into from
Feb 19, 2024
Merged

Remove attrs #1660

merged 31 commits into from
Feb 19, 2024

Conversation

d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Feb 8, 2024

This PR removes the use of attrs from the v3 branch. We were using attrs because it provides a convenient way to define classes with runtime type validation and serialization to / from dicts, among other features. However we decided that, if possible, we would like to avoid the dependency on attrs. Therefore, we need to implement the following features:

  • faux-immutable classes
  • runtime validation of inputs to classes constructors, which entails simple type checking ("is the input an int?") and also potentially recursively deserializing metadata classes from dicts.
  • to / from dict serialization (I distinguish this operation from JSON de/serialization, which we can handle separately)

this PR uses @dataclass(frozen=True) to decorate metadata classes; for each attribute a of class X, there should be a parser function (typically called parse_a) that X.__init__ calls; these functions either return parsed input, or raise an exception. These functions generally have the type signature parse_x(data: JSON) -> SpecificType, that is, they narrow the type of their input. These functions are not tightly coupled to classes, so they can be reused easily.

After parsing an attribute, __init__ then calls object.__setattr__ to assign the parsed attribute to self (we cannot do self.a = parsed_a because the dataclass is frozen). Writing the parse_ functions is tedious but important, because if we make these functions strict, then we can avoid type checking inside elsewhere.

See an illustrative example here for v3 array metadata.

For to/from dictification, all the metadata classes inherit from this base class (right now it's abstract but maybe this is pointless, since I define concrete methods) which defines base to_dict and from_dict methods, that do what you would expect from the name. Subclasses refine these methods as needed.

This is still evolving, and there are a lot of other changes in here that might get obviated by other PRs, so I'm happy to wade through messy merges on my side.

Specifically, I am working through major changes to the codec API here that were necessary to ensure that we do all of our input validation in one place (i.e., on construction of the ArrayMetadata). In v3 codecs get validated in the CodecPipeline object, which isn't associated with ArrayMetadata, but this means we are doing validation in two places, and ideally we only do it in one place. So in this branch, ArrayMetadata.codecs is a list of validated Codec instances, instead of CodecMetadata, and the codec pipeline object is obviated, which entails a lot of changes that I'm still ironing out.

Because of the above, tests are definitely not passing, but I'm working on it!

…data to NamedConfig; turn chunk encoding into stand-alone functions; put codecs on ArrayMetadata instead of just CodecMetadata; add runtime_configuration parameter to codec encode / decode methods; add parsers to codecs
@pep8speaks
Copy link

pep8speaks commented Feb 8, 2024

Hello @d-v-b! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 11:1: E302 expected 2 blank lines, found 1

Comment last updated at 2024-02-19 13:12:48 UTC

@d-v-b d-v-b mentioned this pull request Feb 8, 2024
@d-v-b d-v-b added the V3 Related to compatibility with V3 spec label Feb 8, 2024
@normanrz normanrz changed the title [WIP] Remove attrs Remove attrs Feb 12, 2024
@d-v-b
Copy link
Contributor Author

d-v-b commented Feb 13, 2024

we are using the following pattern here

def parse_property(data: Any) -> Literal["correct"]:
    # check that the data is correct
    if data == 'correct':
        return data
    raise ValueError('It was not correct')

@dataclass(frozen=True)
class foo:
    property: Literal["correct"]

    def __init__(self, property_in):
        property_parsed = parse_property(property_in)
        object.__setattr__(self, "property", property_parsed)

I am a little worried about the cost of inheritance with this approach. For example, merely changing the type of property in the above example requires a new implementation of __init__. Does anyone think this is worth worrying about? (I am aware that addressing a finite sum of concerns like these will take us to a re-implementation of attrs / pydantic 😀 )

@normanrz
Copy link
Contributor

Not sure I understand what you mean wrt cost of inheritance here?

@d-v-b
Copy link
Contributor Author

d-v-b commented Feb 13, 2024

I will expand the example:

def parse_property(data: Any) -> Literal["correct"]:
    # check that the data is correct
    if data == 'correct':
        return data
    raise ValueError('It was not correct')

@dataclass(frozen=True)
class Foo:
    property_a: Literal["correct"]

    def __init__(self, property_a_in):
        property_a_parsed = parse_property(property_a_in)
        object.__setattr__(self, "property_a", property_a_parsed)

@dataclass(frozen=True)
class Bar(Foo)
   # add a new property with the same type as `Foo.property_a`
    property_b: Literal["correct"]

    # we have re-implement the __init__ method, even though we "just" added a new attribute  
   def __init__(self, property_a_in, property_b_in):
        property_a_parsed = parse_property(property_a_in)
        property_b_parsed = parse_property(property_b_in)
        object.__setattr__(self, "property_a", property_a_parsed)   
        object.__setattr__(self, "property_b", property_b_parsed)   

Basically, if we subclass and add a new attribute (or subclass and change a type of an attribute), then we end up needing to re-implement the __init__ method, which could be error-prone if we have classes with lots of attributes. An alternative would be invest in a registry between types and validator functions, and have classes use this registry in __init__ (i'm guessing pydantic and attrs use this approach? but I could be wrong).

To be clear, I'm not arguing against the approach we are taking in this PR. rather, I'm pointing out a drawback of that approach, and inviting criticism :)

@normanrz
Copy link
Contributor

Thanks for expanding. We could delegate in Bar.__init__:

@dataclass(frozen=True)
class Bar(Foo)
   # add a new property with the same type as `Foo.property_a`
    property_b: Literal["correct"]

    # we have re-implement the __init__ method, even though we "just" added a new attribute  
   def __init__(self, property_a_in, property_b_in):
        super().__init__(property_a_in)
        property_b_parsed = parse_property(property_b_in)
        object.__setattr__(self, "property_b", property_b_parsed)   

The method signature would have to deal with all arguments. I don't it is a bad thing to have all arguments listed. That is what users will also have to know about.

@d-v-b
Copy link
Contributor Author

d-v-b commented Feb 13, 2024

While it works if an attribute is being added, delegation with super() is unattractive if the child class is narrowing the type of an attribute defined on the parent class -- we end up setting the same attribute twice (once in parent.__init__ and again in child.__init__), and wasting the compute required to parse the attribute in parent.__init__. And given python's lax type system, there may be cases where a child class overrides the type of an attribute in a way that fails the parent's validation, but passes the child's validation.

Of course there's a direct solution for these situations -- we can just re-write __init__ from scratch in the subclass -- but if we think that's too onerous, then we would need to make our ad-hoc "classes with runtime-validated attributes" system a bit fancier.

@normanrz
Copy link
Contributor

but if we think that's too onerous,

I don't think so. Most init functions are only a few lines of code.

@d-v-b d-v-b mentioned this pull request Feb 14, 2024
@normanrz normanrz mentioned this pull request Feb 16, 2024
6 tasks
@normanrz
Copy link
Contributor

I fixed some typing issues to make mypy happier. I think it would be great to turn that back on soon. I think it really helps for development.
Anyways, I feel like this PR is getting bigger and bigger. We should merge it quickly. @d-v-b would you be ready for that?

@d-v-b
Copy link
Contributor Author

d-v-b commented Feb 19, 2024

Anyways, I feel like this PR is getting bigger and bigger.

I think this is OK? I want to get this design right, since it feels pretty core to the library. I worry that if we sign off on a design in this PR that is provisional, then we won't have a chance to fix it in the future.

In particular, one thing I'm still turning over in my head is whether we want these metadata classes to be easy to subclass. Our current design adds just enough friction to this process that I worry that it would discourage people, or lead them to make mistakes in the __init__ method that would produce invalid zarr metadata.

A concrete example: suppose a user wants to subclass ArrayMetadata to model an array where the data type must be uint8. In our current design, they have to do this:

import numpy as np
from dataclasses import dataclass

def parse_uint8_dtype(data: Any) -> Literal[np.uint8]:
    if data == 'uint8':
        return(np.uint8)
    if data == np.uint8:
        return data
    raise ValueError("data wasn't uint8")

@dataclass(frozen=True)
class Uint8ArrayMetadata(ArrayMetadata):
    data_type: Literal[np.uint8]

    def __init__(self, **kwargs):
        super().__init__(**kwargs) # this will redundantly set the `data_type` attribute
        dtype_parsed = parse_uint8_dtype(kwargs["data_type"]) # the user had to write a parser that handles this type
        object.__setattr__(self,  'data_typ',  dtype_parsed)

I included a subtle, but realistic bug in the top example -- object.__setattr__ is setting the data_typ attribute, not data_type. Because data_type was set in super().__init__(), this will not error, but the class will not do what the user expects. We should try to prevent these kinds of mistakes by design.

An alternative design that i'm leaning towards is to have a central registry that's basically a type : parser mapping, and the __init__ routine of classes that inherit from Metadata uses the type annotation of its fields to look up the correct parsing function. E.g.

def parse_int(data: Any) -> int:
    if isinstance(data, int):
        return data
    raise TypeError(f'Expected an int, got {type(data)}')

# the real thing wouldn't be a mere dict
parser_registry = {int: parse_int}

@dataclass(frozen=True)
class Metadata:

    def __init__(self, **kwargs):
        for field in fields(self):
            parser = parser_registry.get(field.type)
            parsed_value = parser(kwargs[field.name])
            object.__setattr__(self, field.name, parsed_value)

@dataclass(frozen=True)
class Foo(Metadata)
    a: int # because this is annotated and the type is registered, no need to implement `__init__` for this class

With this design, a user who wants to subclass ArrayMetadata to restrict the data_type to uint8 would write the following code:

@dataclass(frozen=True)
class Uint8ArrayMetadata(ArrayMetadata):
    data_type: Literal[np.uint8]

Provided that the type Literal[np.uint8] can be resolved to a parser, this is a lot easier for the user, and less error prone. Also, it superficiallylooks like pydantic / other runtime type checkers that check types purely on the annotation, which is nicer for users.

Logically, the type annotation of a field should determine which parser is run against that field. If we can implement that constraint in code, then we should. I think our task is simplified by the finite set of input types we have to deal with with -- we don't need to handle arbitrary python types, only the types that we are using for zarr metadata.

I have some travel coming up but I will try to find time to push on this.

@normanrz
Copy link
Contributor

Anyways, I feel like this PR is getting bigger and bigger.

I think this is OK? I want to get this design right, since it feels pretty core to the library. I worry that if we sign off on a design in this PR that is provisional, then we won't have a chance to fix it in the future.

I don't think designs have to be fixed once we merge PRs into v3. Later PRs can change the APIs. On the other side, it is quite difficult to progress with large PRs lingering around. For example, I based my #1670 PR on this branch which is not ideal, because then by default the PR would end up in your fork.

I am not a fan of a parser registry. It makes things more complex for imo very little gain. The current approach is much more simple and easy to reason about. Also, I find that most attributes are specific to a particular class. Where they are shared among multiple class, we can share the function--simple. Using the Python type system is not sufficient. For example, how would you specify a type that is an integer between -131072 and 22 (zstd level).

I am not concerned with making it easy to subclass. I don't see real use cases for subclassing the ArrayMetadata. Zarr 3 has explicit extension points (e.g. codecs, chunk grids, data types). We should aim to make those easier to extend with base classes and registries.

@d-v-b
Copy link
Contributor Author

d-v-b commented Feb 19, 2024

For example, how would you specify a type that is an integer between -131072 and 22 (zstd level).

I would use typing.Annotated and declare the range check as a validator, like pydantic does it. Alternatively, one can define a class that wraps int..

I am not concerned with making it easy to subclass. I don't see real use cases for subclassing the ArrayMetadata. Zarr 3 has explicit extension points (e.g. codecs, chunk grids, data types). We should aim to make those easier to extend with base classes and registries.

You personally may not be concerned with subclassing ArrayMetadata, but the use cases are real, and bringing up specification-defined extension points indicates that you don't appreciate the goal I am advocating for. My interest in subclassing is not to support extensions of ArrayMetadata, but rather constraints. In the example I gave above, a user wants to define array metadata with constraints, e.g. that the dtype only be uint8. I think this is actually a realistic scenario. Another realistic scenario is a user who wants to define an array or group with constrained attributes. Many communities using Zarr today are working with groups or arrays that depend on structured attributes, e.g. ome-ngff, xarray, anndata, etc. If we think of the Zarr group / array metadata document as a type, then these arrays / groups with structured metadata are subtypes of that broader type; it's pretty natural to represent this type / subtype relationship with inheritance in python, but only if we write an API that facilitates it. The alternative is people re-writing the array metadata in their own library which does support validation of custom attributes (e.g., pydantic-zarr).

All that being said, I do take your point that we don't need to settle this API in this PR, so I'm happy deferring these decisions for later, and I'm happy to merge this PR as-is

@normanrz
Copy link
Contributor

You personally may not be concerned with subclassing ArrayMetadata, but the use cases are real, and bringing up specification-defined extension points indicates that you don't appreciate the goal I am advocating for.

Maybe I am just not understanding it, sorry!

Many communities using Zarr today are working with groups or arrays that depend on structured attributes, e.g. ome-ngff, xarray, anndata, etc. If we think of the Zarr group / array metadata document as a type, then these arrays / groups with structured metadata are subtypes of that broader type; it's pretty natural to represent this type / subtype relationship with inheritance in python, but only if we write an API that facilitates it. The alternative is people re-writing the array metadata in their own library which does support validation of custom attributes (e.g., pydantic-zarr).

I also thought about the structured attributes use case. I'm not sure the only way to do that is through subclassing. I think composition would also work well. The ArrayMetadata class could have a hook for parsing/validating attributes, e.g. an init arg.

I can see the need for subclassing Arrays to implement different behavior. With the ArrayMetadata, I like that it is quite strict and not as easy to customize, except for the well-defined hooks.

@d-v-b
Copy link
Contributor Author

d-v-b commented Feb 19, 2024

and to be fair, our baseline in zarr-python today is basically no declarative representation of the metadata document, so I think we have already surpassed this considerably. I think we can do much more, but I will leave that stuff for other prs.

@d-v-b
Copy link
Contributor Author

d-v-b commented Feb 19, 2024

any objections to me merging this?

@normanrz
Copy link
Contributor

any objections to me merging this?

Please go ahead :shipit:

@d-v-b d-v-b merged commit 76c3450 into zarr-developers:v3 Feb 19, 2024
7 checks passed
@d-v-b
Copy link
Contributor Author

d-v-b commented Feb 19, 2024

thanks for your help here @normanrz!

@d-v-b d-v-b deleted the remove_attrs branch February 19, 2024 22:08
@jhamman jhamman added this to the 3.0.0.alpha milestone Apr 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
V3 Related to compatibility with V3 spec
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

4 participants