Skip to content

Allow disambiguating union types by unique literal types #273

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
phiresky opened this issue Jun 13, 2022 · 9 comments
Open

Allow disambiguating union types by unique literal types #273

phiresky opened this issue Jun 13, 2022 · 9 comments

Comments

@phiresky
Copy link

phiresky commented Jun 13, 2022

Description

Right now, it seems like union types are only supported if (a) the options are all attrs classes or None and (b) the attrs classes have unique attributes.

I use a ton of "tagged unions", so it would be great there was support for those. (They have one attribute that has an unambiguous value). Here's an example:

@attrs.define
class MatchedThing:
    matched: Literal[True] = True
    xxxxxx: str


@attrs.define
class UnmatchedThing:
    matched: Literal[False] = False
    yyyyyy: str


@attrs.define
class MaybeSomething:
    something: MatchedThing | UnmatchedThing

In this example, I'm using a literal boolean but much more common is to use a literal string, for example type: Literal["request", "response", "error"]. Other possibilities are literal integers (covered by the below code) and literal enum values (not sure if those work with the below code)

This pattern is very common in JSON files.

Here's an implementation that seems to work for me (attrs classes only though the same should work with typing.get_type_hints(anything))`:

def create_literal_field_disambiguator_function(*classes: type) -> Callable[..., Any]:
    """Given attr classes, generate a disambiguation function.
    The function is based a common field with unique literal value types."""
    common_attrs: set[str] = set.intersection(
        *(set(at.name for at in attrs.fields(get_origin(cl) or cl)) for cl in classes)
    )

    def check_attr_is_unique(attr: str) -> dict[str, type] | None:
        attr_value_to_class = {}
        for cl in classes:
            field_type = getattr(attrs.fields(get_origin(cl) or cl), attr).type
            if not cattrs.converters.is_literal(field_type):
                # type is not a literal
                return None
            for value in typing.get_args(field_type):
                if value in attr_value_to_class:
                    # this value is ambiguous -> attribute not usable for disambiguation
                    return None
                attr_value_to_class[value] = cl
        return attr_value_to_class

    unique_attr_value_to_class = None
    # try each attribute to see if the types are Literal and the values are unique
    for attr in common_attrs:
        unique_attr_value_to_class = check_attr_is_unique(attr)
        if unique_attr_value_to_class:
            # found unique attribute
            break
    else:
        raise ValueError(
            f"{classes} cannot be disambiguated with a common literal attribute. checked types of common fields: {common_attrs}"
        )

    def dis_func(data: Mapping, type: Any) -> Optional[type]:
        if not isinstance(data, Mapping):
            raise ValueError("Only input mappings are supported.")
        return unique_attr_value_to_class[data[attr]]

    return dis_func

...
def get_converter():
    ...
    converter.register_structure_hook_factory(
        lambda t: cattrs.converters.is_attrs_union(t),
        lambda t: create_literal_field_disambiguator_function(*typing.get_args(t)),
    )
...

It should probably be added to the existing disambiguation.py machinery, but those parts aren't publicly exposed

@Tinche
Copy link
Member

Tinche commented Jun 15, 2022

I think doing something like this would be an awesome feature.

Not sure if I would enable it by default due to backwards compatibility, but it should definitely be easily accessible.

It'd also be great if the tag could come from other places, like the name of the class or something like that.

This will probably be the focus of the release after this one.

@phiresky
Copy link
Author

phiresky commented Jun 20, 2022

Disambiguating by modifying unstructuring to add the class name as a tag also makes a lot of sense and has advantages in having less repetition in the class definition. The way I did it has the advantage that it's translatable 1:1 to and from TypedDict definitions or TypeScript definitions of the same data structures and doesn't require external information (the configuration of cattrs) to understand how ser/deser will work.

My variant is implemented in a way that (should) be backwards compatible since it only disambiguates a case that would otherwise throw an error (if the disambiguation function is added behind the existing unique-property disambiguator).

@Tinche
Copy link
Member

Tinche commented Jun 20, 2022

So I've been thinking about this since we have a lot of these kinds of situations at work.

I think in some situations we can do better than a literal field though. I think that piece of information is unnecessary on the Python level, so the field is just pure overhead. The field should be init=False and frozen, and if you think about it, it should actually be on the class (a ClassVar), not on the instance. Then if you think about it some more, it shouldn't even exist since that data is encoded into the class on the Python level.

Here's a counterproposal.

@attrs.define
class MatchedThing:
    xxxxxx: str


@attrs.define
class UnmatchedThing:
    yyyyyy: str


@attrs.define
class MaybeSomething:
    something: MatchedThing | UnmatchedThing

# Set up the converter to handle MaybeSomething as a tagged union.
configure_tagged_union(
    converter, 
    MaybeSomething, 
    "matched",  # the field name
    {True: MatchedThing, False: UnmatchedThing},  # the mapping of the field name values to classes, and reverse
)

What do you think?

@phiresky
Copy link
Author

phiresky commented Jun 20, 2022

What you're saying makes sense with the disambiguating attribute not really being a class member at all.

I think the core issue is that there's no real standard integrated way to represent tagged unions / sumtypes in python. So that results in different workarounds people use to create them.

The reason I did it as above is that I started from JSON data that I already had and started representing that (without "structuring") as TypedDicts:

class MatchedThing(TypedDict):
    matched: Literal[True] = True
    xxxxxx: str

class UnmatchedThing(TypedDict):
    matched: Literal[False] = False
    yyyyyy: str

And disambiguated them by checking the value if x["matched"] == True

Only later did I move to attrs.

I can think of three issues with your suggestion with the last being the largest:

  1. It removes the possibility of disambiguating by the value (if x.matched == True needs to be replaced with if isinstance(x, MatchedThing). Probably not too much of an issue
  2. You don't have the info about what the serialized value looks like immediately and you can't tell how the disambiguation works without understanding cattrs and looking at the configure_tagged_union() code which is elsewhere than the class definition
  3. MaybeSomething shouldn't be part of the definition imo. I use the same or overlapping union types in many different places, so it shouldn't depend on the container type. Ideally it also shouldn't depend on the other members of the union. For example:
@attrs.define
class MatchedThing:
    matched: Literal[True] = True
    xxxxxx: str


@attrs.define
class UnmatchedThing:
    matched: Literal[False] = False
    yyyyyy: str

@attrs.define
class ThirdThing:
    matched: Literal["specialcase"] = "specialcase"
    zzzzzz: str

MaybeMatched = MatchedThing | UnmatchedThing

@attrs.define
class Foo:
    x: int
    y: int
    something: MaybeMatched

@attrs.define
class Bar:
    greeting: str
    first_thing: MatchedThing
    second_thing: MaybeMatched
    unmatched_thing: Optional[UnmatchedThing]
    different_thing: MaybeMatched | ThirdThing

How would this look with your idea?

@Tinche
Copy link
Member

Tinche commented Jun 20, 2022

You might have noticed cattrs generally puts logic on converters instead of on the models; this is a conscious design decision. The idea is the same model should be reusable in many different contexts/converters. That said, practicality beats purity, so this isn't set in stone ;)

Point 1: I think you'd be using the match statement to process these, or a chain of isinstance() checks on lower Python versions.

Point 2: yeah, you'd have to look at the converter and not the model. But the model could be reused in different contexts and with different union rules.

Point 3: I'm not sure if I agree or not, it's a difficult question. You can't have a union of one type in Python's type system (i.e. Union[MatchedThing] is MatchedThing), and I think serializing the tag (matched) even if the class isn't used in a union would be wasteful.

As for your example, I think it'd work out of the box without support for tagged unions since all your unions have unique required fields, but that's beside the point ;)

Here's your example adjusted:

@define
class MatchedThing:
    xxxxxx: str


@define
class UnmatchedThing:
    yyyyyy: str

@define
class ThirdThing:
    zzzzzz: str

MaybeMatched = MatchedThing | UnmatchedThing

@define
class Foo:
    x: int
    y: int
    something: MaybeMatched

@define
class Bar:
    greeting: str
    first_thing: MatchedThing
    second_thing: MaybeMatched
    unmatched_thing: Optional[UnmatchedThing]
    different_thing: MaybeMatched | ThirdThing

converter = Converter()

configure_tagged_union(
    converter,
    MaybeMatched,
    "matched",
    {True: MatchedThing, False: UnmatchedThing}
)
configure_tagged_union(
    converter,
    MaybeMatched | ThirdThing,
    "matched",
    {True: MatchedThing, False: UnmatchedThing, "specialcase": ThirdThing}
)

I see your point though, there's repetition. But on the other hand, you don't need to prepare the classes to be used in unions, you just need to prepare the converter. Maybe the verboseness could be solved with a hook factory or something. Needs more consideration ;)

@phiresky
Copy link
Author

phiresky commented Jun 23, 2022

Your example is actually better than I thought, because now the second parameter of your configure_tagged_union function is the Union[], instead of an outer attrs class as in your previous example (MaybeSomething). So I think what your suggested solution makes sense now.

Not having the disambiguating member be part of the model might still cause an issue for some use cases. It seems weird that the model serializes differently depending on which union it is a part of. It might be good to look at other languages or libraries to see how they do it. The typedload library for example disambiguates by simply running the structuring variants one after another until one doesn't throw an error. As more "inspiration" the "serde" library of rust uses the type name as the "tag", and is configurable in whether the tag is added inside the object or outside of it. See https://serde.rs/enum-representations.html . It's easier for Rust though because unions are always tagged.

@Tinche
Copy link
Member

Tinche commented Jun 23, 2022

We'll probably need to support both having the tag name inside the class and not, and combinations.

I was playing around with this idea:

@define
class MatchedThing:
    matched: Final[bool] = True

    xxxxxx: str

According to https://peps.python.org/pep-0591/, a final variable defined in the class body with a default should be a class variable. I like this approach since it's clean. But attrs needs to be upgraded to support this; it doesn't yet.

@wbolster
Copy link
Member

wbolster commented Nov 15, 2023

i tried to achieve something very similar, and was about to post a new issue/question, but then i encountered this issue 🙃

i have multiple attrs classes, each with a ‘type’ field annotated as a Literal allowing one or more enum.Enum members.

each attrs class happens to have some unique (required) fields, so i think i am lucky because the standard disambiguation machinery seems to handle this case.

however, while it seems to work, mypy doesn't seem too keen about it for unclear reasons. full example below:

from __future__ import annotations

import enum
from typing import Literal

import attrs
import cattrs


class E(enum.StrEnum):
    a = enum.auto()
    b = enum.auto()


@attrs.define(kw_only=True)
class A:
    type: Literal[E.a]


@attrs.define(kw_only=True)
class B:
    type: Literal[E.b]
    extra: int


d = {"type": "a"}

# Annotating the result is needed, otherwise mypy complains:
# error: Need type annotation for "result"  [var-annotated]
result: A | B

# mypy complains about the .structure() statement below:
# error: Argument 2 has incompatible type "UnionType"; expected "type[Never]" [arg-type]
result = cattrs.structure(d, A | B)

print(result)

thoughts or advice welcome!

@Tinche
Copy link
Member

Tinche commented Nov 15, 2023

@wbolster You're not doing anything wrong, it's just a limitation of the Python typing system.

The method is annotated to accept type[T] and return T, this doesn't work with unions (or protocols, and some other types) in Mypy. (In Pyright it works I think.) TypeForms (so TypeForm[T] instead of type[T]) was an idea put forward to fix the issue, but the PEP never got written up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants