Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different behavior in input data between v2.4.2 and v2.5.2 #8349

Closed
1 task done
MarshalX opened this issue Dec 11, 2023 · 2 comments
Closed
1 task done

Different behavior in input data between v2.4.2 and v2.5.2 #8349

MarshalX opened this issue Dec 11, 2023 · 2 comments
Labels
bug V2 Bug related to Pydantic V2 pending Awaiting a response / confirmation

Comments

@MarshalX
Copy link

MarshalX commented Dec 11, 2023

Initial Checks

  • I confirm that I'm using Pydantic V2

Description

hello! After upgrading my unit tests started failing. and that's because of input data mutation. i attached as small as i can reproducible example. i can understand that the problem is that DotDict doesn't copy input dict, BUT the question is why it works on pydantic v2.4.2 and doesn't work on v2.5.2. i just want to understand what causes this issue and if isn't it a bug of pydantic or unplanned change

to reproduce:

  • pip install pydantic==2.4.2
  • run the script below. it will print "OK"
  • pip install pydantic==2.5.2
  • run the script below again. it will throw AssertionError

Example Code

import re
import typing as t
from copy import deepcopy

import typing_extensions as te
from pydantic import BaseModel, ConfigDict, Field, GetJsonSchemaHandler
from pydantic.json_schema import JsonSchemaValue
from pydantic_core import core_schema


def _convert_camel_case_to_snake_case(string: str) -> str:
    s = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', string)
    return re.sub('([a-z0-9])([A-Z])', r'\1_\2', s).lower()


def _convert_snake_case_to_camel_case(string: str) -> str:
    s = ''.join([w.capitalize() for w in string.split('_')])
    return s[0].lower() + s[1:]


def _is_snake_case(string: str) -> bool:
    return string == _convert_camel_case_to_snake_case(string)


def _convert_to_opposite_case(string: str) -> str:
    if _is_snake_case(string):
        return _convert_snake_case_to_camel_case(string)
    return _convert_camel_case_to_snake_case(string)


class DotDict:
    """Dot notation for dictionaries.

    Note:
        If the record is out of the official lexicon, it`s impossible to deserialize it to a proper data model.
        Such models will fall back to dictionaries.
        All unknown "Union" types will also be caught as dicts.
        This class exists to provide an ability to use such fallbacks as “real” data models.

    Example:
        >>> test_data = {'a': 1, 'b': {'c': 2}, 'd': [{'e': 3}, 4, 5]}
        >>> model = DotDict(test_data)
        >>> assert isinstance(model, DotDict)
        >>> assert model.nonExistingField is None
        >>> assert model.a == 1
        >>> assert model['a'] == 1
        >>> assert model['b']['c'] == 2
        >>> assert model.b.c == 2
        >>> assert model.b['c'] == 2
        >>> assert model['b'].c == 2
        >>> assert model.d[0].e == 3
        >>> assert model['d'][0]['e'] == 3
        >>> assert model['d'][0].e == 3
        >>> assert model['d'][1] == 4
        >>> assert model['d'][2] == 5
        >>> model['d'][0]['e'] = 6
        >>> assert model['d'][0]['e'] == 6
        >>> assert DotDict(test_data) == DotDict(test_data)
        >>> assert model.to_dict() == test_data
    """

    def __init__(self, data: dict) -> None:
        self._data = data
        for k, v in self._data.items():
            self.__setitem__(k, v)

    def to_dict(self) -> dict:
        """Unwrap DotDict to Python built-in dict."""
        return deepcopy(self._data)

    def __getitem__(self, item: str) -> t.Optional[t.Any]:
        value = self._data.get(item)
        if value is not None:
            return value

        return self._data.get(_convert_to_opposite_case(item))

    __getattr__ = __getitem__

    def __setitem__(self, key: str, value: t.Any) -> None:
        if key == '_data':
            super().__setattr__(key, value)
            return

        # we store the field in case that was firstly meet to not create duplicates
        if key not in self._data and _is_snake_case(key):
            key = _convert_snake_case_to_camel_case(key)

        self._data.__setitem__(key, DotDict.__convert(value))

    __setattr__ = __setitem__

    def __eq__(self, other: t.Any) -> bool:
        if isinstance(other, DotDict):
            return self._data == other._data
        if isinstance(other, dict):
            return self._data == other

        raise NotImplementedError

    def __str__(self) -> str:
        return str(self._data)

    def __repr__(self) -> str:
        return repr(self._data)

    def __reduce_ex__(self, protocol: int):
        return getattr(self._data, '__reduce_ex__', None)(protocol)

    def __reduce__(self):
        return getattr(self._data, '__reduce__', None)()

    @staticmethod
    def __convert(obj: t.Any) -> t.Any:
        if isinstance(obj, dict):
            return DotDict(obj)
        if isinstance(obj, list):
            return [DotDict.__convert(v) for v in obj]
        if isinstance(obj, set):
            return {DotDict.__convert(v) for v in obj}
        if isinstance(obj, tuple):
            return tuple(DotDict.__convert(v) for v in obj)
        return obj


class _DotDictPydanticAnnotation:
    @classmethod
    def __get_pydantic_core_schema__(
            cls,
            _source_type: t.Any,
            _handler: t.Callable[[t.Any], core_schema.CoreSchema],
    ) -> core_schema.CoreSchema:
        """
        We return a pydantic_core.CoreSchema that behaves in the following ways:

        * dicts will be parsed as `DotDict` instances with the int as the _data attribute
        * `DotDict` instances will be parsed as `DotDict` instances without any changes
        * Nothing else will pass validation
        * Serialization will always return just a dict
        """

        def validate_from_dict(value: dict) -> DotDict:
            return DotDict(value)

        from_dict_schema = core_schema.chain_schema(
            [
                core_schema.dict_schema(),
                core_schema.no_info_plain_validator_function(validate_from_dict),
            ]
        )

        return core_schema.json_or_python_schema(
            json_schema=from_dict_schema,
            python_schema=core_schema.union_schema(
                [
                    # check if it's an instance first before doing any further work
                    core_schema.is_instance_schema(DotDict),
                    from_dict_schema,
                ]
            ),
            serialization=core_schema.plain_serializer_function_ser_schema(lambda instance: instance.to_dict()),
        )

    @classmethod
    def __get_pydantic_json_schema__(
            cls, _core_schema: core_schema.CoreSchema, handler: GetJsonSchemaHandler
    ) -> JsonSchemaValue:
        # Use the same schema that would be used for `dict`
        return handler(core_schema.dict_schema())


DotDictType = te.Annotated[DotDict, _DotDictPydanticAnnotation]


class ModelBase(BaseModel):
    model_config = ConfigDict(extra='forbid', populate_by_name=True, strict=True)


class BlobRefLink(BaseModel):
    link: str = Field(alias='$link')


class BlobRef(BaseModel):
    model_config = ConfigDict(extra='forbid', populate_by_name=True, strict=True)

    mime_type: str = Field(alias='mimeType')
    size: int
    ref: BlobRefLink
    py_type: te.Literal['blob'] = Field(default='blob', alias='$type')


class Model1(ModelBase):
    blob: BlobRef
    py_type: te.Literal['model1'] = Field(default='model1', alias='$type', frozen=True)


class Model2(ModelBase):
    blob: BlobRef
    py_type: te.Literal['model2'] = Field(default='model2', alias='$type', frozen=True)


class Model3(ModelBase):
    blob: BlobRef
    py_type: te.Literal['model3'] = Field(default='model3', alias='$type', frozen=True)


UnknownRecordTypePydantic = te.Annotated[
    t.Union[
        'Model1',
        'Model2',
        'Model3',
    ],
    Field(discriminator='py_type'),
]
UnknownType: te.TypeAlias = t.Union[UnknownRecordTypePydantic, DotDictType]


class Record(ModelBase):
    value: 'UnknownType'


if __name__ == '__main__':
    Record.model_rebuild()

    test_data = {
        'value': {
            '$type': 'model1',
            'blob': {
                '$type': 'blob',
                'ref': {
                    '$link': 'blabla'
                },
                'mimeType': 'image/png',
                'size': 40930
            },
        }
    }

    instance = Record(**test_data)
    # call again with the same input dict to reproduce the issue
    instance2 = Record(**test_data)

    assert isinstance(instance.value.blob, BlobRef)
    assert isinstance(instance.value.blob.ref, BlobRefLink)

    # assert fails on pydantic >= 2.5.0
    # works fine on pydantic < 2.5.0
    assert isinstance(instance2.value.blob, BlobRef)
    assert isinstance(instance2.value.blob.ref, BlobRefLink)

    print('OK')

Python, Pydantic & OS Version

pydantic version: 2.5.2
        pydantic-core version: 2.14.5
          pydantic-core build: profile=release pgo=false
                 install path: /Users/ilyasiamionau/Library/Caches/pypoetry/virtualenvs/atproto-iBRicY1L-py3.8/lib/python3.8/site-packages/pydantic
               python version: 3.8.16 (default, Apr 18 2023, 09:49:55)  [Clang 14.0.3 (clang-1403.0.22.14.1)]
                     platform: macOS-14.1.2-arm64-arm-64bit
             related packages: mypy-1.3.0 typing_extensions-4.7.1 pydantic-settings-2.0.3


----------------------------------------------------------------------------------

             pydantic version: 2.4.2
        pydantic-core version: 2.10.1
          pydantic-core build: profile=release pgo=false
                 install path: /Users/ilyasiamionau/Library/Caches/pypoetry/virtualenvs/atproto-iBRicY1L-py3.8/lib/python3.8/site-packages/pydantic
               python version: 3.8.16 (default, Apr 18 2023, 09:49:55)  [Clang 14.0.3 (clang-1403.0.22.14.1)]
                     platform: macOS-14.1.2-arm64-arm-64bit
             related packages: mypy-1.3.0 typing_extensions-4.7.1 pydantic-settings-2.0.3
@MarshalX MarshalX added bug V2 Bug related to Pydantic V2 pending Awaiting a response / confirmation labels Dec 11, 2023
@davidhewitt
Copy link
Contributor

Thanks for sharing. The mutation is happening in your DotDict validator. If I add some prints, and adjust your __repr__ of DotDict to show it's a DotDict:

        def validate_from_dict(value: dict) -> DotDict:
            print("start", value)
            result = DotDict(value)
            print("result", result)
            print("end", value)
            return result

Then I see that calling DotDict(value) is replacing nested dictionaries inside value with DotDict instances.

The reason why this has only broken with the bump to 2.5 is that on 2.5 the new union behaviour is doing a little bit more work than before to check that DotDict is not a better match than your model instances. On 2.4 the DotDict validation was never run, on 2.5 it is now being run which is why you see the mutation bug arising.

@MarshalX
Copy link
Author

@davidhewitt so the more strict validations in the new version of pydantic helped me to find the bug in my code. awesome! thank you. ig we can close it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug V2 Bug related to Pydantic V2 pending Awaiting a response / confirmation
Projects
None yet
Development

No branches or pull requests

2 participants