Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

copy fails on collections.OrderedDict dataclass with required args #105736

Open
ringohoffman opened this issue Jun 13, 2023 · 8 comments
Open

copy fails on collections.OrderedDict dataclass with required args #105736

ringohoffman opened this issue Jun 13, 2023 · 8 comments
Assignees
Labels
3.11 only security fixes 3.12 bugs and security fixes 3.13 new features, bugs and security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@ringohoffman
Copy link

ringohoffman commented Jun 13, 2023

Bug report

Related: huggingface/transformers#8978

import collections
import copy
import dataclasses


@dataclasses.dataclass
class ModelOutputDictBase(dict):
    a: int
    b: int = 2

# works
copy.deepcopy(
    ModelOutputDictBase(
        a=1,
        b=2,
    )
)

@dataclasses.dataclass
class ModelOutputAllDefaults(collections.OrderedDict):
    a: int = 1
    b: int = 2

# works
copy.deepcopy(
    ModelOutputAllDefaults(
        a=1,
        b=2,
    )
)

@dataclasses.dataclass
class ModelOutput(collections.OrderedDict):
    a: int
    b: int = 2


# fails
copy.deepcopy(
    ModelOutput(
        a=1,
        b=2,
    )
)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[4], line 39
     35     a: int
     36     b: int = 2
---> 39 copy.deepcopy(
     40     ModelOutput(
     41         a=1,
     42         b=2,
     43     )
     44 )

File [...lib/python3.8/copy.py:172), in deepcopy(x, memo, _nil)
    170                 y = x
    171             else:
--> 172                 y = _reconstruct(x, memo, *rv)
    174 # If is its own copy, don't memoize.
    175 if y is not x:

File [.../lib/python3.8/copy.py:264), in _reconstruct(x, memo, func, args, state, listiter, dictiter, deepcopy)
    262 if deep and args:
    263     args = (deepcopy(arg, memo) for arg in args)
--> 264 y = func(*args)
    265 if deep:
    266     memo[id(x)] = y

TypeError: __init__() missing 1 required positional argument: 'a'

Your environment

  • CPython versions tested on: Python 3.8.16 (default, Jan 17 2023, 23:13:24) [GCC 11.2.0] :: Anaconda, Inc. on linux
  • Operating system and architecture: Ubuntu 20.04.6 LTS x86_64

Linked PRs

@ringohoffman ringohoffman added the type-bug An unexpected behavior, bug, or error label Jun 13, 2023
@AlexWaygood AlexWaygood added stdlib Python modules in the Lib dir 3.11 only security fixes 3.12 bugs and security fixes 3.13 new features, bugs and security fixes labels Jun 13, 2023
@eendebakpt
Copy link
Contributor

@ringohoffman I can confirm the deepcopy (or copy.copy) fails on python 3.11 with the example above. What is the reason for mixing the dataclass decorator and the dict or OrderedDict subclassing?

For the OrderedDict subclassing the __reduce__ on the ModelOutput becomes <method '__reduce__' of 'collections.OrderedDict' objects> which is not handling the copy of the dataclass correctly (in combination with the copy._reconstruct). I do not know whether this combination is supposed to work

@ringohoffman
Copy link
Author

transformers uses this pattern as part of their ModelOutput base class so you can access elements both by attribute and by item: https://github.com/huggingface/transformers/blob/0c3fdccf2f271fb7c44f6ea6e9f4ee234795f2c5/src/transformers/utils/generic.py#L315-L332

dict and its subclasses have special handling by torch in terms of how it handles splitting tensors that are nested elements of dict across threads and processes: https://github.com/pytorch/pytorch/blob/d0ff640ec865dd8c2898b59f1dab689992ed4621/torch/nn/parallel/scatter_gather.py#L50-L51

@sobolevn
Copy link
Member

CC @rhettinger

@rhettinger
Copy link
Contributor

Here's a smaller reproducer:

from pprint import pp
import collections
import copy

class A(dict):
    def __init__(self, a: int, b: int=2):
        self.a = a
        self.b = b

class B(collections.OrderedDict):
    def __init__(self, a: int, b: int=2):
        self.a = a
        self.b = b

pp(A(1, 3).__reduce__())
pp(B(1, 3).__reduce__())

a = copy.copy(A(1, 3))
b = copy.copy(B(1, 3))

@rhettinger rhettinger self-assigned this Jul 9, 2023
@rhettinger
Copy link
Contributor

For copying to work in this case, the __reduce__ method must return a callable that creates an instance but does not run the class's __init__ method since we can't know its signature in advance.

Also, even before copying, the example class isn't functional for the pure python OrderedDict implementation because it overrides __init__ which is needed to set up the underlying data structures. That setup work needs to be moved from __init__ to __new__.

@rhettinger
Copy link
Contributor

@ericsnowcurrently Eric, do you have any thoughts on how we can (and/or whether we should) modify OrderedDict.__reduce__ (in both the Python and C versions) to survive a subclass that overrides __init__, doesn't call the parent __init__, and that has some required arguments?

@ericsnowcurrently
Copy link
Member

I don't have much stake in this, but that situation certainly gives me pause. We're already dealing with a dict subclass, which makes things dicey. Furthermore, subclassing is necessarily cooperative, particularly when __init__() is involved. I'd hesitate to accommodate subclasses that don't cooperate. (FWIW, I would have expected dataclasses to do the right thing in the original post. Maybe it can't and the use must supply the proper __init__()?)

miss-islington pushed a commit to miss-islington/cpython that referenced this issue Aug 21, 2023
…ersion (pythonGH-108098)

(cherry picked from commit 20cc90c)

Co-authored-by: Raymond Hettinger <rhettinger@users.noreply.github.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Aug 21, 2023
…ersion (pythonGH-108098)

(cherry picked from commit 20cc90c)

Co-authored-by: Raymond Hettinger <rhettinger@users.noreply.github.com>
serhiy-storchaka pushed a commit that referenced this issue Aug 21, 2023
…version (GH-108098) (GH-108201)

(cherry picked from commit 20cc90c)

Co-authored-by: Raymond Hettinger <rhettinger@users.noreply.github.com>
Yhg1s pushed a commit that referenced this issue Aug 21, 2023
…version (GH-108098) (#108200)

gh-105736: Sync pure python version of OrderedDict with the C version (GH-108098)
(cherry picked from commit 20cc90c)

Co-authored-by: Raymond Hettinger <rhettinger@users.noreply.github.com>
@hugovk
Copy link
Member

hugovk commented Nov 10, 2023

Is there anything left to do here, or can we close this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.11 only security fixes 3.12 bugs and security fixes 3.13 new features, bugs and security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

7 participants