Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full pickle support #242

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

peternowee
Copy link
Member

@peternowee peternowee commented Oct 26, 2020

This branch aims to provide full support for pickle serialization of pydot objects. It should fix issues #127, #216 and #217 and is aimed for pydot 2.0.0.

An overview of the commits (original commit hashes before rebasing between parentheses):

  • First, we bind our dynamically defined methods (get_*/set_*/create_*/write_*) to the class instead of the instance. This prevents that pickle tries and fails to pickle them. It also brings performance improvements to pydot as a whole.
  • Then, we remove the limitation of only pickling the obj_dict attribute of an object dictionary. This will ensure that the entire object dictionary, including prog and shape_files, gets pickled.

See the commit messages for details and side notes.

Reviews are welcome, also of the commit messages. Neither the code, nor the commit messages are set in stone yet.

About the pylint errors mentioned in 4b75945 (57a8adc) and 5bd4a38 (9cb6e0d), I welcome opinions not only on if they deserve fixing or not, but also on the question if, when we explicitely decide to ignore a certain pylint error, people appreciate the use of pragmas (# pylint: disable= comments), or think that these just mess up the code. Of course, if an error can be fixed that is preferred, but the list is quite long at the moment and I don't know how many false positives will be left in the end. I know there are alternative linters, but then they question remains (e.g. do we want to see # noqa pragmas for flake8, etc.).

Other comments are also welcome of course. I won't merge this PR for another 2 weeks at least.

Each individual commit (before rebasing) passed the test suite on Python 3.7.3 and 2.7.16, Debian buster 10.6 amd64, PyParsing 2.4.7 and Graphviz 2.40.1 (20161225.0304).

These tests should catch copy-paste bugs, because there is inherently
some code duplication involved in dynamically defining from multiple
template functions. I tested the tests using 10 examples of mix-ups and
all were caught as either `FAIL`, `ERROR` or failure in `import pydot`.

(This commit is part of a series of changes in how the `get_*`,
`set_*`, `create_*` and `write_*` methods are dynamically created.)
This small refactoring prepares for a later commit that will switch to
binding the methods to the class instead of to the instance. The
instance attribute `prog` will not be available then anymore, because
no instance exists yet at the time the class is created. Therefore, the
reference is now replaced by the string `'dot'`.

Note that this does not actually change the default value at all. It
always got fixed to the value of `prog` at the time of function
definition, which was always `'dot'`, as can be seen from line 1713.
The default value already never changed along with the instance
attribute value anymore after function definition.

Side note:

- One could argue that `None` would make a better default value here.
  However, changing that now would be changing the external behavior of
  the methods themselves, while this series aims to change the way in
  which they are created with minimal external effects. I plan to
  propose a switch to `None` together with other changes to
  `create_*`/`write_*` in another series, though.

(This commit is part of a series of changes in how the `get_*`,
`set_*`, `create_*` and `write_*` methods are dynamically created.)
Not seeing any customizations that necessitate the use of the magic
method `__setattr__()` here, so using `setattr()` should be equivalent.
[[1]] [[2]]

Timeit is reporting 20 to 80us reductions in object instantation times,
all 20% drops relative to pydot 1.4.1.

No other changes in external behavior are expected.

[1]: https://stackoverflow.com/questions/14756289/setattrobject-name-value-vs-object-setattr-name-value
[2]: https://stackoverflow.com/questions/7559170/whats-the-difference-between-setattr-and-object-setattr

(This commit is part of a series of changes in how the `get_*`,
`set_*`, `create_*` and `write_*` methods are dynamically created.)
Replace lambdas with function definitions. In a later commit, we will
switch to binding the function as a method to the class instead of to
the instance. In that process, we will be changing some of its
attributes, so then the function object needs to be assigned to a name.
A regular function `def` takes care of that assignment already.

Some other reasons:

- Using lambdas as class/instance methods is frowned upon. Error code
  E731 in flake8, for example. [[1]] [[2]]
- Easier to add docstrings to a function later. [[3]]
- Lambdas cannot contain statements. Here this stands in the way of
  replacing `__setitem__()` with a regular assignment later.

Lambdas return functions as well, so no changes in behavior are
expected [[4]] [[5]]. Timeit is not showing any significant changes in
performance.

The use of the same name (`func`) for these two short-lived functions
triggers pylint error `E0102`/`function-redefined`. Fixing this by
giving the functions two different names might actually increase the
risk of copy-paste bugs, while the redefining of the functions actually
still continues because we are in a loop. It is just that pylint will
not see it anymore then. I plan to discuss how this pylint error should
be addressed in the pull request containing this commit. For now, a
pragma is added to suppress it.

[1]: https://realpython.com/python-lambda/#python-classes
[2]: https://pycodestyle.pycqa.org/en/latest/intro.html#error-codes
[3]: pydot@f374f66
[4]: https://stackoverflow.com/questions/12264834/what-is-the-difference-for-python-between-lambda-and-regular-function
[5]: https://stackoverflow.com/questions/44840101/in-python-lambda-expression-is-fast-or-the-normal-function

(This commit is part of a series of changes in how the `get_*`,
`set_*`, `create_*` and `write_*` methods are dynamically created.)
We now bind the dynamically created methods to their class instead of
generating them for each instance separately. Now, they should get
created only once, at module load, and from then on get the same
treatment as regularly defined methods.

This should lead to better support for serialization by pickle, by
ensuring that the methods are an integral part of the class, and
thereby always available to instances, also after unpickling.

Furthermore, this brings significant speedups and reduced memory usage
for pydot objects.

Notes on pickle:

- This commit prevents errors after unpickling like the following:

      AttributeError: 'Dot' object has no attribute 'write_png'
      AttributeError: 'Dot' object has no attribute 'get_bgcolor'

- The alternative of recreating the methods during unpickling by
  letting `__setstate__` or `__reduce__` call `__init__()` again was
  not chosen because it seems fundamentally wrong [[1]] [[2]] [[3]].

- The other alternative of serializing the instance methods along with
  the rest of the instance was not chosen, because:

  - It also feels fundamentally wrong. Why serialize methods that
    should never differ between instances to begin with?

  - Pickle refuses to work on the instance-bound methods, even after
    lambdas are changed to functions. This could be seen by temporarily
    removing `__getstate__`:

        AttributeError: Can't pickle local object 'Dot.__init__.<locals>.new_method'
        AttributeError: Can't pickle local object 'Common.create_attribute_methods.<locals>.func'

  - Even if an alternative to `pickle` like `dill` is used, the
    serializing of the methods means slower pickling and increased size
    of the representations.

- Moving these methods from the instance to the class also eliminates
  the need for the current custom `__getstate__` and `__setstate__`
  methods. They will be removed in a later commit.

[1]: https://docs.python.org/3.9/library/pickle.html#pickling-class-instances
[2]: https://stackoverflow.com/questions/50308214/python-3-alternatives-for-getinitargs#comment87633434_50308545
[3]: https://nedbatchelder.com/blog/202006/pickles_nine_flaws.html#h_init_isnt_called

Notes on performance:

- Module load time increased between 1.5 to 2.0ms, which is only 2% on
  the load time reported by bash's `time`, but 60% on the `self` for
  `pydot` reported by `python -X importtime`.
- Time saved on each object instantiation, according to `timeit`: 0.1
  to 0.3ms, depending on the class, but for all it translates into a
  further 74 to 78% drop relative to pydot 1.4.1.
- The more objects in a graph, the sooner there will be a net time
  saving. I estimate the break-even point to be around 5 to 15 objects.
  The speedups are most noticeable when building graphs from the ground
  up programmatically, like `test_names_of_a_thousand_nodes` from the
  test suite, which cuts its run time by almost 300ms or 92%.
- Memory savings reported by `pympler.asizeof()` show the sizes of Dot,
  Edge and Node instances are cut by 20, 9 and 7 KiB respectively,
  which translates to 87 to 90% reductions.
- Measured using: Celeron N3350, Linux 4.19 amd64, Debian 10.6, CPython
  3.7.3, timeit, -X importtime, dill 0.3.2, Pympler 0.9.

Further notes:

- As part of the dynamic creation of each method, its function is
  now also renamed (`__name__` and `__qualname__`) to the name it would
  have had if it had been defined in the class itself. This prevents
  that a temporary, generic name shows up in tracebacks.

- The binding to the class is performed from class decorators. Some
  alternatives that were considered:
  - Metaclasses: Got it to work, but more complex than class decorators
    and may cause metaclass conflicts for the subclassing user because
    of their inheritance rules. Many authors suggest to use decorators
    instead of metaclasses where possible. [[4]] [[5]]
  - `__init_subclass__`: Got it to work from class `Common`, but the
    problem is that we are not customizing all its subclasses in the
    same way: Subclass `Dot` needs output format methods, some other
    subclasses need DOT attribute getters/setters, and some others do
    not need anything at all. This can be solved by adding arguments to
    the `__init_subclass__` signature and/or inspection of the
    subclass, plus the necessary logic to switch between the different
    cases then. Or, to prevent that, by overhauling the class hierarchy
    and adding some new classes between `Common` and the subclasses. In
    both cases, class decorators seem a lot simpler in the end.

[4]: https://www.python.org/dev/peps/pep-3129/#rationale
[5]: https://realpython.com/python-metaclasses/#is-this-really-necessary

- `create_attributes_methods()`, which creates the `get_*`/`set_*`
  methods, is transformed to a decorator factory: It now takes only the
  set of DOT attributes and returns the true decorator function that
  will add exactly that set to the class. It can be used as a
  parameterized class decorator [[6]] [[7]] or from a metaclass.
  - Since it does not need to be inherited by subclasses anymore, it is
    moved from class `Common` to module-level.
  - An alternative to using a parameterized decorator was to let the
    decorator determine which set of DOT attributes to apply based on
    inspection of the class. But switching based on the class name
    string seems fragile. Switching by type is difficult for custom
    subclasses because `issubclass()` does not work when classes are
    still being created. Reading the set from a class attribute can
    work, as `create_format_methods()` shows, though I wonder if it
    will work from a metaclass `__new__()` for example, which runs much
    earlier in the class creation process. Also, it raises questions
    about the meaning of such attribute in subclasses that already
    inherit all the methods created for their base class. For example,
    `Cluster` needs to override the set with its own while `Subgraph`
    keeps the set of its parent. Also, the sets of DOT attributes are
    currently kept as global constants and would then either have to be
    moved to the classes (API change) or kept in two places. So, in
    this case, the parameterized solution seemed to require the least
    modifications and provide an easy way to pass custom DOT attributes
    for further subclassing.

[6]: https://stackoverflow.com/questions/681953/how-to-decorate-a-class/44556596#44556596
[7]: https://stackoverflow.com/questions/5929107/decorators-with-parameters

(This commit is part of a series of changes in how the `get_*`,
`set_*`, `create_*` and `write_*` methods are dynamically created.)
This commit wraps up a series of changes aiming to provide full support
for pickle serialization of pydot objects.

The last few commits ensured that dynamically defined methods are bound
to their classes instead of their instances. With that in place, there
is no more need to limit serialization of objects dictionaries to only
their `obj_dict` attribute. Therefore, this commit removes the custom
`__getstate__` and `__setstate__` methods. This makes pickle return to
its normal behavior of serializing the entire dictionary, which means
that an unpickled object will now have all the same attributes as its
original.

This resolves errors like this from unpickled objects:

    AttributeError: 'Dot' object has no attribute 'prog'
    AttributeError: 'Dot' object has no attribute 'shape_files'

Together with the last few commits, issues like pydot#127,
pydot#216 and pydot#217 should now be fixed for pickle and
similar tools like dill.

Background:

Git history shows that in the Initial import of 2007 (pydot 0.9.10,
commit c0233c6), `__getstate__` and `__setstate__` selected the entire
instance's dictionary and from that deleted the convenience methods. In
2010 (between pydot 1.0.2 and 1.0.4, commit b125c5d) this was changed
to selecting only the `obj_dict` attribute of the instance's
dictionary, which did not include the convenience methods to start
with. That change was obviously related to issue pydot#16 that
was closed a few days before it. It seems that the idea was to move
towards keeping all instance-related data under a single `obj_dict`
attribute and not have any other important data in attributes directly
on the instance.

However, currently, a newly created `Dot` instance contains, besides
the `obj_dict`, the attributes `prog` and `shape_files`. Both are set
during instantiation (`Dot.__init__()`), and as such will normally not
be recreated on unpickling [[1]]. Still, they seem important enough
that they should actually get pickled, which is what this commit
accomplishes. The `prog` attribute is just a short string, and
`shape_files` just a list of file paths, so it seems unlikely there
could ever have been any real concerns about pickling them. Instances
of `Graph`, `Subgraph`, `Cluster`, `Node` and `Edge` do not have
dictionary entries other than `obj_dict`, at least not on
instantiation.

[1]: https://docs.python.org/3.9/library/pickle.html#pickling-class-instances

Some examples sizes for a `Dot` object with 1 named node serialized by
dill, an alternative to pickle:

- Before the previous commit, when instances still had all those
  convenience methods bound to them, but `__getstate__` would only
  pickle `obj_dict`: 351 bytes.
- Same, but then what happens when `__getstate__` would not exist or a
  `__reduce__` method is added that does not call `__getstate__` and
  the entire object dictionary gets pickled: 22986 bytes. That shows
  the weight of all those methods that get packed then.
- With this commit, now that the previous commit removed all those
  methods and we can safely start pickling the entire dictionary again:
  413 bytes. The extra 62 bytes compared to the first case reflect the
  object attributes `prog` and `shapefiles` that get pickled now as
  well.

(Measured using `len(dill.dumps(g))`. Dill was used because pickle was
not able to pickle pydot objects before. It can now. For the final
result, I checked that pickle gave the same result as dill.)

Further notes:

- The ChangeLog will mention that pickled objects created with the
  previous code are incompatible with the new code and vice versa. This
  is not unheard of with pickle in general [[2]]. Class versioning and
  conversion [[3]] could solve this, but my guess is that most use
  cases have no need for this. Otherwise please report or provide a
  patch.

  [2]: https://nedbatchelder.com/blog/202006/pickles_nine_flaws.html#h_old_pickles_look_like_old_code
  [3]: https://docs.python.org/3.9/library/pickle.html#what-can-be-pickled-and-unpickled

- Because the Python documentation lists `__getstate__` and
  `__setstate__` only in relation with pickle [[4]], I expect that this
  change will have not have any side effects on other uses of pydot.

  [4]: https://docs.python.org/3.9/genindex-_.html

- This commit will cause Pylint 2.6.0 to emit 9 errors like:

      pydot.py:520:19: E1101: Instance of 'Common' has no 'obj_dict' member (no-member)

  Actually, this reflects an existing situation not caused by this
  commit. Pylint just did not report it as strongly before, because it
  saw `obj_dict` get assigned to from `__setstate__`. Before this
  commit, there was only this 1 warning:

      pydot.py:431:8: W0201: Attribute 'obj_dict' defined outside __init__ (attribute-defined-outside-init)

  I considered adding an `__init__()` method to class `Common` just to
  set `obj_dict` to an empty dictionary, but decided it would not be
  good to do that just for Pylint. None of the subclasses currently
  call `super().__init__()` anyway and the `Common` docstring clearly
  states that it should not be used directly.

  There seems to be room for unifying the `__init__()` methods of the
  various classes and some deduplication to `Common.__init__()`. That
  would bring a meaningful resolution to these Pylint errors. However,
  such changes deserve to be discussed separately.

- Pickle-related tests were updated and extended. Notes:

  - The main pickle tests are now performed on a `Dot` instance instead
    of on a `Graph`, because `Dot` is the only class that has both the
    dynamically generated `create_*`/`write_*` and `get_*`/`set_*`
    methods.

  - Regarding the removal of the use of a tuple in the initialization
    of one of the edges: Support for that was removed in 2016 (pydot
    1.2.4, commit de30c14) and it would lead to a `TypeError` now that
    we start requesting a string representation. If tuple support is
    ever restored again, it should have its own test anyway.
@peternowee
Copy link
Member Author

Rebased on top of pydot 2.0.0.dev0 commit e48d90c, which is basically pydot 1.4.2 except for the version number change.

@sebastiandohnany
Copy link

Hi! Please work on merging this to master. Parallelisation is so important with Pydot! It cuts the time significantly.

@peternowee
Copy link
Member Author

@sebastiandohnany Did you test this branch? Any remarks?

@sebastiandohnany
Copy link

@sebastiandohnany Did you test this branch? Any remarks?

I haven't done any proper testing, that is why I just wrote a message of encouragement – I can try to look at it more extensively later. But for all I know, everything I'm using pydot for (traversing graph, changing properties, etc.) works + usage with multiprocessing is smooth too. As I'm working on animations (uwaicl/epsilon-animate), this cuts my running times greatly, so thank you.

@peternowee
Copy link
Member Author

@sebastiandohnany

this cuts my running times greatly

With "this" you mean working with this branch, right? That is what I meant to ask. Did not run into any problems?

@peternowee
Copy link
Member Author

@sebastiandohnany Reading you last reply again, it seems that you really tried out this branch, so thanks for letting me know your results! About the schedule for merging to master: Because we are now at the start of the pydot 2.0 development cycle, I first wanted to merge the stuff that messes up git blame most, like formatting (done), dropping support for Python 2 (hopefully end of next week) and a module/package restructuring (#230, PR in the making). After that, maybe I can rebase and merge this one, but there are also some other people waiting for a response from me. Plus I am still looking for work, so I cannot really put a time frame on anything right now. Hope you understand.

Copy link

This pull request has conflicts, please resolve those so that the changes can be evaluated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants