Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lookup and hash is slow #178

Closed
jensens opened this issue Feb 16, 2020 · 5 comments
Closed

lookup and hash is slow #178

jensens opened this issue Feb 16, 2020 · 5 comments

Comments

@jensens
Copy link
Member

jensens commented Feb 16, 2020

This performance problem bugs me since a while and some optimization was already done by me, primary in InterfaceClass.

I took the long-running and intense component-architecture using Plone tests as a measure here.
I took Py-Spy to get numbers by letting it sample for ~100s. The "Own-Time" of a function or line is here of interest: Where is time wasted?

Two functions always in the top 5 are zope.interface.interface._lookup and zope.interface.interface.InterfaceClass.__hash__. Both together take with:

  • zope.interface 4.7.1 about 17%
  • zope.interface master about 11%

of the test time.

For __hash__ I looked into the code. It is primary in there, because if __eq__ is implemented __hash__ needs also to be implemented. Just, our __eq__ does not use the __hash__ (which is fine.

Our current __hash__ hashes the tuple of name and module. To be sortable it would need to be swapped anyway.

So what if the InterfaceClass.__hash__ uses the very fast object.__hash__?

I tried this and it breaks one test in zope.interface. And I question if this test makes sense at all.
I'll prepare a PR to discuss it.
I used it with the Plone tests and nothing breaks so far.

The combined time of _lookup and __hash__ shrinks to about 5%

I created Plone site and added content, restarted Zope: all fine.

So, its doubles the speed compared to the current state.

For _lookup I have no further ideas to optimize. Maybe by shifting this over to C-code? But I would keep this for a different discussion.

I really appreciate feedback here.

@jamadden
Copy link
Member

jamadden commented Feb 16, 2020

So what if the InterfaceClass.hash uses the very fast object.hash?

I tried this and it breaks one test in zope.interface. And I question if this test makes sense at all.
I'll prepare a PR to discuss it.

If __eq__ (comparison in general) is implemented to do something custom, __hash__ must also be implemented to do that same thing. According to the language spec, equal objects must have equal hashes:

The only required property is that objects which compare equal have the same hash value; it is advised to mix together the hash values of the components of the object that also play a part in comparison of objects by packing them into a tuple and hashing the tuple

So simply setting InterfaceClass.__hash__ = object.__hash__ is wrong, with very far reaching consequences, staring with the fact that dicts and sets will be broken.

@jensens
Copy link
Member Author

jensens commented Feb 16, 2020

Well, in fact nothing is broken. I know the language specification and I can not see how in our case __eq__ and __hash__ are working together. The only important thing - that is my understanding of the specification - here is, if the result of __eq__ has a different output, then __hash__ must change too (or vice versa). This is given here, because both are stable: the instances values __eq__ takes to compare never changes, also the hash never changes.

@jamadden
Copy link
Member

jamadden commented Feb 16, 2020

Well, in fact nothing is broken

Yes, it is.

First, let's break things and establish that __eq__ no longer implies __hash__ is also equal:

>>> from zope.interface.interface import InterfaceClass
>>> InterfaceClass.__hash__ = object.__hash__
>>> from zope.interface import Interface
>>> i1 = InterfaceClass('I1', (Interface,), {})
>>> i1_2 = InterfaceClass('I1', (Interface,), {})
>>> i1 == i1_2
True
>>> hash(i1) == hash(i1_2)
False

Lists still work, because they only test __eq__:

>>> l = [i1]
>>> i1 in l
True
>>> i1_2 in l
True
>>> l.remove(i1_2)
None
>>> l
[]

But dict, and any other hash-based container, is badly broken compared to our expectations. Let's look at an object that does the correct thing. I'll use int because that's easy.

>>> int_1 = 234563
>>> int_1_2 = int(str(int_1))
# Not the same object instance, but equal, and same hash code.
>>> int_1 is int_1_2
False
>>> int_1 == int_1_2
True
>>> int_1_2 == int_1
True
>>> d = {int_1: "Yes"}
>>> int_1_2 in d
True
>>> d.pop(int_1_2)
'Yes'

But our interface violates this.

>>> i1 == i1_2
True
>>> d = {i1: "Yes"}
>>> i1 in d
True
# What, no, they're equal, why isn't it there?
>>> i1_2 in d 
False 
>>> d.pop(i1_2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: <InterfaceClass __main__.I1>
>>>

It is often the case that the simplest uses of zope.interface will ensure that there is only one InterfaceClass instance with the same __name__ and __module__. But that's far from guaranteed.

@jensens
Copy link
Member Author

jensens commented Feb 16, 2020

It is often the case that the simplest uses of zope.interface will ensure that there is only one InterfaceClass instance with the same name and module. But that's far from guaranteed.

I can not imagine any non-insane use case to have two Interface classes with the same name and module (at the moment this may happen by accident, because of the way we detect __module__ does not consider local defined classes, but that is more like a bug). I would really like to see such a use-case.

We can ensure to not have two of the same with a global and some validating code.

And given that we do not have more than one InterfaceClass instance with same name and module, we can use the object.hash, because then eq and hash always match the same object.

Given no one comes up with a sane use case to have two InterfaceClass instances with same module and name - for performance sake - I would propose to add this breaking change with this kind of singleton-by-module-name-pair InterfaceClass instances.

@jensens
Copy link
Member Author

jensens commented Feb 16, 2020

ok, I kick the idea, I found another micro-optimization, so the influence of __hash__ is minimized to an amount it do not hurt anymore. I better focus on __lookup__ which is more a problem.

@jensens jensens closed this as completed Feb 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants