Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation: Writing persistent objects: Be careful about __eq__ and __hash__ #106

Closed
jamadden opened this issue Sep 10, 2016 · 8 comments

Comments

@jamadden
Copy link
Member

In general, objects without custom __eq__ and __hash__ objects are going to be friendlier on the DB and the cache (if you can get away with identity semantics).

Suppose you have two classes:

import uuid

class WithHash(Persistent):
   def __init__(self):
      self.id = uuid.uuid()
   def __eq__(self, other):
      return self.id == other.id
   def __hash__(self):
      return hash(self.id)

class WithoutHash(Persistent):
    def __init__(self):
       self.id = uuid.uuid()

Now if you had some dictionaries using those classes as keys:

conn = db.open()
conn.root.with_hash_dict = {WithHash(): i for i in range(5000)}
conn.root.wout_hash_dict = {WithoutHash(): i for i in range(5000)}

Doing something like:

conn = db.open()
len(conn.root.with_hash_dict)

is going to unghost 5000 WithHash objects, whereas

conn = db.open()
len(conn.root.wout_hash_dict)

isn't going to unghost any objects (because unpickling a dictionary re-hashes all the keys, and hash() looks up __hash__ on the class not the instance, so if the __hash__ method doesn't access any attributes---like the default one---nothing has to be unghosted) .

Even if your object cache is sized appropriately, large-ish dictionaries can take a long time to unpickle when accessed for the first time in a particular connection, adding lots of load to the DB and/or cache system; the same happens when creating a dictionary in memory for the first time of such persistent objects.

If you can accept identity semantics (and for persistent objects, you surprisingly often can), it's better to avoid custom __eq__ and __hash__ methods if you'll ever be creating dictionaries or sets of your persistent objects.

This is a lesson we learned the hard way; coming from a Java background almost all of our objects defined custom __eq__ and __hash__ methods, and that was fine until we started to get a lot of objects, when it became a performance burden. Now it turns out that many such of those objects don't need these methods.

Worth adding to the docs?

@tseaver
Copy link
Member

tseaver commented Sep 10, 2016

SGTM

@jimfulton
Copy link
Member

On Sat, Sep 10, 2016 at 1:08 PM, Jason Madden notifications@github.com
wrote:

In general, objects without custom eq and hash objects are going
to be friendlier on the DB and the cache (if you can get away with identity
semantics).

Suppose you have two classes:

import uuid
class WithHash(Persistent):
def init(self):
self.id = uuid.uuid()
def eq(self, other):
return self.id == other.id
def hash(self):
return hash(self.id)
class WithoutHash(Persistent):
def init(self):
self.id = uuid.uuid()

Now if you had some dictionaries using those classes as keys:

conn = db.open()
conn.root.with_hash_dict = {WithHash(): i for i in range(5000)}
conn.root.wout_hash_dict = {WithoutHash(): i for i in range(5000)}

Doing something like:

conn = db.open()len(conn.root.with_hash_dict)

is going to unghost 5000 WithHash objects, whereas

conn = db.open()len(conn.root.wout_hash_dict)

isn't going to unghost any objects (because unpickling a dictionary
re-hashes all the keys, and hash() looks up hash on the class not the
instance, so if the hash method doesn't access any attributes---like
the default one---nothing has to be unghosted) .

Even if your object cache is sized appropriately, large-ish dictionaries
can take a long time to unpickle when accessed for the first time in a
particular connection, adding lots of load to the DB and/or cache system;
the same happens when creating a dictionary in memory for the first time of
such persistent objects.

If you can accept identity semantics (and for persistent objects, you
surprisingly often can), it's better to avoid custom eq and hash
methods if you'll ever be creating dictionaries or sets of your persistent
objects.

This is a lesson we learned the hard way; coming from a Java background
almost all of our objects defined custom eq and hash methods, and
that was fine until we started to get a lot of objects, when it became a
performance burden. Now it turns out that many such of those objects don't
need these methods.

Worth adding to the docs?

Probably, but I'm not sure where. It's a bit obscure.

Maybe in "Other things you can do, but shouldn't".

Jim

Jim Fulton
http://jimfulton.info

@jamadden
Copy link
Member Author

Maybe in "Other things you can do, but shouldn't".

As in, you can add custom __hash__ but you maybe shouldn't?

@jimfulton
Copy link
Member

Yup.

It occurs to me that it might be nice to have mix-in classes (or maybe just one) that implements identity-based hash and comparison based on OIDs. In the past, I wanted PxBTrees, but maybe IdentityHashablePersistent and IdentyComparablePersistent (or maybe just the later and maybe with better names :))

@jamadden
Copy link
Member Author

Ok, I'll write this up and submit a PR. (I think it could also use something about implementing comparable methods to be used in a BTree, at least a pointer.)

Those sound like pretty good mixin classes. What would do you do before the object is assigned an OID though? Use its id?

@jimfulton
Copy link
Member

I would error. Using it's id would be a disaster.

@jamadden jamadden self-assigned this Sep 12, 2016
@jamadden
Copy link
Member Author

I had cause to remember about zope.keyreference today, which implements something remarkably similar

@jamadden
Copy link
Member Author

Opened PR #118 for this.

jimfulton added a commit that referenced this issue Sep 17, 2016
Add a section on the pitfalls of __eq__/__hash__. Fixes #106.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants