Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow serializing coordinates as doubles #2510

Closed
danpol opened this issue Jun 25, 2019 · 7 comments
Closed

Allow serializing coordinates as doubles #2510

danpol opened this issue Jun 25, 2019 · 7 comments
Assignees
Milestone

Comments

@danpol
Copy link

danpol commented Jun 25, 2019

Description:

RDKit loses precision in 3D coordinates on pickling. The code below creates a molecule, adds 3D coordinates. It then compares the coordinates of the original molecule vs molecule from multiprocessing. It seems like RDKit saves float64 into float or similar, as pickling/unpickling twice has no effect compared to single pickle/unpickle.

Although the difference is small, it makes calculations irreproducible when one runs processes on a single thread or multiple threads.

It seems like a bug might occur somewhere around this line: https://github.com/rdkit/rdkit/blob/master/Code/GraphMol/MolPickler.cpp#L1385

Example:

from rdkit import Chem
from rdkit.Chem import rdDistGeom
from multiprocessing import Pool
import numpy as np

mol = Chem.MolFromSmiles('CCCOc1cc(Cl)c(Cl)cc1C(O)C1CCCC1')
rdDistGeom.EmbedMolecule(mol)

def g(mol):
    return mol

def compare(mol1, mol2):
    coord1 = mol1.GetConformer(0).GetPositions()
    coord2 = mol2.GetConformer(0).GetPositions()
    return np.abs(coord1 - coord2).max()

# Losing precision during pickling
v1 = g(mol)
v2 = list(Pool(2).map(g, [mol]))[0]
print(compare(v1, v2))

# Repeated pickling does not change coordinates
v3 = list(Pool(2).map(g, [mol]))[0]
print(compare(v2, v3))

This code gives:

1.7723052092577518e-07
0.0
@bp-kelley
Copy link
Contributor

bp-kelley commented Jun 26, 2019 via email

@greglandrum
Copy link
Member

I’ve been thinking about this a bit since the issue was filed and I’m not 100% sure what I think we should do.

First things first though: this isn’t connected to reproducibility. The results should be the same every time you run a particular experiment. They may not be what you expect, but they will be the same.

It would be easy to switch to using doubles, but that’s going to increase the size of the serialized objects substantially. I wonder if it’s worth it.

The other thing we could do is add an option (like the one that controls whether or not properties are pickled). Users who care about having full double precision in the serialized format could then set that option; others wouldn’t have to pay the price.

@bp-kelley
Copy link
Contributor

I think at the minimum, we should have the option to save double precision, so I would vote for that.

If you don't mind considering a conformer a "property" we could add it to the property pickle namespace.

PropertyPickleOptions

@bp-kelley
Copy link
Contributor

We can debate about what the defaults should be, but I'm going to add the option to pickle as doubles. I've run into this issue before when calculating gradients, and it led to a massive internal overhaul of another toolkit, so I'm a bit biased :)

@bp-kelley
Copy link
Contributor

Actually, I would default to "correct" and allow users to switch back to the older representation. We could always compress the coords :)

@bp-kelley bp-kelley self-assigned this Sep 12, 2019
@danpol
Copy link
Author

danpol commented Jan 21, 2020

Hi, @bp-kelley! Did you manage to fix the data type issue?

@bp-kelley
Copy link
Contributor

As a side-note @greglandrum interestingly, the pickleV1 saved them as doubles :)

bp-kelley pushed a commit to bp-kelley/rdkit that referenced this issue Jan 24, 2020
@greglandrum greglandrum added this to the 2020_03_1 milestone Jan 28, 2020
@greglandrum greglandrum changed the title Pickle of Chem.Mol changes conformations' coordinates Allow serializing coordinates as doubles Jan 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants