-
Notifications
You must be signed in to change notification settings - Fork 845
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow serializing coordinates as doubles #2510
Comments
You are correct, conformations internally are stored as doubles but are
saved as floats. I think this is a bug at worst and misleading at best.
@greg Landrum <greg.landrum@gmail.com> what do you think?
…On Tue, Jun 25, 2019 at 9:02 AM Daniil Polykovskiy ***@***.***> wrote:
*Description:*
RDKit loses precision in 3D coordinates on pickling. The code below
creates a molecule, adds 3D coordinates. It then compares the coordinates
of the original molecule vs molecule from multiprocessing. It seems like
RDKit saves float64 into float or similar, as pickling/unpickling twice
has no effect compared to single pickle/unpickle.
Although the difference is small, it makes calculations unreproducible
when one runs processes on a single thread or multiple threads.
It seems like a bug might occur somewhere around this line:
https://github.com/rdkit/rdkit/blob/master/Code/GraphMol/MolPickler.cpp#L1385
*Example:*
from rdkit import Chem
from rdkit.Chem import rdDistGeom
from multiprocessing import Pool
import numpy as np
mol = Chem.MolFromSmiles('CCCOc1cc(Cl)c(Cl)cc1C(O)C1CCCC1')
rdDistGeom.EmbedMolecule(mol)
def g(mol):
return mol
def compare(mol1, mol2):
coord1 = mol1.GetConformer(0).GetPositions()
coord2 = mol2.GetConformer(0).GetPositions()
return np.abs(coord1 - coord2).max()
# Losing precision during pickling
v1 = g(mol)
v2 = list(Pool(2).map(g, [mol]))[0]
print(compare(v1, v2))
# Repeated pickling does not change coordinates
v3 = list(Pool(2).map(g, [mol]))[0]
print(compare(v2, v3))
This code gives:
1.7723052092577518e-07
0.0
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#2510?email_source=notifications&email_token=ACLOFIC6U5MCW4G57QCPEADP4IJNTA5CNFSM4H3ICDG2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4G3RKDUA>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACLOFIALMLYYCWHIFSC4VULP4IJNTANCNFSM4H3ICDGQ>
.
|
I’ve been thinking about this a bit since the issue was filed and I’m not 100% sure what I think we should do. First things first though: this isn’t connected to reproducibility. The results should be the same every time you run a particular experiment. They may not be what you expect, but they will be the same. It would be easy to switch to using doubles, but that’s going to increase the size of the serialized objects substantially. I wonder if it’s worth it. The other thing we could do is add an option (like the one that controls whether or not properties are pickled). Users who care about having full double precision in the serialized format could then set that option; others wouldn’t have to pay the price. |
I think at the minimum, we should have the option to save double precision, so I would vote for that. If you don't mind considering a conformer a "property" we could add it to the property pickle namespace. PropertyPickleOptions |
We can debate about what the defaults should be, but I'm going to add the option to pickle as doubles. I've run into this issue before when calculating gradients, and it led to a massive internal overhaul of another toolkit, so I'm a bit biased :) |
Actually, I would default to "correct" and allow users to switch back to the older representation. We could always compress the coords :) |
Hi, @bp-kelley! Did you manage to fix the data type issue? |
As a side-note @greglandrum interestingly, the pickleV1 saved them as doubles :) |
Description:
RDKit loses precision in 3D coordinates on pickling. The code below creates a molecule, adds 3D coordinates. It then compares the coordinates of the original molecule vs molecule from multiprocessing. It seems like RDKit saves
float64
intofloat
or similar, as pickling/unpickling twice has no effect compared to single pickle/unpickle.Although the difference is small, it makes calculations irreproducible when one runs processes on a single thread or multiple threads.
It seems like a bug might occur somewhere around this line: https://github.com/rdkit/rdkit/blob/master/Code/GraphMol/MolPickler.cpp#L1385
Example:
This code gives:
The text was updated successfully, but these errors were encountered: