-
Notifications
You must be signed in to change notification settings - Fork 845
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Chemical Markup Language, CML, for writing #3024
Conversation
Later I will implement CMLReader as well if I have time. |
@e-kwsm , I could see that having a reader and writer for CML would be useful and really appreciate the contribution, but we are planning the feature freeze for the 2020.03 release tomorrow (Monday). Getting this reviewed and merged in time will be tricky. |
Test is not implemented, so it should be postponed. |
Now tests are implemented and ready for review. |
CI failure is due to timeout.
from rdkit import Chem
from rdkit.Chem import AllChem
mol = Chem.AddHs(Chem.MolFromSmiles('c1ccccc1/C=C/C#N')) # cinnamonitrile
AllChem.EmbedMolecule(mol, randomSeed=0)
r = AllChem.MMFFOptimizeMolecule(mol)
if r != 0:
raise RuntimeError('not converged')
def num_total_hs(mol):
print([a.GetTotalNumHs() for a in mol.GetAtoms()])
num_total_hs(mol)
num_total_hs(Chem.MolFromMolBlock(Chem.MolToMolBlock(mol)))
num_total_hs(Chem.MolFromMolBlock(Chem.MolToMolBlock(mol), removeHs=False))
num_total_hs(Chem.MolFromTPLBlock(Chem.MolToTPLBlock(mol))) prints
|
ddcae19
to
fef7708
Compare
8bc9c0e
to
d2c0b82
Compare
@greglandrum - It might be time to review this now. |
@bp-kelley : agreed |
@bp-kelley @greglandrum @ptosco This one has been sitting for a while, so I'm poking it again. |
|
||
const auto n_rad_es = a->getNumRadicalElectrons(); | ||
mol_num_radical_electrons += n_rad_es; | ||
atom.put("<xmlattr>.spinMultiplicity", n_rad_es + 1u); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look right.
paired electrons = multiplicity = 1
one unpaired electron == spin multiplicity = 2
two unpaired electron == spin multiplicity can be 1 or 3 depending on opposite or same spine
Is this doing something different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct.
There is no spin information, so spinMultiplicity
attribute is added only in the case of zero or one radical electron.
5360b07
to
b0dccc6
Compare
56bdcd4
to
d3c4bfc
Compare
Ready to merge. |
Thanks! I will review in the next couple of days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Numerous comments and suggestions here.
@@ -738,6 +738,58 @@ M END | |||
} | |||
} | |||
|
|||
TEST_CASE("CML writer", "[CML][writer]") { | |||
SECTION("basics") { | |||
std::unique_ptr<RWMol> mol{new RWMol{}}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's certainly not necessary for the merge, but you could perhaps make this a lot simpler by just constructing the molecule from a mol block:
hydrogen cyanide and deuterated hydroxide
OpenBabel09112017493D
5 3 0 0 0 0 0 0 0 0999 V2000
-1.0650 0.0000 0.0000 H 0 0 0 0 0 1 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 4 0 0 0 0 0 0
1.1600 0.0000 0.0000 N 0 0 0 0 0 3 0 0 0 0 0 0
0.0000 0.0000 2.0000 O 0 5 0 0 0 1 0 0 0 0 0 0
0.0000 0.9400 2.0000 H 0 0 0 0 0 1 0 0 0 0 0 0
1 2 1 0 0 0 0
2 3 3 0 0 0 0
4 5 1 0 0 0 0
M RAD 5 1 1 2 1 3 1 4 1 5 1
M ISO 1 5 2
M CHG 1 4 -1
M END
@@ -160,6 +160,19 @@ inline void MolToV3KMolFile(const ROMol &mol, const std::string &fName, | |||
MolToMolFile(mol, fName, includeStereo, confId, kekulize, true); | |||
} | |||
|
|||
RDKIT_FILEPARSERS_EXPORT void MolToCMLBlock(std::ostream &os, const ROMol &mol, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have an overload like this for any of the other writers.
That doesn't make it wrong, but I'm curious why you think it's necessary for the CML writer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was suggested by @bp-kelley (#3024 (comment)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, then the question is for @bp-kelley: why did you suggest adding a new overload for this writer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was to remove duplicated code from the previous functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But does it need to be in the public API? I’m pushing because we don’t have an equivalent for any of the other readers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that’s just where the code was originally. These could all be in a cpp file.
if (kekulize) { | ||
MolOps::Kekulize(rwmol); | ||
} | ||
const auto& conf = rwmol.getConformer(confId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the atomic coordinates are optional in CML.
Given that, I think it makes sense if we make writing the conformer optional too.
The easiest way to do this (I think) is to use a pointer to the active conformer which is null if the molecule doesn't have a conformer.
Below you would then check if the pointer is null before writing the atomic coords.
|
||
void MolToCMLBlock(std::ostream& os, const ROMol& mol, int confId, | ||
bool kekulize) { | ||
auto pt = molToPTree(mol, confId, kekulize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice idea to use the boost property tree to deal with the XML parts.
af1ddee
to
b191127
Compare
Implement RDKit::MolToCMLBlock and RDKit::MolToCMLFile http://www.xml-cml.org/
std::vector<RDGeom::Point3D> xyzs; | ||
std::vector<unsigned> neighbors; | ||
for (auto nbri : boost::make_iterator_range(mol.getAtomBonds(a))) { | ||
const auto* const b = mol[nbri]; | ||
const auto o = b->getOtherAtom(a)->getIdx(); | ||
neighbors.push_back(o); | ||
xyzs.push_back(conf.getAtomPos(o)); | ||
} | ||
|
||
if (neighbors.size() != 4u) { | ||
continue; | ||
} | ||
|
||
/* atomParity = sign(det([[1, 1, 1, 1], | ||
[x0, x1, x2, x3], | ||
[y0, y1, y2, y3], | ||
[z0, z1, z2, z3]])) | ||
*/ | ||
const auto det = determinant(xyzs[1u].x, xyzs[2u].x, xyzs[3u].x, // | ||
xyzs[1u].y, xyzs[2u].y, xyzs[3u].y, // | ||
xyzs[1u].z, xyzs[2u].z, xyzs[3u].z) - | ||
determinant(xyzs[0u].x, xyzs[2u].x, xyzs[3u].x, // | ||
xyzs[0u].y, xyzs[2u].y, xyzs[3u].y, // | ||
xyzs[0u].z, xyzs[2u].z, xyzs[3u].z) + | ||
determinant(xyzs[0u].x, xyzs[1u].x, xyzs[3u].x, // | ||
xyzs[0u].y, xyzs[1u].y, xyzs[3u].y, // | ||
xyzs[0u].z, xyzs[1u].z, xyzs[3u].z) - | ||
determinant(xyzs[0u].x, xyzs[1u].x, xyzs[2u].x, // | ||
xyzs[0u].y, xyzs[1u].y, xyzs[2u].y, // | ||
xyzs[0u].z, xyzs[1u].z, xyzs[2u].z); | ||
if (det == 0.0) { | ||
continue; | ||
} | ||
|
||
auto& atomParity = atom.add("cml:atomParity", det > 0.0 ? 1 : -1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a simpler solution to this problem. I will do a PR against your branch with a suggestion.
1. stop writing the cml: namespace to the output. Tools like marvin can't read that and it's not nececssary 2. fix the H count 3. continue writing atom info for 2D confs 4. simplify the writing of atom parity
write bond hash/wedge information
some suggested changes
RDKIT_FILEPARSERS_EXPORT void MolToCMLBlock(std::ostream &os, const ROMol &mol, | ||
int confId = -1, | ||
bool kekulize = true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please just remove this from the public API. It's fine to have the function in CMLWriter.cpp
and use it there, but there's no need to expose it more broadly.
RDKIT_FILEPARSERS_EXPORT void MolToCMLBlock(std::ostream &os, const ROMol &mol, | |
int confId = -1, | |
bool kekulize = true); |
@e-kwsm Once that last change I requested (to remove the extra function from the public API) and the CI tests pass I will be happy to merge this one. |
@e-kwsm : thanks for making this contribution and sticking with it through the PR process! |
What does this implement/fix? Explain your changes.
Support Chemical Markup Language, CML, for writing
http://www.xml-cml.org/
CML is an XML variant and thus machine-friendly.
Example:
produces
Note that
hydrogenCount
is manually corrected.Any other comments?
TODO