Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Chemical Markup Language, CML, for writing #3024

Merged
merged 6 commits into from
Apr 29, 2021

Conversation

e-kwsm
Copy link
Contributor

@e-kwsm e-kwsm commented Mar 22, 2020

What does this implement/fix? Explain your changes.

Support Chemical Markup Language, CML, for writing
http://www.xml-cml.org/

CML is an XML variant and thus machine-friendly.

Example:

from rdkit import Chem

mol = """acrylonitrile


  7  6  0  0  0  0  0  0  0  0999 V2000
   -0.6992   -1.1405    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.5210    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.7820    0.0000    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   -1.7845   -1.1255    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
   -0.2103   -2.1109    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
   -0.4773    0.9730    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  2  3
  1  5  1  0
  1  6  1  0
  2  3  1  0
  2  7  1  0
  3  4  3  0
M  ISO  1   7   2
M  END
"""

m = Chem.MolFromMolBlock(mol, removeHs=False)
print(Chem.MolToCMLBlock(m))

produces

<?xml version="1.0" encoding="utf-8"?>
<cml:cml xmlns:cml="http://www.xml-cml.org/schema" xmlns:convention="http://www.xml-cml.org/convention/" convention="convention:molecular">
  <cml:molecule id="m-1" formalCharge="0" spinMultiplicity="1">
    <cml:name>acrylonitrile</cml:name>
    <cml:atomArray>
      <cml:atom id="a0" elementType="C" formalCharge="0" hydrogenCount="2" spinMultiplicity="1" x3="-0.699200" y3="-1.140500" z3="0.000000"/>
      <cml:atom id="a1" elementType="C" formalCharge="0" hydrogenCount="1" spinMultiplicity="1" x3="0.000000" y3="0.000000" z3="0.000000"/>
      <cml:atom id="a2" elementType="C" formalCharge="0" hydrogenCount="0" spinMultiplicity="1" x3="1.521000" y3="0.000000" z3="0.000000"/>
      <cml:atom id="a3" elementType="N" formalCharge="0" hydrogenCount="0" spinMultiplicity="1" x3="2.782000" y3="0.000000" z3="0.000000"/>
      <cml:atom id="a4" elementType="H" formalCharge="0" hydrogenCount="0" spinMultiplicity="1" x3="-1.784500" y3="-1.125500" z3="0.000000"/>
      <cml:atom id="a5" elementType="H" formalCharge="0" hydrogenCount="0" spinMultiplicity="1" x3="-0.210300" y3="-2.110900" z3="0.000000"/>
      <cml:atom id="a6" elementType="H" formalCharge="0" hydrogenCount="0" isotopeNumber="2" spinMultiplicity="1" x3="-0.477300" y3="0.973000" z3="0.000000"/>
    </cml:atomArray>
    <cml:bondArray>
      <cml:bond atomRefs2="a0 a1" id="b0" order="2"/>
      <cml:bond atomRefs2="a0 a4" id="b1" order="1"/>
      <cml:bond atomRefs2="a0 a5" id="b2" order="1"/>
      <cml:bond atomRefs2="a1 a2" id="b3" order="1"/>
      <cml:bond atomRefs2="a1 a6" id="b4" order="1"/>
      <cml:bond atomRefs2="a2 a3" id="b5" order="3"/>
    </cml:bondArray>
  </cml:molecule>
</cml:cml>

Note that hydrogenCount is manually corrected.

Any other comments?

TODO

  • test

@e-kwsm
Copy link
Contributor Author

e-kwsm commented Mar 22, 2020

Later I will implement CMLReader as well if I have time.

@greglandrum
Copy link
Member

@e-kwsm , I could see that having a reader and writer for CML would be useful and really appreciate the contribution, but we are planning the feature freeze for the 2020.03 release tomorrow (Monday). Getting this reviewed and merged in time will be tricky.
Do you mind if we postpone review until after the release?

@e-kwsm
Copy link
Contributor Author

e-kwsm commented Mar 22, 2020

Test is not implemented, so it should be postponed.

@e-kwsm e-kwsm marked this pull request as ready for review March 25, 2020 09:14
@e-kwsm
Copy link
Contributor Author

e-kwsm commented Mar 25, 2020

Now tests are implemented and ready for review.

@e-kwsm e-kwsm changed the title Implement RDKit::MolToCMLBlock and RDKit::MolToCMLFile Support Chemical Markup Language, CML, for writing Mar 25, 2020
@e-kwsm
Copy link
Contributor Author

e-kwsm commented Mar 25, 2020

CI failure is due to timeout.


atom/@hydrogenCount would be incorrect:

from rdkit import Chem
from rdkit.Chem import AllChem


mol = Chem.AddHs(Chem.MolFromSmiles('c1ccccc1/C=C/C#N'))  # cinnamonitrile
AllChem.EmbedMolecule(mol, randomSeed=0)
r = AllChem.MMFFOptimizeMolecule(mol)
if r != 0:
    raise RuntimeError('not converged')


def num_total_hs(mol):
    print([a.GetTotalNumHs() for a in mol.GetAtoms()])


num_total_hs(mol)
num_total_hs(Chem.MolFromMolBlock(Chem.MolToMolBlock(mol)))
num_total_hs(Chem.MolFromMolBlock(Chem.MolToMolBlock(mol), removeHs=False))
num_total_hs(Chem.MolFromTPLBlock(Chem.MolToTPLBlock(mol)))

prints

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[1, 1, 1, 1, 1, 0, 1, 1, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

@bp-kelley
Copy link
Contributor

@greglandrum - It might be time to review this now.

@greglandrum
Copy link
Member

@bp-kelley : agreed

@bp-kelley
Copy link
Contributor

@bp-kelley @greglandrum @ptosco This one has been sitting for a while, so I'm poking it again.

Code/GraphMol/FileParsers/CMLWriter.cpp Outdated Show resolved Hide resolved

const auto n_rad_es = a->getNumRadicalElectrons();
mol_num_radical_electrons += n_rad_es;
atom.put("<xmlattr>.spinMultiplicity", n_rad_es + 1u);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look right.

paired electrons = multiplicity = 1
one unpaired electron == spin multiplicity = 2
two unpaired electron == spin multiplicity can be 1 or 3 depending on opposite or same spine

Is this doing something different?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct.
There is no spin information, so spinMultiplicity attribute is added only in the case of zero or one radical electron.

Code/GraphMol/FileParsers/CMLWriter.cpp Outdated Show resolved Hide resolved
Code/GraphMol/FileParsers/CMLWriter.cpp Show resolved Hide resolved
@e-kwsm e-kwsm force-pushed the CMLWriter branch 3 times, most recently from 5360b07 to b0dccc6 Compare November 19, 2020 09:08
@e-kwsm
Copy link
Contributor Author

e-kwsm commented Apr 6, 2021

Ready to merge.

@greglandrum
Copy link
Member

Thanks! I will review in the next couple of days.

Copy link
Member

@greglandrum greglandrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numerous comments and suggestions here.

@@ -738,6 +738,58 @@ M END
}
}

TEST_CASE("CML writer", "[CML][writer]") {
SECTION("basics") {
std::unique_ptr<RWMol> mol{new RWMol{}};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's certainly not necessary for the merge, but you could perhaps make this a lot simpler by just constructing the molecule from a mol block:

hydrogen cyanide and deuterated hydroxide
 OpenBabel09112017493D

  5  3  0  0  0  0  0  0  0  0999 V2000
   -1.0650    0.0000    0.0000 H   0  0  0  0  0  1  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  4  0  0  0  0  0  0
    1.1600    0.0000    0.0000 N   0  0  0  0  0  3  0  0  0  0  0  0
    0.0000    0.0000    2.0000 O   0  5  0  0  0  1  0  0  0  0  0  0
    0.0000    0.9400    2.0000 H   0  0  0  0  0  1  0  0  0  0  0  0
  1  2  1  0  0  0  0
  2  3  3  0  0  0  0
  4  5  1  0  0  0  0
M  RAD  5   1   1   2   1   3   1   4   1   5   1
M  ISO  1   5   2
M  CHG  1   4  -1
M  END

@@ -160,6 +160,19 @@ inline void MolToV3KMolFile(const ROMol &mol, const std::string &fName,
MolToMolFile(mol, fName, includeStereo, confId, kekulize, true);
}

RDKIT_FILEPARSERS_EXPORT void MolToCMLBlock(std::ostream &os, const ROMol &mol,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have an overload like this for any of the other writers.
That doesn't make it wrong, but I'm curious why you think it's necessary for the CML writer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was suggested by @bp-kelley (#3024 (comment)).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, then the question is for @bp-kelley: why did you suggest adding a new overload for this writer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was to remove duplicated code from the previous functions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But does it need to be in the public API? I’m pushing because we don’t have an equivalent for any of the other readers

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that’s just where the code was originally. These could all be in a cpp file.

if (kekulize) {
MolOps::Kekulize(rwmol);
}
const auto& conf = rwmol.getConformer(confId);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the atomic coordinates are optional in CML.
Given that, I think it makes sense if we make writing the conformer optional too.
The easiest way to do this (I think) is to use a pointer to the active conformer which is null if the molecule doesn't have a conformer.
Below you would then check if the pointer is null before writing the atomic coords.

Code/GraphMol/FileParsers/CMLWriter.cpp Outdated Show resolved Hide resolved
Code/GraphMol/FileParsers/CMLWriter.cpp Show resolved Hide resolved
Code/GraphMol/FileParsers/CMLWriter.cpp Show resolved Hide resolved
Code/GraphMol/FileParsers/CMLWriter.cpp Outdated Show resolved Hide resolved
Code/GraphMol/FileParsers/CMLWriter.cpp Show resolved Hide resolved

void MolToCMLBlock(std::ostream& os, const ROMol& mol, int confId,
bool kekulize) {
auto pt = molToPTree(mol, confId, kekulize);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice idea to use the boost property tree to deal with the XML parts.

@e-kwsm e-kwsm force-pushed the CMLWriter branch 2 times, most recently from af1ddee to b191127 Compare April 20, 2021 02:27
Implement RDKit::MolToCMLBlock and RDKit::MolToCMLFile
http://www.xml-cml.org/
Comment on lines 131 to 165
std::vector<RDGeom::Point3D> xyzs;
std::vector<unsigned> neighbors;
for (auto nbri : boost::make_iterator_range(mol.getAtomBonds(a))) {
const auto* const b = mol[nbri];
const auto o = b->getOtherAtom(a)->getIdx();
neighbors.push_back(o);
xyzs.push_back(conf.getAtomPos(o));
}

if (neighbors.size() != 4u) {
continue;
}

/* atomParity = sign(det([[1, 1, 1, 1],
[x0, x1, x2, x3],
[y0, y1, y2, y3],
[z0, z1, z2, z3]]))
*/
const auto det = determinant(xyzs[1u].x, xyzs[2u].x, xyzs[3u].x, //
xyzs[1u].y, xyzs[2u].y, xyzs[3u].y, //
xyzs[1u].z, xyzs[2u].z, xyzs[3u].z) -
determinant(xyzs[0u].x, xyzs[2u].x, xyzs[3u].x, //
xyzs[0u].y, xyzs[2u].y, xyzs[3u].y, //
xyzs[0u].z, xyzs[2u].z, xyzs[3u].z) +
determinant(xyzs[0u].x, xyzs[1u].x, xyzs[3u].x, //
xyzs[0u].y, xyzs[1u].y, xyzs[3u].y, //
xyzs[0u].z, xyzs[1u].z, xyzs[3u].z) -
determinant(xyzs[0u].x, xyzs[1u].x, xyzs[2u].x, //
xyzs[0u].y, xyzs[1u].y, xyzs[2u].y, //
xyzs[0u].z, xyzs[1u].z, xyzs[2u].z);
if (det == 0.0) {
continue;
}

auto& atomParity = atom.add("cml:atomParity", det > 0.0 ? 1 : -1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a simpler solution to this problem. I will do a PR against your branch with a suggestion.

greglandrum and others added 4 commits April 23, 2021 07:02
1. stop writing the cml: namespace to the output. Tools like marvin can't read that and it's not nececssary
2. fix the H count
3. continue writing atom info for 2D confs
4. simplify the writing of atom parity
write bond hash/wedge information
Comment on lines 163 to 165
RDKIT_FILEPARSERS_EXPORT void MolToCMLBlock(std::ostream &os, const ROMol &mol,
int confId = -1,
bool kekulize = true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please just remove this from the public API. It's fine to have the function in CMLWriter.cpp and use it there, but there's no need to expose it more broadly.

Suggested change
RDKIT_FILEPARSERS_EXPORT void MolToCMLBlock(std::ostream &os, const ROMol &mol,
int confId = -1,
bool kekulize = true);

@greglandrum
Copy link
Member

@e-kwsm Once that last change I requested (to remove the extra function from the public API) and the CI tests pass I will be happy to merge this one.

@greglandrum greglandrum added this to the 2021_09_1 milestone Apr 29, 2021
@greglandrum greglandrum merged commit f4c7d34 into rdkit:master Apr 29, 2021
@greglandrum
Copy link
Member

@e-kwsm : thanks for making this contribution and sticking with it through the PR process!

@e-kwsm e-kwsm deleted the CMLWriter branch April 29, 2021 05:42
e-kwsm added a commit to e-kwsm/rdkit that referenced this pull request Feb 2, 2022
e-kwsm added a commit to e-kwsm/rdkit that referenced this pull request Feb 7, 2022
e-kwsm added a commit to e-kwsm/rdkit that referenced this pull request Feb 13, 2022
e-kwsm added a commit to e-kwsm/rdkit that referenced this pull request Feb 22, 2022
e-kwsm added a commit to e-kwsm/rdkit that referenced this pull request Mar 7, 2022
e-kwsm added a commit to e-kwsm/rdkit that referenced this pull request Mar 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants