Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change MMPA to use CanonicalRankAtoms instead of _CIPRank #561

Closed
adalke opened this issue Aug 10, 2015 · 1 comment
Closed

Change MMPA to use CanonicalRankAtoms instead of _CIPRank #561

adalke opened this issue Aug 10, 2015 · 1 comment
Milestone

Comments

@adalke
Copy link
Contributor

adalke commented Aug 10, 2015

I propose that the rdkit/Contrib/mmpa/indexing.py:get_symmetry_class() change from using Chem.AssignStereochemistry() to use the new CanonicalRankAtoms().

I'll start by dredging up an old conversation, which you can see at http://sourceforge.net/p/rdkit/mailman/message/27902778/

On Aug 3, 2011, at 8:58 PM, Alan Smith wrote:

I was just wondering if it there is a function to get the "symmetry class" of an atom in a molecule using rdkit ? Just to clarify, what I mean by "symmetry class" is that it is an arbitrary label for atoms that are symmetrically equivalent in a molecule - so to identify atoms that are symmetrically equivalent you just look for atoms that have the same symmetry class.

On Aug 5, 2011, at 6:43 AM, Greg Landrum wrote:

I did a bit more investigation this morning, and it turns out that it is possible to get the required values already. Here's an example:

>>> m = Chem.MolFromSmiles('c1ncccc1c1cccnc1')
>>> Chem.AssignStereochemistry(m,cleanIt=True,force=True,flagPossibleStereoCenters=True)
>>> [atom.GetProp('_CIPRank') for atom in m.GetAtoms()]
['4', '5', '3', '1', '0', '2', '2', '0', '1', '3', '5', '4']

The key pieces is the "flagPossibleStereoCenters" argument, which tells the stereochemistry assignment code to determine the CIP ranks of atoms even if it doesn't find a potential stereocenter in the molecule.

This insight is used in the MMPA code in rdkit/Contrib/mmpa/indexing.py .

Earlier this year, as a result of github issue #414 at #421 (titled "expose MolOps::rankAtoms() and MolOps::rankAtomsInFragment() to python"), there is a new function named CanonicalRankAtoms(). I decided to investigate if it was more appropriate for the MMPA code, and concluded that it should be used instead of the older CIP code.

I wrote a short program which loops through a set of structures to compute the two assignment lists:

Chem.AssignStereochemistry(mol, flagPossibleStereoCenters=True)
cip = [int(a.GetProp('_CIPRank')) for a in mol.GetAtoms()]

rank = list(Chem.CanonicalRankAtoms(mol, breakTies=False))

This creates values like:

[1, 3, 2, 6, 0, 0, 2, 6, 3, 1, 5, 4, 4, 5]

I also wrote a helper function to convert those lists into a set of symmetry class assignments:

from collections import defaultdict
def get_groups(values):
    value_to_indices = defaultdict(list)
    for i, value in enumerate(values):
        value_to_indices[value].append(i)
    groups = set(tuple(indices) for indices in value_to_indices.values())
    return groups

>>> get_groups([1, 3, 2, 6, 0, 0, 2, 6, 3, 1, 5, 4, 4, 5])
set([(2, 6), (4, 5), (10, 13), (1, 8), (0, 9), (3, 7), (11, 12)])
>>> sorted(get_groups([1, 3, 2, 6, 0, 0, 2, 6, 3, 1, 5, 4, 4, 5]))
[(0, 9), (1, 8), (2, 6), (3, 7), (4, 5), (10, 13), (11, 12)]

In this case, the atom with index 10 is in the same group as the atom with index 13. These sets should be identical if and only if the AssignStereochemistry() and CanonicalRankAtoms() assign atoms to the same symmetry groups.

I processed 86781 isomeric SMILES from PubChem, selected with the following:

gzcat ~/pubchem/Compound_034*.sdf.gz |   \
   awk 'flg {flg=0;print} /PUBCHEM_OPENEYE_ISO_SMILES/ {flg=1}' > x.dat

and found 25 differences. The full script is attached.

Here's one example:

Mismatch with C(NC(=O)[C@@H]([C@@H](C(=O)NCO)O)O)O
Canonical O=C(NCO)[C@@H](O)[C@@H](O)C(=O)NCO
 Rank: [7, 9, 11, 1, 13, 12, 10, 0, 8, 6, 2, 4, 5, 3] => [(0,), (1,), (2,), (3,), (4,), (5,), (6,), (7,), (8,), (9,), (10,), (11,), (12,), (13,)]
 CIP: [1, 3, 2, 6, 0, 0, 2, 6, 3, 1, 5, 4, 4, 5] => [(0, 9), (1, 8), (2, 6), (3, 7), (4, 5), (10, 13), (11, 12)]

In that one, the topologies are the same but there isn't perfect symmetry. To get perfect symmetry, the current SMILES:

C(NC(=O)[C@@H]([C@@H](C(=O)NCO)O)O)O

should be

C(NC(=O)[C@@H]([C@H](C(=O)NCO)O)O)O

I think the CanonicalRankAtom is correct (in the context of the needs for the MMPA code) in saying these are different.

Here's another one:

Mismatch with CC1=C(C=C(C=C1)CC2=NN=N[N-]2)C
Canonical Cc1ccc(Cc2nnn[n-]2)cc1C
Rank: [0, 12, 13, 4, 10, 2, 3, 5, 11, 8, 6, 7, 9, 1] => [(0,), (1,), (2,), (3,), (4,), (5,), (6,), (7,), (8,), (9,), (10,), (11,), (12,), (13,)]
 CIP: [0, 7, 8, 5, 6, 3, 4, 2, 9, 10, 11, 11, 10, 1] => [(0,), (1,), (2,), (3,), (4,), (5,), (6,), (7,), (8,), (9, 12), (10, 11), (13,)]

with the simpler test case of:

Mismatch with c1nnn[n-]1
Canonical c1nnn[n-]1
 Rank: [0, 1, 2, 3, 4] => [(0,), (1,), (2,), (3,), (4,)]
 CIP: [0, 1, 2, 2, 1] => [(0,), (1, 4), (2, 3)]

The difference seems to be the [N-] as changing the charge does not affect the CIP rank:

>>> from rdkit import Chem
>>> mol = Chem.MolFromSmiles("c1nnn[n-]1")
>>> Chem.AssignStereochemistry(mol, flagPossibleStereoCenters=True)
>>> [int(a.GetProp('_CIPRank')) for a in mol.GetAtoms()]
[0, 1, 2, 2, 1]
>>> a=mol.GetAtomWithIdx(4)
>>> a.GetFormalCharge()
-1
>>> a.SetFormalCharge(4)
>>> Chem.AssignStereochemistry(mol, force=True, flagPossibleStereoCenters=True)
>>> [int(a.GetProp('_CIPRank')) for a in mol.GetAtoms()]
[0, 1, 2, 2, 1]

while it does get CanonicalRankAtoms() to give a symmetrical assignment:

>>> mol = Chem.MolFromSmiles("c1nnn[n-]1")
>>> list(Chem.CanonicalRankAtoms(mol, breakTies=False))
[0, 1, 2, 3, 4]
>>> a.GetFormalCharge()
-1
>>> a.SetFormalCharge(0)
>>> list(Chem.CanonicalRankAtoms(mol, breakTies=False))
[0, 1, 3, 3, 1]

As charge is important for MMPA, I believe this example also shows that CanonicalRankAtom() is the better ranking method than AssignStereochemistry(), for what MMPA needs.

See https://gist.github.com/adalke/fe39441b22251c0c1a5a for a patch and for the comparison code.

@greglandrum greglandrum added this to the 2015_09_1 milestone Aug 14, 2015
@greglandrum
Copy link
Member

Thanks for identifying the problem and providing a patch.

I agree that the atom ranking code is a better method for this than the CIP ranks.

sriniker pushed a commit to sriniker/rdkit that referenced this issue Oct 20, 2015
patch from Andrew Dalke
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants