Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Take additional context like a atom query into account when canonicalizing molecule to SMILES #6401

Open
kienerj opened this issue May 26, 2023 · 2 comments

Comments

@kienerj
Copy link

kienerj commented May 26, 2023

Is your feature request related to a problem? Please describe.

When creating canonical SMILES from an RDKit molecule with additional context, said context will be ignored for canonicalization. The issue is rather difficult to describe, so please be patient and ask if my explanations are unclear.

The generated SMILES look the same but the atom indexes are in different order depending from what input the molecule was generated. For my use-case that is relevant that atom index stay the same

Example 1 with SMARTS:

from rdkit import Chem
m1 = Chem.MolFromSmarts("[C]!@;:C-,=C(-[C&R1])-C")
m2 = Chem.MolFromSmarts("C-C(-[C&R1])-,=C!@;:[C]")

Chem.MolToSmiles(m1)
# CC(C)C~C
Chem.MolToSmiles(m2)
# CC(C)C~C

# as said output is the same but now let's look at atom indices
m1.GetProp("_smilesAtomOutputOrder")
# [3,2,4,1,0,]
m2.GetProp("_smilesAtomOutputOrder")
# [0,1,2,3,4,]

image

Depending on how the molecule was created, the canonical SMILES can start with either the "C" atom or the "[C&R1]" atom. This isn't even clear from above image. Since "C" and "[C&R1]" are not the same and to "break a tie" it would be very helpful to take additional context onto account like a query so that the order is clear (either query always first or vice versa).

Example 2 with attachment point:

image

Same issue here. The same SMILES *OC.c1ccncc1 is generated but the atom order is different. The difference in atom order leads to the result that the "variable attachment atoms" get a different atom index, namely 3,7,8 for m and 3,4,5 for m1.

in case of a tie (eg. for the final SMILES the order doesn't matter), additional context in this case participation in a variable attachment should be taken into account to get a canonical order.

Describe the solution you'd like

When canonicalizing to SMILES, additional context should be considered in case that additional context needs to be used elsewhere and linked to the canonical SMILES by atom index.

Additional context

Molfiles for molecules with attachment points:

m = Chem.MolFromMolBlock('''
  Mrv2007 06232015292D          

  0  0  0     0  0            999 V3000
M  V30 BEGIN CTAB
M  V30 COUNTS 9 8 0 0 0
M  V30 BEGIN ATOM
M  V30 1 C -1.7083 2.415 0 0
M  V30 2 C -3.042 1.645 0 0
M  V30 3 C -3.042 0.105 0 0
M  V30 4 N -1.7083 -0.665 0 0
M  V30 5 C -0.3747 0.105 0 0
M  V30 6 C -0.3747 1.645 0 0
M  V30 7 * -0.8192 1.3883 0 0
M  V30 8 O -0.8192 3.6983 0 0
M  V30 9 C 0.5145 4.4683 0 0
M  V30 END ATOM
M  V30 BEGIN BOND
M  V30 1 1 1 2
M  V30 2 2 2 3
M  V30 3 1 3 4
M  V30 4 2 4 5
M  V30 5 1 5 6
M  V30 6 2 1 6
M  V30 7 1 7 8 ENDPTS=(3 1 5 6) ATTACH=ANY
M  V30 8 1 8 9
M  V30 END BOND
M  V30 END CTAB
M  END''')

m2 = Chem.MolFromMolBlock("""
  ChemDraw05262313462D

  0  0  0     0  0              0 V3000
M  V30 BEGIN CTAB
M  V30 COUNTS 9 8 0 0 0
M  V30 BEGIN ATOM
M  V30 1 C -0.887074 0.827626 0.000000 0
M  V30 2 N -0.887074 0.002626 0.000000 0
M  V30 3 C -0.172603 -0.409874 0.000000 0
M  V30 4 C 0.541868 0.002626 0.000000 0
M  V30 5 C 0.541868 0.827626 0.000000 0
M  V30 6 C -0.172603 1.240126 0.000000 0
M  V30 7 * 0.303711 0.140126 0.000000 0
M  V30 8 O 0.887074 -0.443237 0.000000 0
M  V30 9 C 0.673548 -1.240126 0.000000 0
M  V30 END ATOM
M  V30 BEGIN BOND
M  V30 1 2 1 2
M  V30 2 1 2 3
M  V30 3 2 3 4
M  V30 4 1 4 5
M  V30 5 2 5 6
M  V30 6 1 6 1
M  V30 7 1 7 8 ENDPTS=(3 4 5 3) ATTACH=ANY
M  V30 8 1 8 9
M  V30 END BOND
M  V30 END CTAB
M  END
""")
@greglandrum
Copy link
Member

Hi @kienerj. I'd love to be able to do this, but canonicalizing queries is quite involved (almost a small research project) and not something which is likely to show up any time soon.

@kienerj
Copy link
Author

kienerj commented May 30, 2023

fair enough and I was thinking too small. It can get much more complex than my simple examples.

The other elephant in the room might also be performance so yeah for sure not a simple thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants