Take additional context like a atom query into account when canonicalizing molecule to SMILES #6401

kienerj · 2023-05-26T12:43:57Z

Is your feature request related to a problem? Please describe.

When creating canonical SMILES from an RDKit molecule with additional context, said context will be ignored for canonicalization. The issue is rather difficult to describe, so please be patient and ask if my explanations are unclear.

The generated SMILES look the same but the atom indexes are in different order depending from what input the molecule was generated. For my use-case that is relevant that atom index stay the same

Example 1 with SMARTS:

from rdkit import Chem
m1 = Chem.MolFromSmarts("[C]!@;:C-,=C(-[C&R1])-C")
m2 = Chem.MolFromSmarts("C-C(-[C&R1])-,=C!@;:[C]")

Chem.MolToSmiles(m1)
# CC(C)C~C
Chem.MolToSmiles(m2)
# CC(C)C~C

# as said output is the same but now let's look at atom indices
m1.GetProp("_smilesAtomOutputOrder")
# [3,2,4,1,0,]
m2.GetProp("_smilesAtomOutputOrder")
# [0,1,2,3,4,]

Depending on how the molecule was created, the canonical SMILES can start with either the "C" atom or the "[C&R1]" atom. This isn't even clear from above image. Since "C" and "[C&R1]" are not the same and to "break a tie" it would be very helpful to take additional context onto account like a query so that the order is clear (either query always first or vice versa).

Example 2 with attachment point:

Same issue here. The same SMILES *OC.c1ccncc1 is generated but the atom order is different. The difference in atom order leads to the result that the "variable attachment atoms" get a different atom index, namely 3,7,8 for m and 3,4,5 for m1.

in case of a tie (eg. for the final SMILES the order doesn't matter), additional context in this case participation in a variable attachment should be taken into account to get a canonical order.

Describe the solution you'd like

When canonicalizing to SMILES, additional context should be considered in case that additional context needs to be used elsewhere and linked to the canonical SMILES by atom index.

Additional context

Molfiles for molecules with attachment points:

m = Chem.MolFromMolBlock('''
  Mrv2007 06232015292D          

  0  0  0     0  0            999 V3000
M  V30 BEGIN CTAB
M  V30 COUNTS 9 8 0 0 0
M  V30 BEGIN ATOM
M  V30 1 C -1.7083 2.415 0 0
M  V30 2 C -3.042 1.645 0 0
M  V30 3 C -3.042 0.105 0 0
M  V30 4 N -1.7083 -0.665 0 0
M  V30 5 C -0.3747 0.105 0 0
M  V30 6 C -0.3747 1.645 0 0
M  V30 7 * -0.8192 1.3883 0 0
M  V30 8 O -0.8192 3.6983 0 0
M  V30 9 C 0.5145 4.4683 0 0
M  V30 END ATOM
M  V30 BEGIN BOND
M  V30 1 1 1 2
M  V30 2 2 2 3
M  V30 3 1 3 4
M  V30 4 2 4 5
M  V30 5 1 5 6
M  V30 6 2 1 6
M  V30 7 1 7 8 ENDPTS=(3 1 5 6) ATTACH=ANY
M  V30 8 1 8 9
M  V30 END BOND
M  V30 END CTAB
M  END''')

m2 = Chem.MolFromMolBlock("""
  ChemDraw05262313462D

  0  0  0     0  0              0 V3000
M  V30 BEGIN CTAB
M  V30 COUNTS 9 8 0 0 0
M  V30 BEGIN ATOM
M  V30 1 C -0.887074 0.827626 0.000000 0
M  V30 2 N -0.887074 0.002626 0.000000 0
M  V30 3 C -0.172603 -0.409874 0.000000 0
M  V30 4 C 0.541868 0.002626 0.000000 0
M  V30 5 C 0.541868 0.827626 0.000000 0
M  V30 6 C -0.172603 1.240126 0.000000 0
M  V30 7 * 0.303711 0.140126 0.000000 0
M  V30 8 O 0.887074 -0.443237 0.000000 0
M  V30 9 C 0.673548 -1.240126 0.000000 0
M  V30 END ATOM
M  V30 BEGIN BOND
M  V30 1 2 1 2
M  V30 2 1 2 3
M  V30 3 2 3 4
M  V30 4 1 4 5
M  V30 5 2 5 6
M  V30 6 1 6 1
M  V30 7 1 7 8 ENDPTS=(3 4 5 3) ATTACH=ANY
M  V30 8 1 8 9
M  V30 END BOND
M  V30 END CTAB
M  END
""")

The text was updated successfully, but these errors were encountered:

greglandrum · 2023-05-30T13:08:09Z

Hi @kienerj. I'd love to be able to do this, but canonicalizing queries is quite involved (almost a small research project) and not something which is likely to show up any time soon.

kienerj · 2023-05-30T18:09:31Z

fair enough and I was thinking too small. It can get much more complex than my simple examples.

The other elephant in the room might also be performance so yeah for sure not a simple thing.

kienerj added the enhancement label May 26, 2023

kienerj mentioned this issue May 31, 2023

Molecules with query can get a different hash due to SMILES canonicalization not taking query features into account kienerj/UniqueMoleculeHash#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Take additional context like a atom query into account when canonicalizing molecule to SMILES #6401

Take additional context like a atom query into account when canonicalizing molecule to SMILES #6401

kienerj commented May 26, 2023

greglandrum commented May 30, 2023

kienerj commented May 30, 2023

Take additional context like a atom query into account when canonicalizing molecule to SMILES #6401

Take additional context like a atom query into account when canonicalizing molecule to SMILES #6401

Comments

kienerj commented May 26, 2023

greglandrum commented May 30, 2023

kienerj commented May 30, 2023