kekulize option not yielding kekulized MHFP results #3942

greglandrum · 2021-03-17T12:32:25Z

Here's the demo using one of the MHFP example molecules

In [11]: m = Chem.MolFromSmiles('Cn1cnc2c1c(=O)[nH]c(=O)n2C')

In [12]: enc = rdMHFPFingerprint.MHFPEncoder(128,42)

In [13]: sh = enc.CreateShinglingFromMol(m,kekulize=True)

In [14]: list(sh)[:10]
Out[14]: 
['c1cncn1',
 'c1cnc[nH]c1',
 'CN',
 'CN(:C):C',
 'CN(:C:N):C(:C):C',
 'N(:C)(:C)C',
 'N(C)(:C:N):C(:C):C',
 'N1(C):C:N:C(:N):C:1:C(:[NH])=O',
 'C(:N):N',
 'C(:N:C):N(:C)C']

The problem here is that the MHFP code is generating a SMILES with the doKekule argument set to true and expecting that this will cause the SMILES writer to kekulize the molecule. As of the 2020.09 release cycle this does not happen (see #2788 for a discussion of that).

The fix is quite easy: just kekulize the molecule before generating the fingerprint. However since kekulization is not a canonical process, this does result in the code potentially returning different fingerprints for different kekule structures.
Here's a demo of that:

n [42]: m = Chem.MolFromSmiles('CC1=NC=CC=C1')

In [43]: m2 = Chem.MolFromSmiles('N1C=CC=CC=1C')

In [44]: Chem.MolToSmiles(m)==Chem.MolToSmiles(m2)
Out[44]: True

In [45]: Chem.Kekulize(m)

In [46]: Chem.Kekulize(m2)

In [49]: Chem.MolToSmiles(m,kekuleSmiles=True)==Chem.MolToSmiles(m2,kekuleSmiles=True)
Out[49]: False

In [50]: enc = rdMHFPFingerprint.MHFPEncoder(128,42)

In [53]: s1=enc.CreateShinglingFromMol(m,kekulize=True)

In [54]: s2=enc.CreateShinglingFromMol(m2,kekulize=True)

In [56]: s1 = sorted(s1)

In [57]: s2 = sorted(s2)

In [58]: s1==s2
Out[58]: False

In [59]: s1
Out[59]: 
['C(=C)C',
 'C(=C)C',
 'C(=C)C',
 'C(=C)N',
 'C(=CC)C(C)=N',
 'C(=CC)N=C',
 'C(C)(C)=N',
 'C(C)(C=C)=NC',
 'C(C=C)=CC',
 'C(C=C)=CN',
 'C1(C)=NC=CC=C1',
 'C1=CC(C)=NC=C1',
 'C1=CC=CC(C)=N1',
 'C1=CC=CN=C1C',
 'C1=CN=CC=C1',
 'CC',
 'CC(C)=N',
 'CC(C=C)=NC',
 'N(=C)C',
 'N(C=C)=C(C)C',
 'N1=C(C)C=CC=C1',
 'c1ccncc1']

In [60]: s2
Out[60]: 
['C(=C)(C)N',
 'C(=C)C',
 'C(=C)C',
 'C(=C)C',
 'C(=CC)C=N',
 'C(C)(=CC)N=C',
 'C(C)=N',
 'C(C=C)=C(C)N',
 'C(C=C)=CC',
 'C(C=C)=NC',
 'C1(C)=CC=CC=N1',
 'C1=C(C)N=CC=C1',
 'C1=CC=CN=C1',
 'C1=CC=NC(C)=C1',
 'C1=NC(C)=CC=C1',
 'CC',
 'CC(=C)N',
 'CC(=CC)N=C',
 'N(=C)C',
 'N(=CC)C(=C)C',
 'N1=CC=CC=C1C',
 'c1ccncc1']

As an aside: the MHFP publication says "All substructure SMILES are canonicalized and kekulized". This is certainly not the case in the C++ implementation and given the difficulty of doing canonical kekulization, I wonder if it is in the original python implementation. Given that this is the case, I think it would make sense to change the defaults in this code and make kekulize=False the default.

The text was updated successfully, but these errors were encountered:

greglandrum · 2021-03-17T12:35:18Z

@daenuprobst, I will fix the main bug - that the kekulize argument doesn't actually do what it says - as part of fixing #2788. But I want to wait and give you a chance to comment before changing the default in this code to make kekulize=False the default.

FanwangM · 2021-07-19T19:24:00Z

enc = rdMHFPFingerprint.MHFPEncoder(128,42)

I didn't quite understand what the number 128 and 42 mean here after looking at the documentation, https://www.rdkit.org/docs/source/rdkit.Chem.rdMHFPFingerprint.html.
Can you say a little about it? Thank you so much!! @greglandrum

FanwangM · 2021-07-19T19:41:00Z

I got it with detailed explanations and with source code, https://github.com/rdkit/rdkit/blob/2053a98ffa93ab7e7e433362d6164750fac17bf9/Code/GraphMol/Fingerprints/MHFP.cpp#L34-39.

    # n_permutations: analogous to the number of bits ECFP fingerprints are folded into. Higher is better, lower is less exact.
    # seed: seed for the MinHash operation. Has to be the same for comparable fingerprints.

Thank you.

greglandrum added bug changes results labels Mar 17, 2021

greglandrum added a commit to greglandrum/rdkit that referenced this issue Mar 17, 2021

Fixes rdkit#3942

0ef905d

greglandrum mentioned this issue Mar 17, 2021

Fixes #2788: kekuleSmiles=true not producing kekule SMILES #3943

Merged

greglandrum closed this as completed in 31c90a3 Mar 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kekulize option not yielding kekulized MHFP results #3942

kekulize option not yielding kekulized MHFP results #3942

greglandrum commented Mar 17, 2021 •

edited

greglandrum commented Mar 17, 2021

FanwangM commented Jul 19, 2021

FanwangM commented Jul 19, 2021

kekulize option not yielding kekulized MHFP results #3942

kekulize option not yielding kekulized MHFP results #3942

Comments

greglandrum commented Mar 17, 2021 • edited

greglandrum commented Mar 17, 2021

FanwangM commented Jul 19, 2021

FanwangM commented Jul 19, 2021

greglandrum commented Mar 17, 2021 •

edited