GetSubstructMatches() loops at 43690 iterations. #4558

ricrogz · 2021-09-25T01:07:34Z

The attached solvent.sdf.gz file contains a large number of atoms (84525 atoms in 28175 water molecules with explicit Hydrogen atoms).

Running the following script should successfully match and return all the water molecules. uniquify is intentionally disabled to allow a quick execution of the query, so we expect the script to print that it found 56350 matches (since the query is symmetric, it can match the same molecule as H-O-H' and also as H'-O-H).

from rdkit import Chem
from itertools import groupby


def all_equal(iterable):
    g = groupby(iterable)
    return next(g, True) and not next(g, False)


mol = Chem.MolFromMolFile('solvent.sdf', sanitize=False, removeHs=False)
print(f'num atoms: {mol.GetNumAtoms()}')

q = Chem.MolFromSmarts('[H]O[H]')
max_sz = 500000
w = mol.GetSubstructMatches(q, uniquify=False, maxMatches=max_sz)

print(len(w))

On my computer, and with a Debug build, the script returns after 7 seconds, but it surprisingly prints:

Since there's only 28175 water molecules, at least one of them must have been returned multiple (more than 2) times. Notice that we got the maximum number of matches we allowed. And even if we further increase this number, we always get returned the maximum number of matches, up to the point where the call becomes too slow to to wait for it to return, and I abort the experiment.

Digging a bit more, I tried to see if there was some kind of pattern in the molecules that were being returned. And there was. I added this to the script:

for i in range(20):  # Check the first 20 mols
    needle = w[i]
    idxs = [i]
    pos = i
    
    # Find the indexes of repeated molecules
    try:
        for _ in range(len(w)):
            pos = w.index(needle, pos + 1)
            idxs.append(pos)
    except ValueError:
        pass

    # Try to find a pattern in the repetitions
    deltas = [b - a for a, b in zip(idxs, idxs[1:])]
    equal = all_equal(deltas)
    print(f'indexes: {idxs}')
    print(f'deltas : {deltas}')
    print(f'all equal: {equal}')
    print('')

Which prints:

indexes: [0, 43690, 87380, 131070, 174760, 218450, 262140, 305830, 349520, 393210, 436900, 480590]
deltas : [43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690]
all equal: True

indexes: [1, 43691, 87381, 131071, 174761, 218451, 262141, 305831, 349521, 393211, 436901, 480591]
deltas : [43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690]
all equal: True

indexes: [2, 43692, 87382, 131072, 174762, 218452, 262142, 305832, 349522, 393212, 436902, 480592]
deltas : [43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690]
all equal: True

indexes: [3, 43693, 87383, 131073, 174763, 218453, 262143, 305833, 349523, 393213, 436903, 480593]
deltas : [43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690]
all equal: True

indexes: [4, 43694, 87384, 131074, 174764, 218454, 262144, 305834, 349524, 393214, 436904, 480594]
deltas : [43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690]
all equal: True

indexes: [5, 43695, 87385, 131075, 174765, 218455, 262145, 305835, 349525, 393215, 436905, 480595]
deltas : [43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690]
all equal: True

indexes: [6, 43696, 87386, 131076, 174766, 218456, 262146, 305836, 349526, 393216, 436906, 480596]
deltas : [43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690]
all equal: True

indexes: [7, 43697, 87387, 131077, 174767, 218457, 262147, 305837, 349527, 393217, 436907, 480597]
deltas : [43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690]
all equal: True

indexes: [8, 43698, 87388, 131078, 174768, 218458, 262148, 305838, 349528, 393218, 436908, 480598]
deltas : [43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690]
all equal: True

indexes: [9, 43699, 87389, 131079, 174769, 218459, 262149, 305839, 349529, 393219, 436909, 480599]
deltas : [43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690]
all equal: True

indexes: [10, 43700, 87390, 131080, 174770, 218460, 262150, 305840, 349530, 393220, 436910, 480600]
deltas : [43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690]
all equal: True

indexes: [11, 43701, 87391, 131081, 174771, 218461, 262151, 305841, 349531, 393221, 436911, 480601]
deltas : [43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690]
all equal: True

indexes: [12, 43702, 87392, 131082, 174772, 218462, 262152, 305842, 349532, 393222, 436912, 480602]
deltas : [43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690]
all equal: True

indexes: [13, 43703, 87393, 131083, 174773, 218463, 262153, 305843, 349533, 393223, 436913, 480603]
deltas : [43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690]
all equal: True

indexes: [14, 43704, 87394, 131084, 174774, 218464, 262154, 305844, 349534, 393224, 436914, 480604]
deltas : [43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690]
all equal: True

indexes: [15, 43705, 87395, 131085, 174775, 218465, 262155, 305845, 349535, 393225, 436915, 480605]
deltas : [43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690]
all equal: True

indexes: [16, 43706, 87396, 131086, 174776, 218466, 262156, 305846, 349536, 393226, 436916, 480606]
deltas : [43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690]
all equal: True

indexes: [17, 43707, 87397, 131087, 174777, 218467, 262157, 305847, 349537, 393227, 436917, 480607]
deltas : [43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690]
all equal: True

indexes: [18, 43708, 87398, 131088, 174778, 218468, 262158, 305848, 349538, 393228, 436918, 480608]
deltas : [43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690]
all equal: True

indexes: [19, 43709, 87399, 131089, 174779, 218469, 262159, 305849, 349539, 393229, 436919, 480609]
deltas : [43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690]
all equal: True

So, basically, GetSubstructMatches() finds the first 43690 matches, and then loops and starts finding them again. Adding or removing some molecules from the input file do not make this number change.

The text was updated successfully, but these errors were encountered:

* fix * add test

ricrogz added the bug label Sep 25, 2021

ricrogz self-assigned this Sep 25, 2021

This was referenced Sep 25, 2021

Fixes #4558 #4559

Merged

Improve performance of removing substruct/tautomer duplicates #4560

Merged

greglandrum modified the milestones: 2021_03_6, 2021_09_1 Sep 25, 2021

greglandrum closed this as completed in #4559 Sep 25, 2021

greglandrum pushed a commit that referenced this issue Sep 25, 2021

Fixes #4558 (#4559)

2c7485f

* fix * add test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GetSubstructMatches() loops at 43690 iterations. #4558

GetSubstructMatches() loops at 43690 iterations. #4558

ricrogz commented Sep 25, 2021 •

edited

GetSubstructMatches() loops at 43690 iterations. #4558

GetSubstructMatches() loops at 43690 iterations. #4558

Comments

ricrogz commented Sep 25, 2021 • edited

ricrogz commented Sep 25, 2021 •

edited