Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GetSubstructMatches() loops at 43690 iterations. #4558

Closed
ricrogz opened this issue Sep 25, 2021 · 0 comments · Fixed by #4559
Closed

GetSubstructMatches() loops at 43690 iterations. #4558

ricrogz opened this issue Sep 25, 2021 · 0 comments · Fixed by #4559
Assignees
Labels
Milestone

Comments

@ricrogz
Copy link
Contributor

ricrogz commented Sep 25, 2021

The attached solvent.sdf.gz file contains a large number of atoms (84525 atoms in 28175 water molecules with explicit Hydrogen atoms).

Running the following script should successfully match and return all the water molecules. uniquify is intentionally disabled to allow a quick execution of the query, so we expect the script to print that it found 56350 matches (since the query is symmetric, it can match the same molecule as H-O-H' and also as H'-O-H).

from rdkit import Chem
from itertools import groupby


def all_equal(iterable):
    g = groupby(iterable)
    return next(g, True) and not next(g, False)


mol = Chem.MolFromMolFile('solvent.sdf', sanitize=False, removeHs=False)
print(f'num atoms: {mol.GetNumAtoms()}')

q = Chem.MolFromSmarts('[H]O[H]')
max_sz = 500000
w = mol.GetSubstructMatches(q, uniquify=False, maxMatches=max_sz)

print(len(w))

On my computer, and with a Debug build, the script returns after 7 seconds, but it surprisingly prints:

500000

Since there's only 28175 water molecules, at least one of them must have been returned multiple (more than 2) times. Notice that we got the maximum number of matches we allowed. And even if we further increase this number, we always get returned the maximum number of matches, up to the point where the call becomes too slow to to wait for it to return, and I abort the experiment.

Digging a bit more, I tried to see if there was some kind of pattern in the molecules that were being returned. And there was. I added this to the script:

for i in range(20):  # Check the first 20 mols
    needle = w[i]
    idxs = [i]
    pos = i
    
    # Find the indexes of repeated molecules
    try:
        for _ in range(len(w)):
            pos = w.index(needle, pos + 1)
            idxs.append(pos)
    except ValueError:
        pass

    # Try to find a pattern in the repetitions
    deltas = [b - a for a, b in zip(idxs, idxs[1:])]
    equal = all_equal(deltas)
    print(f'indexes: {idxs}')
    print(f'deltas : {deltas}')
    print(f'all equal: {equal}')
    print('')

Which prints:

indexes: [0, 43690, 87380, 131070, 174760, 218450, 262140, 305830, 349520, 393210, 436900, 480590]
deltas : [43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690]
all equal: True

indexes: [1, 43691, 87381, 131071, 174761, 218451, 262141, 305831, 349521, 393211, 436901, 480591]
deltas : [43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690]
all equal: True

indexes: [2, 43692, 87382, 131072, 174762, 218452, 262142, 305832, 349522, 393212, 436902, 480592]
deltas : [43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690]
all equal: True

indexes: [3, 43693, 87383, 131073, 174763, 218453, 262143, 305833, 349523, 393213, 436903, 480593]
deltas : [43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690]
all equal: True

indexes: [4, 43694, 87384, 131074, 174764, 218454, 262144, 305834, 349524, 393214, 436904, 480594]
deltas : [43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690]
all equal: True

indexes: [5, 43695, 87385, 131075, 174765, 218455, 262145, 305835, 349525, 393215, 436905, 480595]
deltas : [43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690]
all equal: True

indexes: [6, 43696, 87386, 131076, 174766, 218456, 262146, 305836, 349526, 393216, 436906, 480596]
deltas : [43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690]
all equal: True

indexes: [7, 43697, 87387, 131077, 174767, 218457, 262147, 305837, 349527, 393217, 436907, 480597]
deltas : [43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690]
all equal: True

indexes: [8, 43698, 87388, 131078, 174768, 218458, 262148, 305838, 349528, 393218, 436908, 480598]
deltas : [43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690]
all equal: True

indexes: [9, 43699, 87389, 131079, 174769, 218459, 262149, 305839, 349529, 393219, 436909, 480599]
deltas : [43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690]
all equal: True

indexes: [10, 43700, 87390, 131080, 174770, 218460, 262150, 305840, 349530, 393220, 436910, 480600]
deltas : [43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690]
all equal: True

indexes: [11, 43701, 87391, 131081, 174771, 218461, 262151, 305841, 349531, 393221, 436911, 480601]
deltas : [43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690]
all equal: True

indexes: [12, 43702, 87392, 131082, 174772, 218462, 262152, 305842, 349532, 393222, 436912, 480602]
deltas : [43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690]
all equal: True

indexes: [13, 43703, 87393, 131083, 174773, 218463, 262153, 305843, 349533, 393223, 436913, 480603]
deltas : [43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690]
all equal: True

indexes: [14, 43704, 87394, 131084, 174774, 218464, 262154, 305844, 349534, 393224, 436914, 480604]
deltas : [43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690]
all equal: True

indexes: [15, 43705, 87395, 131085, 174775, 218465, 262155, 305845, 349535, 393225, 436915, 480605]
deltas : [43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690]
all equal: True

indexes: [16, 43706, 87396, 131086, 174776, 218466, 262156, 305846, 349536, 393226, 436916, 480606]
deltas : [43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690]
all equal: True

indexes: [17, 43707, 87397, 131087, 174777, 218467, 262157, 305847, 349537, 393227, 436917, 480607]
deltas : [43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690]
all equal: True

indexes: [18, 43708, 87398, 131088, 174778, 218468, 262158, 305848, 349538, 393228, 436918, 480608]
deltas : [43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690]
all equal: True

indexes: [19, 43709, 87399, 131089, 174779, 218469, 262159, 305849, 349539, 393229, 436919, 480609]
deltas : [43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690, 43690]
all equal: True

So, basically, GetSubstructMatches() finds the first 43690 matches, and then loops and starts finding them again. Adding or removing some molecules from the input file do not make this number change.

@ricrogz ricrogz added the bug label Sep 25, 2021
@ricrogz ricrogz self-assigned this Sep 25, 2021
@greglandrum greglandrum modified the milestones: 2021_03_6, 2021_09_1 Sep 25, 2021
greglandrum pushed a commit that referenced this issue Sep 25, 2021
* fix

* add test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants