You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The attached solvent.sdf.gz file contains a large number of atoms (84525 atoms in 28175 water molecules with explicit Hydrogen atoms).
Running the following script should successfully match and return all the water molecules. uniquify is intentionally disabled to allow a quick execution of the query, so we expect the script to print that it found 56350 matches (since the query is symmetric, it can match the same molecule as H-O-H' and also as H'-O-H).
from rdkit import Chem
from itertools import groupby
def all_equal(iterable):
g = groupby(iterable)
return next(g, True) and not next(g, False)
mol = Chem.MolFromMolFile('solvent.sdf', sanitize=False, removeHs=False)
print(f'num atoms: {mol.GetNumAtoms()}')
q = Chem.MolFromSmarts('[H]O[H]')
max_sz = 500000
w = mol.GetSubstructMatches(q, uniquify=False, maxMatches=max_sz)
print(len(w))
On my computer, and with a Debug build, the script returns after 7 seconds, but it surprisingly prints:
500000
Since there's only 28175 water molecules, at least one of them must have been returned multiple (more than 2) times. Notice that we got the maximum number of matches we allowed. And even if we further increase this number, we always get returned the maximum number of matches, up to the point where the call becomes too slow to to wait for it to return, and I abort the experiment.
Digging a bit more, I tried to see if there was some kind of pattern in the molecules that were being returned. And there was. I added this to the script:
for i in range(20): # Check the first 20 mols
needle = w[i]
idxs = [i]
pos = i
# Find the indexes of repeated molecules
try:
for _ in range(len(w)):
pos = w.index(needle, pos + 1)
idxs.append(pos)
except ValueError:
pass
# Try to find a pattern in the repetitions
deltas = [b - a for a, b in zip(idxs, idxs[1:])]
equal = all_equal(deltas)
print(f'indexes: {idxs}')
print(f'deltas : {deltas}')
print(f'all equal: {equal}')
print('')
So, basically, GetSubstructMatches() finds the first 43690 matches, and then loops and starts finding them again. Adding or removing some molecules from the input file do not make this number change.
The text was updated successfully, but these errors were encountered:
The attached solvent.sdf.gz file contains a large number of atoms (84525 atoms in 28175 water molecules with explicit Hydrogen atoms).
Running the following script should successfully match and return all the water molecules.
uniquify
is intentionally disabled to allow a quick execution of the query, so we expect the script to print that it found 56350 matches (since the query is symmetric, it can match the same molecule as H-O-H' and also as H'-O-H).On my computer, and with a Debug build, the script returns after 7 seconds, but it surprisingly prints:
Since there's only 28175 water molecules, at least one of them must have been returned multiple (more than 2) times. Notice that we got the maximum number of matches we allowed. And even if we further increase this number, we always get returned the maximum number of matches, up to the point where the call becomes too slow to to wait for it to return, and I abort the experiment.
Digging a bit more, I tried to see if there was some kind of pattern in the molecules that were being returned. And there was. I added this to the script:
Which prints:
So, basically,
GetSubstructMatches()
finds the first 43690 matches, and then loops and starts finding them again. Adding or removing some molecules from the input file do not make this number change.The text was updated successfully, but these errors were encountered: