Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove check for ring information from Atom::Match #6063

Merged
merged 2 commits into from
Feb 8, 2023

Conversation

fwaibl
Copy link
Contributor

@fwaibl fwaibl commented Feb 7, 2023

Description

I ran into a case where I had two different SMILES for identical molecules, but no matching was found (i.e., HasSubstructMatch returned False)

A minimal example would be:

mol1 = Chem.MolFromSmiles('c12cc(c(Cc3ccc(cc3)Cc5ccc(C2)cc5)cc1)')
mol2 = Chem.MolFromSmiles('c1cc2ccc1Cc1ccc(cc1)Cc1ccc(cc1)C2')

In that case, Chem.MolToSmiles(mol1) == Chem.MolToSmiles(mol2) is True, but mol1.GetSubstructMatch(mol2) is empty.

It seems that this is a difficult case for the ring searching algorithm, such that len(Chem.GetSymmSSSR(mol1)), len(Chem.GetSymmSSSR(mol2)) is (7, 9). Since the number of rings is queried in Atom::Match, the substructure search fails as well.

This pull request removes the check for the ring number from Atom::Match in Code/GraphMol/Atom.cpp, so that the substructure search is successful. It also adds a test case for a slightly smaller version of that molecule (1 ring less).

All other tests (ran using ctest) are still passing.

Timings

I also did some timing checks, and it does not seem to make a big difference. Running new_timings.py without (top) and with (bottom) the change gives:

| 2023.03.1pre | 6.6 | 3.3 | 2.1 | 0.0 | 13.0 | 13.7 | 0.0 | 52.7 | 51.6 | 12.5 | 15.8 | 7.6 | 56.5 | 2.2 |
| 2023.03.1pre | 6.2 | 3.1 | 2.0 | 0.0 | 12.9 | 13.6 | 0.0 | 52.3 | 52.1 | 12.2 | 15.2 | 7.4 | 56.2 | 2.2 |

I also ran some small tests myself (using cyclosporin A, a linear peptide with sequence "DGAPSTE", as well as the molecule mentioned above), and did not find a large difference, except for the case were no alignment was found previously. The timings were collected using the timeit.repeat function in Python, with 1000 runs and 5 repeats.

Without the change:

Complicated ring system:
Can match: False
Average times for substructure search: 3.3, 3.4, 3.3, 3.3, 3.3 µs
Cyclosporin A
Average times for substructure search: 453.8, 464.6, 468.4, 466.7, 466.0 µs
Cyclosporin A, search for amides
Average times for substructure search: 8.4, 9.9, 8.5, 8.8, 8.4 µs
Cyclosporin A, search for (nonexistent) prolines
Average times for substructure search: 5.7, 5.6, 5.6, 6.0, 5.5 µs
linear peptide, search for prolines
Average times for substructure search: 4.4, 4.6, 4.5, 4.5, 4.4 µs

With the change:

Complicated ring system:
Can match: True
Average times for substructure search: 86.7, 89.0, 89.9, 89.9, 89.6 µs
Cyclosporin A
Average times for substructure search: 410.4, 413.1, 411.0, 411.1, 413.5 µs
Cyclosporin A, search for amides
Average times for substructure search: 8.6, 8.5, 8.5, 8.5, 8.5 µs
Cyclosporin A, search for (nonexistent) prolines
Average times for substructure search: 5.6, 5.6, 5.5, 5.6, 5.5 µs
linear peptide, search for prolines
Average times for substructure search: 4.4, 4.4, 4.4, 4.4, 4.4 µs

Questions

  • Could this change lead to other unexpected issues?
  • Are there other things I should try out?

I would be very grateful if you could have a look at this. This is my first pull request to rdkit, so please let me know if I forgot something.

Best regards,
Franz Waibl

Franz Waibl added 2 commits February 1, 2023 09:42
In high-symmetry cases where the symmetric SSSR does not find all
possible rings, substructure searches can fail because of this check.
Removing it fixes those cases, but is likely to decrease the performance
of substructure matching.

Also, adds a unit test where the old code fails
Copy link
Member

@greglandrum greglandrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@greglandrum greglandrum added the bug label Feb 8, 2023
@greglandrum greglandrum added this to the 2022_09_5 milestone Feb 8, 2023
@greglandrum greglandrum merged commit c99115e into rdkit:master Feb 8, 2023
@greglandrum
Copy link
Member

Thanks for this well-explained and tested fix @fwaibl !

greglandrum pushed a commit that referenced this pull request Feb 23, 2023
In high-symmetry cases where the symmetric SSSR does not find all
possible rings, substructure searches can fail because of this check.
Removing it fixes those cases, but is likely to decrease the performance
of substructure matching.

Also, adds a unit test where the old code fails

Co-authored-by: Franz Waibl <waiblfranz@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants