Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhanced Stereochemistry canonicalization errors #7041

Open
wants to merge 89 commits into
base: master
Choose a base branch
from

Conversation

tadhurst-cdd
Copy link
Contributor

@tadhurst-cdd tadhurst-cdd commented Jan 12, 2024

Reference Issue

Enhanced Stereochemistry canonicalization errors

What does this implement/fix? Explain your changes.

Many compounds can be formulated as smiles with different enhanced stereochemistry specification but are actually the same compound. For example:

N[C@H]1CC[C@@H](O)CC1 |a:1,4|
N[C@H]1CC[C@@H](O)CC1 |o1:1,4|
N[C@H]1CC[C@@H](O)CC1 |&1:1,4|
N[C@@H]1CC[C@H](O)CC1 |a:1,4|
N[C@@H]1CC[C@H](O)CC1 |o1:1,4|
N[C@@H]1CC[C@H](O)CC1 |&1:1,4

These are all the same, but without the new code, generate different canonical smiles

These are also the same:

C[C@@H](Cl)C[C@H](C)Cl |a:1,4,|
C[C@@H](Cl)C[C@H](C)Cl |o1:1,4,|
C[C@H](Cl)C[C@@H](C)Cl |a:1,4,|
C[C@@H](Cl)C[C@H](C)Cl |&1:1,4,|

Any other comments?

@mc-robinson
Copy link

@tadhurst-cdd this looks like a very nice change. It looks like this may go towards fixing an issue I recently reported #7266

@greglandrum greglandrum self-assigned this Apr 13, 2024
@tadhurst-cdd
Copy link
Contributor Author

Addressed changes to fix errors in tests provided by Greg Landrum. There were a couple of fixes, and the code now does NOT throw an error is the enhanced procedure does not work, but simply calls the old canonicalization method

@tadhurst-cdd
Copy link
Contributor Author

I know that one concern about the performance of the new rigorousEnhancedStereo functionality in RDKit.

I do not think that this is not a major problem.

First, the difference between time required to do the canonicalization WITHOUT the new functionality and WITH the new stuff is a comparison between producing incorrect results and producing correct results. I think we really want the correct results.

Second, the new stuff does not affect the time required to canonicalize structures that do NOT have enhanced stereochemistry. Less than 4% of the structures our customers have registered in CDD Vault contain enhanced stereochemistry, so the impact is very small.

The currently suggested method does this:

Enumerates the possible structures that the enhanced notation represents.
produces a unique smiles for each
Makes a unique list of the unique smiles.
Convert that list of smiles into a list of mols to be represented.
Finds a canonical enhanced stereo representation that expresses that unique list.
(The steps listed above are done twice – once for any OR type enhanced markers, and one for the AND type markers)

This method relies heavily on the current functionality for producing canonical smiles for stereo-labeled compounds, and that is the source of computational complexity. It would be possible to have the new method NOT actually generate and subsequently parse the canonical smiles, but the work of canonicalization would still need to be done. I doubt that any substantial improvement in performance could be made.

One possible change to the method might be to produce, more directly, a list of mols by reordering the atoms and bonds according to the canonical atom rankings. It would be necessary to be able to compare and sort these mols to produce a unique list.

I am interested in other thoughts and suggestions.

tad

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants