Add MolVS tautomer canonicalization #2886

greglandrum · 2020-01-16T14:09:17Z

During the 2018 GSoC project to do a C++ implementation of MolVS, doing the tautomer enumeration and canonicalization were stretch goals. @susanhleung actually managed to complete the tautomer enumeration, but since canonicalization wasn't complete, we didn't publicize this particularly widely.

This PR does the last bit of work and adds tautomer canonicalization.

Notes to reviewers:

the goal of this first merge is to implement the tautomer scoring and canonicalization schemes that are used in MolVS. Once we have that in place the arguments can start (if really necessary) about whether or not we want a different/modified default scoring scheme.
I will be probably be adding some additional tests over the next couple of days, those won't the design or (hopefully) scoring/canonicalization code itself, so it should be fine to start looking at this.

This isn't optimal in terms of performance, but all the MolVS tests pass.

bp-kelley · 2020-01-16T23:32:35Z

Code/GraphMol/MolStandardize/Tautomer.cpp

+  std::string d_smarts;
+  std::shared_ptr<ROMol> dp_mol;
+  smarts_mol_holder(const std::string &smarts) : d_smarts(smarts) {
+    dp_mol.reset(SmartsToMol(smarts));


Should we check that this isn't set a null smarts?

Looks like it's all internal, so no.

It's also tested at match time (line 130)

bp-kelley · 2020-01-16T23:35:46Z

Code/GraphMol/MolStandardize/Tautomer.cpp

+  // a note on efficiency here: we'll construct the SubstructTerm objects here
+  // repeatedly, but the SMARTS parsing for each entry will only be done once
+  // since we're using the boost::flyweights above to cache them
+  const std::vector<SubstructTerm> substructureTerms{


Part of me thinks that this structs + score should be passed in to be easier to modify.

It looks like you've captured this in the score func though.

I thought about adding that option, but then figured it's more straightforward from the API perspective to just leave it out since the user can always provide their own scoring function.

bp-kelley · 2020-01-16T23:38:42Z

Code/GraphMol/MolStandardize/Wrap/testMolStandardize.py

+    ctaut = enumerator.Canonicalize(m, scorefunc1)
+    self.assertEqual(Chem.MolToSmiles(ctaut), "OC1=CCCCC1")
+    ctaut = enumerator.Canonicalize(m, scorefunc2)
+    self.assertEqual(Chem.MolToSmiles(ctaut), "O=C1CCCCC1")


Might be worth writing a function with the wrong API to see what happens :)

good point. I also added one to make sure/demonstrate that you can use lambdas from Python
(boost.python is absolutely magic)

Yeah, this aspect of boost is pretty awesome.

greglandrum added 10 commits January 15, 2020 15:47

first pass at implementing molvs-style tautomer scoring

b591a7e

This isn't optimal in terms of performance, but all the MolVS tests pass.

clang format

09d930e

A bit of refactoring of the tautomer stuff

07e1f04

first pass at python wrappers

9c1f426

allow specifying the tautomer scoring function from C++

3863dd8

EFF: use boost::flyweight so SMARTS is only parsed once

fe0f5a0

improve the python API

7192198

switch to boost::function instead of using function pointers

ff021a1

allow user-provided tautomer scoring functions

a89c306

documentation and scorer version

aa14e3e

greglandrum added the enhancement label Jan 16, 2020

greglandrum added this to the 2020_03_1 milestone Jan 16, 2020

bp-kelley reviewed Jan 16, 2020

View reviewed changes

change in response to review

59d3c15

bp-kelley approved these changes Jan 17, 2020

View reviewed changes

bp-kelley merged commit f8a4020 into rdkit:master Jan 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MolVS tautomer canonicalization #2886

Add MolVS tautomer canonicalization #2886

greglandrum commented Jan 16, 2020

bp-kelley Jan 16, 2020

bp-kelley Jan 16, 2020

greglandrum Jan 17, 2020

bp-kelley Jan 16, 2020

bp-kelley Jan 16, 2020

greglandrum Jan 17, 2020

bp-kelley Jan 16, 2020

greglandrum Jan 17, 2020

bp-kelley Jan 17, 2020

Add MolVS tautomer canonicalization #2886

Add MolVS tautomer canonicalization #2886

Conversation

greglandrum commented Jan 16, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment