Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RDKit learns how to filter PAINS/BRENK/ZINC/NIH via FilterCatalog #536

Merged
merged 3 commits into from Jul 15, 2015

Conversation

Projects
None yet
4 participants
@bp-kelley
Copy link
Contributor

bp-kelley commented Jul 14, 2015

FilterCatalogs give RDKit the ability to screen out or reject
undesirable molecules based on various criteria. Supplied
with RDKIt are the following filter sets:

  • PAINS - Pan assay interference patterns.
    These are separated into three sets PAINS_A, PAINS_B and PAINS_C.
    Reference: Baell JB, Holloway GA. New Substructure Filters for
    Removal of Pan Assay Interference Compounds (PAINS)
    from Screening Libraries and for Their Exclusion in
    Bioassays.
    J Med Chem 53 (2010) 2719Ð40. doi:10.1021/jm901137j.
  • BRENK - filters unwanted functionality due to potential tox reasons
    or unfavorable pharmacokinetics.
    Reference: Brenk R et al. Lessons Learnt from Assembling Screening
    Libraries for Drug Discovery for Neglected Diseases.
    ChemMedChem 3 (2008) 435-444. doi:10.1002/cmdc.200700139.
  • NIH - annotated compounds with problematic functional groups
    Reference: Doveston R, et al. A Unified Lead-oriented Synthesis of
    over Fifty Molecular Scaffolds. Org Biomol Chem 13
    (2014) 859Ð65.
    doi:10.1039/C4OB02287D.
    Reference: Jadhav A, et al. Quantitative Analyses of Aggregation,
    Autofluorescence, and Reactivity Artifacts in a Screen
    for Inhibitors of a Thiol Protease.
    J Med Chem 53 (2009) 37Ð51. doi:10.1021/jm901070c.
  • ZINC - Filtering based on drug-likeness and unwanted functional
    groups
    Reference: http://blaster.docking.org/filtering/

The following is C++ and Python examples of how to filter molecules.

[C++]

include <GraphMol/FilterCatalog.h>

using namespace RDKit;

SmilesMolSupplier suppl(…);

// setup the desired catalogs
FilterCatalogParams params;
params.addCatalog(FilterCatalogParams::PAINS_A);
params.addCatalog(FilterCatalogParams::PAINS_B);
params.addCatalog(FilterCatalogParams::PAINS_C);

// create the catalog
FilterCatalog catalog(params);

unique_ptr<ROMol> mol; // automatically cleans up after us    
int count = 0;
while(!suppl.atEnd()){
  mol.reset(suppl.next());
  TEST_ASSERT(mol.get());

  // Does a PAINS filter hit?
  if (catalog.hasMatch(*mol)) {
    std::cerr << "Warning: molecule failed filter " << std::endl;
  }

  // More detailed data by retrieving the catalog entry
  const FilterCatalogEntry *entry = catalog.getFirstMatch(*mol);
  if (entry) {
    std::cerr << "Warning: molecule failed filter: reason " <<
      entry->getDescription() << std::endl;

    // get the matched substructure atoms for visualization
    std::vector<FilterMatch> matches;
    if (entry->getFilterMatches(*mol, matches)) {
      for(std::vector<FilterMatch>::const_iterator it = matches.begin();
          it != matches.end(); ++it) {
        // Get the SmartsMatcherBase that matched
        const FilterMatch & fm = (*it);
        boost::shared_ptr<SmartsMatcherBase> matchingFilter = \
          fm.filterMatch;

        // Get the matching atom indices
        const MatchVectType &vect = fm.atomPairs;
        for (MatchVectType::const_iterator it=vect.begin();
             it != vect.end(); ++it) {
             int atomIdx = it->second;
        }

      }
    }
  }
  count ++;
} // end while

Python API

import sys
from rdkit.Chem import FilterCatalog

params = FilterCatalog.FilterCatalogParams()
params.AddCatalog(FilterCatalogParams.FilterCatalogs.PAINS_A)
params.AddCatalog(FilterCatalogParams.FilterCatalogs.PAINS_B)
params.AddCatalog(FilterCatalogParams.FilterCatalogs.PAINS_C)
catalog = FilterCatalog.FilterCatalog(params)

...
for mol in mols:
if catalog.HasMatch(mol):
print("Warning: molecule failed filter", file=sys.stderr)
# more detailed
entry = catalog.GetFirstMatch(mol)
if entry:
print("Warning: molecule failed filter: reason %s"%(
entry.GetDescription()), file=sys.stderr)

     # get to the atoms involved in the substructure
     #  there ma be many matching filters here...
     for filterMatch in entry.getFilterMatches(mol):
         filter = filterMatch.filterMatch
         # get a description of the matching filter
         print(filter)
         for queryAtomIdx, atomIdx in filterMatch.atomPairs:
             # do something with the substructure matches

Advanced

FilterCatalogs are fully serializable and can be stored for later use.

To serialize a catalog, use the catalog.Serialize() method.
std::string pickle = catalog.Serialize();

To unserialize, send the resulting string into the constructor
FilterCatalog catalog(pickle);

The underlying matchers can be arbitrarily complicated and new
ones with more complicated semantics can be created. The default
matching objects are:

SmartsMatcher - match a smarts pattern or query molecule with a minimum
and maximum count
ExclusionList - returns false if any of the supplied matches exist

And - combine two matchers
Or - true if any of two matchers are true
Not - invert the match (note that this can have confusing semantics
when dealing with substructure matches)

Entries can be added at any time to a catalog:

ExclusionList excludedList;

excludedList.addPattern(SmartsMatcher("Pattern 1", smarts)); 
excludedList.addPattern(SmartsMatcher("Pattern 2", smarts2)); 

A FilterCatalog supports a few different types of matching. One is
a traditional rejection filter where if a substructure exists in
the target molecule, the molecule is rejected.

These types of queries can indicate the substructure that triggered
the rejection through the FilterCatalogEntry::GetMatch(mol)
function.

The FilterCatalog also supports acceptance filters, that are
designed to indicate which molecules are ok. These have
to be transformed into rejection filters or simply wrapped in a
Not( acceptanceFilter ) when entered into the catalog. For example,
from Zinc:

carbons [#6] 40

means that we have a maximum of 40 carbon atoms. We can write this by
converting the max count to a min count (i.e. the pattern is triggered
when the molecule has mincount atoms);

const unsigned int minCount = 40+1;
SmartsMatcher( "Too many carbons", "[#6"], minCount );

This can be properly substructure searched.

Or we can wrap this in a not:

const unsigned int minCount = 0;
const unsigned int maxCount = 40;
Not( SmartsMatcher( "ok number of carbons", "[#6]", minCount, maxCount) );

Note: Wrapping in a Not loses the ability to highlight the rejecting
pattern when visualizing the molecule.

RDKit learns how to filter PAINS/BRENK/ZINC/NIH via FilterCatalog
FilterCatalogs give RDKit the ability to screen out or reject 
undesirable molecules based on various criteria.  Supplied 
with RDKIt are the following filter sets:

  * PAINS - Pan assay interference patterns.  
    These are separated into three sets PAINS_A, PAINS_B and PAINS_C.
    Reference: Baell JB, Holloway GA. New Substructure Filters for 
               Removal of Pan Assay Interference Compounds (PAINS) 
               from Screening Libraries and for Their Exclusion in 
               Bioassays.
               J Med Chem 53 (2010) 2719Ð40. doi:10.1021/jm901137j.

  * BRENK - filters unwanted functionality due to potential tox reasons 
            or unfavorable pharmacokinetics.
    Reference: Brenk R et al. Lessons Learnt from Assembling Screening 
               Libraries for Drug Discovery for Neglected Diseases.
               ChemMedChem 3 (2008) 435-444. doi:10.1002/cmdc.200700139.

  * NIH - annotated compounds with problematic functional groups
     Reference: Doveston R, et al. A Unified Lead-oriented Synthesis of 
                over Fifty Molecular Scaffolds. Org Biomol Chem 13 
                (2014) 859Ð65.
                doi:10.1039/C4OB02287D.
     Reference: Jadhav A, et al. Quantitative Analyses of Aggregation, 
                Autofluorescence, and Reactivity Artifacts in a Screen 
                for Inhibitors of a Thiol Protease.
                J Med Chem 53 (2009) 37Ð51. doi:10.1021/jm901070c.

  * ZINC - Filtering based on drug-likeness and unwanted functional 
           groups
    Reference: http://blaster.docking.org/filtering/

The following is C++ and Python examples of how to filter molecules.

[C++]

#include <GraphMol/FilterCatalog.h>
using namespace RDKit;

    SmilesMolSupplier suppl(…);

    // setup the desired catalogs
    FilterCatalogParams params;
    params.addCatalog(FilterCatalogParams::PAINS_A);
    params.addCatalog(FilterCatalogParams::PAINS_B);
    params.addCatalog(FilterCatalogParams::PAINS_C);
    
    // create the catalog
    FilterCatalog catalog(params);

    unique_ptr<ROMol> mol; // automatically cleans up after us    
    int count = 0;
    while(!suppl.atEnd()){
      mol.reset(suppl.next());
      TEST_ASSERT(mol.get());

      // Does a PAINS filter hit?
      if (catalog.hasMatch(*mol)) {
        std::cerr << "Warning: molecule failed filter " << std::endl;
      }
      
      // More detailed data by retrieving the catalog entry
      const FilterCatalogEntry *entry = catalog.getFirstMatch(*mol);
      if (entry) {
        std::cerr << "Warning: molecule failed filter: reason " <<
          entry->getDescription() << std::endl;
        
        // get the matched substructure atoms for visualization
        std::vector<FilterMatch> matches;
        if (entry->getFilterMatches(*mol, matches)) {
          for(std::vector<FilterMatch>::const_iterator it = matches.begin();
              it != matches.end(); ++it) {
            // Get the SmartsMatcherBase that matched
            const FilterMatch & fm = (*it);
            boost::shared_ptr<SmartsMatcherBase> matchingFilter = \
              fm.filterMatch;
            
            // Get the matching atom indices
            const MatchVectType &vect = fm.atomPairs;
            for (MatchVectType::const_iterator it=vect.begin();
                 it != vect.end(); ++it) {
                 int atomIdx = it->second;
            }

          }
        }
      }
      count ++;
    } // end while

Python API

  import sys
  from rdkit.Chem import FilterCatalog

  params = FilterCatalog.FilterCatalogParams()
  params.AddCatalog(FilterCatalogParams.FilterCatalogs.PAINS_A)
  params.AddCatalog(FilterCatalogParams.FilterCatalogs.PAINS_B)
  params.AddCatalog(FilterCatalogParams.FilterCatalogs.PAINS_C)
  catalog = FilterCatalog.FilterCatalog(params)
  
  ...
  for mol in mols:
      if catalog.HasMatch(mol):
         print("Warning: molecule failed filter", file=sys.stderr)
      # more detailed
      entry = catalog.GetFirstMatch(mol)
      if entry:
         print("Warning: molecule failed filter: reason %s"%(
           entry.GetDescription()), file=sys.stderr)
           
         # get to the atoms involved in the substructure
         #  there ma be many matching filters here...
         for filterMatch in entry.getFilterMatches(mol):
             filter = filterMatch.filterMatch
             # get a description of the matching filter
             print(filter)
             for queryAtomIdx, atomIdx in filterMatch.atomPairs:
                 # do something with the substructure matches

Advanced

 FilterCatalogs are fully serializable and can be stored for later use.

  To serialize a catalog, use the catalog.Serialize() method.
     std::string pickle = catalog.Serialize();
     
  To unserialize, send the resulting string into the constructor
     FilterCatalog catalog(pickle);


 The underlying matchers can be arbitrarily complicated and new
  ones with more complicated semantics can be created.  The default
  matching objects are:

  SmartsMatcher - match a smarts pattern or query molecule with a minimum 
                  and maximum count
  ExclusionList - returns false if any of the supplied matches exist

  And - combine two matchers
  Or  - true if any of two matchers are true
  Not - invert the match (note that this can have confusing semantics
          when dealing with substructure matches)

  Entries can be added at any time to a catalog:
  
   ExclusionList excludedList;   

    excludedList.addPattern(SmartsMatcher("Pattern 1", smarts)); 
    excludedList.addPattern(SmartsMatcher("Pattern 2", smarts2)); 
   

  A FilterCatalog supports a few different types of matching.  One is
  a traditional rejection filter where if a substructure exists in
  the target molecule, the molecule is rejected.

  These types of queries can indicate the substructure that triggered
  the rejection through the FilterCatalogEntry::GetMatch(mol)
  function.

  The FilterCatalog also supports acceptance filters, that are
  designed to indicate which molecules are ok.  These have
  to be transformed into rejection filters or simply wrapped in a 
  Not( acceptanceFilter ) when entered into the catalog.  For example, 
   from Zinc:

    carbons [#6] 40

  means that we have a maximum of 40 carbon atoms.  We can write this by
  converting the max count to a min count (i.e. the pattern is triggered
  when the molecule has mincount atoms);

    const unsigned int minCount = 40+1;
    SmartsMatcher( "Too many carbons", "[#6"], minCount );

  This can be properly substructure searched.

  Or we can wrap this in a not:
  
    const unsigned int minCount = 0;
    const unsigned int maxCount = 40;
    Not( SmartsMatcher( "ok number of carbons", "[#6]", minCount, maxCount) );

  Note: Wrapping in a Not loses the ability to highlight the rejecting
    pattern when visualizing the molecule.
@greglandrum

This comment has been minimized.

Copy link
Member

greglandrum commented Jul 14, 2015

@bp-kelley : I think you forgot to include $RDBASE/rdkit/Chem/FilterCatalog in the PR

@bp-kelley

This comment has been minimized.

Copy link
Contributor Author

bp-kelley commented Jul 14, 2015

You are correct, it's in my tree but not in the pull request!


Brian Kelley

On Jul 14, 2015, at 11:27 AM, Greg Landrum notifications@github.com wrote:

@bp-kelley : I think you forgot to include $RDBASE/rdkit/Chem/FilterCatalog in the PR


Reply to this email directly or view it on GitHub.

@greglandrum greglandrum merged commit ad99cc6 into rdkit:master Jul 15, 2015

0 of 2 checks passed

continuous-integration/appveyor AppVeyor build failed
Details
continuous-integration/travis-ci/pr The Travis CI build failed
Details

@greglandrum greglandrum added this to the 2015_09_1 milestone Jul 16, 2015

@apahl

This comment has been minimized.

Copy link
Contributor

apahl commented Jul 22, 2015

This filter set is a very interesting addition to the RDKit, esp. the PAINS filters, thanks a lot.
I have tried the original Smarts filters from the KNIME implementation of the filters (Simon Saubern, Rajarshi Guha and Jonathan B. Baell; KNIME Workflow to Assess PAINS Filters in SMARTS Format. Comparison of RDKit and Indigo Cheminformatics Libraries; Mol. Inf. 2011, 30 (10), 847-850) and could increase their compound hit count with RDKit/Python in their example data set of10k compounds from 329 to 799, when adding hydrogens to the molecules. However, this is still less than what PPilot finds (858) and what the original SLN filters found (861).
My question is:
Is anyone aware of a reference compound set that covers all (480, I think) Smarts filters of the PAINS set?
I have just run the whole ChEMBL20 with them and found that only 322 filters (out of 480) were hitting (with PPilot).
I could of course go on and try other open databases and hope to hit more, but for such a prominent filter set like the PAINS, it would be really great to have a reference compound set.

Kind regards,
Axel

@bp-kelley

This comment has been minimized.

Copy link
Contributor Author

bp-kelley commented Jul 22, 2015

Axel,
Thanks for the comments. One of the reference datasets (used in the original paper) is MDDR 2008 that we could use to generate a reference set, but I don't know the hit rate count. I believe Chembl may also be a good reference, but I will have to look at that.

Practically, we have found that a lot of differences are actually caused by aromaticity perception of the underlying chemistry engine, however this is just a guess at this point (and doesn't take into account the SLN->Smarts conversion). I know of a few alternate implementations of PAINS that perhaps could be used in generating a proper reference set which (based on your analysis) seems like the right thing to do at this point just for validation purposes.

Cheers,
Brian

@greglandrum

This comment has been minimized.

Copy link
Member

greglandrum commented Jul 22, 2015

Given that Brian and I are both sitting in the same room with Rajarshi Guha for the next couple days, I think we should be able to come up with something here. :-)

@apahl

This comment has been minimized.

Copy link
Contributor

apahl commented Jul 22, 2015

That sounds just great!!

Kind regards,
Axel

@greglandrum

This comment has been minimized.

Copy link
Member

greglandrum commented Jul 22, 2015

I just did a bit of experimentation.
I was going to suggest that Axel shouldn't need to add Hs to make the queries work. The "mergeHs" argument to MolFromSmarts() should really prevent that. However... while testing I found a bug, #544, which is relevant to this discussion.

@apahl

This comment has been minimized.

Copy link
Contributor

apahl commented Jul 23, 2015

I don't know, Greg, just using mergeHs does not work for me, here is an example:

In [26]:
# Filter indol_3yl_alk(461) from the PAINS set:
smarts = 'n:1(c(c(c:2:c:1:c:c:c:c:2-[#1])-[#6;X4]-[#1])-[$([#6](-[#1])-[#1]),$([#6]=,:[!#6&!#1]),$([#6](-[#1])-[#7]),$([#6](-[#1])(-[#6](-[#1])-[#1])-[#6](-[#1])(-[#1])-[#7](-[#1])-[#6](-[#1])-[#1])])-[$([#1]),$([#6](-[#1])-[#1])]'
mol = Chem.MolFromSmiles("Cc1ccc(NC(=O)Cc2c(C(=O)O)[nH]c3ccccc23)c(Br)c1")

In [27]:
smarts_mol = Chem.MolFromSmarts(smarts)
mol.HasSubstructMatch(smarts_mol)

Out[27]:
False

In [28]:
smarts_mol = Chem.MolFromSmarts(smarts, mergeHs=True)
mol.HasSubstructMatch(smarts_mol)

Out[28]:
False

In [29]:
mol_h = Chem.AddHs(mol)
smarts_mol = Chem.MolFromSmarts(smarts)
mol_h.HasSubstructMatch(smarts_mol)

Out[29]:
True

Kind regards,
Axel

@greglandrum

This comment has been minimized.

Copy link
Member

greglandrum commented Jul 23, 2015

Hi Axel,

There are (at least) two problems there:

  • a problem in the way Hs are merged in recursive SMARTS : #544. I have made some progress on this
  • there is one piece of the SMARTS: '[$(#1)]' that isn't currently well handled (and will be difficult to handle)

I'm going to do some bulk testing here to find other incorrect examples to help get as close as possible to the bottom of this.

@apahl

This comment has been minimized.

Copy link
Contributor

apahl commented Jul 24, 2015

Thanks a lot, Greg.
Sorry for opening this can of worms....

@greglandrum

This comment has been minimized.

Copy link
Member

greglandrum commented Aug 4, 2015

@apahl and @bp-kelley ,
Here's where this stands: Yesterday I checked in a few more code changes along with a bunch of edits of the SMARTS from the KNIME workflow. These edits are mainly intended to remove explicit Hs appearing in the SMARTS that the H merging code will not be able to handle. A lot of these are explicit Hs in an atom or list; these need to be merged by hand. An example of this, I convert things like C-[#1,#6,#7] to [C;!H0,$(C-[#6,#7])]

I tested using the 10000 WEHI molecules that were part of the KNIME workflow.
The test code looks like this:

smas = [x[0] for x in csv.reader(file('./wehi_pains.csv'))]
osmas = [x[0] for x in csv.reader(file('./wehi_pains.orig.csv'))]
if len(sys.argv)>1:
    keep = int(sys.argv[1])
    smas = [smas[keep]]
    osmas = [osmas[keep]]
opatts = [Chem.MolFromSmarts(x,mergeHs=False) for x in smas]
patts = [Chem.MolFromSmarts(x,mergeHs=True) for x in smas]
print("   reading mols")
smis = [x[0] for x in csv.reader(file('./test_data/wehi_mols.csv'))]
ms = [Chem.MolFromSmiles(x) for x in smis]
mhs = [Chem.AddHs(x) for x in ms]

print("   filtering")
matches=[]
found=0
for i,(m,mh) in enumerate(zip(ms,mhs)):
    for j,(patt,opatt) in enumerate(zip(patts,opatts)):
        t1 = m.HasSubstructMatch(patt)
        t2 = mh.HasSubstructMatch(opatt)
        if t1:
            found+=1        
        if t1^t2:
            matches.append((i,j,smis[i],smas[j]))
            print(i,j,smis[i],smas[j])
    if not (i+1)%100:
        print("    Done: ",i+1," matches: ",len(matches)," found: ",found)

I'm running something similar across CHEMBL20 now. This will take a while and is certainly going to turn up additional problems.

Along the way I am keeping track of how many times each pattern occurs; would be good to make sure that we have at least one example for each pattern.

@greglandrum

This comment has been minimized.

Copy link
Member

greglandrum commented Aug 4, 2015

Ok, the ChEMBL experiment is done and I fixed a couple more SMARTS.
The SMARTS in https://github.com/rdkit/rdkit/blob/3af7aeaaea348ef25e974056ad1b593efa4e7f8d/Data/Pains/wehi_pains.csv
now work on deprotonated molecules (as long as you use the mergeHs=True argument to Chem.MolFromSmarts()).

This has been tested against all 10K molecules in the WEHI set that was used in the original KNIME workflow (only has matches for 144 of the PAINS) and the 1.4M molecules in ChEMBL20 (only has matches for 293 of the PAINS).

I will try to find a dataset (likely going to be pubchem... shudder!) to find matches for the remaining ~200 PAINS so that we have at least one matching molecule for each of them to use in a test set.

@apahl

This comment has been minimized.

Copy link
Contributor

apahl commented Aug 5, 2015

Greg, thanks a lot again for all the work you put into this.
Do you think it is possible that a lot of the remaining filters in the PAINS set don't code for "real" molecules (although RDKit can generate mols from them)?

@greglandrum

This comment has been minimized.

Copy link
Member

greglandrum commented Aug 5, 2015

That is certainly a possibility, I'm not convinced that the SLN -> SMARTS translation went flawlessly.

In the interests of having some kind of test set, I am currently running through the full ZINC set with the hydrogen-suppressed SMARTS version (doing the searches with Hs in the molecule is a bit too slow). It's about 13 million compounds in and has turned up matches for 89 of the remaining patterns. There are still 82 left to go.

The supplementary material for the "PAINS in KNIME" paper (http://onlinelibrary.wiley.com/store/10.1002/minf.201100076/asset/supinfo/minf_201100076_sm_miscellaneous_information.pdf?v=1&s=5e9531fd6b228e4f1d4db17b31bc31b254d23558) includes a set of examples that did have matches in those 10K molecules from SLN but that do not from SMARTS. I'm going to spot-check some of those to see if I can figure out why. Going to have to brush up on on my SLN first though... :-)

I'm going to need to do a blog post about this exercise.

@greglandrum

This comment has been minimized.

Copy link
Member

greglandrum commented Aug 6, 2015

This morning I ran the remaining unmatched queries (for some reason now I only have 81) against pubchem. This gave me an additional 62 matches (only 18 left without matches) and allowed me to do some more testing/tweaking of the SMARTS. The differences were mainly due to different aromaticity models (primarily to do with the impact of exocyclic double bonds). At this point most of the pubchem matches also match with the RDKit code.

The changes are all checked in, as are a set of tests that have one matching molecule per SMARTS. This isn't optimal, but going much beyond that would be pretty onerous.

@bp-kelley : apologies, but the FilterCatalog is going to need to be updated again. I think I'm done tweaking these things for now though.

@bp-kelley

This comment has been minimized.

Copy link
Contributor Author

bp-kelley commented Aug 6, 2015

No worries, I have a script that does the conversion, so it is fairly trivial.

I'll also incorporate your test into the filter catalog suite when I have a moment.


Brian Kelley

On Aug 6, 2015, at 7:42 AM, Greg Landrum notifications@github.com wrote:

This morning I ran the remaining unmatched queries (for some reason now I only have 81) against pubchem. This gave me an additional 62 matches (only 18 left without matches) and allowed me to do some more testing/tweaking of the SMARTS. The differences were mainly due to different aromaticity models (primarily to do with the impact of exocyclic double bonds). At this point most of the pubchem matches also match with the RDKit code.

The changes are all checked in, as are a set of tests that have one matching molecule per SMARTS. This isn't optimal, but going much beyond that would be pretty onerous.

@bp-kelley : apologies, but the FilterCatalog is going to need to be updated again. I think I'm done tweaking these things for now though.


Reply to this email directly or view it on GitHub.

@jir322

This comment has been minimized.

Copy link

jir322 commented Aug 6, 2015

Thank you for the PAINS SMARTS curation, Greg. We were making progress on
our own effort to curate them, and it was real slog! We appreciate this
very much, along with so much else you do in RDKit!

John

On Thu, Aug 6, 2015 at 4:42 AM, Greg Landrum notifications@github.com
wrote:

This morning I ran the remaining unmatched queries (for some reason now I
only have 81) against pubchem. This gave me an additional 62 matches (only
18 left without matches) and allowed me to do some more testing/tweaking of
the SMARTS. The differences were mainly due to different aromaticity models
(primarily to do with the impact of exocyclic double bonds). At this point
most of the pubchem matches also match with the RDKit code.

The changes are all checked in, as are a set of tests that have one
matching molecule per SMARTS. This isn't optimal, but going much beyond
that would be pretty onerous.

@bp-kelley https://github.com/bp-kelley : apologies, but the
FilterCatalog is going to need to be updated again. I think I'm done
tweaking these things for now though.


Reply to this email directly or view it on GitHub
#536 (comment).

@greglandrum

This comment has been minimized.

Copy link
Member

greglandrum commented Aug 7, 2015

Thanks for the positive feedback John! It's always great to hear that the work is appreciated.

The cleanup that's been done so far was, indeed, something of a slog. Enough of one that I haven't yet tried to figure out why I'm not finding matches for the 18 PAINS I still haven't found examples for. I'm guessing that something got seriously borked in the conversion from SLN to SMARTS, but haven't drilled into it yet.

This will show up in a blog post at some point (hopefully soon), but here are the missing SMARTS as well as the name of the corresponding PAINS in case anyone has the time/desire to investigate further:

"[#8]-[#6](=[#8])-[#6](-[#1])(-[#1])-[#16;X2]-[#6](=[#7]-[#6]#[#7])-[#7](-[#1])-c:1:c:c:c:c:c:1","<regId=cyanamide_A(1)>"
"c:1-3:c(:c:c:c:c:1)-[#16]-[#6](=[#7]-[#7]=[#6]-2-[#6]=[#6]-[#6]=[#6]-[#6]=[#6]-2)-[#7]-3-[#6](-[#1])-[#1]","<regId=colchicine_het(1)>"
"c:1(:c(:c:2:c(:n:c:1-[#7](-[#1])-[#1]):c:c:c(:c:2-[#7](-[#1])-[#1])-[#6]#[#7])-[#6]#[#7])-[#6]#[#7]","<regId=cyano_amino_het_A(1)>"
"[#6](-[#1])-[#6]:2:[#7]:[#7](-c:1:c:c:c:c:c:1):[#16]:3:[!#6&!#1]:[!#1]:[#6]:[#6]:2:3","<regId=het_thio_N_55(5)>"
"[#6]-2(=[#16])-[#7]-1-[#6]:[#6]-[#7]=[#7]-[#6]-1=[#7]-[#7]-2-[#1]","<regId=thio_urea_K(2)>"
"[#7](-[#1])(-[#1])-c:1:c(:c(:c(:c(:c:1-[#7](-[#1])-[#16](=[#8])=[#8])-[#1])-[#7](-[#1])-[#6](-[#1])-[#1])-[F,Cl,Br,I])-[#1]","<regId=anil_NH_no_alk_B(1)>"
"[#7]-4(-c:1:c:c:c:c:c:1)-[#6](=[#7+](-c:2:c:c:c:c:c:2)-[#6](=[#7]-c:3:c:c:c:c:c:3)-[#7]-4)-[#1]","<regId=het_5_inium(1)>"
"c:1:3:c(:c:c:c:c:1)-[#7]-2-[#6](=[#8])-[#6](=[#6](-[F,Cl,Br,I])-[#6]-2=[#8])-[#7](-[#1])-[#6]:[#6]:[#6]:[#6](-[#8]-[#6](-[#1])-[#1]):[#6]:[#6]:3","<regId=anil_OC_alk_B(3)>"
"[#6]-1(=[#6](-!@[#6]=[#7])-[#16]-[#6](-[#7]-1)=[#8])-[$([F,Cl,Br,I]),$([#7+](:[#6]):[#6])]","<regId=thiaz_ene_C(11)>"
"s:1:c(:c(-[#1]):c(:c:1-[#6]-3=[#7]-c:2:c:c:c:c:c:2-[#6](=[#7]-[#7]-3-[#1])-c:4:c:c:n:c:c:4)-[#1])-[#1]","<regId=het_76_A(1)>"
"[#7]=[#6]-1-[#7](-[#1])-[#6](=[#6](-[#7]-[#1])-[#7]=[#7]-1)-[#7]-[#1]","<regId=het_6_imidate_A(4)>"
"[#6]-2(=[#7]-c1c(c(nn1-[#6](-[#6]-2(-[#1])-[#1])=[#8])-[#7](-[#1])-[#1])-[#7](-[#1])-[#1])-[#6]","<regId=het_65_G(1)>"
"c:1:2(:c(:c(:c(:o:1)-[#6])-[#1])-[#1])-[#6](=[#8])-[#7](-[#1])-[#6]:[#6](-[#1]):[#6](-[#1]):[#6](-[#1]):[#6](-[#1]):[#6]:2-[#6](=[#8])-[#8]-[#1]","<regId=anthranil_acid_I(1)>"
"[#6](-[#1])(-c:1:c(:c(:c(:c(:c:1-[#1])-[#1])-[Cl])-[#1])-[#1])(-c:2:c(:c(:c(:c(:c:2-[#1])-[#1])-[Cl])-[#1])-[#1])-[#8]-[#6](-[#1])(-[#1])-[#6](-[#1])(-[#1])-[#6](-[#1])(-[#1])-c3nc(c(n3-[#6](-[#1])(-[#1])-[#1])-[#1])-[#1]","<regId=misc_imidazole(1)>"
"c2(c-1n(-[#6](-[#6]=[#6]-[#7]-1)=[#8])nc2-c3cccn3)-[#6]#[#7]","<regId=het_65_H(1)>"
"[#7](-[#1])(-c:1:c(:c(:c(:c(:c:1-[#1])-[#1])-[#1])-[#1])-[#8]-[#1])-[#6]-2=[#6](-[#8]-[#6](-[#7]=[#7]-2)=[#7])-[#7](-[#1])-[#1]","<regId=het_6_imidate_B(1)>"
"[#8]=[#6]-3-c:1:c(:c:c:c:c:1)-[#6]-2=[#6](-[#8]-[#1])-[#6](=[#8])-[#7]-c:4:c-2:c-3:c:c:c:4","<regId=quinone_C(2)>"
"c:1:c:c-2:c(:c:c:1)-[#7](-[#6](-[#8]-[#6]-2)(-[#6](=[#8])-[#8]-[#1])-[#6](-[#1])-[#1])-[#6](=[#8])-[#6](-[#1])-[#1]","<regId=misc_aminal_acid(1)>"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.