Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
RDKit learns how to filter PAINS/BRENK/ZINC/NIH via FilterCatalog #536
FilterCatalogs give RDKit the ability to screen out or reject
The following is C++ and Python examples of how to filter molecules.
using namespace RDKit;
params = FilterCatalog.FilterCatalogParams()
FilterCatalogs are fully serializable and can be stored for later use.
To serialize a catalog, use the catalog.Serialize() method.
To unserialize, send the resulting string into the constructor
The underlying matchers can be arbitrarily complicated and new
SmartsMatcher - match a smarts pattern or query molecule with a minimum
And - combine two matchers
Entries can be added at any time to a catalog:
A FilterCatalog supports a few different types of matching. One is
These types of queries can indicate the substructure that triggered
The FilterCatalog also supports acceptance filters, that are
means that we have a maximum of 40 carbon atoms. We can write this by
This can be properly substructure searched.
Or we can wrap this in a not:
Note: Wrapping in a Not loses the ability to highlight the rejecting
Jul 15, 2015
This filter set is a very interesting addition to the RDKit, esp. the PAINS filters, thanks a lot.
Practically, we have found that a lot of differences are actually caused by aromaticity perception of the underlying chemistry engine, however this is just a guess at this point (and doesn't take into account the SLN->Smarts conversion). I know of a few alternate implementations of PAINS that perhaps could be used in generating a proper reference set which (based on your analysis) seems like the right thing to do at this point just for validation purposes.
I just did a bit of experimentation.
I don't know, Greg, just using mergeHs does not work for me, here is an example:
In : # Filter indol_3yl_alk(461) from the PAINS set: smarts = 'n:1(c(c(c:2:c:1:c:c:c:c:2-[#1])-[#6;X4]-[#1])-[$([#6](-[#1])-[#1]),$([#6]=,:[!#6&!#1]),$([#6](-[#1])-[#7]),$([#6](-[#1])(-[#6](-[#1])-[#1])-[#6](-[#1])(-[#1])-[#7](-[#1])-[#6](-[#1])-[#1])])-[$([#1]),$([#6](-[#1])-[#1])]' mol = Chem.MolFromSmiles("Cc1ccc(NC(=O)Cc2c(C(=O)O)[nH]c3ccccc23)c(Br)c1") In : smarts_mol = Chem.MolFromSmarts(smarts) mol.HasSubstructMatch(smarts_mol) Out: False In : smarts_mol = Chem.MolFromSmarts(smarts, mergeHs=True) mol.HasSubstructMatch(smarts_mol) Out: False In : mol_h = Chem.AddHs(mol) smarts_mol = Chem.MolFromSmarts(smarts) mol_h.HasSubstructMatch(smarts_mol) Out: True
There are (at least) two problems there:
I'm going to do some bulk testing here to find other incorrect examples to help get as close as possible to the bottom of this.
@apahl and @bp-kelley ,
I tested using the 10000 WEHI molecules that were part of the KNIME workflow.
smas = [x for x in csv.reader(file('./wehi_pains.csv'))] osmas = [x for x in csv.reader(file('./wehi_pains.orig.csv'))] if len(sys.argv)>1: keep = int(sys.argv) smas = [smas[keep]] osmas = [osmas[keep]] opatts = [Chem.MolFromSmarts(x,mergeHs=False) for x in smas] patts = [Chem.MolFromSmarts(x,mergeHs=True) for x in smas] print(" reading mols") smis = [x for x in csv.reader(file('./test_data/wehi_mols.csv'))] ms = [Chem.MolFromSmiles(x) for x in smis] mhs = [Chem.AddHs(x) for x in ms] print(" filtering") matches= found=0 for i,(m,mh) in enumerate(zip(ms,mhs)): for j,(patt,opatt) in enumerate(zip(patts,opatts)): t1 = m.HasSubstructMatch(patt) t2 = mh.HasSubstructMatch(opatt) if t1: found+=1 if t1^t2: matches.append((i,j,smis[i],smas[j])) print(i,j,smis[i],smas[j]) if not (i+1)%100: print(" Done: ",i+1," matches: ",len(matches)," found: ",found)
I'm running something similar across CHEMBL20 now. This will take a while and is certainly going to turn up additional problems.
Along the way I am keeping track of how many times each pattern occurs; would be good to make sure that we have at least one example for each pattern.
Ok, the ChEMBL experiment is done and I fixed a couple more SMARTS.
This has been tested against all 10K molecules in the WEHI set that was used in the original KNIME workflow (only has matches for 144 of the PAINS) and the 1.4M molecules in ChEMBL20 (only has matches for 293 of the PAINS).
I will try to find a dataset (likely going to be pubchem... shudder!) to find matches for the remaining ~200 PAINS so that we have at least one matching molecule for each of them to use in a test set.
That is certainly a possibility, I'm not convinced that the SLN -> SMARTS translation went flawlessly.
In the interests of having some kind of test set, I am currently running through the full ZINC set with the hydrogen-suppressed SMARTS version (doing the searches with Hs in the molecule is a bit too slow). It's about 13 million compounds in and has turned up matches for 89 of the remaining patterns. There are still 82 left to go.
The supplementary material for the "PAINS in KNIME" paper (http://onlinelibrary.wiley.com/store/10.1002/minf.201100076/asset/supinfo/minf_201100076_sm_miscellaneous_information.pdf?v=1&s=5e9531fd6b228e4f1d4db17b31bc31b254d23558) includes a set of examples that did have matches in those 10K molecules from SLN but that do not from SMARTS. I'm going to spot-check some of those to see if I can figure out why. Going to have to brush up on on my SLN first though... :-)
I'm going to need to do a blog post about this exercise.
This morning I ran the remaining unmatched queries (for some reason now I only have 81) against pubchem. This gave me an additional 62 matches (only 18 left without matches) and allowed me to do some more testing/tweaking of the SMARTS. The differences were mainly due to different aromaticity models (primarily to do with the impact of exocyclic double bonds). At this point most of the pubchem matches also match with the RDKit code.
The changes are all checked in, as are a set of tests that have one matching molecule per SMARTS. This isn't optimal, but going much beyond that would be pretty onerous.
@bp-kelley : apologies, but the FilterCatalog is going to need to be updated again. I think I'm done tweaking these things for now though.
No worries, I have a script that does the conversion, so it is fairly trivial.
I'll also incorporate your test into the filter catalog suite when I have a moment.
Thank you for the PAINS SMARTS curation, Greg. We were making progress on
On Thu, Aug 6, 2015 at 4:42 AM, Greg Landrum email@example.com
Thanks for the positive feedback John! It's always great to hear that the work is appreciated.
The cleanup that's been done so far was, indeed, something of a slog. Enough of one that I haven't yet tried to figure out why I'm not finding matches for the 18 PAINS I still haven't found examples for. I'm guessing that something got seriously borked in the conversion from SLN to SMARTS, but haven't drilled into it yet.
This will show up in a blog post at some point (hopefully soon), but here are the missing SMARTS as well as the name of the corresponding PAINS in case anyone has the time/desire to investigate further: