Skip to content


Subversion checkout URL

You can clone with
Download ZIP
Browse files

Added summary section

  • Loading branch information...
commit 543f6434e2317d735ea799033105a2264aae56b3 1 parent 3516c06
@rajarshi authored
Showing with 77 additions and 16 deletions.
  1. +14 −2 paper.bib
  2. +63 −14 paper.tex
16 paper.bib
@@ -2,7 +2,7 @@
-%% Created for Rajarshi Guha at 2012-04-17 10:36:35 -0400
+%% Created for Rajarshi Guha at 2012-04-17 22:53:54 -0400
%% Saved with string encoding Unicode (UTF-8)
@@ -11,6 +11,18 @@
@string{jcim = {J.~Chem.~Inf.~Model.}}
+ Author = {Guha, R.},
+ Chapter = {On Exploring Structure Activity Relationships},
+ Date-Added = {2012-04-17 21:59:48 -0400},
+ Date-Modified = {2012-04-17 22:05:07 -0400},
+ Editor = {Kortegere, S.},
+ Publisher = {Humana Press},
+ Series = {Methods in Molecular Biology},
+ Title = {In silico Models for Drug Discovery},
+ Volume = {submitted},
+ Year = {2012}}
Abstract = {The privacy of chemical structure is of paramount importance for the industrial sector, in particular for the pharmaceutical industry. At the same time, companies handle large amounts of physico-chemical and biological data that could be shared in order to improve our molecular understanding of pharmacokinetic and toxicological properties, which could lead to improved predictivity and shorten the development time for drugs, in particular in the early phases of drug discovery. The current study provides some theoretical limits on the information required to produce reverse engineering of molecules from generated descriptors and demonstrates that the information content of molecules can be as low as less than one bit per atom. Thus theoretically just one descriptor can be used to completely disclose the molecular structure. Instead of sharing descriptors, we propose to share surrogate data. The sharing of surrogate data is nothing else but sharing of reliably predicted molecules. The use of surrogate data can provide the same information as the original set. We consider the practical application of this idea to predict lipophilicity of chemical compounds and we demonstrate that surrogate and real (original) data provides similar prediction ability. Thus, our proposed strategy makes it possible not only to share descriptors, but also complete collections of surrogate molecules without the danger of disclosing the underlying molecular structures.},
Author = {Tetko, Igor V. and Abagyan, Ruben and Oprea, Tudor I.},
@@ -25260,7 +25272,7 @@ @comment{BibDesk
<key>group name</key>
<string>Exploring SAR</string>
- <string>Bajorath:2009ai,Korff:2006aa,Chen:2009hb</string>
+ <string>Chen:2009hb,Korff:2006aa,Bajorath:2009ai</string>
<key>group name</key>
77 paper.tex
@@ -73,11 +73,10 @@ \section{Introduction}
numerical forms range from a set of 3D coordinates (which coupled with
appropiate atom types, is sufficient for methods such QM approaches
and docking) to more abstract numerical descriptions derived from 2D
-or 3D representations which can be useful is statistical approaches.
-It is now possible to evaluate thousands of numerical descriptors of
-chemical structure. As will be discussed later, many of these
-descriptors are closely related - one can be substituted for
+or 3D representations which can be useful is statistical
+approaches. It is now possible to evaluate thousands of numerical
+descriptors of chemical structure. As will be discussed later, many of
+these descriptors are closely related - one can be substituted for
another. The choice of descriptors is a well known problem and given a
large collection of them, approaches to identify a suitable subset
have been discussed extensively in the literature
@@ -314,8 +313,7 @@ \section{A Categorization of Descriptors}
\section{What is a Useful Descriptor?}
-Clearly, one can calculate a huge variety of descriptors, many of
-which will be correlated with others. It is then critical to ask, what
+Given that we can generate thousands of descriptors, it is critical to ask, what
makes a descriptor useful? Fundamentally, a descriptor must correlate
structural features with some physicochemical property and show
minimal correlation with other descriptors. In addition, a generally
@@ -399,10 +397,9 @@ \section{Descriptor Implementations}
Bioclipse \cite{Spjuth:2007aa}, MOE (Chemical Computing Group) and
Maestro (Schr\"{o}dinger, Inc.). Workflow tools are also a useful
class of applications for descriptor generation and allow one to
-easily generate descriptors from multiple sources.
-Given the focus of this article on Open Source implementations, we
-have noted the availability in Table \ref{tab:impl}. While a number of
+easily generate descriptors from multiple sources. Given the focus of
+this article on Open Source implementations, we have noted the
+availability in Table \ref{tab:impl}. While a number of
implementations are not strictly Open Source according to OSI
definitions, the fact that they provide free academic licenses does
allow them to be used somewhat freely.
@@ -447,7 +444,7 @@ \subsection{Comparing implementations}
different cheminformatcs toolkits will be identical. Gupta et al
employed SMARTS based descriptors from the CDK and MOE to develop
decision tree models to predict human liver microsomal metabolic
-stability\cite{Gupta:2010uq}. Their results indicated very similar
+stability \cite{Gupta:2010uq}. Their results indicated very similar
performance between the two implementations. In general, one would
expect a high degree of correlation between different implementations
of well-defined descriptors. Here ``well defined'' indicates that the
@@ -467,8 +464,9 @@ \subsection{Comparing implementations}
reasonable to compare calculated log P values from different
implementations to the experimentally observed values, rather than
between themselves. Figure \ref{fig:logp} compares computed log P
-values from the CDK, ChemAxon and ACD Labs for a set of 10,000
-molecules taken from the logPstar dataset.
+values from the CDK (specifically, an implementation of the XlogP
+method), ChemAxon and ACD Labs for a set of 10,000 molecules taken
+from the logPstar dataset.
\subsection{Descriptor Naming \& Versioning}
@@ -518,6 +516,57 @@ \subsection{Descriptor Naming \& Versioning}
+Molecular descriptors play a fundamental role in cheminformatic and
+chemometric analyses and as we have described in this article it is
+possible to evaluate thousands of descriptors using a variety of
+software. Though the bulk of descriptors are well defined in the
+literature, multiple implementations of the same descriptor can yield
+different results. These differences arise from differences in the
+underlying chemistry models and reference data used by the
+implementations. As shown by Gupta et al \cite{Gupta:2010uq},
+the performance of predictive models developed using commercial or
+Open Source descriptor implementations is very similar. While
+different tools differ in the specific number and types of descriptors
+that they calculate, the fact many descriptors are correlated with
+others suggests that it does not matter too much which set of
+descriptors are used in an application. Of course, this cannot be a
+general rule - specific applications may require a specific set of
+descriptors. This is not to say that all descriptor tools exhibit
+similr performance. For example, it is clear from Figure \ref{fig:logp}
+that the ACD Labs implementation of log P performs significantly
+better than other implementations. Though this is not a completely
+rigorous comparison (the methods underlying the ACD Labs, CDK and
+ChemAxon implementations are different), it does highlight that
+certain implementations of a descriptor can fare better than others -
+especially those cases where the descriptor is based on a predictive
+Given these challenges it is imperative that descriptor tools provide
+access to version information. This allows the user to provide an
+exact specification of how the descriptor was calculated. However, one
+aspect that has not been standardized across different implementations
+is the naming scheme. In other words, tools that evaluate the same
+descriptor may name them in a similar but not exactly identical
+manner. This makes automated comparisons and merging of descriptors
+from different sources problematic. To address the CDK has developed
+the concept of a ``descriptor specification'' which is associated with
+a descriptor value and includes information on the vendor, descriptor
+title and a reference to an entry in a descriptor dictionary that
+contains more details of the descriptor in RDF. While the extra
+descriptor metadata is currently used to implement a simple
+classification scheme, it is conceivable that in the future the
+specification approach could be adopted by multiple vendors, allowing
+for automated reasoning over descriptor implementations.
+In summary, there is no dearth of tools to generate molecular
+descriptors. Many of them are available under liberal Open Source
+licenses, though these implementations do not necessrily cover all the
+descriptors described to date. The tools range from toolkits (which
+implies that one must write a program to generate descriptors) to
+self-contaiend GUI or command line tools. Given the caveats regarding
+implementation specific differences, it is important to keep track of
+provenance information when calculating descriptors to ensure
Please sign in to comment.
Something went wrong with that request. Please try again.