# public rajarshi /cacm-article

### Subversion checkout URL

You can clone with HTTPS or Subversion.

# Some changes and added references...#13

Merged
merged 5 commits into from over 2 years ago
 +67 44

### 2 participants

No description provided.

 joergkurtwegner Minot change 42a335f joergkurtwegner Minor rewrite 3cffda8 joergkurtwegner Minor rewrite 9731432 joergkurtwegner Minor rewrite f417566 joergkurtwegner Reference added 4fbaba3
referenced this pull request from a commit
 rajarshi Merge pull request #13 from joergkurtwegner/master Some changes and added references... 6b7368c
merged commit 6b7368c into from
closed this

Showing 5 unique commits by 1 author.

Nov 17, 2011
Minot change 42a335f
Minor rewrite 3cffda8
Minor rewrite 9731432
Minor rewrite f417566
Reference added 4fbaba3
 @@ -70,10 +70,11 @@ \section*{Algorithmic graph theory} 70 70   71 71  \item \emph{Searching within complex data types, e.g. molecules, for semantic web approaches}.  72 72   73 -The key bonus of the semantic web is that different data sources can be readily integrated with each other. In the field  73 +One key concept of the linked data web, the semantic web, is that different data sources can be readily integrated with each other. Still, in the field  74 74  of Cheminformatics, we are not only interest in linking two molecules  75 -(this normalization problem for different protomers, tautomers, or special cases of isomerisms remains open), but we  76 -are also interested in being able to search efficiently within molecules when being linked via semantic web approaches.  75 +(the linking normalization problem for different protomers, tautomers, or special cases of isomerisms remain open), but we  76 +are also interested in being able to search efficiently within molecules when being linked via semantic web approaches. Typical  77 +searches will require being able to apply substructure or similarity searches.  77 78  What could be algorithmic solutions for this?  78 79  \end{enumerate}  79 80   @@ -81,61 +82,54 @@ \section*{Cryptography} 81 82  One-way molecular featurization (???)  82 83   83 84  \section*{Data mining}  84 -evaluation of similarities in a heterogeneous network. What is a specific example here?  85 -  86 -integrate chemical structure information with ontologies. Specific example problem?  87 -  88 -Classify molecular descriptors up to equivalence, and dependence on one another.  89 -  90 -systems-level understanding of small molecules. What does this mean, and what would a specific challenge problem be?  91 -  92 85  \begin{enumerate}  93 -  94 -\item \emph{Efficient molecule browsing, e.g. on scaffold level}.  95 -  96 -Chemical Abstract Services have a molecule browsing tool called SubScape, which allows to brows large-scale  97 -chemical spaces efficiently. What could be large-scale solutions for doing this within (combined and aligned)  98 -public databases.  86 +\item evaluation of similarities in a heterogeneous network (JKW: What are heterogenous networks?). What is a specific example here?  87 +\item integrate chemical structure information with ontologies. Specific example problem?  88 +\item Classify molecular descriptors up to equivalence, and dependence on one another.  89 +\item systems-level understanding of small molecules (JKW: which systems? This will help being clearer on the challenge). What does this mean, and what would a specific challenge problem be?  90 +%\item \emph{Efficient molecule browsing, e.g. on scaffold level}.  91 +%  92 +%Chemical Abstract Services have a molecule browsing tool called SubScape, which allows to browse large-scale  93 +%chemical spaces efficiently. What could be large-scale solutions for doing this within (combined and aligned)  94 +%public databases.  99 95  %  100 96  \item \emph{Large-scale browsing of molecular property spaces, e.g. on scaffold level, side-effect-level, ...}.  101 97   102 -Certain molecules might have hundreds of biological activities, side-effects in humans (from clinical trials), or  103 -many other properties attached to them. What are large-scale mining and visulization options, especially when thinking  104 -about mining private and public datasources at the very same time?  98 +Certain molecules might have hundreds of biological activities, side-effects in humans (SIDER database \cite{Kuhn_Campillos_Letunic_Jensen_Bork_2010}), or  99 +many other properties attached to them. What are large-scale mining and visulization options?  100 +How can we mine private and public data sources at the very same time?  105 101  %  106 -\item \emph{Patent text mining (curation)}.  102 +\item \emph{Chemical image/text mining in patents (curation)}.  107 103   108 -There are various tools for doing automatic text mining on chemical patents. Still, the overall acceptance rate is improvable  109 -since many medicinal chemists are very concerned about the data quality of such efforts. What could be done to improve  110 -the mining quality and to provide confidence level estimations for each molecule coming from patent mining?  104 +There are various tools for doing automatic text mining on chemical patents. Still, the overall acceptance rate of chemical  105 +text mining is improvable, since many medicinal chemists are very concerned about the data quality of such efforts.  106 +What could be done to improve the mining quality, curate the obtained data, and to provide confidence level estimations  107 +for each molecule coming from patent mining? Do require image2structure and text2structure mining also data stores  108 +for ensuring a sufficient amount of confidence and data quality?  111 109  How can patent mining be used to create new drugs faster or to speed-up collaboration/licensing discussions?  112 110  \end{enumerate}  113 111   114 112  \section*{Machine learning}  115 -I don't know. Help? Ideas: better kernelization techniques (something specific though), better algorithms to train X model about Y thing.  116 113  \begin{enumerate}  117 114  \item \emph{Large-scale vectorial versus kernels molecule similarity}  118 115   119 -Vectorial molecule encodings serve as efficient approximations.  116 +Vectorial molecule encodings can serve as efficient approximations of molecules.  120 117  Sometimes non-vectorial molecular 3D shape or molecule kernel comparisons might be more suitabe to compare molecules, since  121 -they might better correlate with biological activity, toxicity in humans, etc. One key problem is that molecular 3D shape  122 -or molecule kernel approaches require to compare two molecules (or their 3D conformational eplosions) directly.  123 -This becomes prohibitively expensive when considering millions of molecules. What could be potential solutions for approximating boundary  124 -conditions or large-scale mining methods allowing within a second range to return all similar molecules to molecule X, especially  125 -considering not just one vectorial encoding, but molecule kernels, ligand-protein similarities (for example from ligand-protein crystal structures),  126 -or chemogenomics similarities (all reported biological activities for one molecule).  118 +they might better correlate with activities. One key problem is that non-vectorial encodings require to compare all molecules  119 +(or their 3D conformational eplosions) in a pair-wise manner.  120 +This becomes prohibitively expensive when considering millions of molecules.  121 +Can dyadic data approaches help \cite{Hochreiter:2006:SVM:1159508.1159516}? Other approximations or cascading flows?  127 122  %  128 123  \item \emph{Using multiple annotations for improving molecular mining/predictions (chemogenomics)}  129 124   130 125  As an example: Biological activities might not be independent of each other, but have a certain correlation between each other.  131 126  In Chemogenomics this is used for creating models of combining molecules with protein sequences, molecules with active sites of proteins,  132 -molecules with biological read-outs of multiple assays. How can we optimize such highly complex mining scenarios, especially  127 +or molecules with biological activities of multiple assays. How can we optimize such highly complex mining scenarios, especially  133 128  when considering large-scale data sets with hundred of thousands molecules and thousands of biological activities?  134 129  How can we combine, mine, and visualize categorial and continuous output variables, e.g. hydrophobicity of a molecule and toxicity in humans,  135 -by still being able to make conrete proposals to medicinal chemistry of which parts of the molecule needs to be changed to  136 -optimize the effect on a certain set of multi-objective variables? Is analoging (creating vrey small modifications of a molecule and measuring its activities)  130 +by still being able to make concrete proposals to medicinal chemistry? Is analoging (creating very small modifications of a molecule and measuring its activities)  137 131  really the most efficient way forward? If we test molecules, should we test it in a single biological assay or in multiple biological assays, if multiple, which ones?  138 -If a company does not have a biological assay within reach, which other partner could offer testing a molecule within two days (vendor matching based on licenses)?  132 +If a company does not have a biological assay within reach, which other partner could offer testing a molecule within two days (vendor matching based on licenses or contracts)?  139 133  \end{enumerate}  140 134   141 135  \section*{Software engineering}  @@ -147,10 +141,11 @@ \section*{Software engineering} 147 141  databases, especially with the explosion of public databases, but also creates hugh space and time complexity issues when  148 142  searching within such databases. What could be better interfaces, maintenance, data structures, and private/public sharing  149 143  scenarios for conformational 3D databases?  150 -Many software vendors use different solutions for parallizing comput jobs: SGE, PVM, MPI, etc.  151 -Everyone knowing thh enterprise structure within companies might know that having more than one parallel processing framework  152 -is not easy? What could be better ways to streamline parallel processing structures for  144 +Many software vendors use different solutions for parallizing compute jobs: SGE, PVM, MPI, etc.  145 +Everyone knowing the enterprise IT approval cycles might know that having more than one parallel processing framework  146 +is not easy. What could be better ways to streamline parallel processing structures for  153 147  cheminformatics (and molecular modeling) algorithms? Is a cloud really an option? What about SaaS with secured data transfer?  148 +Can this also offer alternative licensing strategies for software suites in this domain?  154 149  \end{enumerate}  155 150   156 151  \section*{Enterprise software (KM,ELN)}  @@ -158,11 +153,12 @@ \section*{Enterprise software (KM,ELN)} 158 153  \begin{enumerate}  159 154  \item \emph{Public-private collaboration and security scenarios}  160 155   161 -Having within an organization, e.g. a commercial company, single or a small number of established KM and ELN products.  156 +Let us assume an organization, e.g. a commercial company, has a single or a small number of established KM and ELN products.  162 157  How can we improve the maintenance, leveraging, and collaboration with many external partners (each of them potentially  163 -with another KM/ELN solution)? Which party is hosting which data in which data structure, and how can we ensure that only  164 -pre-defined data entries (and a limited number of annotations, e.g. bioological activities) are visible to a partner.  165 -How can this be organized for a multitude of partners? Cloud computing?  158 +with another KM/ELN solution)? Which party is hosting which data in which data structure (ontologies?), and  159 +how can we ensure that only pre-defined data entries  160 +(and a limited number of annotations, e.g. biological activities) are visible to a partner.  161 +How can this be organized for a multitude of partners? Cloud computing, user management, encryption granularity and efficient security management?  166 162  \end{enumerate}  167 163   168 164  \bibliographystyle{abbrv} 
 @@ -1243,7 +1243,7 @@ @book{TopologicalLook 1243 1243  } 1244 1244   1245 1245  @article{MCSreview, 1246 - author = {John W. Raymond, and Peter Willett}, 1246 + author = {John W. Raymond and Peter Willett}, 1247 1247  title = {Maximum common subgraph isomorphism algorithms for the matching of chemical structures}, 1248 1248  journal = {Journal of Computer-Aided Molecular Design}, 1249 1249  volume = {16}, @@ -1270,3 +1270,30 @@ @article{Epp-JGAA-99 1270 1270  pages = {1--27}, 1271 1271  year = {1999}, 1272 1272  review = {MR-2001b-05154}} 1273 + 1274 +@article{Kuhn_Campillos_Letunic_Jensen_Bork_2010,  1275 + title={A side effect resource to capture phenotypic effects of drugs.},  1276 + volume={6}, url={http://www.ncbi.nlm.nih.gov/pubmed/20087340},  1277 + number={343}, journal={Molecular Systems Biology},  1278 + publisher={Nature Publishing Group},  1279 + author={Kuhn, Michael and Campillos, Monica and Letunic, Ivica and Jensen, Lars Juhl and Bork, Peer},  1280 + year={2010},  1281 + pages={343}} 1282 + 1283 +@article{Hochreiter:2006:SVM:1159508.1159516, 1284 + author = {Hochreiter, Sepp and Obermayer, Klaus}, 1285 + title = {Support vector machines for dyadic data}, 1286 + journal = {Neural Comput.}, 1287 + volume = {18}, 1288 + issue = {6}, 1289 + month = {June}, 1290 + year = {2006}, 1291 + issn = {0899-7667}, 1292 + pages = {1472--1510}, 1293 + numpages = {39}, 1294 + url = {http://dl.acm.org/citation.cfm?id=1159508.1159516}, 1295 + doi = {10.1162/neco.2006.18.6.1472}, 1296 + acmid = {1159516}, 1297 + publisher = {MIT Press}, 1298 + address = {Cambridge, MA, USA}, 1299 +}