Skip to content

Commit

Permalink
Merge branch 'DL_edition' into remove_dm_dep
Browse files Browse the repository at this point in the history
  • Loading branch information
mbackenkoehler authored May 17, 2023
2 parents 917b3fa + 8e3e774 commit b7d4efe
Show file tree
Hide file tree
Showing 42 changed files with 3,039 additions and 2,607 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,9 @@

**Note:** This talktorial is a part of TeachOpenCADD, a platform that aims to teach domain-specific skills and to provide pipeline templates as starting points for research projects.

**<span style="color:red">Important</span>:** Currently, this talktorial uses Datamol which has to be installed using `conda install -c conda-forge datamol`.


Authors:

- [Gerrit Großmann](https://mosi.uni-saarland.de/people/gerrit/), 2022, Saarland University
- Gerrit Großmann, 2022, [Chair for Modelling and Simulation](https://mosi.uni-saarland.de/people/gerrit/), Saarland University


__Talktorial T033__: This talktorial is part of the TeachOpenCADD pipeline described in the TeachOpenCADD publication (TODO), comprising of talktorials T033 to T038.
Expand Down Expand Up @@ -41,10 +38,10 @@ Specifically, we learn about molecular representations and find that representin
* [AlphaFold Protein Structure Database](https://alphafold.ebi.ac.uk/)
* Papers:
* [Molecular representations in AI-driven drug discovery: a review and practical guide](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-020-00460-5#:~:text=Traditionally%2C%20molecules%20are%20represented%20as,of%20chemical%20structures%20in%20cheminformatics.)
* [A Review of molecular representation in the Age of Machine Learning](https://wires.onlinelibrary.wiley.com/doi/full/10.1002/wcms.1603)
* [A Review of molecular representation in the Age of machine learning](https://wires.onlinelibrary.wiley.com/doi/full/10.1002/wcms.1603)
* [Point-based molecular representation learning from conformers](https://openreview.net/pdf?id=pjePBJjlBby)
* [Learning 3D Representations of Molecular Chirality with Invariance to Bond Rotations](https://openreview.net/pdf?id=hm2tNDdgaFK)
* Talktorials:
* Talktorials:
* [T008 - Protein data acquisition: Protein Data Bank (PDB)](https://github.com/volkamerlab/teachopencadd/blob/master/teachopencadd/talktorials/T008_query_pdb/talktorial.ipynb)
* [T017 - Advanced NGLview usage](https://github.com/volkamerlab/teachopencadd/blob/master/teachopencadd/talktorials/T017_advanced_nglview_usage/talktorial.ipynb)
* Deep learning talktorials T033 to T038
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "4IUaerZ4G9Ox"
Expand All @@ -12,27 +13,29 @@
"\n",
"Authors:\n",
"\n",
"- [Gerrit Großmann](https://mosi.uni-saarland.de/people/gerrit/), 2022, Saarland University"
"- Gerrit Großmann, 2022, [Chair for Modelling and Simulation](https://mosi.uni-saarland.de/people/gerrit/), Saarland University"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "YSkj7U8pHJNi"
},
"source": [
"__Talktorial T033__: This talktorial is part of the TeachOpenCADD pipeline described in the TeachOpenCADD publication (TODO), comprising of talktorials T033 to T038."
"__Talktorial T033__: This talktorial is part of the TeachOpenCADD pipeline described in the TeachOpenCADD publication, consisting of Talktorials T033 to T038."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "g8iOcdFAHqHM"
},
"source": [
"## Aim of this talktorial\n",
"\n",
"In this talktorial, we conduct the groundwork for the deep learning talktorials (<span style=\"color:pink\">add references: 034, 035, 036, 037, 038</span>).\n",
"In this talktorial, we conduct the groundwork for the deep learning talktorials.\n",
"Specifically, we learn about molecular representations and find that representing a molecule in a computer is not a trivial task. Different representations come with their specific implications and (dis-)advantages."
]
},
Expand Down Expand Up @@ -232,6 +235,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "85vLHU-BPcYZ"
Expand All @@ -248,8 +252,14 @@
"\n",
"*Figure 3*: \n",
"CPK coloring from Wikipedia.\n",
"\n",
"\n",
"\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"2D visualizations are easy to draw and come in many different flavors.\n",
"For instance, the **Lewis structure** contains no 3D information (excess electrons that form lone pairs are sometimes shown as dots, we skip this part here).\n",
"\n",
Expand All @@ -266,9 +276,14 @@
"![Ethanol visualization](images/ethanol.png)\n",
"\n",
"*Figure 4*: \n",
"Different visualizations of ethanol.\n",
"\n",
"\n",
"Different visualizations of ethanol.\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"A special feature of this is the **Natta Projection** which provides basic (but not in every case complete) information about the relative positions of the atoms in 3D. For instance, consider the kinase inhibitor from the [RDKit Cookbook](https://www.rdkit.org/docs/Cookbook.html):\n",
"\n",
"![Kinase Inhibitor](images/kinase_inhibitor.png)\n",
Expand All @@ -280,8 +295,7 @@
"* Solid wedges indicate a bond that points out of the plane;\n",
"* Dashed wedges indicate a bond that points into the plane (away from the observer)\n",
"\n",
"You can find the corresponding ball-and-stick plot [here](https://molview.org/?q=C1CC2=C3C(=CC=C2)C(=CN3C1)[C@H]4[C@@H](C(=O)NC4=O)C5=CNC6=CC=CC=C65).\n",
"\n"
"You can find the corresponding ball-and-stick plot [here](https://molview.org/?q=C1CC2=C3C(=CC=C2)C(=CN3C1)[C@H]4[C@@H](C(=O)NC4=O)C5=CNC6=CC=CC=C65).\n"
]
},
{
Expand Down Expand Up @@ -359,14 +373,15 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "N0FsCL83SH3t"
},
"source": [
"**Text-based representations:**\n",
"\n",
"Text-based representations use a sequence of characters to specify a molecule. This is possible for practically all (small) molecules relevant in practice. Here, we discuss SMILES, InChI, and SELFIES. For a deeper dive, we refer the reader to the SMILES talktorial [T035 · SMILES based property prediction](https://github.com/volkamerlab/teachopencadd/tree/master/teachopencadd/talktorials)\n",
"Text-based representations use a sequence of characters to specify a molecule. This is possible for practically all (small) molecules relevant in practice. Here, we discuss SMILES, InChI, and SELFIES. For a deeper dive, we refer the reader to the SMILES __Talktorial T034__.\n",
"\n",
"**SMILES** (Simplified Molecular Input Line Entry Specification) is the most widely used text-based representation and can be handled by all common frameworks. When we specify a molecule in RDKit, we often use SMILES notation (more on this in the practical part): \n",
"\n",
Expand All @@ -379,8 +394,14 @@
"The main problem with SMILES for molecule representation is that two (or more) different SMILES strings might refer to the same molecule. Researchers try to circumvent this by resorting to a **canonical SMILES** notation. However, the canonicalization depends on the canonicalization algorithms and is therefore not standardized. \n",
"\n",
"In the other direction, a single SMILES string typically identifies no more than one molecule. However, when stereochemistry information is not given in the SMILES string, it leaves room for ambiguity (in some cases, it might not even be possible to remove all ambiguity for different molecular configurations). \n",
"\n",
"\n",
"\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"**InChI** (International Chemical Identifier) is a more modern and also widely-used alternative to SMILES. The key advantage is that it exhibits less chemical ambiguity and that a standard canonical exists. \n",
"The downside is that it is difficult for humans to read.\n",
"\n",
Expand All @@ -396,6 +417,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "WE_GZsYbSH6L"
Expand All @@ -422,11 +444,8 @@
"\n",
"**Permutation invariance:** Assume you build a machine learning model that takes as input molecular graphs and outputs some prediction. It would be desirable that your model guarantees that isomorphic graphs (like *Graph 1* and *Graph 2*) generate the same output.\n",
"We call neural networks (or functions in general) that have these guarantees (node-)permutation invariant (or equivariant for node-level outputs).\n",
"However, there is a trade-off. Functions that are permutation invariant are typically not universal. That is, they are not able to tell all graphs that are non-isomorphic apart. If both were given, permutation invariance and the ability to produce a different output for all non-isomorphic graphs, our neural network would solve the graph isomorphism problem (which is computationally extremely difficult). \n",
"\n",
"\n",
"However, there is a trade-off. Functions that are permutation invariant are typically not universal. That is, they are not able to tell all graphs that are non-isomorphic apart. If both were given, permutation invariance and the ability to produce a different output for all non-isomorphic graphs, our neural network would solve the graph isomorphism problem (which is computationally extremely difficult). This is also discussed in the __Talktorial T035__ about graph neural networks. \n",
"\n",
"TODO: link to talktorial (<span style=\"color:pink\">resolve this</span>)\n",
"\n",
"**Representational power:** Another problem is that graphs do not contain 3D information. \n",
"Specifically, different [isomers](https://en.wikipedia.org/wiki/Isomer) can correspond to the same molecular graph but differ in the relative 3D positions of the atoms. These are called spatial isomers. \n",
Expand Down Expand Up @@ -507,6 +526,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "XRA2lGEkZODP"
Expand All @@ -526,7 +546,7 @@
"\n",
"They can be easily used for classical machine learning tasks because the architecture does not need to be invariant/equivariant to the node-ordering or geometric operations.\n",
"\n",
"[T004 · Ligand-based screening: compound similarity](https://projects.volkamerlab.org/teachopencadd/talktorials/T004_compound_similarity.html) explains several molecular fingerprints.\n",
"__Talktorial T004__ explains several molecular fingerprints.\n",
"\n",
"---\n",
"\n",
Expand Down Expand Up @@ -1031,12 +1051,13 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "AWII15ft83YJ"
},
"source": [
"NGLViewer allows us to see a ball-and-stick visualization (example taken from [this](http://nglviewer.org/nglview/release/v0.6.1/api.html) tutorial, we also refer the reader to [Talktorial 009](https://github.com/volkamerlab/teachopencadd/blob/gg-033-molecular_representations/teachopencadd/talktorials/T009_compound_ensemble_pharmacophores/talktorial.ipynb))."
"NGLViewer allows us to see a ball-and-stick visualization (example taken from [this](http://nglviewer.org/nglview/release/v0.6.1/api.html) tutorial, we also refer the reader to __Talktorial T009__)."
]
},
{
Expand Down Expand Up @@ -1111,7 +1132,7 @@
},
"outputs": [],
"source": [
"# generate moleucle from Smiles\n",
"# generate molecule from Smiles\n",
"aspirin = Chem.MolFromSmiles(\"CC(=O)OC1=CC=CC=C1C(=O)O\")"
]
},
Expand Down Expand Up @@ -1677,7 +1698,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
"version": "3.9.15"
},
"toc-autonumbering": true,
"vscode": {
Expand Down
Binary file not shown.
Loading

0 comments on commit b7d4efe

Please sign in to comment.