In [1]:
import gzip
from rdkit import Chem
from rdkit.Chem import Draw,AllChem
from rdkit.Chem.Draw import IPythonConsole
from IPython.display import Image




<table style="border:none">
<tr style="border:none">
<td colspan=2 style="border:none">
</td>
</tr>
<tr style="border:none">
<td colspan=2 style="border:none">
<h1>RDKit: State of the toolkit (2016 UGM edition)</h1>
</td>
</tr>
<tr style="border:none">
<td style="border:none">
Greg Landrum, Ph.D.<br />
T5 Informatics, KNIME.com <br />
Basel, Switzerland<br />
<img style="align:left" align="left" src="images/T5.shaded.132.png" alt="T5 logo" />
</td>
<td style="border:none">
<img src="images/logo.lrg.png" alt="RDKit logo" />
</td>
</tr></table>

# An overview of the RDKit



## Open-source toolkit for cheminformatics
- Business-friendly BSD license
- Core data structures and algorithms in C++
- Python (2.x and 3.x) wrapper generated using Boost.Python
- Java and C\# wrappers generated with SWIG
- 2D and 3D molecular operations
- Descriptor generation for machine learning
- Molecular database cartridge for PostgreSQL
- Cheminformatics nodes for KNIME (distributed from the KNIME community site: http://tech.knime.org/community/rdkit)


## Ecosystem

![RDKit ecosystem](images/ecodesystem.png)

*Exact same algorithms/implementations accessible from many different endpoints*

## Operational
- http://www.rdkit.org
- Supports Mac/Windows/Linux
- Releases every 6 months
- Web presence:
    - Homepage: http://www.rdkit.org
      Documentation, links
    - Github (https://github.com/rdkit)
      Downloads, bug tracker, git repository
    - Sourceforge (http://sourceforge.net/projects/rdkit)
      Mailing lists
    - Blog (https://rdkit.blogspot.com)
      Tips, tricks, random stuff
    - Tutorials (https://github.com/rdkit/rdkit-tutorials)
      Jupyter-based tutorials for using the RDKit
    - KNIME integration (https://github.com/rdkit/knime-rdkit)
      RDKit nodes for KNIME
- Mailing lists at https://sourceforge.net/p/rdkit/mailman/, searchable archives available for [rdkit-discuss](http://www.mail-archive.com/rdkit-discuss@lists.sourceforge.net/) and [rdkit-devel](http://www.mail-archive.com/rdkit-devel@lists.sourceforge.net/)
- Social media:
    - Twitter: @RDKit_org
    - LinkedIn: https://www.linkedin.com/groups/8192558
    - Google+: https://plus.google.com/u/0/116996224395614252219
    - Slack: https://rdkit.slack.com (invite required, contact Greg)

## History and Milestones:
- 2000-2006: initial development work at Rational Discovery
- 2006: code open sourced and released on sourceforge.net
- 2007: First NIBR contribution (chemical reaction handling); Noel discovers the RDKit (=first rdkit-discuss post?)
- 2008: first POC of Java wrapper; Mac support added; SLN and Mol2 parsers; 
- 2009: Morgan fingerprints; switch to cmake; switch to VF2 for SSS
- 2010: PostgreSQL cartridge; First iteration of the KNIME nodes; $RDBASE/Contrib appears; SaltRemover and FunctionalGroups code
- 2011: New Java wrappers; more functionality moved to C++; InChI support; Avalontools integration
- 2012: First UGM; Speed improvements; MCS implementation; IPython integration; “RDKit Cookbook” appears
- 2013: Move to github; Pandas integration; MMFF and Open3DAlign support; PDB support; rdkit blog started
- 2014: python3 support; conda integration; experimental lucene integration; MCS implementation in C++
- 2015: new drawing code; improved canonicalization algorithm; improved 3D coordinate generation; reduced memory usage
- 2016: Regular patch releases; easier builds; performance improvements; KNIME nodes move to Github 

## Functionality Overview: Basics
- Input/Output: SMILES/SMARTS, SDF, TDT, SLN [1](#footnote1), Corina mol2 [1](#footnote1), PDB, sequence notation, FASTA (peptides only), HELM (peptides only)
- Substructure searching
- Canonical SMILES
- Chirality support (i.e. R/S or E/Z labeling)
- Chemical transformations (e.g. remove matching substructures)
- Chemical reactions
- Molecular serialization (e.g. mol \<-\> text)
- 2D depiction, including constrained depiction
- Fingerprinting: Daylight-like, atom pairs, topological torsions, Morgan algorithm, “MACCS keys”, extended reduced graphs, etc.
- Similarity/diversity picking
- Gasteiger-Marsili charges
- Bemis and Murcko scaffold determination
- Salt stripping
- Functional-group filters

## Functionality Overview: 2D
- 2D pharmacophores [1](#footnote1)
- Hierarchical subgraph/fragment analysis
- RECAP and BRICS implementations
- Multi-molecule maximum common substructure [2](#footnote2)
- Enumeration of molecular resonance structures
- Molecular descriptor library:
  - Topological (κ3, Balaban J, etc.)
  - Compositional (Number of Rings, Number of Aromatic Heterocycles, etc.)
  - Electrotopological state (Estate)
  - clogP, MR (Wildman and Crippen approach)
  - “MOE like” VSA descriptors
  - MQN [6](#footnote6)
- Similarity Maps [7](#footnote7)
- Machine Learning:
  - Clustering (hierarchical, Butina)
  - Information theory (Shannon entropy, information gain, etc.)
- Tight integration with the [Jupyter](http://jupyter.org) notebook (formerly the IPython notebook) and [Pandas](http://pandas.pydata.org/).


## Functionality Overview: 3D
- 2D-\>3D conversion/conformational analysis via distance geometry 
- UFF and MMFF94/MMFF94S implementations for cleaning up structures
- Pharmacophore embedding (generate a pose of a molecule that matches a 3D pharmacophore) [1](#footnote1)
- Feature maps
- Shape-based similarity
- RMSD-based molecule-molecule alignment
- Shape-based alignment (subshape alignment [3](#footnote3)) [1](#footnote1)
- Unsupervised molecule-molecule alignment using the Open3DAlign algorithm [4](#footnote4)
- Integration with PyMOL for 3D visualization
- Molecular descriptor library:
  - PMI, NPR, PBF, etc.
  - Feature-map vectors [5](#footnote5)
- Torsion Fingerprint Differences for comparing conformations [8](#footnote8)


## Documentation
[Overview](http://rdkit.readthedocs.org/en/latest/):

![docs overview](images/docs_overview.png)

Generated with Sphinx (standard python documentation tool)

## Documentation
[Sample](http://rdkit.readthedocs.org/en/latest/GettingStartedInPython.html#reading-single-molecules):

![doc zoom](images/docs_zoom.png)

All Python code samples are *tested* to protect against doc-rot.


## Tutorials

[Github repo](https://github.com/rdkit/rdkit-tutorials)
![tutorial](images/tutorial_example.png)

All Python code samples are *tested* to protect against doc-rot.


## Footnotes
<a name="footnote1">1</a>: These implementations are functional but are not necessarily the best, fastest, or most complete.

<a name="footnote2">2</a>: Originally contributed by Andrew Dalke

<a name="footnote3">3</a>: Putta, S., Eksterowicz, J., Lemmen, C. & Stanton, R. "A Novel Subshape Molecular Descriptor" *Journal of Chemical Information and Computer Sciences* **43:1623–35** (2003).

<a name="footnote4">4</a>: Tosco, P., Balle, T. & Shiri, F. "Open3DALIGN: an open-source software aimed at unsupervised ligand alignment." *J Comput Aided Mol Des* **25:777–83** (2011).

<a name="footnote5">5</a>: Landrum, G., Penzotti, J. & Putta, S. "Feature-map vectors: a new class of informative descriptors for computational drug discovery" *Journal of Computer-Aided Molecular Design* **20:751–62** (2006).

<a name="footnote6">6</a>: Nguyen, K. T., Blum, L. C., van Deursen, R. & Reymond, J.-L. "Classification of Organic Molecules by Molecular Quantum Numbers." *ChemMedChem* **4:1803–5** (2009).

<a name="footnote7">7</a>: Riniker, S. & Landrum, G. A. "Similarity maps - a visualization strategy for molecular fingerprints and machine-learning methods." *Journal of Cheminformatics* **5:43** (2013).

<a name="footnote8">8</a>: Schulz-Gasch, T., Schärfer, C., Guba, W. & Rarey, M. "TFD: Torsion Fingerprints As a New Measure To Compare Small Molecule Conformations." *J. Chem. Inf. Model.* **52:1499–1512** (2012).

<a name="footnote9">9</a>: Riniker, S. & Landrum, G. A. "Better informed distance geometry: Using what we know to improve conformation generation." *J. Chem. Inf. Model.* **55:2562–74** (2015). 


# Sustainability

Solving the bus problem...

- This clearly isn’t just a hobby project any more
- Used internally in NIBR and other companies in multiple production systems
- Contributions (features, bug fixes, etc) coming in from the community, including from other companies
- I’m no longer the only one answering questions on the mailing list
- Part of other open-source projects


# Community

The core of any open-source project


## Who's using it?

Hard to say with any certainty

- Active contributors to the mailing list from:
  - Big pharma
  - Small pharma/biotech
  - Software/Services
  - Academia
- All of the last three UGMs at capacity with 40+ attendees
- Contributions coming from the community:
  - bug reports  
  - wiki pages
  - code and documentation patches
  - changes to the build system
  - active use in other systems.
- Community contributions for packaging:
  - rpms/debs for Fedora/Debian linux
  - homebrew recipe for MacOS
  - conda packages


## Contrib dir

The Contrib directory, part of the standard RDKit distribution, includes code that has been contributed by members of the community.

### LEF: Local Environment Fingerprints

Contains python source code from the publications:

-   A. Vulpetti, U. Hommel, G. Landrum, R. Lewis and C. Dalvit, "Design and NMR-based screening of LEF, a library of chemical fragments with different Local Environment of Fluorine" *J. Am. Chem. Soc.* **131** (2009) 12949-12959. http://dx.doi.org/10.1021/ja905207t
-   Vulpetti, G. Landrum, S. Ruedisser, P. Erbel and C. Dalvit, "19F NMR Chemical Shift Prediction with Fluorine Fingerprint Descriptor" *J. of Fluorine Chemistry* **131** (2010) 570-577. http://dx.doi.org/10.1016/j.jfluchem.2009.12.024

Contribution from Anna Vulpetti

### M\_Kossner

Contains a set of pharmacophoric feature definitions as well as code for finding molecular frameworks.

Contribution from Markus Kossner

### PBF: Plane of best fit

Contains C++ source code and sample data from the publication:

Firth, N. Brown, and J. Blagg, "Plane of Best Fit: A Novel Method to Characterize the Three-Dimensionality of Molecules" *Journal of Chemical Information and Modeling* **52** 2516-2525 (2012). http://pubs.acs.org/doi/abs/10.1021/ci300293f

Contribution from Nicholas Firth

### mmpa: Matched molecular pairs

Python source and sample data for an implementation of the matched-molecular pair algorithm described in the publication:

Hussain, J., & Rea, C. "Computationally efficient algorithm to identify matched molecular pairs (MMPs) in large data sets." *Journal of chemical information and modeling* **50** 339-348 (2010). http://dx.doi.org/10.1021/ci900450m

Includes a fragment indexing algorithm from the publication:

Wagener, M., & Lommerse, J. P. "The quest for bioisosteric replacements." *Journal of chemical information and modeling* **46** 677-685 (2006).

Contribution from Jameed Hussain.

### SA\_Score: Synthetic assessibility score

Python source for an implementation of the SA score algorithm described in the publication:

Ertl, P. and Schuffenhauer A. "Estimation of Synthetic Accessibility Score of Drug-like Molecules based on Molecular Complexity and Fragment Contributions" *Journal of Cheminformatics* **1:8** (2009)

Contribution from Peter Ertl

### fraggle: A fragment-based molecular similarity algorithm

Python source for an implementation of the fraggle similarity algorithm developed at GSK and described in this RDKit UGM presentation: https://github.com/rdkit/UGM_2013/blob/master/Presentations/Hussain.Fraggle.pdf

Contribution from Jameed Hussain

### pzc: Tools for building and validating classifiers

Contribution from Paul Czodrowski

### ConformerParser: parser for Amber trajectory files

Contribution from Sereina Riniker

### AtomAtomSimilarity: atom-atom-path method for fragment similarity

Python source for an implementation of the Atom-Atom-Path similarity method for fragments described in the publication:

Gobbi, A., Giannetti, A. M., Chen, H. & Lee, M.-L. "Atom-Atom-Path similarity and Sphere Exclusion clustering: tools for prioritizing fragment hits." *J. Cheminformatics* **7:11** (2015). http://dx.doi.org10.1186/s13321-015-0056-8

Contribution from Richard Hall

## Integration into other projects

- [ChEMBL Beaker](https://github.com/mnowotka/chembl_beaker) - standalone web server wrapper for RDKit and OSRA
- [myChEMBL](https://github.com/chembl/mychembl) ([blog post](http://chembl.blogspot.de/2013/10/chembl-virtual-machine-aka-mychembl.html), [paper](http://bioinformatics.oxfordjournals.org/content/early/2013/11/20/bioinformatics.btt666)) - A virtual machine implementation of open data and cheminformatics tools
- [ZINC](http://zinc15.docking.org) - Free database of commercially-available compounds for virtual screening
- [Coot](https://www2.mrc-lmb.cam.ac.uk/personal/pemsley/coot/) - software for macromolecular model building, model completion and validation
- [sdf_viewer.py](https://github.com/apahl/sdf_viewer) - an interactive SDF viewer
- [sdf2ppt](https://github.com/dkuhn/sdf2ppt) - Reads an SDFile and displays molecules as image grid in powerpoint/openoffice presentation.
- [MolGears](https://github.com/admed/molgears) - A cheminformatics tool for bioactive molecules
- [PYPL](http://www.biochemfusion.com/downloads/#OracleUtilities) - Simple cartridge that lets you call Python scripts from Oracle PL/SQL.
- [shape-it-rdkit](https://github.com/jandom/shape-it-rdkit) - Gaussian molecular overlap code shape-it (from silicos it) ported to RDKit backend
- [WONKA](http://wonka.sgc.ox.ac.uk/WONKA/) - Tool for analysis and interrogation of protein-ligand crystal structures
- [OOMMPPAA](http://oommppaa.sgc.ox.ac.uk/OOMMPPAA/) - Tool for directed synthesis and data analysis based on protein-ligand crystal structures
- [OCEAN](https://github.com/rdkit/OCEAN) - web-tool for target-prediction of chemical structures which uses ChEMBL as datasource
- [chemfp](http://chemfp.com)
- [rdkit_ipynb_tools](https://github.com/apahl/rdkit_ipynb_tools) - RDKit Tools for the IPython Notebook
- [chemicalite](https://github.com/rvianello/chemicalite) - SQLite integration for the RDKit
- [django-rdkit](https://github.com/rdkit/django-rdkit) - Django integration for the RDKit
- [Vernalis KNIME nodes](https://tech.knime.org/book/vernalis-nodes-for-knime-trusted-extension)
- [Erlwood KNIME nodes](https://tech.knime.org/community/erlwood)
- [AZOrange](https://github.com/AZcompTox/AZOrange)


<table style="border:none">
<tr style="border:none">
<td style="border:none; vertical-align:text-top">
<h1>Support</h1>
Another critical piece for any software project, open-source or otherwise.
</td>
<td rowspan=2 style="border:none; width:50%">
</td>
</tr>
<tr style="border:none">
<td style="border:none">
<h2>Options:</h2>
<ul>
<li> RDKit mailing list</li>
<li> Github </li>
<li> RDKit slack channel</li>
<li> Commercial (via T5 Informatics)</li>
</ul>
</td>
</tr></table>

## Patch Releases

- Starting with the 2016_03 version of the RDKit, we've been doing patch releases: about once a month we release a new version with just bug fixes.
- The changes are documented and these should always be safe to install.
- Some more thoughts on this appeared on the [T5 Informatics Blog](https://medium.com/@greg.landrum_t5/a-new-ish-rdkit-release-model-3efa17ff54b7#.8agtoocol)

Here's an example of the release notes from a [patch release](https://github.com/rdkit/rdkit/releases/tag/Release_2016_03_5):
```
  # Release_2016.03.5
  (Changes relative to Release_2016.03.4)

  ## Acknowledgements:
  Piotr Dabrowski, Markus Metz, Stephen Roughley, Riccardo Vianello

  ## Bug Fixes:
    - GetSSSR interrupts by segmentation fault
    (github issue #1023 from PiotrDabr)
    - typos in MMPA hash code
    (github issue #1044 from greglandrum)
    - Bond::BondDir::EITHERDOUBLE not exposed to python
    (github issue #1051 from greglandrum)
    - Fix leak with renumberAtoms() in the SWIG wrappers
    (github pull #1064 from greglandrum)
    - computeInitialCoords() should call the SSSR code before it calls assignStereochemistry()
    (github issue #1073 from greglandrum)
```



<table style="border:none">
<tr style="border:none">
<td style="border:none; vertical-align:text-top">
<h2>Long-term support releases</h2>
<ul>
<li>The idea here is borrowed from Ubuntu Linux and some other open-source software packages: we will occasionally designate an RDKit release as a "long-term support release" (or "long-term release", the terminology isn't quite settled). We'll create patch releases with bug fixes against these releases for a fixed period of time, likely two years.
<li>Some more thoughts on this appeared on the <a href="https://medium.com/@greg.landrum_t5/a-new-ish-rdkit-release-model-3efa17ff54b7#.8agtoocol">T5 Informatics Blog</a></li>
</ul>
</td>
<td style="border:none; width:30%">
</td>
</tr>
</table>


# The Future

Future work tends to be determined by what's needed for active projects or requests that come out of the community. So there's not really a roadmap.

But sometimes's it's obvious. Here's a set of some things are already on the "ToDo" list or that are just being thought about.

## Some larger scale backend changes

### Moving to modern C++. 

The goal is to modernize the C++ codebase and allow the developers to work with more up-to-date tools.
This may also have a positive impact on performance/stability. 

We will start this after the "Q3" 2016 release. Here's some [more detail](https://medium.com/@greg.landrum_t5/the-rdkit-and-modern-c-48206b966218).

### Starting to make backwards incompatible changes.

The goal is to make the RDKit API simpler to use and understand and clean up some of the "less than optimal" decisions made early in the development of the toolkit.

Some specific planned changes:
- Dealing with the explicit/implicit hydrogen mess
- Improvements to the representation of stereochemistry

We will put a process in place for doing this, but not necessarily start, before the Q1 2017 release. 
Here's some [more detail](https://medium.com/@greg.landrum_t5/breaking-with-the-past-making-backwards-incompatible-changes-in-the-rdkit-68e006579663).

## Some upcoming features/improvements

- Get the structure checker to v1.0
- Get the enumeration toolkit to v1.0
- More 3D descriptors
- Improved 3D integration in the jupyter notebook
- More KNIME nodes!


# Still lots of other stuff to do though...

## Technical Debt/Code improvements

- More demos/documentation for advanced functionality. *Started:* https://github.com/rdkit/rdkit-tutorials
- Ongoing performance improvements
- Explore use of GPUs
- Extend and better document the SWIG wrappers
- Switch to boost.logging

[This](http://bukai.pharm.or.jp/bukai_kozo/SARNews/SARNews_19.pdf) shouldn't be the only available tutorial for the ph4 embedding functionality:
![japanese ph4 tutorial](images/ph4_tutorial.png)

## Integrations
- Additional KNIME nodes
- Additional functionality for the PostgreSQL cartridge
- Getting the Lucene integration to v1
- Improve 3D integration with IPython notebook
- Interactive 2D sketches in IPython notebook
- Continued exploration of RDKit use in Javascript via emscripten
- Explore integration with one of the NoSQL document stores (i.e. MongoDb, CouchDb, etc.)
- Explore integration with Spark

## New features: 2D
- ongoing improvements in molecule-drawing code
- improved S group support
- pure RDKit molecular standardization
- get molecule hashing code to v1
- canonical tautomer generation
- canonical CTAB generation
- robust and flexible R-group decomposition
- implementation of a "scaffold hopping" fingerprint like ERG (extended reduced graphs)
- improved query-query matching to allow pseudo-Markush substructure searches

## New features: 3D
- implicit solvent model for the force fields
- implementation of a 3D pharmacophore fingerprint
- go beyond basics for 3D pharmacophore analysis
- get the pharmacophore embedding code to v1
- implementation of one-or-more shape-based fingerprints
- shape-based alignment
- other alignment-free 3D similarity approaches
- generation of molecular surfaces
- molecular-interaction fields (to allow 3D QSAR)
- template-guided embedding in a protein pocket

# Thanks!
