# Substructure Search Benchmarking

These benchmarks were originally run on an early 2015 MacBook Pro with a 2.7 GHz dual-core i5 processor and 8GB of memory. 

In addition to the dependencies listed below, they make use of three sets of fragments and patterns you can find in `mongordkit/data`. All of the large chemical databases that we search against are constructed from ChEMBL_27. 

## Setup Work
### Imports

In [1]:
from __future__ import print_function
import random, gzip, time, mongordkit, pymongo, rdkit, matplotlib
from rdkit import Chem
from rdkit.Chem import rdMolDescriptors
from rdkit.Chem import AllChem
from rdkit.Avalon import pyAvalonTools
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit import rdBase
from rdkit import DataStructs
import matplotlib.pyplot as plt
import numpy as np
from os import sys
import pandas as pd
from IPython.display import display, HTML

from mongordkit.Database import write
from mongordkit.Search import similarity
from mongordkit.Search import substructure
from mongordkit import Search

### Database Setup
Here we set up a database called `test` that will hold our molecules. We will construct a collection called `molecules_100K` to hold the first 100,000 molecules in the ChEMBL_27 dataset and a collection called `molecules_1M` to hold the first 1,000,000 molecules in the ChEMBL_27 dataset. If you have already run benchmarks from `mongo-rdkit` on your local MongoDB instance, these should have been set up already.

In [3]:
# If necessary, write the first 100,000 compounds to molecules_100K.
if db.molecules_100K.count_documents({}) != 100000:
    write.WriteFromSDF(db.molecules_100K, '../../../chembl_27.sdf', chunk_size=1000, limit=100000)

populating mongodb collection with compounds from SDF...
100000 molecules successfully imported
1 duplicates skipped


100000

In [None]:
# If necessary, write the first 1,000,000 compounds to molecules_1M.
if db.molecules_1M.count_documents({}) != 1000000:
    write.WriteFromSDF(db.molecules_1M, '../../../chembl_27.sdf', chunk_size=1000, limit=1000000)

populating mongodb collection with compounds from SDF...


In [7]:
# Let's ensure that there are actually 100,000 and 1M documents in these collections, respectively.
print(f"In molecules_100K: {db.molecules_100K.count_documents({})} documents")
print(f"In molecules_1M: {db.molecules_1M.count_documents({})} documents")

In molecules_100K: 100000 documents
In molecules_1M: 180512 documents


In [None]:
# Next, we have to prepare all of the documents in our collections for search by adding in fingerprints.
substructure.AddPatternFingerprints(db.molecules_100K)
substructure.AddPatternFingerprints(db.molecules_1M)

### Query Set Setup
For our queries, we'll use three sets of patterns identified by Greg Landrum in one of his [blog posts](http://rdkit.blogspot.com/2013/11/fingerprint-based-substructure.html) on substructure searching and discussed in this [mailing list](http://www.mail-archive.com/rdkit-discuss@lists.sourceforge.net/msg02066.html) and this [presentation](http://www.hinxton.wellcome.ac.uk/advancedcourses/MIOSS%20Greg%20Landrum.pdf). They are: 
- Fragments: 500 diverse molecules taken from the ZINC Fragments set
- Leads: 500 diverse molecules taken from the ZINC Lead-like set
- Pieces: 823 pieces of molecules obtained by doing a BRICS fragmentation of some molecules from the pubchem screening set.

In [6]:
f = open('../../data/zinc.frags.500.q.smi')
fragments = [Chem.MolFromSmiles(line.split()[0]) for line in f]
f.close()

f = open('../../data/zinc.leads.500.q.smi')
leads = [Chem.MolFromSmiles(line.split()[0]) for line in f]
f.close()

f = open('../../data/fragqueries.q.txt')
pieces = [Chem.MolFromSmiles(line) for line in f]
f.close()

## Benchmarking
### Naive Substructure Search
`substructure.SubSearchNaive` is a search that simply loops through the dataset and checks for a substructure match on each molecule. This method is not directly benchmarked here because searching through a single molecule takes upward of 5 seconds; this means that it is far too slow to feel directly interactive.
### Substructure Search with Fingerprint Screening
Instead, we will benchmark the standard `SubSearch`, which makes use of fingerprint screening to dramatically increase efficiency. For each of our query sets, we will search all of their elements against `molecules_100K` and `molecules_1M`, then return the median and mean query times in seconds. 

In [11]:
def benchmark_query_set(query_set, dataset):
    results = []
    for pattern in query_set:
        start = time.time()
        substructure.SubSearch(pattern, dataset)
        end = time.time()
        results.append(end - start)
    return results

In [12]:
# Benchmark for search of all three query sets against 100K and 1M.
# This should take around five minutes; these calls commented out if necessary.
frag_times_100K = benchmark_query_set(fragments, db.molecules_100K)
lead_times_100K = benchmark_query_set(leads, db.molecules_100K)
pieces_times_100K = benchmark_query_set(pieces, db.molecules_100K)

results = [frag_times_100K, lead_times_100K, pieces_times_100K]
means_100K = [np.mean(times) for times in results]
medians_100K = [np.median(times) for times in results]

data = {'mean (100K)': means, 'median (100K)': medians}
df = pd.DataFrame(data, index =['fragments', 'leads', 'pieces']) 
df

Unnamed: 0,mean,median
fragments,0.06274,0.062074
leads,0.062592,0.062289
pieces,0.062739,0.06195


In [None]:
# Benchmark for search of all three query sets against 1M. 
# This should take around five minutes; these calls can be commented out if necessary.
frag_times_1M = benchmark_query_set(fragments, db.molecules_1M)
lead_times_1M = benchmark_query_set(leads, db.molecules_1M)
pieces_times_1M = benchmark_query_set(pieces, db.molecules_1M)

results = [frag_times_1M, lead_times_1M, pieces_times_1M]
means_1M = [np.mean(times) for times in results]
medians_1M = [np.median(times) for times in results]

data = {'mean (1M)': means, 'median (1M)': medians}
df = pd.DataFrame(data, index =['fragments', 'leads', 'pieces']) 
df

## Discussion

A median search time of less than 70ms indicates decent performance, certainly fast enough to have interactive search performance on large datasets with single molecules (the traditional UI benchmark for instant feedback being 100ms). 

### Dataset Size
Now we are also interested in learning how this substructure search scales according to the size of the dataset. In order to do so, we will conduct the searches with the same query set against datasets of increasing size, from 1000 - 10,000 molecules.