# Similarity and Substructure Search

Last updated: 7/27/20

Methods for similarity and substructure search are included in the `mongordkit.Search` module.

In [1]:
from mongordkit.Search import similarity, substructure, utils
from mongordkit import Search
from mongordkit.Database import create, write
from rdkit import Chem
import pymongo

## Reset Cells

Run these cells to reset the MongoDB database used in this notebook.

In [2]:
client = pymongo.MongoClient()
client.drop_database('demo_db')
demo_db = client.demo_db

## Preparing for Search
Adequately preparing the database for searching requires adding a variety of fingerprints and hashes. You can easily perform all of the setup work required for similarity and substructure search by calling the method `Search.PrepareForSearch`. Generally, workflow will follow straight from the following two lines into search calls:

In [3]:
write.WriteFromSDF(demo_db.molecules, '../../data/test_data/first_200.props.sdf')
Search.PrepareForSearch(demo_db, demo_db.molecules, demo_db.mfp_counts, demo_db.permutations)

populating mongodb collection with compounds from SDF...




200 molecules successfully imported
0 duplicates skipped
Preparing database and collections for search...
Added pattern fps, morgan fps, and support for LSH.


However, the rest of this notebook will explicitly note the addition of fingerprints and hashes in an effort to better communicate how the code actually works. Let's reset the database again so that we can insert the hashes step by step without any issues.

In [None]:
client.drop_database('demo_db')
demo_db = client.demo_db

## Similarity Search

`mongordkit.Search.similarity` supports similarity search best on a MongoDB collection prepared by `mongordkit.Database.write`. For the general level of similarity search, users can also use any collection that has documents with the following fields:
- `'rdmol': binary pickle object`
- `'index': a unique identifier for each molecule`
- `'fingerprints': {a nested document that can be blank at the start}'`

Let's run through an example of similarity search. First, we'll write into the database 200 molecules from a data file included in the `mongordkit` package. We will use default write settings.

In [None]:
write.WriteFromSDF(demo_db.molecules, '../../data/test_data/first_200.props.sdf')

`similarity.SimSearchNaive` will directly loop through the database and display results. This is good for purposes of verifying accuracy. However, this implementation is extremely slow for any decently-sized database. Instead, `similarity` supports precalculating the following kinds of fingerprints for screening: 
- Morgan (default radius 2, length 2048)

through `similarity.AddMorganFingerprints`. For each document in a passed in collection, this method adds the nested field `{morgan_fp: {bits: }, {count: }}` to the document's `fingerprint` field. `AddMorganFingerprints` also creates indices on `morgan_fp[bits]` and `morgan_fp[count]` to speed search. 

In [None]:
similarity.AddMorganFingerprints(demo_db.molecules, demo_db.mfp_counts)

In [None]:
demo_db.molecules.find_one()['fingerprints']['morgan_fp']

From here, we can directly perform similarity search. `similarity` provides two methods that take advantage of fingerprint screening: `similaritySearch` and `similaritySearchAggregate`. The latter shifts much of the computation into the MongoDB server by using an aggregation pipeline and can dramatically improve performance when working with sharded MongoDB servers.

In [None]:
q_mol = Chem.MolFromSmiles('Cc1ccccc1')

# Perform a similarity search on TestDB for q_mol with a Tanimoto threshold of 0.4. 
results1 = similarity.SimSearch(q_mol, demo_db.molecules, demo_db.mfp_counts, 0.8)

# Do the same thing, but use the MongoDB Aggregation Pipeline. 
results2 = similarity.SimSearchAggregate(q_mol, demo_db.molecules, demo_db.mfp_counts, 0.8)

print('similaritySearch: {}'.format(results1))
print('\n')
print('similaritySearchAggregate: {}'.format(results2))

Note that the search returns only the index for the molecule, which in this case is the inchikey; users should find it easy to go from the index to the full molecule document by way of a quick search. This also makes it easier for users to retrieve molecules when indices represent multiple tautomers or isomers in the collection.

`SimSearch` and `SimSearchAggregate` both make use of the conventional fingerprint screening method. `similarity` also supports searching using Locality Sensitive Hashing, as developed by ChemBL in an excellent [blog post](http://chembl.blogspot.com/2015/08/lsh-based-similarity-search-in-mongodb.html). The method here is called `SimSearchLSH` and requires a little bit more setup work:

In [None]:
# Generate 100 different permutations of length 2048 and save them in demo_db.permutations as separate documents.
similarity.AddRandPermutations(demo_db.permutations)

# Add locality-sensitive hash values to each documents in demo_db.molecules by splitting the 100 different permutations
# in demo_db.permutations into 25 different buckets. 
similarity.AddLocalityHashes(demo_db.molecules, demo_db.permutations, 25)

# Create 25 different collections in db_demo each store a subset of hash values for molecules in demo_db.molecules.
similarity.AddHashCollections(demo_db, demo_db.molecules)

Now let's try a search using the query molecule from earlier:

In [None]:
q_mol = Chem.MolFromSmiles('Cc1ccccc1')

results3 = similarity.SimSearchLSH(q_mol, demo_db, demo_db.molecules, demo_db.permutations, threshold=0.8)

print('similaritySearchLSH: {}'.format(results3))

The LSH algorithm relies on random permutations using the `numpy` module, so it yields non-deterministic results. This means that LSH is well-suited for *scanning* datasets (its performance on large datasets is faster than either similarity search method), but is less accurate than regular similarity search, especially below thresholds of 0.7. Specific notes on benchmarks can be found in "Benchmarking Similarity Search."

## Substructure Search

`mongordkit.Search.substructure` supports substructure search best on collections prepared by `write`. Requirements are identical to those for similarity search: a `molecules` collection whose documents have `rdmol` and `index` fields. 

`substructure.SubSearchNaive` provides a fingerprint-less, slower implementation of substructure search suitable for very small databases:

In [None]:
q_mol = Chem.MolFromSmiles('C1=CC=CC=C1OC')

# Perform a substructure search for q_mol on TestDB. 
substructure.SubSearchNaive(q_mol, demo_db.molecules, chirality=False)

By adding pattern fingerprints, which are optimized for substructure search, we can use `substructure.SubSearch`, which takes advantage of fingerprint screening to avoid as many expensive calls to `HasSubstructMatch` as possible. 

In [None]:
substructure.AddPatternFingerprints(demo_db.molecules)
substructure.SubSearch(q_mol, demo_db.molecules, chirality=False)

## `.Search` contents

mongordkit.Search.**PrepareForSearch**(db (*MongoDB database for hash information*), mol_collection (*MongoDB collection*), count_collection (*MongoDB collection*), perm_collection (*MongoDB collection*)) --> None

## `.similarity` Contents

### Constants:
- DEFAULT_THRESHOLD = 0.8
- DEFAULT_MORGAN_RADIUS = 2
- DEFAULT_MORGAN_LENGTH = 2048
- DEFAULT_BIT_N = 2048
- DEFAULT_BUCKET_N = 25
- DEFAULT_PERM_LEN = 2048
- DEFAULT_PERM_N = 100

mongordkit.Search.similarity.**AddMorganFingerprints**(mol_collection (*MongoDB collection*), count_collection (*MongoDB collection*), radius=2 (*int: radius of Morgan fingerprint*), length=2048 (*int: length of Morgan fingerprint bit vector*)) --> None

mongordkit.Search.similarity.**SimSearchNaive**(mol (*rdmol object*), mol_collection (*MongoDB collection*), threshold=0.8 (*Tanimoto threshold between 0 and 1, float*)) --> *list: results with format [tanimoto, index]*

mongordkit.Search.similarity.**SimSearch**(mol (*rdmol object*), mol_collection (*MongoDB collection*), threshold=0.8 (*Tanimoto threshold between 0 and 1, float*)) --> *list: results with format [tanimoto, index]*

mongordkit.Search.similarity.**SimSearchAggregate**(mol (*rdmol object*), mol_collection (*MongoDB collection*), threshold=0.8 (*Tanimoto threshold between 0 and 1, float*)) --> *list: results with format [tanimoto, index]*

mongordkit.Search.similarity.**AddRandPermutations**(perm_collection (*MongoDB collection*), len=2048 (*int: length corresponding to length of fingerprint bit vectors*), num=100 (*int: number of permutations*)) --> None

mongordkit.Search.similarity.**AddLocalityHashes**(mol_collection (*MongoDB collection*), perm_collection (*MongoDB collection*), nBuckets=25 (*int: number of hash buckets. The number of permutations (mod NBuckets) must be 0*)) --> None

mongordkit.Search.similarity.**AddHashCollections**(db (*MongoDB database*), mol_collection (*MongoDB collection*)) --> None

mongordkit.Search.similarity.**SimSearchLSH**(mol (*rdmol object*), db (*MongoDB database containing hash collections*), mol_collection (*MongoDB collection*), perm_collection (*MongoDB collection*), threshold=0.8 (*Tanimoto threshold between 0 and 1, float*)) --> *list: results with format [tanimoto, index]*

## `.substructure` Contents

mongordkit.Search.substructure.**AddPatternFingerprints**(mol_collection (MongoDB collection), length=2048 (*int: length of Pattern fingerprint bit vector*)) --> None

mongordkit.Search.similarity.**SubSearchNaive**(pattern (*rdmol object*), db, chirality=False (*boolean: include chirality in search or not*)) --> *list: results with format [smiles]*

mongordkit.Search.similarity.**SubSearch**(pattern (*rdmol object*), db, chirality=False (*boolean: include chirality in search or not*)) --> *list: results with format [smiles]*