# Similarity and Substructure Search

Last updated: 7/11/20

Methods for similarity and substructure search are included in the `mongordkit.Search` module.

In [7]:
from mongordkit.Search import similarity, substructure, utils
from mongordkit.Database import create, write
from rdkit import Chem
import pymongo

## Reset Cells

Run these cells to reset the local MongoDB instance used in this notebook.

In [3]:
client = pymongo.MongoClient()
print(client.list_database_names())
client.drop_database('TestDatabase')
print(client.list_database_names())

['TestDatabase', 'admin', 'config', 'db', 'local']
['admin', 'config', 'db', 'local']


## Similarity Search

`mongordkit.Search.similarity` supports similarity search best on a database prepared by `mongordkit.Database.write`. Users can also use any database that has a `molecules` collection where each document in that collection has the following fields:
- `'rdmol': binary pickle object`
- `'smiles': some SMILES string`

Let's run through an example of similarity search. First, we'll have to set up our database:

In [4]:
TestDB = create.createFromHostPort('TestDatabase', host='localhost', port=27017)
write.writeFromSDF(TestDB, '../../data/test_data/first_200.props.sdf', 'test')

populating mongodb collection with compounds from chembl...
200 molecules successfully imported


200

`similarity.SimSearchNaive` will directly loop through the database and display results. However, this implementation is extremely slow for any decently-sized database. Instead, `similarity` supports precalculating the following kinds of fingerprints for screening: 
- Morgan (length 1048)

through `similarity.addMorganFingerprints`. For each document in a passed in database's `molecules` collection, this method creates a nested field that contains `{morgan_fp: {bits: }, {count: }}`. Note that `addMorganFingerprints` also creates indices on `morgan_fp[bits]` and `morgan_fp[count]` to speed search. 

In [5]:
similarity.addMorganFingerprints(TestDB, radius=2, length=1024)

In [6]:
TestDB.molecules.find_one()['morgan_fp']

{'bits': [33,
  56,
  84,
  130,
  313,
  314,
  356,
  547,
  650,
  698,
  744,
  747,
  849,
  853,
  967],
 'count': 15}

From here, we can directly perform similarity search. `similarity` provides two methods that take advantage of fingerprint screening: `similaritySearch` and `similaritySearchAggregate`. The latter shifts much of the computation into the MongoDB server by using an aggregation pipeline and may improve performance when working with performant or sharded MongoDB servers. 

In [19]:
q_mol = Chem.MolFromSmiles('Cc1ccccc1')

# Perform a similarity search on TestDB for q_mol with a Tanimoto threshold of 0.4. 
results1 = similarity.similaritySearch(q_mol, TestDB, 0.35)

# Do the same thing, but use the MongoDB Aggregation Pipeline. 
results2 = similarity.similaritySearchAggregate(q_mol, TestDB, 0.35)

print('similaritySearch: {}'.format(results1))
print('\n')
print('similaritySearchAggregate: {}'.format(results2))

similaritySearch: [[0.35294117647058826, 'c1ccc(P(c2ccccc2)c2ccccc2)cc1'], [0.4117647058823529, 'Cc1ccc(S)cc1'], [0.35, 'CC(O)(c1ccccc1)c1ccccc1']]


similaritySearchAggregate: [[0.35294117647058826, 'c1ccc(P(c2ccccc2)c2ccccc2)cc1'], [0.4117647058823529, 'Cc1ccc(S)cc1'], [0.35, 'CC(O)(c1ccccc1)c1ccccc1']]


## Substructure Search

Likewise, `mongordkit.Search.substructure` supports substructure search best on databases prepared by `write`. Database requirements are identical to those for similarity search: a `molecules` collection whose documents have `rdmol` and `smiles` fields. 

`substructure.SubSearchNaive` provides a fingerprint-less, slower implementation of substructure search suitable for very small databases:

In [27]:
q_mol = Chem.MolFromSmiles('C1=CC=CC=C1OC')

# Perform a substructure search for q_mol on TestDB. 
substructure.SubSearchNaive(q_mol, TestDB, chirality=False)

['c1ccc(-c2ccccc2OCCOc2ccccc2-c2ccccc2)cc1',
 'COc1ccc(Cc2ccc(OC)cc2)cc1',
 'COc1cc([N+](=O)[O-])c(N)c([N+](=O)[O-])c1',
 'COc1ccc(/C=N/O)cc1',
 'Cc1nc2ccccc2c(Oc2ccccc2)c1-c1ccccc1',
 'O/N=C/c1ccc2c(c1)OCO2',
 'COc1ccc(CC#N)cc1',
 'COc1ccc(C(C)(C)C#N)cc1']

By adding pattern fingerprints, which are optimized for substructure search, we can use `substructure.SubSearch`, which takes advantage of fingerprint screening to avoid as many expensive calls to `HasSubstructMatch` as possible. 

In [1]:
substructure.AddPatternFingerprints(TestDB.molecules, TestDB.mfp_counts, length=None)
substructure.SubSearch(q_mol, TestDB, chirality=False)

NameError: name 'substructure' is not defined

## `.similarity` Contents

mongordkit.Search.similarity.**AddMorganFingerprints**(mol_collection (*MongoDB collection*), count_collection (*MongoDB collection*), radius=2 (*int: radius of Morgan fingerprint*), length=2048 (*int: length of Morgan fingerprint bit vector*)) --> None

mongordkit.Search.similarity.**SimSearchNaive**(mol (*rdmol object*), mol_collection (*MongoDB collection*), threshold=0.8 (*Tanimoto threshold between 0 and 1, float*)) --> *list: results with format [tanimoto, smiles]*

mongordkit.Search.similarity.**SimSearch**(mol (*rdmol object*), mol_collection (*MongoDB collection*), threshold=0.8 (*Tanimoto threshold between 0 and 1, float*)) --> *list: results with format [tanimoto, smiles]*

mongordkit.Search.similarity.**SimSearchAggregate**(mol (*rdmol object*), mol_collection (*MongoDB collection*), threshold=0.8 (*Tanimoto threshold between 0 and 1, float*)) --> *list: results with format [tanimoto, smiles]*

mongordkit.Search.**AddRandPermutations**(perm_collection (*MongoDB collection), len=2048, num=100) --> None

mongordkit.Search.similarity.**SimSearchLSH**(mol (*rdmol object*), db (*MongoDB database containing hash collections*), mol_collection (*MongoDB collection*), perm_collection (*MongoDB collection*), threshold=0.8 (*Tanimoto threshold between 0 and 1, float*)) --> *list: results with format [tanimoto, smiles]*

## `.substructure` Contents

mongordkit.Search.substructure.**AddPatternFingerprints**(db, length=2048 (*int: length of Pattern fingerprint bit vector*)) --> None

mongordkit.Search.similarity.**SubSearchNaive**(pattern (*rdmol object*), db, chirality=False (*boolean: include chirality in search or not*)) --> *list: results with format [smiles]*

mongordkit.Search.similarity.**SubSearch**(pattern (*rdmol object*), db, chirality=False (*boolean: include chirality in search or not*)) --> *list: results with format [smiles]*