# Creating and Writing to MongoDB

Last updated: 8/10/20

Methods that directly modify MongoDB database instances are included in the `mongordkit.Database` module.





In [1]:
from mongordkit.Database import create, write, utils, registration
from rdkit import Chem
import pymongo

## Reset Cells
Run the contents of this cell to reset the local MongoDB database, `demo_db`, used in this notebook.

In [2]:
client = pymongo.MongoClient()
client.drop_database('demo_db')
demo_db = client.demo_db

# Disable rdkit warnings
rdkit.RDLogger.DisableLog('rdApp.*')

## Creating Databases (DEPRECATED for now)
Users can opt to bring their own database instances, but `Database.create` provides methods that will create ready-made MongoDB instances, defaulting to your local MongoDB:

In [None]:
# # Return a database using a host port, such as the local port:
# db = create.createFromHostPort('demo_db', host='localhost', port=27017)

# # Return a database using a MongoDB URI, such as that provided by Atlas:
# TestDB = create.createFromURL('demo_db', url=None)

These databases are created with a `registration` collection. The registration collection includes several documents that consist of common pre-made settings, with the default being `STANDARD_SETTING`. All settings are documented in `Database.utils`.

In [None]:
# print(utils.STANDARD_SETTING)

## Data Registration
`Database.registration` constructs document representations of molecules according to configurable schemes and handles data registration settings.

It does this in two parts. First, it defines the global variable `HASH_FUNCTIONS` as a dictionary that maps hash function names to methods. It also defines the global variables `DEFAULT_SCHEME_NAME`, `DEFAULT_AUTHOR`, `DEFAULT_PREPROCESS`, and `DEFAULT_INDEX`, which are used in scheme creation and are thus defined for easy configuration. 

Second, the file defines the `MolDocScheme` object, which stores scheme information in its instance variables and is passed into `.write` methods in order to specify molecule document format. By default, `MolDocScheme` includes scheme name, author, whether or not the molecule has been pre-processed, an index option, two hashes, fingerprints, and value fields. All of the information contained in a `MolDocScheme` object can be used directly to generate documents for molecules:

In [3]:
rdmol = Chem.MolFromSmiles('Cc1ccccc1')
scheme = registration.MolDocScheme()
scheme.generate_mol_doc(rdmol)

{'rdmol': Binary(b'\xef\xbe\xad\xde\x00\x00\x00\x00\x0b\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x07\x00\x00\x00\x07\x00\x00\x00\x80\x01\x06\x00`\x00\x00\x00\x01\x03\x06@(\x00\x00\x00\x03\x04\x06@h\x00\x00\x00\x03\x03\x01\x06@h\x00\x00\x00\x03\x03\x01\x06@h\x00\x00\x00\x03\x03\x01\x06@h\x00\x00\x00\x03\x03\x01\x06@h\x00\x00\x00\x03\x03\x01\x0b\x00\x01\x00\x01\x02h\x0c\x02\x03h\x0c\x03\x04h\x0c\x04\x05h\x0c\x05\x06h\x0c\x06\x01h\x0c\x14\x01\x06\x01\x06\x05\x04\x03\x02\x17\x00\x00\x00\x00\x16', 0),
 'index': 'YXFVVABEGXRONW-UHFFFAOYSA-N',
 'smiles': 'Cc1ccccc1',
 'scheme': 'default',
 'hashes': {'MolFormula': 'C7H8',
  'SmallWorldIndexBRL': 'B7R1L5',
  'AtomBondCounts': '7,7',
  'cx_smiles': 'Cc1ccccc1',
  'NetCharge': '0',
  'CanonicalSmiles': 'Cc1ccccc1',
  'inchikey_standard': 'YXFVVABEGXRONW-UHFFFAOYSA-N',
  'inchikey_KET_15T': 'YXFVVABEGXRONW-UHFFFAOYNA-N',
  'SmallWorldIndexBR': 'B7R1',
  'DegreeVector': '0,1,5,1',
  'ElementGraph': 'CC1CCCCC1',
  'HetAtomTautomer': 'C[C]1[CH][C

The `MolDocScheme` class also defines a series of instance methods, such as `MolDocScheme.set_index` and `MolDocScheme.remove_field`, that can be used to modify document schemes:

In [4]:
scheme.remove_field('CanonicalSmiles')
scheme.add_hash_field('MolFormula')
scheme.set_index('MolFormula')
scheme.generate_mol_doc(rdmol)

removed AnonymousGraph from scheme


{'rdmol': Binary(b'\xef\xbe\xad\xde\x00\x00\x00\x00\x0b\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x07\x00\x00\x00\x07\x00\x00\x00\x80\x01\x06\x00`\x00\x00\x00\x01\x03\x06@(\x00\x00\x00\x03\x04\x06@h\x00\x00\x00\x03\x03\x01\x06@h\x00\x00\x00\x03\x03\x01\x06@h\x00\x00\x00\x03\x03\x01\x06@h\x00\x00\x00\x03\x03\x01\x06@h\x00\x00\x00\x03\x03\x01\x0b\x00\x01\x00\x01\x02h\x0c\x02\x03h\x0c\x03\x04h\x0c\x04\x05h\x0c\x05\x06h\x0c\x06\x01h\x0c\x14\x01\x06\x01\x06\x05\x04\x03\x02\x17\x00\x00\x00\x00\x16', 0),
 'index': 'C7H8',
 'smiles': 'Cc1ccccc1',
 'scheme': 'default',
 'hashes': {'MolFormula': 'C7H8',
  'SmallWorldIndexBRL': 'B7R1L5',
  'AtomBondCounts': '7,7',
  'cx_smiles': 'Cc1ccccc1',
  'NetCharge': '0',
  'CanonicalSmiles': 'Cc1ccccc1',
  'inchikey_standard': 'YXFVVABEGXRONW-UHFFFAOYSA-N',
  'inchikey_KET_15T': 'YXFVVABEGXRONW-UHFFFAOYNA-N',
  'SmallWorldIndexBR': 'B7R1',
  'DegreeVector': '0,1,5,1',
  'ElementGraph': 'CC1CCCCC1',
  'HetAtomTautomer': 'C[C]1[CH][CH][CH][CH][CH]1_0_0',
 

Because `MolDocScheme` objects contain no functions—only references to functions—they can be pickled. In fact, the methods in `write` can save `MolDocSchemes` so that custom schemes are retrievable for later use.

## Writing to a Database
`Database.write` provides write functionality. Its core method is `WriteFromSDF`, which relies on rdkit's `ForwardSDMolSupplier` to write data from an SDF file into a specified database.

For each molecule in the SDF, `WriteFromSDF` inserts a document whose fields are specified by the `MolDocScheme` object passed into the function (one with default settings is created if the `scheme` argument is left blank).

In [5]:
# Write the contents of first_200_props.sdf, a test dataset, into the collection demo_db.molecules.
# The index will default to the molecule's inchikey.
# Return the number of molecules succesfully imported.
write.WriteFromSDF(demo_db.molecules, '../../data/test_data/first_200.props.sdf')

populating mongodb collection with compounds from SDF...




200 molecules successfully imported
0 duplicates skipped


200

The above call is the most basic version of `writeFromSDF`. For additional flexibility, `writeFromSDF` takes several optional arguments—users can specify a custom scheme object, a registration collection to write scheme objects to, how many molecules are inserted at a time (this can affect performance), and limit the number of molecules written in.

In [6]:
# Write the first 100 molecules of first_200_props.sdf, a test dataset, into demo_db.molecules
# This write will use canonical SMILES as the identifying index and thus does not conflict with the above write. 
# If we had used inchikey again, the write would have imported 0 molecules.
scheme = registration.MolDocScheme()
scheme.set_index('CanonicalSmiles')
write.WriteFromSDF(demo_db.molecules, '../../data/test_data/first_200.props.sdf', 
                   scheme, reg_collection=demo_db.schema, chunk_size=50, limit=100)

populating mongodb collection with compounds from SDF...




100 molecules successfully imported
0 duplicates skipped


100

In the case that users aren't working with an SDF, `.write` also provides `WriteFromMolList`, which will take a Python list of rdmol objects in place of the SDF argument in `WriteFromSDF`.

## `.create` Module Contents

mongordkit.Database.create.**createFromHostPort**(database_name, host=None (*string*), port=None (*string*)) --> *a MongoDB database instance named database_name*

mongordkit.Database.create.**createFromURL**(database_name, url=None (*string*)) --> *a MongoDB database instance named database_name*

## `.registration` Module Contents

In [7]:
registration.HASH_FUNCTIONS

{'AnonymousGraph': <function mongordkit.Database.registration.<lambda>(rdmol, f=rdkit.Chem.rdMolHash.HashFunction.AnonymousGraph)>,
 'ElementGraph': <function mongordkit.Database.registration.<lambda>(rdmol, f=rdkit.Chem.rdMolHash.HashFunction.ElementGraph)>,
 'CanonicalSmiles': <function mongordkit.Database.registration.<lambda>(rdmol, f=rdkit.Chem.rdMolHash.HashFunction.CanonicalSmiles)>,
 'MurckoScaffold': <function mongordkit.Database.registration.<lambda>(rdmol, f=rdkit.Chem.rdMolHash.HashFunction.MurckoScaffold)>,
 'ExtendedMurcko': <function mongordkit.Database.registration.<lambda>(rdmol, f=rdkit.Chem.rdMolHash.HashFunction.ExtendedMurcko)>,
 'MolFormula': <function mongordkit.Database.registration.<lambda>(rdmol, f=rdkit.Chem.rdMolHash.HashFunction.MolFormula)>,
 'AtomBondCounts': <function mongordkit.Database.registration.<lambda>(rdmol, f=rdkit.Chem.rdMolHash.HashFunction.AtomBondCounts)>,
 'DegreeVector': <function mongordkit.Database.registration.<lambda>(rdmol, f=rdkit.Ch

**Class** mongordkit.Database.registration.**MolDocScheme()**

**Instance variables**:
```
self.scheme_name = DEFAULT_SCHEME_NAME
self.author = DEFAULT_AUTHOR
self.pre_processed = DEFAULT_PREPROCESS
self.index_option = DEFAULT_INDEX
self.hashes = set(HASH_FUNCTIONS.keys())
self.fingerprints = {}
self.value_fields = {}
```
**Instance methods**:
- set_index(self, new_index) --> *None*
- get_index_value(self, rdmol) --> *calculated index value*
- add_hash_field(self, field_name, field_method) --> *None*
- add_value_field(self, field_name, field_value) --> *None*
- add_all_hashes(self) --> *None*
- remove_field(self, field_name) --> *None*
- generate_mol_doc(self, rdmol) --> *Dict: document representing molecule according to scheme*

## `.write` Module Contents

mongordkit.Database.write.**WriteFromSDF**(database, sdf, scheme=MolDocScheme(), reg_collection=None, chunk_size=100, limit=None, warnings=False (*Make this true to turn on rdkit warnings*) --> *int: number of molecules imported*

mongordkit.Database.write.**WriteFromMolList**(database, list, scheme=MolDocScheme(), reg_collection=None, chunk_size=100, limit=None) --> *int: number of molecules imported*