# Creating and Writing to MongoDB

Last updated: 7/12/20

Methods that directly modify MongoDB database instances are included in the `mongordkit.Database` module.





In [None]:
from mongordkit.Database import create, write, utils
import pymongo

## Reset Cells
Run the contents of this cell to reset the local MongoDB database used in this notebook.

In [None]:
client = pymongo.MongoClient()
print(client.list_database_names())
client.drop_database('TestDatabase')
print(client.list_database_names())

## Creating Databases
Users can opt to bring their own database instances, but `Database.create` provides methods that will create ready-made MongoDB instances, defaulting to your local MongoDB:

In [None]:
# Return a database using a host port, such as the local port:
TestDB = create.createFromHostPort('TestDatabase', host='localhost', port=27017)

# Return a database using a MongoDB URI, such as that provided by Atlas:
TestDB = create.createFromURL('TestDatabase', url=None)

These databases are created with a `registration` collection. The registration collection includes several documents that consist of common pre-made settings, with the default being `STANDARD_SETTING`. All settings are documented in `Database.utils`.

In [None]:
print(utils.STANDARD_SETTING)

## Writing to a Database
`Database.write` provides write functionality. Its core method is `writeFromSDF`, which relies on rdkit's `ForwardSDMolSupplier` to write data from an SDF file into a specified database.

For each molecule in the SDF, `writeFromSDF` inserts a document containing at the minimum a unique identifying index, that molecule's SMILES, a pickle of the molecule's rdmol, and a field that specifies the registration option used to store the molecule.

In [None]:
# Write the contents of first_200_props.sdf, a test dataset, into the TestDatabase created above. 
# The index will default to the molecule's inchikey.
# Return the number of molecules succesfully imported.
write.writeFromSDF(TestDB, '../../data/test_data/first_200.props.sdf', 'test')

The above call is the most basic version of `writeFromSDF`. For additional flexibility, `writeFromSDF` takes several optional arguments that allow users to specify how inbound molecules should be standardized, a field relating to the data's origin, customize the index, and change how many molecules are inserted into the database at a time. 

In [None]:
# Write the contents of first_200_props.sdf, a test dataset, into the TestDatabase created above. 
# This write will use canonical SMILES as the identifying index and thus does not conflict with the above write. 
# If we had used inchikey again, the write would have imported 0 molecules.
write.writeFromSDF(TestDB, '../../data/test_data/first_200.props.sdf', 'test', reg_option='standard_setting', index_option='canonical_smiles', chunk_size=100, limit=None)

In order to maintain consistency, the registration options and index options are drawn from a set of predetermined options specified in `Database.utils`.

## `.create` Module Contents

mongordkit.Database.create.**createFromHostPort**(database_name, host=None (*string*), port=None (*string*)) --> *a MongoDB database instance named database_name*

mongordkit.Database.create.**createFromURL**(database_name, url=None (*string*)) --> *a MongoDB database instance named database_name*

## `.write` Module Contents

mongordkit.Database.write.**writeFromSDF**(database, source_sdf, source_name *(string)*, reg_option="standard_setting", index_option="inchikey", chunk_size=100, limit=None) --> *int: number of molecules imported*

As of 7/15/20, `writeFromSDF` supports the following registration options: 
* 'standard_setting'

And the following index options: 
* 'inchikey'
* 'canonical_smiles'
* 'het_atom_tautomer'