-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature mongodb serialization #276
Conversation
Adds ability to serialize pyGSTi objects to a MongoDB database by creating routines parallel to those used for disk (filesystem) I/O that read and write MongoDB documents. This commit adds support for many of pyGSTi's objects, and main types (experiment designs, datasets, results, protocols) have been tested to be able to be read and written to a MongoDB instance. Most of the implementation lies within the new io/mongodb.py module, which parallels metadir.py functionality for a database. Issues arise in that the paths used for filesystem I/O serve as both an ID (e.g., for a DB record) and a hierarchical link, indicating the parent and children of a record. The current MongoDB implementation uses subcollections to store related data, but this still seems a bit clunky and can result in many "nested" MongoDB subcollections. Also, removal of data is less straightforward in the DB setting. Overall, in addition to basic bug fixing, we still need to look into the following: - ability to write over (& update) existing DB documents - ability to remove documents cleanly. - resolve issue of subcollection nesting.
…zes bulk_write Adds the ability to specify whether to allow existing MongoDB documents to be overwritten when writing pyGSTi objects to a mongo database. This is implemented by the addition of 'overwrite_existing' parameters to many of the write functions in mongodb.py and protocol.py. This commit also adds a pre-check for writing a document id that already exists (when overwrite_existing=False), so that this common type of write failure it indicated before any auxiliary documents are written to the database. Finally, the way writes are done to the MongoDB database are updated to collect all the insert/replace operations and then use pymongo's bulk_write function instead of invoking insert_one / replace_one repeatedly throughout calls to write_auxtree_to_mongodb. This should benefit both performance and robustness.
…al functions. Adds the `pygsti.io.mongodb.subcollection_names` dictionary, which centralizes the subcollection names used when writing pygsti objects to MongoDB instances. Also adds `pygsti.io.mongodb.STANDALONE_COLLECTION_NAME`, which names the top-level collection used to store "standalone" pygsti objects (classes that have their own "from_mongodb", "write_to_mongodb", and "remove_from_mongodb" methods but are NOT TreeNode derived). Updates sub-collection names, including those used for dataset/datarow serialization, to be more succint (e.g., changed "pygsti_datasets" to "datasets") to the DB names are cleaner - ideally a single "pygsti_" collection name with non-"pygsti_"-prefixed subcollections. Also fixes `remove_auxtree_from_mongodb` so it properly removes everything it should - linked standalone object documents in particular. At this point all the basic MongoDB operations have been tested to work with all the main data types: edesigns, protocol data, and results. Currently, we don't serialize result directories as their own records. This keeps things simpler, and currently these dir documents aren't needed because they don't store anything - but we may want to add these records for easier future expansion (FUTURE work).
Only works if you have pymongo installed and a MongoDB server running on localhost:27017, but illustrates the usage and works on Erik's machine.
Adds necessary read/write from/to MongoDB methods to CachedObjectiveFunction (forgotten previously). Adds MongoSerializable class as an extension of NicelySerializable that gives the object being serialized access to a MongoDB database for placing large/complex components. This framework is needed by ConfidenceRegionFactory, which uses gridfs to store Hessian and Hessian projection data when serialized to a MongoDB database. This is necessary because (currently) the maximum MongoDB document size is 16MB. This commit also fixes a few minor bugs brought out by testing the serialization of large & nested trees of results, data, and edesigns that included confidence region factories and wildcard budgets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good! Had one comment about a potential extraneous return, but nothing that would block this PR.
pygsti/io/metadir.py
Outdated
@@ -525,7 +525,7 @@ def write_obj_to_meta_based_dir(obj, dirname, auxfile_types_member, omit_attribu | |||
vals = obj.__dict__ | |||
auxtypes = obj.__dict__[auxfile_types_member] | |||
|
|||
write_meta_based_dir(dirname, vals, auxtypes, init_meta=meta) | |||
return write_meta_based_dir(dirname, vals, auxtypes, init_meta=meta) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems unnecessary? Unless there were plans to pass on the return value of write_meta_based_dir
in the future, but that currently returns None as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, good catch. This was a leftover from some other idea I had about implementing the mongoDB serialization. I've added a commit removing it.
This was recently added as a part of a possible alternate implementation for mongoDB serialization that wasn't ultimately pursued.
Adds ability to serialize pyGSTi objects to a MongoDB database.
Leverages the framework used to save pyGSTi objects to json files within a filesystem hierarchy to alternatively save these objects to a MongoDB database. This medium provides some potential benefits over a filesystem in that it's easier to share and distribute (see sharding MongoDB lingo), and can be made robust through transactions and automated backups.
This PR represents the first implementation of a MongoDB serialization interface, adding MongoDB counterparts to many of the file I/O functions. In particular, it adds
write_to_mongodb
methods for experiment design, protocol data, and protocol results objects, and addspygsti.io.read_edesign_from_mongodb
,pygsti.io.read_data_from_mongodb
, andpygsti.io.read_results_from_mongodb
functions. Human-readable names should be used as thedoc_id
arguments to these functions, which loosely correspond to the directory-node names within the filesystem picture.A module of unit tests is included, which tests most but not all of the functionality (missing in particular serialization of
ConfidenceRegionFactory
objects). Note, however, thatpymongo
and a local MongoDB server are needed to run these tests and currentlypymongo
is not listed as a "testing" dependency in pyGSTi.