Feature mongodb serialization #276

enielse · 2022-11-16T16:09:21Z

Adds ability to serialize pyGSTi objects to a MongoDB database.

Leverages the framework used to save pyGSTi objects to json files within a filesystem hierarchy to alternatively save these objects to a MongoDB database. This medium provides some potential benefits over a filesystem in that it's easier to share and distribute (see sharding MongoDB lingo), and can be made robust through transactions and automated backups.

This PR represents the first implementation of a MongoDB serialization interface, adding MongoDB counterparts to many of the file I/O functions. In particular, it adds write_to_mongodb methods for experiment design, protocol data, and protocol results objects, and adds pygsti.io.read_edesign_from_mongodb, pygsti.io.read_data_from_mongodb, and pygsti.io.read_results_from_mongodb functions. Human-readable names should be used as the doc_id arguments to these functions, which loosely correspond to the directory-node names within the filesystem picture.

A module of unit tests is included, which tests most but not all of the functionality (missing in particular serialization of ConfidenceRegionFactory objects). Note, however, that pymongo and a local MongoDB server are needed to run these tests and currently pymongo is not listed as a "testing" dependency in pyGSTi.

Adds ability to serialize pyGSTi objects to a MongoDB database by creating routines parallel to those used for disk (filesystem) I/O that read and write MongoDB documents. This commit adds support for many of pyGSTi's objects, and main types (experiment designs, datasets, results, protocols) have been tested to be able to be read and written to a MongoDB instance. Most of the implementation lies within the new io/mongodb.py module, which parallels metadir.py functionality for a database. Issues arise in that the paths used for filesystem I/O serve as both an ID (e.g., for a DB record) and a hierarchical link, indicating the parent and children of a record. The current MongoDB implementation uses subcollections to store related data, but this still seems a bit clunky and can result in many "nested" MongoDB subcollections. Also, removal of data is less straightforward in the DB setting. Overall, in addition to basic bug fixing, we still need to look into the following: - ability to write over (& update) existing DB documents - ability to remove documents cleanly. - resolve issue of subcollection nesting.

…zes bulk_write Adds the ability to specify whether to allow existing MongoDB documents to be overwritten when writing pyGSTi objects to a mongo database. This is implemented by the addition of 'overwrite_existing' parameters to many of the write functions in mongodb.py and protocol.py. This commit also adds a pre-check for writing a document id that already exists (when overwrite_existing=False), so that this common type of write failure it indicated before any auxiliary documents are written to the database. Finally, the way writes are done to the MongoDB database are updated to collect all the insert/replace operations and then use pymongo's bulk_write function instead of invoking insert_one / replace_one repeatedly throughout calls to write_auxtree_to_mongodb. This should benefit both performance and robustness.

…al functions. Adds the `pygsti.io.mongodb.subcollection_names` dictionary, which centralizes the subcollection names used when writing pygsti objects to MongoDB instances. Also adds `pygsti.io.mongodb.STANDALONE_COLLECTION_NAME`, which names the top-level collection used to store "standalone" pygsti objects (classes that have their own "from_mongodb", "write_to_mongodb", and "remove_from_mongodb" methods but are NOT TreeNode derived). Updates sub-collection names, including those used for dataset/datarow serialization, to be more succint (e.g., changed "pygsti_datasets" to "datasets") to the DB names are cleaner - ideally a single "pygsti_" collection name with non-"pygsti_"-prefixed subcollections. Also fixes `remove_auxtree_from_mongodb` so it properly removes everything it should - linked standalone object documents in particular. At this point all the basic MongoDB operations have been tested to work with all the main data types: edesigns, protocol data, and results. Currently, we don't serialize result directories as their own records. This keeps things simpler, and currently these dir documents aren't needed because they don't store anything - but we may want to add these records for easier future expansion (FUTURE work).

Only works if you have pymongo installed and a MongoDB server running on localhost:27017, but illustrates the usage and works on Erik's machine.

Adds necessary read/write from/to MongoDB methods to CachedObjectiveFunction (forgotten previously). Adds MongoSerializable class as an extension of NicelySerializable that gives the object being serialized access to a MongoDB database for placing large/complex components. This framework is needed by ConfidenceRegionFactory, which uses gridfs to store Hessian and Hessian projection data when serialized to a MongoDB database. This is necessary because (currently) the maximum MongoDB document size is 16MB. This commit also fixes a few minor bugs brought out by testing the serialization of large & nested trees of results, data, and edesigns that included confidence region factories and wildcard budgets.

sserita

This looks good! Had one comment about a potential extraneous return, but nothing that would block this PR.

sserita · 2022-11-16T18:54:25Z

pygsti/io/metadir.py

@@ -525,7 +525,7 @@ def write_obj_to_meta_based_dir(obj, dirname, auxfile_types_member, omit_attribu
        vals = obj.__dict__
        auxtypes = obj.__dict__[auxfile_types_member]

-    write_meta_based_dir(dirname, vals, auxtypes, init_meta=meta)
+    return write_meta_based_dir(dirname, vals, auxtypes, init_meta=meta)


This seems unnecessary? Unless there were plans to pass on the return value of write_meta_based_dir in the future, but that currently returns None as well.

Yep, good catch. This was a leftover from some other idea I had about implementing the mongoDB serialization. I've added a commit removing it.

This was recently added as a part of a possible alternate implementation for mongoDB serialization that wasn't ultimately pursued.

enielse added 6 commits November 7, 2022 11:16

Adds unit test module for mongoDB serialization tests.

c181145

Only works if you have pymongo installed and a MongoDB server running on localhost:27017, but illustrates the usage and works on Erik's machine.

Adds pymongo optional dependency to setup.py's extra dependencies.

66a1c1b

sserita approved these changes Nov 16, 2022

View reviewed changes

Removes useless return statement in metadir.write_obj_to_meta_based_dir.

66a4707

This was recently added as a part of a possible alternate implementation for mongoDB serialization that wasn't ultimately pursued.

sserita merged commit 92b8179 into develop Nov 17, 2022

sserita deleted the feature-mongodb-serialization branch November 17, 2022 00:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature mongodb serialization #276

Feature mongodb serialization #276

enielse commented Nov 16, 2022

sserita left a comment

sserita Nov 16, 2022

enielse Nov 16, 2022

Feature mongodb serialization #276

Feature mongodb serialization #276

Conversation

enielse commented Nov 16, 2022

sserita left a comment

Choose a reason for hiding this comment

sserita Nov 16, 2022

Choose a reason for hiding this comment

enielse Nov 16, 2022

Choose a reason for hiding this comment