Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature mongodb serialization #276

Merged
merged 7 commits into from
Nov 17, 2022
Merged

Conversation

enielse
Copy link
Collaborator

@enielse enielse commented Nov 16, 2022

Adds ability to serialize pyGSTi objects to a MongoDB database.

Leverages the framework used to save pyGSTi objects to json files within a filesystem hierarchy to alternatively save these objects to a MongoDB database. This medium provides some potential benefits over a filesystem in that it's easier to share and distribute (see sharding MongoDB lingo), and can be made robust through transactions and automated backups.

This PR represents the first implementation of a MongoDB serialization interface, adding MongoDB counterparts to many of the file I/O functions. In particular, it adds write_to_mongodb methods for experiment design, protocol data, and protocol results objects, and adds pygsti.io.read_edesign_from_mongodb, pygsti.io.read_data_from_mongodb, and pygsti.io.read_results_from_mongodb functions. Human-readable names should be used as the doc_id arguments to these functions, which loosely correspond to the directory-node names within the filesystem picture.

A module of unit tests is included, which tests most but not all of the functionality (missing in particular serialization of ConfidenceRegionFactory objects). Note, however, that pymongo and a local MongoDB server are needed to run these tests and currently pymongo is not listed as a "testing" dependency in pyGSTi.

Adds ability to serialize pyGSTi objects to a MongoDB database by
creating routines parallel to those used for disk (filesystem) I/O
that read and write MongoDB documents.  This commit adds support for
many of pyGSTi's objects, and main types (experiment designs, datasets,
results, protocols) have been tested to be able to be read and written
to a MongoDB instance.

Most of the implementation lies within the new io/mongodb.py module,
which parallels metadir.py functionality for a database.  Issues arise
in that the paths used for filesystem I/O serve as both an ID (e.g.,
for a DB record) and a hierarchical link, indicating the parent and children
of a record.  The current MongoDB implementation uses subcollections
to store related data, but this still seems a bit clunky and can result
in many "nested" MongoDB subcollections.  Also, removal of data is less
straightforward in the DB setting.

Overall, in addition to basic bug fixing, we still need to look into
the following:
- ability to write over (& update) existing DB documents
- ability to remove documents cleanly.
- resolve issue of subcollection nesting.
…zes bulk_write

Adds the ability to specify whether to allow existing MongoDB documents to be
overwritten when writing pyGSTi objects to a mongo database.  This is implemented
by the addition of 'overwrite_existing' parameters to many of the write functions
in mongodb.py and protocol.py.  This commit also adds a pre-check for writing a
document id that already exists (when overwrite_existing=False), so that this
common type of write failure it indicated before any auxiliary documents are
written to the database.  Finally, the way writes are done to the MongoDB
database are updated to collect all the insert/replace operations and then
use pymongo's bulk_write function instead of invoking insert_one / replace_one
repeatedly throughout calls to write_auxtree_to_mongodb.  This should benefit
both performance and robustness.
…al functions.

Adds the `pygsti.io.mongodb.subcollection_names` dictionary, which centralizes
the subcollection names used when writing pygsti objects to MongoDB instances.
Also adds `pygsti.io.mongodb.STANDALONE_COLLECTION_NAME`, which names the top-level
collection used to store "standalone" pygsti objects (classes that have their own
"from_mongodb", "write_to_mongodb", and "remove_from_mongodb" methods but are NOT
TreeNode derived).

Updates sub-collection names, including those used for dataset/datarow serialization,
to be more succint (e.g., changed "pygsti_datasets" to "datasets") to the DB names
are cleaner - ideally a single "pygsti_" collection name with non-"pygsti_"-prefixed
subcollections.

Also fixes `remove_auxtree_from_mongodb` so it properly removes everything it
should - linked standalone object documents in particular.

At this point all the basic MongoDB operations have been tested to work with
all the main data types: edesigns, protocol data, and results.  Currently,
we don't serialize result directories as their own records.

This keeps things simpler, and currently these dir documents aren't needed because
they don't store anything - but we may want to add these records for easier
future expansion (FUTURE work).
Only works if you have pymongo installed and a MongoDB server
running on localhost:27017, but illustrates the usage and works
on Erik's machine.
Adds necessary read/write from/to MongoDB methods to CachedObjectiveFunction
(forgotten previously).  Adds MongoSerializable class as an extension of
NicelySerializable that gives the object being serialized access to a
MongoDB database for placing large/complex components.  This framework
is needed by ConfidenceRegionFactory, which uses gridfs to store Hessian
and Hessian projection data when serialized to a MongoDB database.  This
is necessary because (currently) the maximum MongoDB document size is 16MB.

This commit also fixes a few minor bugs brought out by testing the
serialization of large & nested trees of results, data, and edesigns
that included confidence region factories and wildcard budgets.
Copy link
Contributor

@sserita sserita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! Had one comment about a potential extraneous return, but nothing that would block this PR.

@@ -525,7 +525,7 @@ def write_obj_to_meta_based_dir(obj, dirname, auxfile_types_member, omit_attribu
vals = obj.__dict__
auxtypes = obj.__dict__[auxfile_types_member]

write_meta_based_dir(dirname, vals, auxtypes, init_meta=meta)
return write_meta_based_dir(dirname, vals, auxtypes, init_meta=meta)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unnecessary? Unless there were plans to pass on the return value of write_meta_based_dir in the future, but that currently returns None as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, good catch. This was a leftover from some other idea I had about implementing the mongoDB serialization. I've added a commit removing it.

This was recently added as a part of a possible alternate implementation
for mongoDB serialization that wasn't ultimately pursued.
@sserita sserita merged commit 92b8179 into develop Nov 17, 2022
@sserita sserita deleted the feature-mongodb-serialization branch November 17, 2022 00:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants