Feature mongodb upgrade #284

enielse · 2022-12-20T16:02:07Z

Upgrades MongoDB serialization framework to be more powerful and efficient

Updates the structure of pyGSTi objects stored in a MongoDB database so that now:

objects are identified by a normal bson.objectid.ObjectId identifier, rather than a filesystem-like path.
a parent object holds the IDs of its child or attribute documents within its "main" document.

This results in a more typical data model and allows documents to serve in a many-to-one capacity, improving space efficiency. For example, an experiment design can be linked to multiple protocol data objects and a circuit document can be referenced by multiple experiment designs and/or data sets. This could result in massive data saving by not having to duplicate, e.g., circuits shared by many different experiment designs.

Internally, the MongoSerializable base class has been elevated in importance and now serves as the base class for both large (experiment design, protocol data, and results) objects and small NicelySerializable-derived objects. This allows nearly all of pyGSTi's objects to be serialized to a MongoDB using the same interface.

This PR marks a point where the MongoDB support is much more complete and established than it was. I expect we'll still want to tweak some things before the format stabilizes completely, but this seems to be almost there. The biggest issue regards removing documents, which should be done with care as they can now be shared, e.g. have multiple parent objects.

Unit tests exist and have been updated, but are not very extensive at this point.

…mart" way. Previously, setting overwrite=False when saving an object to a mongoDB would fail if a document with the destination id already existed. With this update, an error is *not* raised in this case if the existing mongoDB document matches the one being written. This behaviour is advantageous when dealing with nested objects, and paves the way toward a serialization that doesn't create duplicate documents for objects that are contained within other objects. For example, saving a ProtoclData object that contains a reference to an already-saved experiment design might be able to avoid re-writing that experiment design by checking that a matching document exists in the database already. This also fixes an even more blatent issue that when an object with standalone sub-objects was saved multiple times with the same ID this would continue to zombie the old and create new standalone object documents. The use of a new function, pygsti.io.mongodb.prepare_doc_for_existing_doc_check, massages a document that is being saved to smooth out aspects of a strict dictionary comparison with a database document.

Adds preloaded_edesign and edesign_already_written arguments to the I/O methods of ProtocolData, allowing users to load the edesign separately. Similarly, adds data_already_written argument to ProtocolResultsDir write methods. This update may prove especially helpful when reading from and writing to MongoDB databases, since it may be useful to separate edesigns, data, and results more than is done when using the filesystem. In all changes here, the default values of new arguments preserve the previous behavior. and already_loaded arguments to

…tions. Adds these routines to pygsti.io.readers to maintain the parallel filesystem & MongoDB APIs (filesystem versions of these functions exist already and MongoDB versions were forgotten until now).

Previously, if edesign had a different MongoDB ID than the data or results object that contained it, serialization would fail because the data or results object would expect to be able to query for the edesign object using the same id. Now a `preloaded_edesign` can be passed to TreeNode._init_children(...) explicitly.

…gous. The initial MongoDB serialization support basically duplicated the JSON-based filesystem serialization to write MongoDB documents. This update restructures the MongoDB document format used by pyGSTi objects, while keeping the outward API nearly the same. From an external perspective, the main change is that experiment design, data, and results objects are no longer given path-like ids but will have normal database identifiers. These IDs are generated the first time an object is written to the database (and returned by the "write" method) and are held within the Python object for future writes. You no longer write *to* a path/ID, you simply write an object: either a new ID is generated or an existing stored ID is used as appropriate. Within the database, these IDs are used to link the various documents, and now a parent object will hold the IDs of it's children/attribute documents within its "main" document. This results in a more typical data model and allows documents to serve in a many-to-one capacity, improving space efficiency. For exampmle, an experiment design can be linked to multiple protocol data objects and a circuit document can be referenced by multiple experiment designs and/or data sets. Internally, this commit makes MongoSerializable, which used to be a separate and somewhat ad-hoc base class, a base class of NicelySerializable (the base class for objects that have simple JSON serializations). A MongoSerializable object is one that can be serialized to a MongoDB database, and so this update makes everything that is JSON-serializable (most pyGSTi objects) also MongoDB-serializable, which is nice and cohesive. Furthermore, even pyGSTi's large non-NicelySerializable objects (edesigns, protocol data, results, protocols, datasets, estimates, confidence region factories, cached objective fns) *are* MongoSerializable, and so are brought onto the same footing as most of the rest of pyGSTi's objects. Object removal is more difficult under the new data model, since objects don't have clear owners and so it's difficult to tell whether the members/attributes of an object should always be removed along with it. The `RecursiveRemovalSpecification` object is meant to give some control over this, but users should be very careful with removing database documents. All in all, this commit represents a significant upgrade to the MongoDB serialization capabilities in pyGSTi, but should be tested more before we rely on it.

Passes on Erik's laptop, but requires a local MongoDB instance running (tests are skipped if pymongo isn't installed).

Fixes serializaiton code when the `_basis` of a OpGaugeGroupWithBasis object is just a string and not a `Basis` object.

Fixes a bug allowing the use of "StdOutcomeQubits" in dataset file header, and infers outcomes from stated "Columns" field in header. The latter helps to read-in dataset files in a more consistent manner (i.e. the order of outcome labels) facilitating the comparison of multiple datasets.

…m_mongodb Updates these functions to reflect recent changes to mongoDB serialization framework (like taking a MongoDB instance instead of a collection). These updates should have been added along with recent changes but were just forgotten/missed.

Several bugs were uncovered while testing MongoDB serialization, and are fixed here: - within core.run_iterative_gst a separate copy of the optimizer object is made when running the first iteration with a different number of finite difference iterations (see "fditers"). Previously the existing optimizer was updated, which is not good because the optimizer object is serializable and persistent, and so is expected to remain unchanged by running GST. - adds previously omitted de-serialization of a TPState's ._basis member, so that TPStates are not altered by saving and loading them from disk/DB. - adds forgotten '_dbcoordinates': 'none' to Estimate class .auxtypes member (this recently added member should never be serialized).

Replaces an occurance of ReplaceOne(doc_id) with ReplaceOne({'_id': doc_id}), as is needed by the pymongo API.

…nto feature-mongodb-upgrade

TPState objects that were serialized to disk prior to fixing the basis serialization of TPState objects (in commit aa8a290) were unable to be loaded because they were at least sometimes saved with an empty basis. This issue is fixed by this commit by adding a HACK in tpstate.py that checks for an empty basis and converts this to None. This isn't a complete patch, as the basis should really be something non-None, but it's the best we can do given there's no information stored about the basis.

sserita

Everything seems good but there's a lot of code here, so I asked a couple of clarifying questions to make sure I understood the goal of the PR a little better.

Also I wasn't sure if you were ready for this to go in, since you've been pushing bugfixes.

sserita · 2023-01-05T20:25:51Z

pygsti/algorithms/core.py

@@ -797,8 +798,12 @@ def _max_array_types(artypes_list):  # get the maximum number of each array type

            for j, obj_fn_builder in enumerate(iteration_objfn_builders):
                tNxt = _time.time()
-                optimizer.fditer = optimizer.first_fditer if (i == 0 and j == 0) else 0
-                opt_result, mdc_store = run_gst_fit(mdc_store, optimizer, obj_fn_builder, printer - 1)
+                if i == 0 and j == 0:  # special case: in first optimization run, use "first_fditer"


Is this a bugfix? Don't see how it is related to the mongo serialization.

Yes, and it's related to mongo serialization :). The "bug" was that we were modifying the optimizer object (by setting its .fditer attribute on old line 800) within this GST-running routine. This modification is unexpected by the caller, and is caught by the MongoDB serialization which includes checks to make sure serializable objects like the optimizer (Optimizer is a subclass of NicelySerializable) don't change when we don't expect them to.

sserita · 2023-01-05T21:49:32Z

pygsti/baseobjs/nicelyserializable.py



-class NicelySerializable(object):
+class NicelySerializable(MongoSerializable):


Just to confirm the swap of inheritance order. The intention is that everything that is filesystem serializable is also mongo serializable, but there are some objects (e.g. DataSet, Estimate, Protocol) which are currently only mongo serializable (as of this PR)? I assume the goal is to enable the storing/reuse of these subcomponents, which before were only dumped as part of a larger tree structure?

Yes, this is correct, and I agree it's a bit counterintuitive at first. The thinking is that "everything can be rendered as JSON can also be serialized to a MongoDB" which implies that anything that is NicelySerializable must also be MongoSerializable. Another way of thinking about this is that something that is NicelySerializable is able to render itself as simple JSON, whereas something that is MongoSerializable is able to insert itself into a Mongo database, perhaps using multiple records. In principle we could take all the records a MongoSerializable object generates and cram them together into one big piece of JSON, but in practice there are certain more complex objects (like DataSets, Estimates, and Protocols) that naturally can save themselves as multiple records in a database but would be a bit unwieldy as a huge piece of JSON.

Fixes the implementation of the 'circuit-str-json' auxtype, which is used to serialize fiducial pair dictionaries (in experiment designs). The metadir implementation is fine, but when I implemented the MongoDB version I didn't realize exactly what was being serialized (I thought it was just a list of circuits that should be stored as a list of JSON circuit string, but really it can be anything JSONable that may contain Circuits and doesn't contain any string, and when serialized the Circuit objects are converted to strings). This commit fixes this oversight so that now the 'circuit-str-json' auxtype works properly with MongoDB serialization. This commit also adds a backwards compatibility HACK for loading in old objects that were written before the MongoDB serialization code existed and thus don't have {... _dbcoordinates: 'none'} in their auxtypes member. The hack adds a _dbcoordinates: 'none' entry to the auxtypes dictionaries of objects that don't have them.

…data_id. Adds a warning and recovery attempt for the case when a results object is being saved, is given the ID of a data object that is supposedly already saved, but the data object doesn't contain any ._dbcoordinates (i.e. this attribute is None). Previously the code would just error, and now it attempts to save the data object. This really shouldn't be needed (hence the warning), but it seems like good insurance.

Replaces a call to pymongo's find with find_one and count_documents calls, which profiling shows to be faster. I thought that iterating through the results of a single find operation would be faster, but I was wrong :)

Removes a spurious pickling of 'json' data before saving it to the database, mostly likely from a copy/paste error of the 'pickle' auxtype. Before this fix, 'json' auxtype data would not be saved correctly, and would be unloadable (eek!). The prominent example of a 'json'-auxtype member is the 'idealout_lists' attribute of an RB experiment design.

enielse and others added 16 commits December 12, 2022 18:22

Adds read_protocol_from_mongodb and remove_protocol_from_mongodb func…

08cf575

…tions. Adds these routines to pygsti.io.readers to maintain the parallel filesystem & MongoDB APIs (filesystem versions of these functions exist already and MongoDB versions were forgotten until now).

Updates MongoDB serialization unit tester.

4d65a63

Passes on Erik's laptop, but requires a local MongoDB instance running (tests are skipped if pymongo isn't installed).

Fixes minor bug in gauge group serialization

f61f564

Fixes serializaiton code when the `_basis` of a OpGaugeGroupWithBasis object is just a string and not a `Basis` object.

Merge branch 'develop' into feature-mongodb-upgrade

8365477

Merge branch 'develop' into feature-mongodb-upgrade

374d3c3

Merge branch 'develop' into feature-mongodb-upgrade

3ae84e6

Fixes bug in mongo serialization when overwrite_existing=True.

40e88cd

Replaces an occurance of ReplaceOne(doc_id) with ReplaceOne({'_id': doc_id}), as is needed by the pymongo API.

Merge branch 'feature-mongodb-upgrade' of github.com:pyGSTio/pyGSTi i…

d40a21c

…nto feature-mongodb-upgrade

sserita reviewed Jan 5, 2023

View reviewed changes

enielse added 4 commits January 9, 2023 20:21

Updates MongoSerializeable._find_one_doc to be more efficient.

eb1ffe9

Replaces a call to pymongo's find with find_one and count_documents calls, which profiling shows to be faster. I thought that iterating through the results of a single find operation would be faster, but I was wrong :)

sserita approved these changes Jan 10, 2023

View reviewed changes

sserita merged commit a4106a7 into develop Jan 10, 2023

sserita deleted the feature-mongodb-upgrade branch January 10, 2023 18:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature mongodb upgrade #284

Feature mongodb upgrade #284

enielse commented Dec 20, 2022

sserita left a comment

sserita Jan 5, 2023

enielse Jan 5, 2023

sserita Jan 5, 2023

enielse Jan 5, 2023



		class NicelySerializable(object):
		class NicelySerializable(MongoSerializable):

Feature mongodb upgrade #284

Feature mongodb upgrade #284

Conversation

enielse commented Dec 20, 2022

sserita left a comment

Choose a reason for hiding this comment

sserita Jan 5, 2023

Choose a reason for hiding this comment

enielse Jan 5, 2023

Choose a reason for hiding this comment

sserita Jan 5, 2023

Choose a reason for hiding this comment

enielse Jan 5, 2023

Choose a reason for hiding this comment