-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature mongodb upgrade #284
Conversation
…mart" way. Previously, setting overwrite=False when saving an object to a mongoDB would fail if a document with the destination id already existed. With this update, an error is *not* raised in this case if the existing mongoDB document matches the one being written. This behaviour is advantageous when dealing with nested objects, and paves the way toward a serialization that doesn't create duplicate documents for objects that are contained within other objects. For example, saving a ProtoclData object that contains a reference to an already-saved experiment design might be able to avoid re-writing that experiment design by checking that a matching document exists in the database already. This also fixes an even more blatent issue that when an object with standalone sub-objects was saved multiple times with the same ID this would continue to zombie the old and create new standalone object documents. The use of a new function, pygsti.io.mongodb.prepare_doc_for_existing_doc_check, massages a document that is being saved to smooth out aspects of a strict dictionary comparison with a database document.
Adds preloaded_edesign and edesign_already_written arguments to the I/O methods of ProtocolData, allowing users to load the edesign separately. Similarly, adds data_already_written argument to ProtocolResultsDir write methods. This update may prove especially helpful when reading from and writing to MongoDB databases, since it may be useful to separate edesigns, data, and results more than is done when using the filesystem. In all changes here, the default values of new arguments preserve the previous behavior. and already_loaded arguments to
…tions. Adds these routines to pygsti.io.readers to maintain the parallel filesystem & MongoDB APIs (filesystem versions of these functions exist already and MongoDB versions were forgotten until now).
Previously, if edesign had a different MongoDB ID than the data or results object that contained it, serialization would fail because the data or results object would expect to be able to query for the edesign object using the same id. Now a `preloaded_edesign` can be passed to TreeNode._init_children(...) explicitly.
…gous. The initial MongoDB serialization support basically duplicated the JSON-based filesystem serialization to write MongoDB documents. This update restructures the MongoDB document format used by pyGSTi objects, while keeping the outward API nearly the same. From an external perspective, the main change is that experiment design, data, and results objects are no longer given path-like ids but will have normal database identifiers. These IDs are generated the first time an object is written to the database (and returned by the "write" method) and are held within the Python object for future writes. You no longer write *to* a path/ID, you simply write an object: either a new ID is generated or an existing stored ID is used as appropriate. Within the database, these IDs are used to link the various documents, and now a parent object will hold the IDs of it's children/attribute documents within its "main" document. This results in a more typical data model and allows documents to serve in a many-to-one capacity, improving space efficiency. For exampmle, an experiment design can be linked to multiple protocol data objects and a circuit document can be referenced by multiple experiment designs and/or data sets. Internally, this commit makes MongoSerializable, which used to be a separate and somewhat ad-hoc base class, a base class of NicelySerializable (the base class for objects that have simple JSON serializations). A MongoSerializable object is one that can be serialized to a MongoDB database, and so this update makes everything that is JSON-serializable (most pyGSTi objects) also MongoDB-serializable, which is nice and cohesive. Furthermore, even pyGSTi's large non-NicelySerializable objects (edesigns, protocol data, results, protocols, datasets, estimates, confidence region factories, cached objective fns) *are* MongoSerializable, and so are brought onto the same footing as most of the rest of pyGSTi's objects. Object removal is more difficult under the new data model, since objects don't have clear owners and so it's difficult to tell whether the members/attributes of an object should always be removed along with it. The `RecursiveRemovalSpecification` object is meant to give some control over this, but users should be very careful with removing database documents. All in all, this commit represents a significant upgrade to the MongoDB serialization capabilities in pyGSTi, but should be tested more before we rely on it.
Passes on Erik's laptop, but requires a local MongoDB instance running (tests are skipped if pymongo isn't installed).
Fixes serializaiton code when the `_basis` of a OpGaugeGroupWithBasis object is just a string and not a `Basis` object.
Fixes a bug allowing the use of "StdOutcomeQubits" in dataset file header, and infers outcomes from stated "Columns" field in header. The latter helps to read-in dataset files in a more consistent manner (i.e. the order of outcome labels) facilitating the comparison of multiple datasets.
…m_mongodb Updates these functions to reflect recent changes to mongoDB serialization framework (like taking a MongoDB instance instead of a collection). These updates should have been added along with recent changes but were just forgotten/missed.
Several bugs were uncovered while testing MongoDB serialization, and are fixed here: - within core.run_iterative_gst a separate copy of the optimizer object is made when running the first iteration with a different number of finite difference iterations (see "fditers"). Previously the existing optimizer was updated, which is not good because the optimizer object is serializable and persistent, and so is expected to remain unchanged by running GST. - adds previously omitted de-serialization of a TPState's ._basis member, so that TPStates are not altered by saving and loading them from disk/DB. - adds forgotten '_dbcoordinates': 'none' to Estimate class .auxtypes member (this recently added member should never be serialized).
Replaces an occurance of ReplaceOne(doc_id) with ReplaceOne({'_id': doc_id}), as is needed by the pymongo API.
…nto feature-mongodb-upgrade
TPState objects that were serialized to disk prior to fixing the basis serialization of TPState objects (in commit aa8a290) were unable to be loaded because they were at least sometimes saved with an empty basis. This issue is fixed by this commit by adding a HACK in tpstate.py that checks for an empty basis and converts this to None. This isn't a complete patch, as the basis should really be something non-None, but it's the best we can do given there's no information stored about the basis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything seems good but there's a lot of code here, so I asked a couple of clarifying questions to make sure I understood the goal of the PR a little better.
Also I wasn't sure if you were ready for this to go in, since you've been pushing bugfixes.
@@ -797,8 +798,12 @@ def _max_array_types(artypes_list): # get the maximum number of each array type | |||
|
|||
for j, obj_fn_builder in enumerate(iteration_objfn_builders): | |||
tNxt = _time.time() | |||
optimizer.fditer = optimizer.first_fditer if (i == 0 and j == 0) else 0 | |||
opt_result, mdc_store = run_gst_fit(mdc_store, optimizer, obj_fn_builder, printer - 1) | |||
if i == 0 and j == 0: # special case: in first optimization run, use "first_fditer" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a bugfix? Don't see how it is related to the mongo serialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and it's related to mongo serialization :). The "bug" was that we were modifying the optimizer object (by setting its .fditer
attribute on old line 800) within this GST-running routine. This modification is unexpected by the caller, and is caught by the MongoDB serialization which includes checks to make sure serializable objects like the optimizer (Optimizer
is a subclass of NicelySerializable
) don't change when we don't expect them to.
|
||
|
||
class NicelySerializable(object): | ||
class NicelySerializable(MongoSerializable): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm the swap of inheritance order. The intention is that everything that is filesystem serializable is also mongo serializable, but there are some objects (e.g. DataSet, Estimate, Protocol) which are currently only mongo serializable (as of this PR)? I assume the goal is to enable the storing/reuse of these subcomponents, which before were only dumped as part of a larger tree structure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is correct, and I agree it's a bit counterintuitive at first. The thinking is that "everything can be rendered as JSON can also be serialized to a MongoDB" which implies that anything that is NicelySerializable
must also be MongoSerializable
. Another way of thinking about this is that something that is NicelySerializable
is able to render itself as simple JSON, whereas something that is MongoSerializable
is able to insert itself into a Mongo database, perhaps using multiple records. In principle we could take all the records a MongoSerializable
object generates and cram them together into one big piece of JSON, but in practice there are certain more complex objects (like DataSet
s, Estimate
s, and Protocol
s) that naturally can save themselves as multiple records in a database but would be a bit unwieldy as a huge piece of JSON.
Fixes the implementation of the 'circuit-str-json' auxtype, which is used to serialize fiducial pair dictionaries (in experiment designs). The metadir implementation is fine, but when I implemented the MongoDB version I didn't realize exactly what was being serialized (I thought it was just a list of circuits that should be stored as a list of JSON circuit string, but really it can be anything JSONable that may contain Circuits and doesn't contain any string, and when serialized the Circuit objects are converted to strings). This commit fixes this oversight so that now the 'circuit-str-json' auxtype works properly with MongoDB serialization. This commit also adds a backwards compatibility HACK for loading in old objects that were written before the MongoDB serialization code existed and thus don't have {... _dbcoordinates: 'none'} in their auxtypes member. The hack adds a _dbcoordinates: 'none' entry to the auxtypes dictionaries of objects that don't have them.
…data_id. Adds a warning and recovery attempt for the case when a results object is being saved, is given the ID of a data object that is supposedly already saved, but the data object doesn't contain any ._dbcoordinates (i.e. this attribute is None). Previously the code would just error, and now it attempts to save the data object. This really shouldn't be needed (hence the warning), but it seems like good insurance.
Replaces a call to pymongo's find with find_one and count_documents calls, which profiling shows to be faster. I thought that iterating through the results of a single find operation would be faster, but I was wrong :)
Removes a spurious pickling of 'json' data before saving it to the database, mostly likely from a copy/paste error of the 'pickle' auxtype. Before this fix, 'json' auxtype data would not be saved correctly, and would be unloadable (eek!). The prominent example of a 'json'-auxtype member is the 'idealout_lists' attribute of an RB experiment design.
Upgrades MongoDB serialization framework to be more powerful and efficient
Updates the structure of pyGSTi objects stored in a MongoDB database so that now:
bson.objectid.ObjectId
identifier, rather than a filesystem-like path.This results in a more typical data model and allows documents to serve in a many-to-one capacity, improving space efficiency. For example, an experiment design can be linked to multiple protocol data objects and a circuit document can be referenced by multiple experiment designs and/or data sets. This could result in massive data saving by not having to duplicate, e.g., circuits shared by many different experiment designs.
Internally, the
MongoSerializable
base class has been elevated in importance and now serves as the base class for both large (experiment design, protocol data, and results) objects and smallNicelySerializable
-derived objects. This allows nearly all of pyGSTi's objects to be serialized to a MongoDB using the same interface.This PR marks a point where the MongoDB support is much more complete and established than it was. I expect we'll still want to tweak some things before the format stabilizes completely, but this seems to be almost there. The biggest issue regards removing documents, which should be done with care as they can now be shared, e.g. have multiple parent objects.
Unit tests exist and have been updated, but are not very extensive at this point.