Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature mongodb upgrade #284

Merged
merged 20 commits into from
Jan 10, 2023
Merged

Feature mongodb upgrade #284

merged 20 commits into from
Jan 10, 2023

Conversation

enielse
Copy link
Collaborator

@enielse enielse commented Dec 20, 2022

Upgrades MongoDB serialization framework to be more powerful and efficient

Updates the structure of pyGSTi objects stored in a MongoDB database so that now:

  1. objects are identified by a normal bson.objectid.ObjectId identifier, rather than a filesystem-like path.
  2. a parent object holds the IDs of its child or attribute documents within its "main" document.

This results in a more typical data model and allows documents to serve in a many-to-one capacity, improving space efficiency. For example, an experiment design can be linked to multiple protocol data objects and a circuit document can be referenced by multiple experiment designs and/or data sets. This could result in massive data saving by not having to duplicate, e.g., circuits shared by many different experiment designs.

Internally, the MongoSerializable base class has been elevated in importance and now serves as the base class for both large (experiment design, protocol data, and results) objects and small NicelySerializable-derived objects. This allows nearly all of pyGSTi's objects to be serialized to a MongoDB using the same interface.

This PR marks a point where the MongoDB support is much more complete and established than it was. I expect we'll still want to tweak some things before the format stabilizes completely, but this seems to be almost there. The biggest issue regards removing documents, which should be done with care as they can now be shared, e.g. have multiple parent objects.

Unit tests exist and have been updated, but are not very extensive at this point.

enielse and others added 16 commits December 12, 2022 18:22
…mart" way.

Previously, setting overwrite=False when saving an object to a mongoDB
would fail if a document with the destination id already existed.  With
this update, an error is *not* raised in this case if the existing mongoDB
document matches the one being written.

This behaviour is advantageous when dealing with nested objects, and
paves the way toward a serialization that doesn't create duplicate
documents for objects that are contained within other objects.  For
example, saving a ProtoclData object that contains a reference to an
already-saved experiment design might be able to avoid re-writing that
experiment design by checking that a matching document exists in
the database already.  This also fixes an even more blatent issue that
when an object with standalone sub-objects was saved multiple times with
the same ID this would continue to zombie the old and create new standalone
object documents.

The use of a new function, pygsti.io.mongodb.prepare_doc_for_existing_doc_check,
massages a document that is being saved to smooth out aspects of a
strict dictionary comparison with a database document.
Adds preloaded_edesign and edesign_already_written arguments to the
I/O methods of ProtocolData, allowing users to load the edesign separately.
Similarly, adds data_already_written argument to ProtocolResultsDir write
methods.

This update may prove especially helpful when reading from and writing
to MongoDB databases, since it may be useful to separate edesigns, data,
and results more than is done when using the filesystem.

In all changes here, the default values of new arguments preserve the previous behavior.

 and already_loaded arguments to
…tions.

Adds these routines to pygsti.io.readers to maintain the parallel
filesystem & MongoDB APIs (filesystem versions of these functions exist
already and MongoDB versions were forgotten until now).
Previously, if edesign had a different MongoDB ID than the data or
results object that contained it, serialization would fail because
the data or results object would expect to be able to query for the
edesign object using the same id.  Now a `preloaded_edesign` can be
passed to TreeNode._init_children(...) explicitly.
…gous.

The initial MongoDB serialization support basically duplicated the
JSON-based filesystem serialization to write MongoDB documents.  This
update restructures the MongoDB document format used by pyGSTi objects,
while keeping the outward API nearly the same.

From an external perspective, the main change is that experiment design,
data, and results objects are no longer given path-like ids but will have
normal database identifiers.  These IDs are generated the first time an
object is written to the database (and returned by the "write" method)
and are held within the Python object for future writes.  You no longer
write *to* a path/ID, you simply write an object: either a new ID
is generated or an existing stored ID is used as appropriate.

Within the database, these IDs are used to link the various documents,
and now a parent object will hold the IDs of it's children/attribute
documents within its "main" document.  This results in a more typical
data model and allows documents to serve in a many-to-one capacity,
improving space efficiency.  For exampmle, an experiment design can
be linked to multiple protocol data objects and a circuit document
can be referenced by multiple experiment designs and/or data sets.

Internally, this commit makes MongoSerializable, which used to be
a separate and somewhat ad-hoc base class, a base class of
NicelySerializable (the base class for objects that have simple
JSON serializations).  A MongoSerializable object is one that can
be serialized to a MongoDB database, and so this update makes everything
that is JSON-serializable (most pyGSTi objects) also MongoDB-serializable,
which is nice and cohesive.  Furthermore, even pyGSTi's large
non-NicelySerializable objects (edesigns, protocol data, results,
protocols, datasets, estimates, confidence region factories,
cached objective fns) *are* MongoSerializable, and so are brought
onto the same footing as most of the rest of pyGSTi's objects.

Object removal is more difficult under the new data model, since
objects don't have clear owners and so it's difficult to tell
whether the members/attributes of an object should always be removed
along with it.  The `RecursiveRemovalSpecification` object is meant
to give some control over this, but users should be very careful
with removing database documents.

All in all, this commit represents a significant upgrade to the
MongoDB serialization capabilities in pyGSTi, but should be tested
more before we rely on it.
Passes on Erik's laptop, but requires a local MongoDB instance running
(tests are skipped if pymongo isn't installed).
Fixes serializaiton code when the `_basis` of a OpGaugeGroupWithBasis
object is just a string and not a `Basis` object.
Fixes a bug allowing the use of "StdOutcomeQubits" in dataset file
header, and infers outcomes from stated "Columns" field in header.
The latter helps to read-in dataset files in a more consistent manner
(i.e. the order of outcome labels) facilitating the comparison of
multiple datasets.
…m_mongodb

Updates these functions to reflect recent changes to mongoDB serialization
framework (like taking a MongoDB instance instead of a collection).  These
updates should have been added along with recent changes but were just
forgotten/missed.
Several bugs were uncovered while testing MongoDB serialization,
and are fixed here:

- within core.run_iterative_gst a separate copy of the optimizer
  object is made when running the first iteration with a different
  number of finite difference iterations (see "fditers").  Previously
  the existing optimizer was updated, which is not good because the
  optimizer object is serializable and persistent, and so is expected
  to remain unchanged by running GST.

- adds previously omitted de-serialization of a TPState's ._basis member,
  so that TPStates are not altered by saving and loading them from disk/DB.

- adds forgotten '_dbcoordinates': 'none' to Estimate class .auxtypes
  member (this recently added member should never be serialized).
Replaces an occurance of ReplaceOne(doc_id) with ReplaceOne({'_id': doc_id}),
as is needed by the pymongo API.
TPState objects that were serialized to disk prior to fixing the
basis serialization of TPState objects (in commit
aa8a290) were unable to be
loaded because they were at least sometimes saved with an empty basis.
This issue is fixed by this commit by adding a HACK in tpstate.py that
checks for an empty basis and converts this to None.  This isn't a
complete patch, as the basis should really be something non-None, but
it's the best we can do given there's no information stored about the
basis.
Copy link
Contributor

@sserita sserita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything seems good but there's a lot of code here, so I asked a couple of clarifying questions to make sure I understood the goal of the PR a little better.

Also I wasn't sure if you were ready for this to go in, since you've been pushing bugfixes.

@@ -797,8 +798,12 @@ def _max_array_types(artypes_list): # get the maximum number of each array type

for j, obj_fn_builder in enumerate(iteration_objfn_builders):
tNxt = _time.time()
optimizer.fditer = optimizer.first_fditer if (i == 0 and j == 0) else 0
opt_result, mdc_store = run_gst_fit(mdc_store, optimizer, obj_fn_builder, printer - 1)
if i == 0 and j == 0: # special case: in first optimization run, use "first_fditer"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a bugfix? Don't see how it is related to the mongo serialization.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and it's related to mongo serialization :). The "bug" was that we were modifying the optimizer object (by setting its .fditer attribute on old line 800) within this GST-running routine. This modification is unexpected by the caller, and is caught by the MongoDB serialization which includes checks to make sure serializable objects like the optimizer (Optimizer is a subclass of NicelySerializable) don't change when we don't expect them to.



class NicelySerializable(object):
class NicelySerializable(MongoSerializable):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm the swap of inheritance order. The intention is that everything that is filesystem serializable is also mongo serializable, but there are some objects (e.g. DataSet, Estimate, Protocol) which are currently only mongo serializable (as of this PR)? I assume the goal is to enable the storing/reuse of these subcomponents, which before were only dumped as part of a larger tree structure?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is correct, and I agree it's a bit counterintuitive at first. The thinking is that "everything can be rendered as JSON can also be serialized to a MongoDB" which implies that anything that is NicelySerializable must also be MongoSerializable. Another way of thinking about this is that something that is NicelySerializable is able to render itself as simple JSON, whereas something that is MongoSerializable is able to insert itself into a Mongo database, perhaps using multiple records. In principle we could take all the records a MongoSerializable object generates and cram them together into one big piece of JSON, but in practice there are certain more complex objects (like DataSets, Estimates, and Protocols) that naturally can save themselves as multiple records in a database but would be a bit unwieldy as a huge piece of JSON.

Fixes the implementation of the 'circuit-str-json' auxtype, which is
used to serialize fiducial pair dictionaries (in experiment designs).
The metadir implementation is fine, but when I implemented the MongoDB
version I didn't realize exactly what was being serialized (I thought
it was just a list of circuits that should be stored as a list of
JSON circuit string, but really it can be anything JSONable that
may contain Circuits and doesn't contain any string, and when
serialized the Circuit objects are converted to strings).  This
commit fixes this oversight so that now the 'circuit-str-json' auxtype
works properly with MongoDB serialization.

This commit also adds a backwards compatibility HACK for loading
in old objects that were written before the MongoDB serialization
code existed and thus don't have {... _dbcoordinates: 'none'} in their
auxtypes member.  The hack adds a _dbcoordinates: 'none' entry to
the auxtypes dictionaries of objects that don't have them.
…data_id.

Adds a warning and recovery attempt for the case when a results object
is being saved, is given the ID of a data object that is supposedly
already saved, but the data object doesn't contain any ._dbcoordinates
(i.e. this attribute is None).  Previously the code would just error,
and now it attempts to save the data object.  This really shouldn't
be needed (hence the warning), but it seems like good insurance.
Replaces a call to pymongo's find with find_one and count_documents
calls, which profiling shows to be faster.  I thought that iterating
through the results of a single find operation would be faster, but
I was wrong :)
Removes a spurious pickling of 'json' data before saving it to
the database, mostly likely from a copy/paste error of the
'pickle' auxtype.  Before this fix, 'json' auxtype data would not
be saved correctly, and would be unloadable (eek!).  The prominent
example of a 'json'-auxtype member is the 'idealout_lists' attribute
of an RB experiment design.
@sserita sserita merged commit a4106a7 into develop Jan 10, 2023
@sserita sserita deleted the feature-mongodb-upgrade branch January 10, 2023 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants