Vermongo

Thilo Planz edited this page Feb 4, 2012 · 1 revision

Vermongo: Simple Document Versioning with MongoDB

Vermongo is a simple versioning scheme for keeping older revision of MongoDB documents.

It is designed to be extremely simple and place no constraints on the type of documents that it can be applied to.

Vermongo's optimistic concurrency control can also be used to just detect and avoid conflicting updates, even when one does not want to keep older revisions in the database.

Overview

Vermongo is similar to Simple Document Versioning with CouchDB in that it stores the previous version of a document somewhere else before overwriting it with an update.

Placing a document under version control is very unintrusive. Vermongo only adds a version number property (an int32 called _version) to the document. It does not touch any other fields. In particular, it also does not place any requirements on the contents of the _id field.

Older revisions are stored in a separate collection that shadows the original collection. They are MongoDB documents themselves, and can be queried.

When creating, updating, or deleting documents, you just have to follow the Vermongo protocol outlined here. There will be (at least) a Java library to facilitate this. The only major downside is that documents have to be updated one-by-one (no bulk updates possible), and that there is some overhead in retrieving and copying the old version.

Query operations can just proceed as normal.

Creating a document

When inserting a new document, just add a field _version with the value '1'.

This is a int32 property (Integer in Java).

Since only old revisions (and not the current one) are stored in the shadow collection, there is nothing else to do.

Updating a document

You can only update documents when you know the revision number of the version that you are going to replace (your base revision). This is similar to how subversion works, and protects from conflicting updates: If someone else has changed the document in the mean-time, you will not be allowed to overwrite it, you have to look at the recent version and merge it into your version first.

After checking that the version you are about to overwrite is the one you thought it was, you copy it into the shadow collection (so that that revision becomes part of the version history).

After the shadow copy has been created, you proceed to update the document, thereby increasing the version number by one. Because the document might have changed in the mean-time again, this is done using the atomic "Update if Current" pattern (querying against the expected base version number again).

Deleting a document

When you delete a document, you move both the current version and a dummy version that works as a delete flag into the shadow collection. After that the document is deleted.

The delete flag dummy version is useful in that it allows to be queried for, and can also contain meta-data (such as who deleted the document, and when).

Structure of the shadow collection

For every collection "foo" there is a shadow collection "foo.vermongo", where old versions are stored.

The documents in the shadow collection are identical to the original document, except that their _id field is changed to include both the original _id and the _version number. This is done in a way that allows range queries against the shadow collection's _id index to retrieve all revisions for a given document, and also to retrieve a sorted range of versions within that document.

Example

After

  • inserting into collection 'foo' a document { a=>"x"},
  • updating it to { a: "y"},
  • then to { a: "z" },
  • and finally deleting it,

the shadow collection will contain the following (the original collection will not contain the document anymore, as it has been deleted):

> db.foo.vermongo.find()
{ "_id" : { "_id" : ObjectId("4c78da..."), "_version" : 1 }, "a" : "x", "_version" : 1 }
{ "_id" : { "_id" : ObjectId("4c78da..."), "_version" : 2 }, "a" : "y", "_version" : 2 }
{ "_id" : { "_id" : ObjectId("4c78da..."), "_version" : 3 }, "a" : "z", "_version" : 3 }
{ "_id" : { "_id" : ObjectId("4c78da..."), "_version" : 4 }, "_version" : "deleted:4" }

The most interesting aspect here is probably the structure of the _id field of the shadow copies: It is a complex object itself, with two fields, the original document _id and _version. Because of the way BSON is encoded, you can do range queries against these ids, similar to what you could do with a composite multi-key index (it is essential that _id comes before _version):

# get just revisions 2 to 3 for the document
> db.foo.vermongo.find( { _id :  {
   "$gt" :  { "_id" : ObjectId("4c78da1..."), "_version" : 1 }   ,
   "$lt" :  { "_id" : ObjectId("4c78da1..."), "_version" : 4 }  } } ) 

{ "_id" : { "_id" : ObjectId("4c78da1..."), "_version" : 2 }, "a" : "y", "_version" : 2 }
{ "_id" : { "_id" : ObjectId("4c78da1..."), "_version" : 3 }, "a" : "z", "_version" : 3 }

Performance Considerations

Query performance (current versions)

Since the old versions of the documents are stored in a separate collection, and not changes are made to the original document (other than adding the small _version field) or its indexes, using Vermongo has no impact on the performance of query operations.

Update performance

Since every update needs to copy the revision it is going to overwrite, updates take more time, roughly twice as long (unmeasured).

Also bulk updates are not possible anymore, as are asynchronous updates (not waiting for confirmation from the server).

Query performance (historical documents)

With the default index (on the _id field) the shadow collection can be efficiently (index range scan) queried for an ordered (by version number) range of revisions for any given document.

Since the old revisions have the same structure as the original document, they can also be queried for application-specific data. This should be efficient if you are filtering within the revisions of a given document (as the _id index can still be used). If queries across documents need to be done, you can add appropriate indexes to the shadow collection, just like you would to the original collection.

Storage space requirement

Since the documents are stored uncompressed, every revision takes up as much space as the original document.

Delta-compression would be an option, but this would remove the possibility to easily query the revision history.

There are plans to add document-level compression to MongoDB.

Note regarding storage space requirements for V7Files

V7Files uses Vermongo to keep track of file and folder meta-data. Storing the version history of these small documentsshould not take unreasonable amount of space, even without any compression. The contents of the files themselves are handled using a different scheme.