Skip to content
This repository has been archived by the owner on Oct 15, 2022. It is now read-only.

StorageFormat

Thilo Planz edited this page Jun 7, 2012 · 8 revisions

Internal Storage Format

Contents Storage and File Collections

v7files stores the contents of files separate from their metadata. This allows for de-duplication (the contents of two identical files have to be stored just once).

Shared contents collection

v7files provides a number of virtual file systems, with different features such as folder hierarchies, access control, or version tracking. These file system implementations store the metadata that makes up the files in collections under their own control (and applications using v7files can also use additional collections as well), but the contents are all stored in the shared collection v7files.content.

To facilitate garbage collection of contents that are no longer used anywhere, backreferences from contents to files are kept in the collection v7files.refs.

Hierarchical file system

A hierarchical file system with support for versioning and access control that can be exposed via WebDAV is maintained in the v7files.files collection.

Buckets

v7files.buckets. No folder hierarchy, only manual reference tracking, no version history. See WebAppIntegration.

Contents Storage details

Every document in the v7files.content collection describes a piece of content, i.e. a sequence of bytes. It is uniquely identified by the SHA-1 digest for that sequence of bytes. This digest is used as the Mongo document _id (as twenty bytes of binary data).

Direct storage

The simplest way to store contents is by just putting them into a byte array in the content document:

{ _id : <sha1> , in: <bytes> }

The field is called in for "inline".

Compressed storage

Some data can be more efficiently stored using gzip compression.

{ _id: <sha1>, store: 'gz', zin: <gzipped bytes>, length: 1234 }

The store field denotes the storage scheme, and zin stands for "zipped inline". Note that the SHA-1 is the digest of the uncompressed original data and length field also indicates the uncompressed length.

Concatenated storage

A content document can also pull in the contents from other documents and concatenate them. This allows for efficient differential storage of similar contents, as well as storing contents too large to fit into a single Mongo document.

{ _id: <sha1>,
      store: 'cat', 
      base: [
         { sha: <sha1 of some content>, length: 1234 },
         { sha: <sha1 of some content>, length: 1234 }
         ]
  }

store: 'cat' means that this content is a concatenation of (parts of) other content.

The base field is an array with the chunks to be concatenated. Each chunk can be a (presumably short) piece of new inline data (given as a byte array), or a reference to contents stored elsewhere. If the contents are stored elsewhere, an embedded document that we call a "content pointer" is given. There are a couple of options here, but it must contain at least the sha field that will be used to look up the data, and the length field (necessary to calculate the length of the combined piece).

Referring to stored content

Content pointers

The v7files.content collection described above stores contents, but you also need some more meta-data to make a complete "file". There are various different types of meta-data and many ways where it can be kept, but all of them use the following simple schema to interact with v7files: You refer to the contents by their SHA digest, and you store it in a MongoDB document (top-level or nested) in a field called sha, either as a 20-byte binary or as a 40-character hex-encoded String. Since this is wasteful for very short files, you are encouraged to inline short files into a byte array field called in instead. You probably also want to have filename, contentType and length fields.

Note that the length can be different than the length of the referenced content: If it is smaller, it will be truncated, if it is longer, the content will be repeated. There can also be an offset (off) to start the concatenation in the middle of the chunk.

Examples:

// You might want to embed this in your Mongo-based application
{ sha : <sha1>, 
  filename: 'hello.txt', 
  length: 1993, 
  contentType: 'text/plain' }

// or this, if the file is short
{ in: <bytes>, 
  filename: 'short.txt', 
  contentType: 'text/plain' }

 // content pointers are also used internally for chunked documents
 // (see above for details)
 { _id: <sha1>, store : 'cat',
      base: [
         { sha: <sha1 of some content>, length: 1234 },
         { sha: <sha1 of some content>, length: 1234 }
         ]
  }


// and a file in v7files' WebDAV looks like this
// (see below for details)
{
  _id : <ObjectId>,
  _version: 3,
  parent: <ObjectId>,
  acl: {},
  filename: 'a.txt',
  length: 123,
  sha: <sha1>,
  contentType: 'text/plain',
  createdAt: <Timestamp>,
  updatedAt: <Timestamp>
}

Reference tracking

The downside of sharing contents between multiple files is that when you delete a file, you cannot just delete the contents. This can only be done once all files that (directly or indirectly) refer to the content have been deleted.

Every file naturally has a reference to its contents, and v7files also keeps track of the reverse links. In the v7files.refs collection there is a document for every piece of content in v7files.content (with the same _id) that has

  • refs: an array with the _id of every file that currently uses it.
  • refHistory: contains all current and previous entries of refs. When a file is deleted, the contents remain in GridFS, but its backlink is removed from refs. It is kept as a copy in refHistory, however.
  • refBase: in addition to files referencing content, content can also be referenced by other content (when using out-of-band "alt" storage, such as concatenation). Those references (the SHA1 _id of the referring content) is tracked here.

Conceptually, the references are part of the contents document, but keeping them in a separate collection makes it easier to update them (without having to touch the usually very large content documents).

File and folder metadata

Every file and folder is represented by a MongoDB document in the v7files.files collection. It has the following fields:

  • _id: an id which unique identifies the file, even if it moves around in the filesystem or changes its name. This is a randomly assigned ObjectId, except for the "root" folders which are identified by Strings chosen by the user and mapped to URL endpoints.
  • _version: an integer, starting at 1 and incrementing with every update to the file
  • parent: the _id of the parent file
  • acl: a nested object containing access control lists
  • filename: the name of the file. This becomes a URL component for WebDAV.
  • length: the length of the file in bytes. Missing in the case of a folder (or inline storage)
  • sha: a byte array with the SHA-1 hash of the file's contents. This is used to link the file to its contents, which are stored in GridFS. Missing in the case of a folder (or inline storage)
  • in: extremely small files (smaller than their own SHA-1 hash) can be stored inline (as a byte array). In this case, the sha and length fields will be missing
  • contentType: the content type of the file
  • created_at: the creation date of the file
  • updated_at: the creation date of the current revision of the file, missing for the first version

Version history

When a file is modified, the _version field in the v7files.files collection is incremented by one, and the previous revision is moved to a shadow collection that tracks version history, called v7files.files.vermongo. Deleted files are also stored there. The "main" collection only contains the current versions of all files. See Versioning for details.