Skip to content
This repository has been archived by the owner on Oct 15, 2022. It is now read-only.

StorageFormat

Thilo Planz edited this page Jan 30, 2012 · 8 revisions

Internal Storage Format

All file contents are stored in GridFS, but the file (and folder) metadata is stored in separate collection, where it can also be versioned.

File and folder metadata

Every file and folder is represented by a MongoDB document in the v7files collection. It has the following fields:

  • _id: an id which unique identifies the file, even if it moves around in the filesystem or changes its name. This is a randomly assigned ObjectId, except for the "root" folders which are identified by Strings chosen by the user and mapped to URL endpoints.
  • _version: an integer, starting at 1 and incrementing with every update to the file
  • parent: the _id of the parent file
  • acl: a nested object containing access control lists
  • filename: the name of the file. This becomes a URL component for WebDAV.
  • length: the length of the file in bytes. Missing in the case of a folder (or inline storage)
  • sha: a byte array with the SHA-1 hash of the file's contents. This is used to link the file to its contents, which are stored in GridFS. Missing in the case of a folder (or inline storage)
  • in: extremely small files (smaller than their own SHA-1 hash) can be stored inline (as a byte array). In this case, the sha and length fields will be missing
  • contentType: the content type of the file
  • created_at: the creation date of the file
  • updated_at: the creation date of the current revision of the file, missing for the first version

Version history

When a file is modified, the _version field in the v7files collection is incremented by one, and the previous revision is moved to a shadow collection that tracks version history, called v7files.vermongo. Deleted files are also stored there. The "main" collection only contains the current versions of all files.

File contents in GridFS

The v7files collection (and its shadow collection) only store the file metadata. File contents are stored in GridFS, keyed by the SHA-1 hash of that data. This is a regular GridFS bucket called v7.fs (so that there will be the collections v7.fs.files and v7.fs.chunks). The _id for these GridFS files is the binary SHA-1 hash (a byte array).

Because of this arrangement, renaming or duplicating a file (without changing its contents) will not take up additional storage.

On the other hand, if you are not interested in retaining the complete file change history, you will need to eventually "garbage-collect" content that is no longer referenced.

In addition to the fields defined by GridFS, v7files stores the following metadata on the GridFS documents:

  • refs: for garbage collection purposes, the content has a backlink to every file that currently uses it.
  • refHistory: contains all current and previous entries of refs. When a file is deleted, the contents remain in GridFS, but its backlink is removed from refs. It is kept as a copy in refHistory, however.
  • store: to support compression, the raw GridFS contents may be different from the file it represents. If this field is present (and not set to raw), a compression scheme is being used. For example, a value of z means that the GridFS holds the zlib-deflated contents of the file.
  • in: optionally the contents of small inlined files, see compression
  • alt: optionally references to out-of-band files, see compression
  • _id: the GridFS file id is the SHA-1 digest of the original data, and does not change because of compression.
  • length: because of compression, the length field describes the length of the GridFS chunk contents, which may be different from the length of the actual file (length in v7files).
  • filename: the contents can be re-used by many files (with different name). The filename field contains the name of the first file. It is not used by v7files and just retained for information purposes.
Clone this wiki locally