Skip to content

Commit

Permalink
The missing half of the string interning patch :-|
Browse files Browse the repository at this point in the history
 * Bug fix for half-assed string interning implementation
  * Cleaned up path2keys() (also to increase performance)
  * Added indices on `tree`, `strings` tables for performance
 * Bug fix for report_disk_usage()
 * Bug fix: Replaced sys.exit(1) with os._exit(1) where `inside FUSE'
 * The preferred key/value store is now gdbm because it supports fast
   vs. synchronous modes. Existing key/value stores are however accessed
   using the library that created them and when gdbm isn't available
   any other persistent key/value store will do fine (using anydbm)
 * Switched storage of hashes from hexadecimal to binary strings (this
   breaks compatibility with older databases but saves a decent amount
   of disk space)
 * Changed remaining TEXT to BLOB, added sqlite3.Binary() / str() calls
 * Added rudimentary profiling of logical operations
  • Loading branch information
xolox committed Jun 24, 2010
1 parent 4215513 commit 8fae294
Show file tree
Hide file tree
Showing 3 changed files with 200 additions and 140 deletions.
35 changes: 13 additions & 22 deletions README.md
@@ -1,3 +1,5 @@
# DedupFS: A deduplicating FUSE file system written in Python

The Python script `dedupfs.py` implements a file system in user-space using
FUSE. It's called DedupFS because the file system's primary feature is
deduplication, which enables it to store virtually unlimited copies of files
Expand All @@ -7,17 +9,11 @@ In addition to deduplication the file system also supports transparent
compression using any of the compression methods lzo, zlib and bz2.

These two properties make the file system ideal for backups: The author
currently stores 230 GB worth of backups in two databases of 2,5 GB each.
currently stores 250 GB worth of backups using only 8 GB of disk space.

The design of DedupFS was inspired by Venti and ZFS.

**Warning:** *The latest commits have introduced a hard to track bug that's
probably related to string interning. After spending two days tracking down the
bug I've suspended my efforts until I can find more time :-(. Obviously I don't
suggest using the file system until I've fixed the bug!*

USAGE
=======
## Usage

To use this script on Ubuntu (where it was developed) try the following:

Expand All @@ -31,8 +27,7 @@ To use this script on Ubuntu (where it was developed) try the following:
# - ~/.dedupfs-metastore.sqlite3 contains the tree and meta data
# - ~/.dedupfs-datastore.db contains the (compressed) data blocks

STATUS
========
## Status

Development on DedupFS began as a proof-of-concept to find out how much disk
space the author could free by employing deduplication to store his daily
Expand All @@ -45,25 +40,21 @@ prove the correctness of the code (the tests are being worked on).

The file system initially stored everything in a single SQLite database, but it
turned out that after the database grew beyond 8 GB the write speed would drop
from 8-12 MB/s to 2-3 MB/s. Therefor the file system now uses a second database
to store the data blocks. Berkeley DB is used for this database because it's
meant to be used as a key/value store and doesn't require escaping binary data.
from 8-12 MB/s to 2-3 MB/s. Therefor the file system now stores its data blocks
in a separate database, which is a persistent key/value store managed by dbm.

DEPENDENCIES
==============
## Dependencies

This script requires the Python FUSE binding in addition to several Python
standard libraries like `dbm`, `sqlite3`, `hashlib` and `cStringIO`.
standard libraries like `anydbm`, `sqlite3`, `hashlib` and `cStringIO`.

CONTACT
=========
## Contact

If you have questions, bug reports, suggestions, etc. the author can be
contacted at <peter@peterodding.com>. The latest version of DedupFS is
available at <http://peterodding.com/code/dedupfs> and <http://github.com/xolox/dedupfs>.

LICENSE
=========
## License

DedupFS is licensed under the MIT license.
Copyright 2010 Peter Odding <peter@peterodding.com>.
This software is licensed under the MIT license.
© 2010 Peter Odding &lt;<peter@peterodding.com>&gt;.
16 changes: 11 additions & 5 deletions TODO
Expand Up @@ -9,17 +9,17 @@ Here are some things on my to-do list, in no particular order:
else:
flush user & meta data (file contents & attributes)

* Implement rename() independently of [un]link() to improve performance?
* Implement rename() independently of link()/unlink() to improve performance?

* Implement --verify-reads option that recalculates hashes when reading to
* Implement `--verify-reads` option that recalculates hashes when reading to
check for data block corruption?

* report_disk_usage() has become way too expensive for regular status
* `report_disk_usage()` has become way too expensive for regular status
reports because it takes more than a minute on a 7.0 GB database. The only
way it might work was if the statistics are only retrieved from the database
once and from then on kept up to date inside Python, but that seems like an
awful lot of work. For now I've just removed the call to report_disk_usage()
from print_stats() and added a --print-stats command-line option that just
awful lot of work. For now I've removed the call to `report_disk_usage()`
from `print_stats()` and added a `--print-stats` command-line option that
reports the disk usage and then exits.

* Tag databases with a version number and implement automatic upgrades because
Expand All @@ -28,3 +28,9 @@ Here are some things on my to-do list, in no particular order:
* Change the project name because `DedupFS` is already used by at least two
other projects? One is a distributed file system which shouldn't cause too
much confusion, but the other is a deduplicating file system as well :-\

* Support directory hard links without upsetting FUSE and add a command-line
option that instructs `dedupfs.py` to search for identical subdirectories
and replace them with directory hard links.

* Support files that don't fit in RAM (virtual machine disk images…)

0 comments on commit 8fae294

Please sign in to comment.