The missing half of the string interning patch :-|

* Bug fix for half-assed string interning implementation * Cleaned up path2keys() (also to increase performance) * Added indices on `tree`, `strings` tables for performance * Bug fix for report_disk_usage() * Bug fix: Replaced sys.exit(1) with os._exit(1) where `inside FUSE' * The preferred key/value store is now gdbm because it supports fast vs. synchronous modes. Existing key/value stores are however accessed using the library that created them and when gdbm isn't available any other persistent key/value store will do fine (using anydbm) * Switched storage of hashes from hexadecimal to binary strings (this breaks compatibility with older databases but saves a decent amount of disk space) * Changed remaining TEXT to BLOB, added sqlite3.Binary() / str() calls * Added rudimentary profiling of logical operations
xolox · Jun 24, 2010 · 8fae294 · 8fae294
1 parent 4215513
commit 8fae294
Show file tree

Hide file tree

Showing 3 changed files with 200 additions and 140 deletions.
diff --git a/README.md b/README.md
@@ -1,3 +1,5 @@
+# DedupFS: A deduplicating FUSE file system written in Python
+
 The Python script `dedupfs.py` implements a file system in user-space using
 FUSE. It's called DedupFS because the file system's primary feature is
 deduplication, which enables it to store virtually unlimited copies of files
@@ -7,17 +9,11 @@ In addition to deduplication the file system also supports transparent
 compression using any of the compression methods lzo, zlib and bz2.
 
 These two properties make the file system ideal for backups: The author
-currently stores 230 GB worth of backups in two databases of 2,5 GB each.
+currently stores 250 GB worth of backups using only 8 GB of disk space.
 
 The design of DedupFS was inspired by Venti and ZFS.
 
-**Warning:** *The latest commits have introduced a hard to track bug that's
-probably related to string interning. After spending two days tracking down the
-bug I've suspended my efforts until I can find more time :-(. Obviously I don't
-suggest using the file system until I've fixed the bug!*
-
- USAGE
-=======
+## Usage
 
 To use this script on Ubuntu (where it was developed) try the following:
 
@@ -31,8 +27,7 @@ To use this script on Ubuntu (where it was developed) try the following:
     #  - ~/.dedupfs-metastore.sqlite3 contains the tree and meta data
     #  - ~/.dedupfs-datastore.db contains the (compressed) data blocks
 
- STATUS
-========
+## Status
 
 Development on DedupFS began as a proof-of-concept to find out how much disk
 space the author could free by employing deduplication to store his daily
@@ -45,25 +40,21 @@ prove the correctness of the code (the tests are being worked on).
 
 The file system initially stored everything in a single SQLite database, but it
 turned out that after the database grew beyond 8 GB the write speed would drop
-from 8-12 MB/s to 2-3 MB/s. Therefor the file system now uses a second database
-to store the data blocks. Berkeley DB is used for this database because it's
-meant to be used as a key/value store and doesn't require escaping binary data.
+from 8-12 MB/s to 2-3 MB/s. Therefor the file system now stores its data blocks
+in a separate database, which is a persistent key/value store managed by dbm.
 
- DEPENDENCIES
-==============
+## Dependencies
 
 This script requires the Python FUSE binding in addition to several Python
-standard libraries like `dbm`, `sqlite3`, `hashlib` and `cStringIO`.
+standard libraries like `anydbm`, `sqlite3`, `hashlib` and `cStringIO`.
 
- CONTACT
-=========
+## Contact
 
 If you have questions, bug reports, suggestions, etc. the author can be
 contacted at <peter@peterodding.com>. The latest version of DedupFS is
 available at <http://peterodding.com/code/dedupfs> and <http://github.com/xolox/dedupfs>.
 
- LICENSE
-=========
+## License
 
-DedupFS is licensed under the MIT license.
-Copyright 2010 Peter Odding <peter@peterodding.com>.
+This software is licensed under the MIT license.  
+© 2010 Peter Odding &lt;<peter@peterodding.com>&gt;.
diff --git a/TODO b/TODO
@@ -9,17 +9,17 @@ Here are some things on my to-do list, in no particular order:
    else:
     flush user & meta data (file contents & attributes)
 
- * Implement rename() independently of [un]link() to improve performance?
+ * Implement rename() independently of link()/unlink() to improve performance?
 
- * Implement --verify-reads option that recalculates hashes when reading to
+ * Implement `--verify-reads` option that recalculates hashes when reading to
    check for data block corruption?
 
- * report_disk_usage() has become way too expensive for regular status
+ * `report_disk_usage()` has become way too expensive for regular status
    reports because it takes more than a minute on a 7.0 GB database. The only
    way it might work was if the statistics are only retrieved from the database
    once and from then on kept up to date inside Python, but that seems like an
-   awful lot of work. For now I've just removed the call to report_disk_usage()
-   from print_stats() and added a --print-stats command-line option that just
+   awful lot of work. For now I've removed the call to `report_disk_usage()`
+   from `print_stats()` and added a `--print-stats` command-line option that
    reports the disk usage and then exits.
 
  * Tag databases with a version number and implement automatic upgrades because
@@ -28,3 +28,9 @@ Here are some things on my to-do list, in no particular order:
  * Change the project name because `DedupFS` is already used by at least two
    other projects? One is a distributed file system which shouldn't cause too
    much confusion, but the other is a deduplicating file system as well :-\
+
+ * Support directory hard links without upsetting FUSE and add a command-line
+   option that instructs `dedupfs.py` to search for identical subdirectories
+   and replace them with directory hard links.
+
+ * Support files that don't fit in RAM (virtual machine disk images…)