Skip to content


Subversion checkout URL

You can clone with
Download ZIP
Commits on Jul 22, 2010
Commits on Jun 24, 2010
  1. Added more automated tests

     * Made the "write arbitrary data" tests use smaller
       files to make the test script faster.
     * Added test numbers to text folds.
     * Verify that written data persists when remounted.
     * Verify that garbage collection of unused data blocks works.
     * Verify that garbage collection of interned path segments works.
  2. Added links to similar projects

     * The reason I started developing DedupFS was because I was trying and
       failing to get lessfs working and decided that it would be more
       educational and less frustrating to try to write my own file system
       using the Python FUSE binding (I was right ;-)
     * I only learned about ArchiveFS a week ago so haven't had the time to
       make a proper assessment, but it seems to me the projects share very
       similar goals yet quite different approaches.
  3. The missing half of the string interning patch :-|

     * Bug fix for half-assed string interning implementation
      * Cleaned up path2keys() (also to increase performance)
      * Added indices on `tree`, `strings` tables for performance
     * Bug fix for report_disk_usage()
     * Bug fix: Replaced sys.exit(1) with os._exit(1) where `inside FUSE'
     * The preferred key/value store is now gdbm because it supports fast
       vs. synchronous modes. Existing key/value stores are however accessed
       using the library that created them and when gdbm isn't available
       any other persistent key/value store will do fine (using anydbm)
     * Switched storage of hashes from hexadecimal to binary strings (this
       breaks compatibility with older databases but saves a decent amount
       of disk space)
     * Changed remaining TEXT to BLOB, added sqlite3.Binary() / str() calls
     * Added rudimentary profiling of logical operations
Commits on May 31, 2010
Commits on May 29, 2010
  1. Made test script more human friendly

     * Stop after first failed test to make debugging easier
     * Use bold font for messages from test script so they can be easily
       distinguished from the file system logging output.
     * Color error messages red.
  2. Improve performance by using string interning and dropping indices

    After storing a few dozen full system backups on the file system things started
    to get slow again. By then the `datastore.db` file was about 2,9 GB and the
    `metastore.sqlite3` file was about that big as well but growing at a rapid
    rate... After some inspection I found two reasons it was growing so quickly:
     1. A while ago I added several indices to the SQLite3 database hoping to
        improve performance, but this was back when the data blocks were still
        stored using SQLite. Then I switched to storing the data blocks in
        Berkeley DB but I neglected to remove the SQLite database indices.
        This made the indices dead weight which is why I've removed them.
     2. My file system contained more than 500 MB of path segment strings while
        only about 5 MB was unique! Therefor I've added string interning of path
        segments which makes the code a bit more complicated but can save
        considerable disk space. I guess I might also have fixed a bug by
        switching the path segment string type from TEXT to BLOB?
    With these changes my `metastore.sqlite3` file shrunk to 467 MB which makes
    SQLite less of a bottleneck for the file system, at least until I start using
    it more intensively... :-\
  3. Implemented proper nlink semantics, started adding tests

    The tests are simply a shell script that checks the basic
    functionality of the file system, which isn't very flexible
    at all but for now it'll have to do -- it's at least a whole
    lot better than no tests at all :-)
  4. Just enable the log_call() lines and be done with it!

    I keep (un)commenting the log_call() lines and regularly
    forget to do so which means I can re-run the test I just
    executed :-\. Therefor I've now decided to just keep the
    calls enabled, which means -vv will log all file system
    calls unless calls_log_filter is changed.
Commits on May 28, 2010
  1. Consider changing project name?

  2. Improved flexibility of SQLite database initialization

    Previously if the SQLite database file didn't exist yet it
    would be initialized for first use, otherwise the database
    was presumed to already be properly initialized. This
    approach made it impossible to use DedupFS with for example
    Python's tempfile.mkstemp() function.
    I've now changed the file system's SQL initialization code
    to always run and I've added IF NOT EXISTS and OR IGNORE
    clauses so that the initialization code is ignored after
    the first run. This shouldn't take more than a second or so
    on each initialization but it makes the code more robust.
  3. Small optimization in path2keys()

Commits on May 20, 2010
  1. Use Berkeley DB instead of SQLite to store the data blocks

    From the updated README:
    The file system initially stored everything in a single SQLite database, but it
    turned out that after the database grew beyond 8 GB the write speed would drop
    from 8-12 MB/s to 2-3 MB/s. Therefor the file system now uses a second database
    to store the data blocks. Berkeley DB is used for this database because it's
    meant to be used as a key/value store and doesn't require escaping binary data.
  2. Improved profiling support

     * Don't strip_dirs() because it causes confusion when filenames are included in listings.
     * Show the internal time for each function instead of
       the cumulative time (more useful after all?).
Commits on May 19, 2010
  1. Changed getattr(), statfs() to return named tuples

    This improves performance slightly and as a free bonus
    Stat() objects can be pretty printed without any code.
    I've `backported' this change from my experimental
    local branch that uses Berkeley DB to store data blocks
    so here's hoping I didn't miss any changes :-S
Commits on May 17, 2010
  1. Initial commit

Something went wrong with that request. Please try again.