Skip to content

An academic implementation of a FUSE based file system using Couchbase for distributed storage.

License

Notifications You must be signed in to change notification settings

raycardillo/cbfuse

Repository files navigation

cbfuse

Project license

An academic exercise that implements a FUSE file-system using Couchbase as the data store.


This is currently just a fun/academic/experimental project that I started to help me learn more about FUSE and to gain some first hand experience with libcouchbase (the C SDK for Couchbase) after seeing all of the new and exciting features coming in Couchbase Server 7.0.

The idea is to use Couchbase Server as a distributed data store for a user based file system using FUSE. Because this was just a fun exercise, I wasn't super focused on efficient file system design, I have not put much thought into optimizing key access patterns, nor have I worried about distributed locks, CAS, etc. It would also be better to optimize the network access using transactions or batch operations. Those topics should be addressed for this to be a more robust and useful distributed filesystem.

If you're looking for an actual distributed file server built on top of Couchbase, check out cbfs on CouchbaseLabs instead.

Implementation Notes:

  • I am currently using the FUSE high-level operations to create a logical overlay of a filesystem.
  • I have not fully tested FUSE in the normal multi-threaded daemon mode of operation (only tested with -f -s so far).
  • All of the calls to Couchbase are currently synchronous and I haven't optimized batch calls or looked into transactions.
  • Currently only developed and tested with macOS using macFUSE for convenience.
  • Paths are currently limited to 250 characters (the length of a Couchbase key) but I have plans to expand that.
    • Here are some thoughts:
      1. Must try to take advantage of Couchbase keys for quick lookup (and future improvements I want to explore).
      2. Must only use more expensive operations/techniques when needed (e.g., when path is larger than 250 characters).
      3. Must support at least 4096 character upper limit (the current path limit for ext4 file systems).
      4. I want to avoid more time consuming lookup strategies that require multiple trips (e.g., path keys, collision documents).
      5. However, using a counter may be useful if the solution is fast and results in fewer calls and less complex keys.
    • Current idea:
      • When path <= 250 characters:
        • Just use it because it's already unique.
      • When path > 250 characters:
        • XXH128 hash is performed over entire path and converted to a 22 character Base64 string.
        • The next 228 characters are samples to help add to the unique key property.
        • From a path of size n (where n > 250) and starting at [0] the samples are taken from:
          • 30 chars starting at: [1]
          • 48 chars starting at: [n\*0.25]
          • 50 chars starting at: [n\*0.50]
          • 50 chars starting at: [n\*0.75]
          • 50 chars starting at: [n-51]
        • Note that this strategy scales to try to find unique strings throughout. This is important because some storage patterns may have common sub-structures that are similar with unique paths earlier in the string (or visa-versa).
        • XXH128 itself has practically zero chance of collision (see: https://github.com/Cyan4973/xxHash/wiki/Collision-ratio-comparison).
        • The combination of XXH128 plus these character samples, with paths up to 4096, bounds the limits fairly well.
        • Collision should be impossible, but I'll leave the math to prove it as an exercise for the theoretical Computer Scientists.
    • REMINDER: This is all because I want to use the high-level operations which are heavily based on the path string. To avoid multiple calls, I need to obtain a unique key on the client without reverting to strategies that would require multiple trips. When using the high-level operations, this is true for all operations, but it's especially true for operations like getattr which are called very frequently.

General Development Notes

  • Could be cleaner - this was just a quick mash up for experimentation and fun. I may improve it in the future if I have time or other ideas to explore.
  • Tested primarily on macOS Big Sur (11.4) (x86).
  • I tried to keep the CMake config "clean" so it should only require a little TLC to build for other Unix based operating systems.
  • FUSE on Windows is a different story and Dokan is probably a better approach (rewrite or using the FUSE wrapper utility).

Development Build Environment Setup

  • Intall cmake utility
    • brew install cmake
  • Install Couchbase Server 7.0
  • Install Couchbase C library
    • brew install libcouchbase
    • IMPORTANT: You'll need v3.1.0 until the next 7.0 RC is released!! (more info)
    • I installed v3.1.0 by downloading source and doing a manual build.
  • Install FUSE
    • brew install macfuse
    • Tested with macfuse (v4.1.2) == FUSE (v2.9)
  • Install cJSON
    • brew install cjson
    • Tested with v1.7.14
  • Install xxHash
    • brew install xxhash
    • Tested with 0.8.0
  • Install Visual Studio Code (optional)

Getting Started

  • Build cbfuse
    • see environment setup instructions above
    • mkdir build; cd build
    • cmake ..
    • cmake --build . --config Release
  • Setup Couchbase
    • Start the Couchbase server
    • Create a bucket (e.g., cbfuse)
    • See ./scripts/setup.sh which does the following in cbfuse:
      • Under the _default Scope, add Collections:
        • stats - used for basic file stat attributes
        • blocks - used to store file data blocks
        • dentries - used to store directory entry info
  • Running a quick debug test
    • This filesystem runs in the foreground and is single-threaded.
    • Mount the filesystem
      • ./cbfuse/cbfuse ~/cbfuse --cb_connect=couchbase://127.0.0.1/cbfuse --cb_username=raycardillo --cb_password=raycardillo
    • Unmount the filesystem
      • umount cbfuse

About

An academic implementation of a FUSE based file system using Couchbase for distributed storage.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published