Skip to content

Commit

Permalink
make sure on get stores are tried in random order
Browse files Browse the repository at this point in the history
Also update docs
  • Loading branch information
muhamadazmy committed Sep 25, 2023
1 parent f27801f commit cb795e6
Show file tree
Hide file tree
Showing 6 changed files with 90 additions and 131 deletions.
28 changes: 28 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ blake2b_simd = "1"
aes-gcm = "0.10"
hex = "0.4"
lazy_static = "1.4"
rand = "0.8"
# next are only needed for the binarys
simple_logger = {version = "1.0.1", optional = true}
daemonize = { version = "0.5", optional = true }
Expand Down
11 changes: 7 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,7 @@ Options:
Similar to `mount` rfs provides an `unpack` subcommand that downloads the entire content (extract) of an `fl` to a provided directory.

```bash
fs unpack --help
rfs unpack --help
unpack (downloads) content of an FL the provided location

Usage: rfs unpack [OPTIONS] --meta <META> <TARGET>
Expand All @@ -119,11 +119,14 @@ Arguments:
<TARGET> target directory to upload

Options:
-m, --meta <META> path to metadata file (flist)
-c, --cache <CACHE> directory used as cache for downloaded file chuncks [default: /tmp/cache]
-h, --help Print help
-m, --meta <META> path to metadata file (flist)
-c, --cache <CACHE> directory used as cache for downloaded file chuncks [default: /tmp/cache]
-p, --preserve-ownership preserve files ownership from the FL, otherwise use the current user ownership setting this flag to true normally requires sudo
-h, --help Print help
```

By default when unpacking the `-p` flag is not set. which means downloaded files will be `owned` by the current user/group. If `-p` flag is set, the files ownership will be same as the original files used to create the fl (preserve `uid` and `gid` of the files and directories) this normally requires `sudo` while unpacking.

# Specifications

Please check [docs](docs)
173 changes: 47 additions & 126 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -1,151 +1,72 @@
# Prerequests
# FungiList specifications

## Introduction

If the term User space is strange check this -> [**Kernel space** and **User space**](https://en.wikipedia.org/wiki/User_space_and_kernel_space)
The idea behind the FL format is to build a full filesystem description that is compact and also easy to use from almost ANY language. The format need to be easy to edit by tools like `rfs` or any other tool.

We decided to eventually use `sqlite`! Yes the `FL` file is just a `sqlite` database that has the following [schema](../schema/schema.sql)

**FS** (File System): Is a way of managing How to store and retrive data, such as ext4 and NTFS.
## Tables

**How files stored and retrieved**:
### Inode

- in linux, every file has an associated what called <u>Inode</u>
- the Inode is a data structure which contains the following

- the file's metadata
- 12 direct pointers to 12 data blocks from the file
- indirect pointer to a 12 direct pointers which points to the next 12 data blocks
- two doubly indirect pointers, each one points to an indirect pointer and that indirect pointer points to the next 12 data blocs
- three tribly indirect pointers .. you know the idea :)
Inode table describe each entry on the filesystem. It matches really closely the same `inode` structure on the linux operating system. Each inode has a unique id called `ino`, a parent `ino`, name, and other parameters (user, group, etc...).

The type of the `inode` is defined by its `mode` which is a `1:1` mapping from the linux `mode`

<u>Inode structure</u>
> from the [inode manual](https://man7.org/linux/man-pages/man7/inode.7.html)
| meta data |
```
POSIX refers to the stat.st_mode bits corresponding to the mask
S_IFMT (see below) as the file type, the 12 bits corresponding to
the mask 07777 as the file mode bits and the least significant 9
bits (0777) as the file permission bits.
| 12 | ----> data blocks (0)
| direct | ----> ...
| pointers | ----> data blocks (11)
The following mask values are defined for the file type:
|indirectPtr| ----> | 12 | ----> data blocks (12)
| direct | ----> ...
| pointers | ----> data blocks (23)
S_IFMT 0170000 bit mask for the file type bit field
|two doubly | ----> |indirectPtr| ----> | 12 | ----> data blocks (24)
|indirectPtr| ----- | direct | ----> ...
| | pointers | ----> data blocks (23)
|
|
--> |indirectPtr| ----> | 12 | ----> data blocks (24)
| direct | ----> ...
| pointers | ----> data blocks (23)
S_IFSOCK 0140000 socket
S_IFLNK 0120000 symbolic link
S_IFREG 0100000 regular file
S_IFBLK 0060000 block device
S_IFDIR 0040000 directory
S_IFCHR 0020000 character device
S_IFIFO 0010000 FIFO
```

| three | ----> |two doubly indirectPtr| ....
| tribly | ----> |two doubly indirectPtr| ....
|indirectPtr| ----> |two doubly indirectPtr| ....
## Extra

**VFS** (Virutal File System): is an abstraction layer on the actual mounted filesystems. The user space sees only the virtual file system and the VFS manages the underlying FileSystems. <u>so that</u> any file system must register at the VFS.
the `extra` table holds any **optional** data associated to the inode based on its type. For now it holds the `link target` for symlink inodes.

**FUSE** (File system in USEr space): Is a user space filesystem framework (consists of two parts) used to build your own file-system (a user-defined file-system).
## Tag

tag is key value for some user defined data associated with the FL. The standard keys are:

**FUSE main components**:
- `version`
- `description`
- `author`

- FUSE file system daemon (with FUSE library i.e. libfuse)
- FUSE driver (with a request queue)
But an FL author can add other custom keys there

These two componenet are communicates through `/dev/fuse` device, it works as IPC (Inter-Process Communication).
## Block

**How FUSE works**:
the `block` table is used to associate data file blocks with files. An `id` field is the blob `id` in the `store`, the `key` is the key used to decrypt the blob. The current implementation of `rfs` does the following:

![image](https://user-images.githubusercontent.com/18401282/160552509-d40ab27a-a002-4fae-a6b6-fb983f97babf.png)
- For each blob (512k) the `sha256`. This becomes the encryption key of the block. We call it `key`
- The block is then `snap` compressed
- Then encrypted with `aes_gcm` using the `key`, and the first 12 bytes of the key as `nonce`
- The final encrypted blocked is hashed again with `sha256` this becomes the `id` of the block
- The final encrypted blob is then sent to the store using the `id` as a key.

## Route

the route table holds routing information for the blobs. It basically describe where to find `blobs` with certain `ids`. The routing is done as following:

- suppose we take the `/tmp/rmnt` directory as a mount point for our own file system (i.e rfs)
- The fuse driver registeres `/tmp/rmnt` as a mount point in the VFS
- when you inside `/tmp/rmnt` and you run and application such as `ls` (which means you need to read the content of the current directory)
- This will need to a call to the VFS
- The VFS knows that the `/tmp/rmnt` is related to the fuse driver, so that the VFS routes the operation to it.
- the driver allocates a FUSE request structure and puts it in the FUSE queue in a wait state
- the FUSE daemon then picks the request from the kernel queue by reading from `/dev/fuse`.
- the userspace filesystem (i.e. `rfs`) will read the request and process it.
> Note routing table is loaded one time when `rfs` is started and
---
# rfs (Rust File-System)
rfs (Rust File-Sytems): is a FUSE file system which used in Zero-OS.

**The main idea** of `rfs` is reading a remote file as needed chunk by chunk.

<u>Explanation</u>:

- your files is stored on a remote server which is [hub.grid.tf](hub.grid.tf)
- and you want locally read a file or a video locally
- rfs will only download the file or the video in chunks (not the whole file or video)
- every chunk is actually a file with a hash that represent the id of that chunk
- the downloaded chunks is saved in a cache (it is actually a directory on your local machine i.e. `/tmp/cache`)

<u> But how rfs knows about the structure of the remote server's filesystem </u>

- the server [hub.grid.tf](hub.grid.tf) recieves a `.tar.gz` file.
- It containes the whole file system structure with its contents.
- after uploading the `.tar.gz` file, the server builds a `.flist` file (accronym File LIST)
- This `.flist` file is a sqlite database with the file system structure without the actual contents (only the names of the directories and the files).


**The interaction between `rfs` and the `FUSE daemon`**

- when the fuse daemon reading the request from the kernel queue by reading from `/dev/fuse`.
- `rfs` reads the request and checkes which operation want to be served (the operation must be implemented to be served)
- after handling the request the `rfs` replies with the result

<u>Handling the read operation</u>


- The request wants for example to read 400B from position x (where x < 400 and chunk size is 100B as an example)
- Calculating the cursor offset and the chunk index (offset = x, chunk_index = x/chunk_size)
- Reading the chunk (chunk_index) from the cache, But if the chunk is not in the cache, then it needs to be downloaded and cached
- repeat the previous step until requested size fullfilled and response to the FUSE request


<u>Use Case</u>
(suppose that every block is 100B)

|-------0--------|--------1---------|---------2-------|---------3-------|--------4--------|

^--------offset = 150 && size = 250 - ---^

- chunk_index = offset/chunk_size = 150/100 = 1
- here the read operation will start from block (1) untile block(3) to fulfill the requested size
- if such a block was downloaded it will be found in the cache and read
- if the block is not in the cache, it will be downloaded from the server


---
# rfs Usage

rfs --help

USAGE:
rfs [FLAGS] [OPTIONS] <TARGET> --meta <META>

FLAGS:
-d, --daemon daemonize process
--debug enable debug logging
-h, --help Prints help information
-V, --version Prints version information

OPTIONS:
--cache <cache> cache directory [default: /tmp/cache]
--storage-url <hub> storage url to retrieve files from [default: redis://hub.grid.tf:9900]
--log <log> log file only in daemon mode
--meta <META> metadata file, can be a .flist file, a .sqlite3 file or a directory with a
`flistdb.sqlite3` inside

ARGS:
<TARGET>



### Use case

rfs --meta <filename>.flist /tmp/rmnt
- We use the first byte of the blob `id` as the `route key`
- The `route key`` is then consulted against the routing table
- While building an `FL` all matching stores are updated with the new blob. This is how the system does replication
- On `getting` an object, the list of matching routes are tried in random order the first one to return a value is used
- Note that same range and overlapping ranges are allowed, this is how shards and replications are done.
Binary file removed docs/fs-behavior.png
Binary file not shown.
8 changes: 7 additions & 1 deletion src/store/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ pub mod dir;
mod router;
pub mod zdb;

use rand::seq::SliceRandom;
use std::{collections::HashMap, pin::Pin};

pub use bs::BlockStore;
Expand Down Expand Up @@ -117,7 +118,12 @@ impl Store for Router {
return Err(Error::InvalidKey);
}
let mut errors = Vec::default();
for store in self.route(key[0]) {

// to make it fare we shuffle the list of matching routers randomly everytime
// before we do a get
let mut routers: Vec<&Box<dyn Store>> = self.route(key[0]).collect();
routers.shuffle(&mut rand::thread_rng());
for store in routers {
match store.get(key).await {
Ok(object) => return Ok(object),
Err(err) => errors.push(err),
Expand Down

0 comments on commit cb795e6

Please sign in to comment.