Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Customize block size #177

Open
road2react opened this issue Aug 12, 2022 · 15 comments
Open

Customize block size #177

road2react opened this issue Aug 12, 2022 · 15 comments
Labels
topic:performance type:format-change issues requiring an archive format change

Comments

@road2react
Copy link

Right now, blocks are limited to 1 MB each. When backing up to a cloud storage, the latency related to reading and writing may be significant, reaching up to several seconds per operation.

A potential way to reduce overhead would be to increase the block size.

Would this be possible?

@sourcefrog
Copy link
Owner

So yes, we could add an option, at least as a way to experiment.

I would like to set a default that is good for the widest range of situations rather than making people work it out.

The block size also determines the granularity with which identical content can be found across different files, or across different versions of the same file. So increasing it is likely to, to some extent, decrease block reuse.

One goal for Conserve is to very aggressively issue lots of parallel IO, making use of Rust's fearless concurrency. It already does this to some extent, and there is room to do much more. Many factors in contemporary systems align with this approach: many cores, deep SSD device command queues, high bandwidth-delay networks. If we have say 10-100 requests in flight then the per-request latency still matters but not as much. So I think this is the main thing to lean on, but it is more complicated to implement than just increasing the target block size.

Another thing to consider there is that Conserve tries to write blocks of a certain size but for various reasons some objects can be smaller; they could be made larger but that would have other tradeoffs e.g. around recovering from an interrupted backup. So again, parallelism for throughput.

Finally: are you really seeing seconds per 1MB object to cloud storage? Which service, over what network? I wouldn't be surprised by 300-1000ms but seconds seems high?

@sourcefrog
Copy link
Owner

If someone is interested in doing this here is a guide:

There are actually a few variables to consider, including:

conserve/src/lib.rs

Lines 91 to 98 in 1c8a847

/// Break blocks at this many uncompressed bytes.
pub(crate) const MAX_BLOCK_SIZE: usize = 1 << 20;
/// Maximum file size that will be combined with others rather than being stored alone.
const SMALL_FILE_CAP: u64 = 100_000;
/// Target maximum uncompressed size for combined blocks.
const TARGET_COMBINED_BLOCK_SIZE: usize = MAX_BLOCK_SIZE;

pub const MAX_ENTRIES_PER_HUNK: usize = 1000;

  • Add variables to BackupOptions
  • Migrate users of the constants to look in BackupOptions instead
  • Add command line options to set them: rather than adding a bunch of specific rarely-used options perhaps we should have -o entries_per_hunk=10000
  • Add CLI tests that use them and then inspect the stats and the contents of the repo

I think that's it.

@WolverinDEV
Copy link
Contributor

Hey, just a quick thought:
What happens if we mix a backups with different block size in the same archive?

@road2react
Copy link
Author

Finally: are you really seeing seconds per 1MB object to cloud storage? Which service, over what network? I wouldn't be surprised by 300-1000ms but seconds seems high?

I'm using Box mounted using rclone. Using the terminal, cd-ing into the mount takes ~1s for a normal directory, but several seconds if there are many files in the directory.

@sourcefrog
Copy link
Owner

Hey, just a quick thought: What happens if we mix a backups with different block size in the same archive?

Nothing too bad: nothing should be making strong assumptions that the blocks are of any particular size.

Unchanged files (same mtime) will continue to reference the blocks they used last time.

Files not recognized as unchanged but which in fact have content in common will no longer match that content if the block size changes, so all their content will be written again to new-sized blocks. That would include cases like: the file was touched (mtime updated with no content change); the file was renamed or copied; more data was appended to the file.

We should still test it of course. And if this is relied upon it should be (more?) explicit in the docs.

If we want larger files probably the index hunks would be the place to start.

There is also an assumption that a number of blocks can be fairly freely held in memory. So we shouldn't make them 2GB or anything extreme like that, where holding 20 simultaneously could cause problems.

@sourcefrog
Copy link
Owner

Finally: are you really seeing seconds per 1MB object to cloud storage? Which service, over what network? I wouldn't be surprised by 300-1000ms but seconds seems high?

I'm using Box mounted using rclone. Using the terminal, cd-ing into the mount takes ~1s for a normal directory, but several seconds if there are many files in the directory.

Interesting... I wonder how many API calls are generated from a single file read or write.

Running conserve with -D may give you an idea which file IOs are slow.

There might be a big win from a transport that talks to the Box API directly, which would be some more work, but perhaps not an enormous amount.

@road2react
Copy link
Author

I ran with -D and only received the following output:

2022-08-12T17:38:55.829222Z TRACE conserve: tracing enabled
2022-08-12T17:38:55.829283Z DEBUG globset: built glob set; 1 literals, 2 basenames, 0 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes    

Nothing else has been printed and the progress indicator does not show.

Running without -D allows the progress bar to show and indicates that the backup is working.

Without -D, 10 new entries, 0 changed, 0 deleted, 0 unchanged appears to show a new entry 2-3 times per second. (I assume the several seconds caused by cd is due to the traversal of every file in the directory.)

@road2react
Copy link
Author

There might be a big win from a transport that talks to the Box API directly, which would be some more work, but perhaps not an enormous amount.

I'm using rclone's encryption function, which may not work if talking to the Box API directly.

@sourcefrog
Copy link
Owner

Oh the logging might be on my SFTP branch.

@sourcefrog
Copy link
Owner

I don't think Conserve requests to read any archive directory during a backup. (It does during validate, delete, and gc).

If rclone reads the remote directory repeatedly even when the app does not request it that may be a performance drag regardless of block size.

Perhaps you can get a request log out of rclone?

And, let's split Box/rclone performance to a separate bug.

@road2react
Copy link
Author

road2react commented Aug 13, 2022

rclone reports a lot of repeated reads:

rclone[376785]: DEBUG : conserve/b0000/BANDHEAD: Open: flags=OpenReadOnly
rclone[376785]: DEBUG : conserve/b0000/BANDHEAD: Open: flags=O_RDONLY
rclone[376785]: DEBUG : conserve/b0000/BANDHEAD: >Open: fd=conserve/b0000/BANDHEAD (r), err=<nil>
rclone[376785]: DEBUG : conserve/b0000/BANDHEAD: >Open: fh=&{conserve/b0000/BANDHEAD (r)}, err=<nil>
rclone[376785]: DEBUG : conserve/b0000/BANDHEAD: Attr:
rclone[376785]: DEBUG : conserve/b0000/BANDHEAD: >Attr: a=valid=1m0s ino=0 size=56 mode=-rw-r--r--, err=<nil>
rclone[376785]: DEBUG : &{conserve/b0000/BANDHEAD (r)}: Read: len=4096, offset=0
rclone[376785]: DEBUG : conserve/b0000/BANDHEAD: ChunkedReader.openRange at 0 length 1048576
rclone[376785]: DEBUG : conserve/b0000/BANDHEAD: ChunkedReader.Read at 0 length 4096 chunkOffset 0 chunkSize 1048576
rclone[376785]: DEBUG : &{conserve/b0000/BANDHEAD (r)}: >Read: read=56, err=<nil>
rclone[376785]: DEBUG : &{conserve/b0000/BANDHEAD (r)}: Flush:
rclone[376785]: DEBUG : &{conserve/b0000/BANDHEAD (r)}: >Flush: err=<nil>
rclone[376785]: DEBUG : &{conserve/b0000/BANDHEAD (r)}: Release:
rclone[376785]: DEBUG : conserve/b0000/BANDHEAD: ReadFileHandle.Release closing

These reads repeat several times before reading conserve/d/<id>.

Each instance of the BANDHEAD read operation aligns with an increment of the amount of new entries in the progress display.

Stopping and restarting the backup operation continues reads on b0000, even after new ones are created.

Occasionally, rclone reports

DEBUG : conserve/d/1d6: Re-reading directory (25h16m9.701252182s old)
DEBUG : conserve/d/1d6/: >Lookup: node=conserve/d/1d6/1d61f19d98dbc87b0a6b3b1941b7133aaaebed5a19db44f85b3f42a1d739a36cab1d0e7179ba1297a0338d21873e253662152156fd4f6b551f2b36e8a71a9674, err=<nil>

in succession, for different ids. I believe that this corresponds to writing a block file.

@sourcefrog
Copy link
Owner

This might be connected to #175 just fixed by @WolverinDEV which is one cause of repeatedly re-reading files.

However, there are some other cases where it reads a small file repeatedly in a way that is cheap on a local filesystem (where it will be in cache) but might be very slow remotely. It's definitely worth fixing, and I think I have fixed some in the sftp branch, but there are probably more.

@road2react
Copy link
Author

With the latest changes:

There looks like there are no more repeated reads. Now, it looks like for each block, it first checks if it is already written. If not, it will create a temp file (tmp08SFF2), write to that file, and rename it to the block id. This appears to be 3 round trips per block, which aligns with the network pattern:

image

1 MB spike around every second, where each operation has ~300ms round trip.

@sourcefrog
Copy link
Owner

Yep, it does currently

  1. See if the block is already present, in which case we don't need to do the work to compress it.
  2. Write to a temporary file, so that if the process is interrupted we won't be left with an incomplete file under the final name. (This is not 100% guaranteed by the filesystem, but it's the usual pattern.)
  3. Rename into place.

This is pretty reasonable (although perhaps not optimal) locally but not good if the filesystem is very high latency.

A few options:

  1. Just issue more parallel IO.
  2. Remember which blocks are referenced by the basis index: we can already assume they're present and should not need to check. (The most common case, of an unchanged file, does not check, but there might be other edge cases. This should be pretty rare.)
  3. Similarly, remember blocks that we've already seen are present. (Cache in RAM for presence of blocks #106)
  4. If we have a Transport API for the remote filesystem, then in some cases that may already support a reliable atomic write that cannot leave the file half-written. For example this should be possible on S3. Then we don't need the rename.
  5. Even on Unix or Windows maybe a faster atomic write is possible?

@WolverinDEV
Copy link
Contributor

This might be connected to #175 just fixed by @WolverinDEV which is one cause of repeatedly re-reading files.

What @road2react describes seems to be pretty much, what I experienced as well.
The odds are high, that this has been fixed by #175.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:performance type:format-change issues requiring an archive format change
Projects
None yet
Development

No branches or pull requests

3 participants