-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Customize block size #177
Comments
So yes, we could add an option, at least as a way to experiment. I would like to set a default that is good for the widest range of situations rather than making people work it out. The block size also determines the granularity with which identical content can be found across different files, or across different versions of the same file. So increasing it is likely to, to some extent, decrease block reuse. One goal for Conserve is to very aggressively issue lots of parallel IO, making use of Rust's fearless concurrency. It already does this to some extent, and there is room to do much more. Many factors in contemporary systems align with this approach: many cores, deep SSD device command queues, high bandwidth-delay networks. If we have say 10-100 requests in flight then the per-request latency still matters but not as much. So I think this is the main thing to lean on, but it is more complicated to implement than just increasing the target block size. Another thing to consider there is that Conserve tries to write blocks of a certain size but for various reasons some objects can be smaller; they could be made larger but that would have other tradeoffs e.g. around recovering from an interrupted backup. So again, parallelism for throughput. Finally: are you really seeing seconds per 1MB object to cloud storage? Which service, over what network? I wouldn't be surprised by 300-1000ms but seconds seems high? |
If someone is interested in doing this here is a guide: There are actually a few variables to consider, including: Lines 91 to 98 in 1c8a847
Line 31 in 1c8a847
I think that's it. |
Hey, just a quick thought: |
I'm using Box mounted using rclone. Using the terminal, cd-ing into the mount takes ~1s for a normal directory, but several seconds if there are many files in the directory. |
Nothing too bad: nothing should be making strong assumptions that the blocks are of any particular size. Unchanged files (same mtime) will continue to reference the blocks they used last time. Files not recognized as unchanged but which in fact have content in common will no longer match that content if the block size changes, so all their content will be written again to new-sized blocks. That would include cases like: the file was touched (mtime updated with no content change); the file was renamed or copied; more data was appended to the file. We should still test it of course. And if this is relied upon it should be (more?) explicit in the docs. If we want larger files probably the index hunks would be the place to start. There is also an assumption that a number of blocks can be fairly freely held in memory. So we shouldn't make them 2GB or anything extreme like that, where holding 20 simultaneously could cause problems. |
Interesting... I wonder how many API calls are generated from a single file read or write. Running conserve with There might be a big win from a transport that talks to the Box API directly, which would be some more work, but perhaps not an enormous amount. |
I ran with
Nothing else has been printed and the progress indicator does not show. Running without Without |
I'm using rclone's encryption function, which may not work if talking to the Box API directly. |
Oh the logging might be on my SFTP branch. |
I don't think Conserve requests to read any archive directory during a backup. (It does during validate, delete, and gc). If rclone reads the remote directory repeatedly even when the app does not request it that may be a performance drag regardless of block size. Perhaps you can get a request log out of rclone? And, let's split Box/rclone performance to a separate bug. |
rclone reports a lot of repeated reads:
These reads repeat several times before reading Each instance of the Stopping and restarting the backup operation continues reads on Occasionally, rclone reports
in succession, for different ids. I believe that this corresponds to writing a block file. |
This might be connected to #175 just fixed by @WolverinDEV which is one cause of repeatedly re-reading files. However, there are some other cases where it reads a small file repeatedly in a way that is cheap on a local filesystem (where it will be in cache) but might be very slow remotely. It's definitely worth fixing, and I think I have fixed some in the sftp branch, but there are probably more. |
With the latest changes: There looks like there are no more repeated reads. Now, it looks like for each block, it first checks if it is already written. If not, it will create a temp file ( 1 MB spike around every second, where each operation has ~300ms round trip. |
Yep, it does currently
This is pretty reasonable (although perhaps not optimal) locally but not good if the filesystem is very high latency. A few options:
|
What @road2react describes seems to be pretty much, what I experienced as well. |
Right now, blocks are limited to 1 MB each. When backing up to a cloud storage, the latency related to reading and writing may be significant, reaching up to several seconds per operation.
A potential way to reduce overhead would be to increase the block size.
Would this be possible?
The text was updated successfully, but these errors were encountered: