Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New feature: "rcat" (inverse of cat) command for rclone #1001

Closed
DurvalMenezes opened this issue Jan 7, 2017 · 15 comments
Closed

New feature: "rcat" (inverse of cat) command for rclone #1001

DurvalMenezes opened this issue Jan 7, 2017 · 15 comments
Assignees
Milestone

Comments

@DurvalMenezes
Copy link
Contributor

DurvalMenezes commented Jan 7, 2017

As per this forum thread.

@ncw ncw added the enhancement label Jan 8, 2017
@ncw ncw added this to the v1.37 milestone Jan 8, 2017
@ncw
Copy link
Member

ncw commented Feb 11, 2017

Note that this is metioned in #230 also

@Roman2K
Copy link
Sponsor

Roman2K commented Feb 19, 2017

Waiting for rclone rcat too 👍 Would be very welcome for a project of mine to avoid temp files (issue).

@eborisch
Copy link

Just dropping a note here; over in #230, I just posted a link to a wrapper script to provide pipe in/out with integrity (checksum) checks along the way.

@breunigs breunigs self-assigned this Jul 1, 2017
@breunigs
Copy link
Collaborator

breunigs commented Jul 2, 2017

@ncw I have an implementation ready in the rcat branch where I tested if I could get away without having to create a new interface. I simply set file.Size = -1 and that works fine for Google Drive and Dropbox (even with accounting and bandwidth limiter out of the box, to my surprise).

Do you have easy means to run this against the other remotes to see which ones either need patching or cannot support files of undefined length? I'd have to first setup accounts for most of the providers. The branch has an integration test for uploading files with unspecified size.

@DurvalMenezes
Copy link
Contributor Author

@breunigs, I have immediate application for this on Google Drive (at the tail end of the tar-pipe commands I use to backup my VMs).

Is there an easy way for me to get a Linux-AMD64 rclone binary with this enabled? I would like to avoid compiling it myself right now, if possible.

@breunigs
Copy link
Collaborator

breunigs commented Jul 2, 2017

https://www.breunig.xyz/share/rclone-beta/v1.36-235-g96ee739b-rcat%CE%B2/

I advise against uploading the tar files through these means, though. The upload happens non-chunked and the data is not kept around, meaning that retries are not possible and any failure will break the pipe. If the tar files are expensive to compute or your uplink small you might be better off with eborisch's script (haven't tried): https://github.com/eborisch/rpipe

@DurvalMenezes
Copy link
Contributor Author

DurvalMenezes commented Jul 4, 2017

Hello @breunigs,

Thanks for making the binary available, and for the detailed warnings. My tarfiles are not difficult nor expensive to compute, and in those machines I have a reliable and reasonably fast uplink, so the lack of retries is OK with me (the worst that could happen would be for me to miss a backup window).

I'm happy to report I've just tested it to produce a backup from one of my VMs (actually, from a LXC container running inside the VM, both running Devuan Jessie 1.0.0) to an encrypted remote configured for my Google Drive account, and it worked great. The command I used was like this:

(cd /; sudo tar --one-file-system --numeric-owner -czpf - .) | rclone-rcat rcat egd:HOSTNAME_-_FS_-_TIMESTAMP.tar.gz

After it finished, I verified it completely by piping an "rclone cat" of the same remote tarfile above into "tar d", like this:

rclone cat egd:HOSTNAME_-_FS_-_TIMESTAMP.tar.gz | (cd /; sudo tar --one-file-system --numeric-owner -dzpf -)

And it reported just the expected differences (~/.rclone.conf itself, system log files, etc).

So things seem to be working perfectly. I have just a minor quib: is there a reason for "rclone rcat" to run at less than half the speed of the subsequent "rclone cat" (apart from possible internet vagaries between my VM/LXC container and Google Drive servers)? Because I clocked the "rclone rcat" running at just 4.82MB/s (approx. half of the expected nominal speed for the symmetrical 100Mbps link speed I have at this machine), while the "rclone cat" ran at a much more expected 10.77MB/s.

Thanks again, and Cheers,
-- Durval.

@breunigs
Copy link
Collaborator

breunigs commented Jul 4, 2017

Interesting about the speed. I have multiple ideas where that might be coming from. Can you try:

  1. (cd /; sudo tar --one-file-system --numeric-owner -czpf - .) | pv >/dev/null. pv is packaged and shows the speed of data flowing through it. If this doesn't reach at least ~12 MB/s, then your disk is slow.
  2. Can you compare the upload speed of rclone copy pregenerated.tar egd:/bla vs rclone rcat? If copy is also slow, it's probably the network path being congested.
  3. If copy is fast, does it get slow when you run it with something like rclone copy --drive-upload-cutoff 1GB (needs to be bigger than the file you're trying to upload)? If that's the case, then it's because of the different endpoints used – single upload vs. multi part upload. I guess that's pretty unlikely, though.
  4. If copy is still fast, I would assume it's because stdin is read without (additional?) buffer and the overhead of reading it in tiny pieces is what's limiting the speed. For Drive it makes sense to reuse the --chunk-size setting, but not with the current approach used.

@ncw
Copy link
Member

ncw commented Jul 8, 2017

@breunigs Just looked at your rcat branch - looking very nice! file.Size = -1 will definitely work for some remotes, but not others (eg B2, ACD). This should probably be a Feature flag which would be helpful for rclone mount too. This is essentially the same way that rclone mount uploads files.

If the remote doesn't have the "streaming upload" feature then it could revert back to storing a file on disk. I would make a temporary local Fs and then use fs.Copy() to copy up the file which will do retries, etc. That could also be a flag option.

I put some comments on the code too, one of which might explain the speed issue.

@ncw
Copy link
Member

ncw commented Jul 8, 2017

@breunigs I ran your integration test on all the remotes.

I got failures from

  • acd
  • yandex

Suprisingly to me the b2 remote passed. I think what happened is because the src didn't have an sha1 the b2 backend spooled it to disk.

@breunigs
Copy link
Collaborator

breunigs commented Jul 8, 2017

Thanks for checking that! I will see about adding the StreamingUpload interface/feature. Not quite sure how one can use a temporary local FS, but since the test suite uses it extensively I should be able to find out. Sounds like a decent way to reuse the checking code.

@breunigs
Copy link
Collaborator

breunigs commented Jul 14, 2017

note to self: dropbox upload fails with 301 MiB, but still works with 300 MiB. Need to figure out where this limitation comes from. Fixed

@breunigs
Copy link
Collaborator

@ncw would you mind taking another look?

I added direct streaming support for all remotes that allow this for an unknown size. The Dropbox "small files code path" only supports up to 150 MB (documented) / 300 MB (actually) before there's an error within the dropbox-sdk code. The chunked upload path requires chunks+1 API calls, with the last carrying only metadata. I guess it is possible to buffer chunkSize bytes to check if there's enough data to save one HTTP request, but it didn't seem a good trade off.

@ncw ncw modified the milestones: v1.37, v1.38 Jul 19, 2017
@breunigs
Copy link
Collaborator

breunigs commented Aug 2, 2017

@ncw I guess my review request got lost during the 1.37 release days? I rebased the branch on master and checked the speed:

# copying a 4GB file                        # runtime in seconds
cat x | pv -at > y                          # 40, 58, 52
cat x | pv -at | ./rclone rcat localtest:y  # 44, 56, 62

Without properly benchmarking it, I would say this puts rclone at least in the same ballpark, which I guess it good enough.

@ncw
Copy link
Member

ncw commented Aug 2, 2017

@breunigs sorry about missing your request. Making a PR is probably the best way of getting me to review stuff (yes you can make a PR from and to the same repo!).

The code looks fine - I think you should merge that.

I made rclone info which makes a start at discovering things about filenames. You could add a bit in there to discover whether streaming uploads work if you wanted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants