Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rclone multi-thread copy is slow on local write #3419

Closed
mmatczuk opened this issue Aug 8, 2019 · 23 comments
Closed

rclone multi-thread copy is slow on local write #3419

mmatczuk opened this issue Aug 8, 2019 · 23 comments
Labels
Milestone

Comments

@mmatczuk
Copy link
Contributor

mmatczuk commented Aug 8, 2019

What is the problem you are having with rclone?

Rclone takes significantly more time to write a file than cp unix command.

I traced it down to multi-thread copy being slow, at least in that case.

I conducted an experiment where we have 1GB a file that is not cached and try to copy that to an empty directory. With cp we have runtime of about a second.

$ pcstat 1GB.bin.1
+-----------+----------------+------------+-----------+---------+
| Name      | Size (bytes)   | Pages      | Cached    | Percent |
|-----------+----------------+------------+-----------+---------|
| 1GB.bin.1 | 1073741824     | 262144     | 0         | 000.000 |
+-----------+----------------+------------+-----------+---------+
$ taskset -c 0 /usr/bin/time -v cp 1GB.bin.1 /var/empty/rclone/
        Command being timed: "cp 1GB.bin.1 /var/empty/rclone/"
        User time (seconds): 0.00
        System time (seconds): 0.50
        Percent of CPU this job got: 44%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.14
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 2492
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 1
        Minor (reclaiming a frame) page faults: 152
        Voluntary context switches: 3769
        Involuntary context switches: 28
        Swaps: 0
        File system inputs: 2097696
        File system outputs: 2097152
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

While with rclone it's like 15s.

$ taskset -c 0 /usr/bin/time -v ./rclone copy 1GB.bin.1 /var/empty/rclone
        Command being timed: "./rclone copy 1GB.bin.1 /var/empty/rclone"
        User time (seconds): 14.46
        System time (seconds): 0.81
        Percent of CPU this job got: 93%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:16.41
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 27600
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 2228
        Voluntary context switches: 7190
        Involuntary context switches: 1980
        Swaps: 0
        File system inputs: 2097152
        File system outputs: 2097152
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

This is quite surprising as plain io.Copy code is almost as fast as cp

package main

import (
	"flag"
	"io"
	"os"
)

func main() {
	src := flag.String("src", "", "")
	dst := flag.String("dst", "", "")
	mode := flag.Int("mode", 0, "")
	flag.Parse()

	var err error
	switch *mode {
	case 0:
		err = stdCopy(*src, *dst)
	default:
		panic("No such mode")
	}
	if err != nil {
		panic(err)
	}
}

func stdCopy(src, dst string) error {
	r, err := os.OpenFile(src, os.O_RDONLY, 0)
	if err != nil {
		return err
	}
	w, err := os.OpenFile(dst, os.O_RDWR|os.O_CREATE|os.O_TRUNC, 0666)
	if err != nil {
		return err
	}

	if _, err := io.Copy(w, r); err != nil {
		return err
	}
	if err = r.Close(); err != nil {
		return err
	}
	if err = w.Close(); err != nil {
		return err
	}

	return nil
}
$ pcstat 1GB.bin
+---------+----------------+------------+-----------+---------+
| Name    | Size (bytes)   | Pages      | Cached    | Percent |
|---------+----------------+------------+-----------+---------|
| 1GB.bin | 1073741824     | 262144     | 0         | 000.000 |
+---------+----------------+------------+-----------+---------+
$ taskset -c 0 /usr/bin/time -v ./poc -mode 0 -src 1GB.bin -dst /var/empty/rclone/1GB.bin
        Command being timed: "./poc -mode 0 -src 1GB.bin -dst /var/empty/rclone/1GB.bin"
        User time (seconds): 0.06
        System time (seconds): 0.59
        Percent of CPU this job got: 44%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.50
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 1572
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 156
        Voluntary context switches: 33945
        Involuntary context switches: 11214
        Swaps: 0
        File system inputs: 2097152
        File system outputs: 2097152
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

What is your rclone version (output from rclone version)

commit d51a970

Which OS you are using and how many bits (eg Windows 7, 64 bit)

Linux lh 5.2.5-200.fc30.x86_64 #1 SMP Wed Jul 31 14:37:17 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Which cloud storage system are you using? (eg Google Drive)

Only local

The command you were trying to run (eg rclone copy /tmp remote:tmp)

rclone copy 1GB.bin.1 /var/empty/rclone

A log from the command with the -vv flag (eg output from rclone -vv copy /tmp remote:tmp)

 michal@lh:scylla/rclone (master)$ ./rclone -vv copy 1GB.bin.2 /var/empty/rclone
2019/08/08 12:03:26 DEBUG : rclone: Version "v1.48.0-114-gd51a9709-beta" starting with parameters ["./rclone" "-vv" "copy" "1GB.bin.2" "/var/empty/rclone"]
2019/08/08 12:03:26 DEBUG : Using config file from "/home/michal/.config/rclone/rclone.conf"
2019/08/08 12:03:26 DEBUG : 1GB.bin.2: Couldn't find file - need to transfer
2019/08/08 12:03:26 DEBUG : 1GB.bin.2: Starting multi-thread copy with 4 parts of size 256M
2019/08/08 12:03:26 DEBUG : 1GB.bin.2: multi-thread copy: stream 4/4 (805306368-1073741824) size 256M starting
2019/08/08 12:03:26 DEBUG : 1GB.bin.2: multi-thread copy: stream 2/4 (268435456-536870912) size 256M starting
2019/08/08 12:03:26 DEBUG : 1GB.bin.2: multi-thread copy: stream 1/4 (0-268435456) size 256M starting
2019/08/08 12:03:26 DEBUG : 1GB.bin.2: multi-thread copy: stream 3/4 (536870912-805306368) size 256M starting
2019/08/08 12:03:26 DEBUG : 1GB.bin.2: multi-thread copy: stream 4/4 (805306368-1073741824) size 256M finished
2019/08/08 12:03:26 DEBUG : 1GB.bin.2: multi-thread copy: stream 3/4 (536870912-805306368) size 256M finished
2019/08/08 12:03:26 DEBUG : 1GB.bin.2: multi-thread copy: stream 2/4 (268435456-536870912) size 256M finished
2019/08/08 12:03:36 DEBUG : 1GB.bin.2: multi-thread copy: stream 1/4 (0-268435456) size 256M finished
2019/08/08 12:03:36 DEBUG : 1GB.bin.2: Finished multi-thread copy with 4 parts of size 256M
2019/08/08 12:03:39 INFO  : 1GB.bin.2: Multi-thread Copied (new)
2019/08/08 12:03:39 INFO  : 
Transferred:            1G / 1 GBytes, 100%, 74.522 MBytes/s, ETA 0s
Errors:                 0
Checks:                 0 / 0, -
Transferred:            1 / 1, 100%
Elapsed time:       13.7s

When running without the multi-thread copy - we have 10x improvement.

$ ./rclone -vv copy --multi-thread-cutoff 10G 1GB.bin.2 /var/empty/rclone
2019/08/08 12:06:11 DEBUG : rclone: Version "v1.48.0-114-gd51a9709-beta" starting with parameters ["./rclone" "-vv" "copy" "--multi-thread-cutoff" "10G" "1GB.bin.2" "/var/empty/rclone"]
2019/08/08 12:06:11 DEBUG : Using config file from "/home/michal/.config/rclone/rclone.conf"
2019/08/08 12:06:11 DEBUG : 1GB.bin.2: Couldn't find file - need to transfer
2019/08/08 12:06:13 INFO  : 1GB.bin.2: Copied (new)
2019/08/08 12:06:13 INFO  : 
Transferred:            1G / 1 GBytes, 100%, 551.474 MBytes/s, ETA 0s
Errors:                 0
Checks:                 0 / 0, -
Transferred:            1 / 1, 100%
Elapsed time:        1.8s

2019/08/08 12:06:13 DEBUG : 3 go routines active
2019/08/08 12:06:13 DEBUG : rclone: Version "v1.48.0-114-gd51a9709-beta" finishing with parameters ["./rclone" "-vv" "copy" "--multi-thread-cutoff" "10G" "1GB.bin.2" "/var/empty/rclone"]
@mmatczuk mmatczuk changed the title rclone is fairly slow on local write rclone multi-thread copy is slow on local write Aug 8, 2019
@ncw
Copy link
Member

ncw commented Aug 8, 2019

Multithread copy is a definite win for copying over the network.

I can see why it wouldn't be a win for the local backend though as in general IO through the kernel isn't optimised for writing many streams at once.

What do you think we should do - disable it for local -> local copies?

Or do you think there is a way we can fix multithread copy?

Making this bigger would probably help

multithreadBufferSize = 32 * 1024

@ncw ncw added the bug label Aug 8, 2019
@ncw ncw added this to the v1.49 milestone Aug 8, 2019
@mmatczuk
Copy link
Contributor Author

mmatczuk commented Aug 9, 2019

I think that disabling the multithread copy for local to local copy makes sense.

The issue is that network I/O can be nowadays faster than disc. Probably not on my laptop but running that on a beefy server in AWS reading from S3 in the same region can. We can always manually disable that.

I spotted something weird not sure why this is happening. Take a look at the timings

2019/08/08 12:03:26 DEBUG : 1GB.bin.2: multi-thread copy: stream 4/4 (805306368-1073741824) size 256M starting
2019/08/08 12:03:26 DEBUG : 1GB.bin.2: multi-thread copy: stream 2/4 (268435456-536870912) size 256M starting
2019/08/08 12:03:26 DEBUG : 1GB.bin.2: multi-thread copy: stream 1/4 (0-268435456) size 256M starting
2019/08/08 12:03:26 DEBUG : 1GB.bin.2: multi-thread copy: stream 3/4 (536870912-805306368) size 256M starting
2019/08/08 12:03:26 DEBUG : 1GB.bin.2: multi-thread copy: stream 4/4 (805306368-1073741824) size 256M finished
2019/08/08 12:03:26 DEBUG : 1GB.bin.2: multi-thread copy: stream 3/4 (536870912-805306368) size 256M finished
2019/08/08 12:03:26 DEBUG : 1GB.bin.2: multi-thread copy: stream 2/4 (268435456-536870912) size 256M finished
2019/08/08 12:03:36 DEBUG : 1GB.bin.2: multi-thread copy: stream 1/4 (0-268435456) size 256M finished

everything finishes fast but the first stream, we have 2019/08/08 12:03:26 and then 2019/08/08 12:03:36.

When I change nr of steams to 2 it is similar

$ ./rclone -vv copy --multi-thread-streams 2 1GB.bin.2 /var/empty/rclone/
2019/08/09 11:44:04 DEBUG : rclone: Version "v1.48.0-115-g42a93249-mmt/fadvise-beta" starting with parameters ["./rclone" "-vv" "copy" "--multi-thread-streams" "2" "1GB.bin.2" "/var/empty/rclone/"]
2019/08/09 11:44:04 DEBUG : Using config file from "/home/michal/.config/rclone/rclone.conf"
2019/08/09 11:44:04 DEBUG : 1GB.bin.2: Couldn't find file - need to transfer
2019/08/09 11:44:04 DEBUG : 1GB.bin.2: Starting multi-thread copy with 2 parts of size 512M
2019/08/09 11:44:04 DEBUG : 1GB.bin.2: multi-thread copy: stream 2/2 (536870912-1073741824) size 512M starting
2019/08/09 11:44:04 DEBUG : 1GB.bin.2: multi-thread copy: stream 1/2 (0-536870912) size 512M starting
2019/08/09 11:44:05 DEBUG : 1GB.bin.2: multi-thread copy: stream 2/2 (536870912-1073741824) size 512M finished
2019/08/09 11:44:25 DEBUG : 1GB.bin.2: multi-thread copy: stream 1/2 (0-536870912) size 512M finished
2019/08/09 11:44:25 DEBUG : 1GB.bin.2: Finished multi-thread copy with 2 parts of size 512M
2019/08/09 11:44:29 INFO  : 1GB.bin.2: Multi-thread Copied (new)
2019/08/09 11:44:29 INFO  : 
Transferred:            1G / 1 GBytes, 100%, 41.758 MBytes/s, ETA 0s
Errors:                 0
Checks:                 0 / 0, -
Transferred:            1 / 1, 100%
Elapsed time:       24.5s

Any thoughts @ncw ?

ncw added a commit that referenced this issue Aug 10, 2019
Before this change, if the caller didn't provide a hint, we would
calculate all hashes for reads and writes.

The new whirlpool hash is particularly expensive and that has become noticeable.

Now we don't calculate any hashes on upload or download unless hints are provided.

This means that some operations may run slower and these will need to be discovered!

It does not affect anything calling operations.Copy which already puts
the corrects hints in.
ncw added a commit that referenced this issue Aug 10, 2019
Before this change for a post copy Hash check we would run the hashes sequentially.

Now we run the hashes in parallel for a useful speedup.

Note that this refactors the hash check in Copy to use the standard
hash checking routine.
ncw added a commit that referenced this issue Aug 10, 2019


Before this change we calculated the checkums when using
--ignore-checksum but ignored them at the end.

Now we don't calculate the checksums at all which is more efficient.
@ncw
Copy link
Member

ncw commented Aug 10, 2019

Ah ha, well spotted... That lead to a string of improvements which I'll go into in detail here!

I pressed CTRL-\ in that pause at the end, and one of the goroutines is doing this

goroutine 9 [running]:
github.com/rclone/rclone/vendor/github.com/jzelinskie/whirlpool.(*whirlpool).transform(0xc000236000)
	/home/ncw/go/src/github.com/rclone/rclone/vendor/github.com/jzelinskie/whirlpool/whirlpool.go:95 +0x525 fp=0xc00021fb78 sp=0xc00021fa60 pc=0x754ad5
github.com/rclone/rclone/vendor/github.com/jzelinskie/whirlpool.(*whirlpool).Write(0xc000236000, 0xc000238000, 0x8000, 0x8000, 0x8000, 0x0, 0x0)
	/home/ncw/go/src/github.com/rclone/rclone/vendor/github.com/jzelinskie/whirlpool/whirlpool.go:144 +0x139 fp=0xc00021fbb8 sp=0xc00021fb78 pc=0x754e09
io.(*multiWriter).Write(0xc000322180, 0xc000238000, 0x8000, 0x8000, 0x8000, 0x0, 0x0)
	/opt/go/go1.12/src/io/multi.go:60 +0x87 fp=0xc00021fc18 sp=0xc00021fbb8 pc=0x46af77
github.com/rclone/rclone/fs/hash.(*MultiHasher).Write(...)
	/home/ncw/go/src/github.com/rclone/rclone/fs/hash/hash.go:202
github.com/rclone/rclone/backend/local.(*localOpenFile).Read(0xc000366240, 0xc000238000, 0x8000, 0x8000, 0xc000238000, 0x8000, 0x8000)
	/home/ncw/go/src/github.com/rclone/rclone/backend/local/local.go:889 +0xee fp=0xc00021fcd8 sp=0xc00021fc18 pc=0xd5c1ee

As in computing the hash...

By default the local backend will compute all the hashes and the whirlpool hash is particularly expensive. I recently fixed a similar problem in rcat c014b2e.

And the reason it only affects the first chunk is this that the local backend calculates all the hashes only if you start reading from 0.

If set the default hashes to for the local backend to be none then I get this, so no long pause on first segment

$ rm 1G.copy ; rclone --multi-thread-streams 4 copyto -vv 1GB 1G.copy
2019/08/10 09:34:15 DEBUG : rclone: Version "v1.48.0-DEV" starting with parameters ["rclone" "--multi-thread-streams" "4" "copyto" "-vv" "1GB" "1G.copy"]
2019/08/10 09:34:15 DEBUG : Using config file from "/home/ncw/.rclone.conf"
2019/08/10 09:34:15 DEBUG : 1GB: Couldn't find file - need to transfer
2019/08/10 09:34:15 DEBUG : 1GB: Starting multi-thread copy with 4 parts of size 256M
2019/08/10 09:34:15 DEBUG : 1GB: multi-thread copy: stream 4/4 (805306368-1073741824) size 256M starting
2019/08/10 09:34:15 DEBUG : 1GB: multi-thread copy: stream 2/4 (268435456-536870912) size 256M starting
2019/08/10 09:34:15 DEBUG : 1GB: multi-thread copy: stream 1/4 (0-268435456) size 256M starting
2019/08/10 09:34:15 DEBUG : 1GB: multi-thread copy: stream 3/4 (536870912-805306368) size 256M starting
2019/08/10 09:34:16 DEBUG : 1GB: multi-thread copy: stream 2/4 (268435456-536870912) size 256M finished
2019/08/10 09:34:16 DEBUG : 1GB: multi-thread copy: stream 1/4 (0-268435456) size 256M finished
2019/08/10 09:34:16 DEBUG : 1GB: multi-thread copy: stream 3/4 (536870912-805306368) size 256M finished
2019/08/10 09:34:16 DEBUG : 1GB: multi-thread copy: stream 4/4 (805306368-1073741824) size 256M finished
2019/08/10 09:34:16 DEBUG : 1GB: Finished multi-thread copy with 4 parts of size 256M
2019/08/10 09:34:16 calculating hash MD5
2019/08/10 09:34:18 done calculating hash MD5
2019/08/10 09:34:18 calculating hash MD5
2019/08/10 09:34:20 done calculating hash MD5
2019/08/10 09:34:20 INFO  : 1GB: Multi-thread Copied (new)
2019/08/10 09:34:20 INFO  : 
Transferred:   	        1G / 1 GBytes, 100%, 205.256 MBytes/s, ETA 0s
Errors:                 0
Checks:                 0 / 0, -
Transferred:            1 / 1, 100%
Elapsed time:        4.9s

However I see the hashes aren't parallelized....

Parallizing them gives

Transferred:   	        1G / 1 GBytes, 100%, 319.413 MBytes/s, ETA 0s

so quite a big improvement!

I don't think we are going to improve that since the multithread transfer has to read the source twice and both write and read back the destination, whereas the non multhread copy only has to read the source once (the hash is cacluated as it goes) and write the file once (again the hash is calculated as it goes).

You can disable the post copy checksum with --ignore-checksum in which case the multithread copy becomes quicker as it doesn't calculate the checksums at all whereas I think the normal copy still calculates the checksums but doesn't check them...

I put in a fix for that too...

$ rm -f 1G.copy* ; rclone --multi-thread-streams 1 --ignore-checksum copyto -vv 1GB 1G.copy2
Transferred:   	        1G / 1 GBytes, 100%, 372.985 MBytes/s, ETA 0s
Errors:                 0
Checks:                 0 / 0, -
Transferred:            1 / 1, 100%
Elapsed time:        2.7s

$ rm -f 1G.copy* ; rclone --multi-thread-streams 2 --ignore-checksum copyto -vv 1GB 1G.copy2
Transferred:   	        1G / 1 GBytes, 100%, 457.752 MBytes/s, ETA 0s
Errors:                 0
Checks:                 0 / 0, -
Transferred:            1 / 1, 100%
Elapsed time:        2.2s

$ rm -f 1G.copy* ; rclone --multi-thread-streams 4 --ignore-checksum copyto -vv 1GB 1G.copy2
Transferred:   	        1G / 1 GBytes, 100%, 515.836 MBytes/s, ETA 0s
Errors:                 0
Checks:                 0 / 0, -
Transferred:            1 / 1, 100%
Elapsed time:        1.9s

These results don't take into account sync time so here they are with.

Note the ~3 seconds of user time for calculating the hashes. Note
also that with --ignore-checksum all methods are comparable with cp!

$ rm -f 1G.copy* ; sync ; time ( rclone --multi-thread-streams 0 copyto  1GB 1G.copy2 ; sync )

real	0m5.335s
user	0m3.981s
sys	0m1.127s

$ rm -f 1G.copy* ; sync ; time ( rclone --multi-thread-streams 2 copyto  1GB 1G.copy2 ; sync )

real	0m9.084s
user	0m4.142s
sys	0m1.832s

$ rm -f 1G.copy* ; sync ; time ( rclone --multi-thread-streams 4 copyto  1GB 1G.copy2 ; sync )

real	0m8.476s
user	0m3.981s
sys	0m1.851s

$ rm -f 1G.copy* ; sync ; time ( rclone --multi-thread-streams 0 --ignore-checksum copyto  1GB 1G.copy2 ; sync )

real	0m5.894s
user	0m0.184s
sys	0m1.557s

$ rm -f 1G.copy* ; sync ; time ( rclone --multi-thread-streams 2 --ignore-checksum copyto  1GB 1G.copy2 ; sync )

real	0m5.726s
user	0m0.119s
sys	0m1.076s

$ rm -f 1G.copy* ; sync ; time ( rclone --multi-thread-streams 4 --ignore-checksum copyto  1GB 1G.copy2 ; sync )

real	0m5.258s
user	0m0.153s
sys	0m1.081s

$ rm -f 1G.copy* ; sync ; time ( cp -av 1GB 1G.copy2 ; sync )
'1GB' -> '1G.copy2'

real	0m6.980s
user	0m0.016s
sys	0m1.034s

Here is a beta with those things in (or pull from fix-3419-multithread branch).

Can you have a play and tell me what you think?

https://beta.rclone.org/branch/v1.48.0-152-gf9276355-fix-3419-multithread-beta/ (uploaded in 15-30 mins)

@mmatczuk
Copy link
Contributor Author

mmatczuk commented Aug 10, 2019 via email

ncw added a commit that referenced this issue Aug 12, 2019
Before this change, if the caller didn't provide a hint, we would
calculate all hashes for reads and writes.

The new whirlpool hash is particularly expensive and that has become noticeable.

Now we don't calculate any hashes on upload or download unless hints are provided.

This means that some operations may run slower and these will need to be discovered!

It does not affect anything calling operations.Copy which already puts
the corrects hints in.
ncw added a commit that referenced this issue Aug 12, 2019
Before this change for a post copy Hash check we would run the hashes sequentially.

Now we run the hashes in parallel for a useful speedup.

Note that this refactors the hash check in Copy to use the standard
hash checking routine.
ncw added a commit that referenced this issue Aug 12, 2019


Before this change we calculated the checkums when using
--ignore-checksum but ignored them at the end.

Now we don't calculate the checksums at all which is more efficient.
@ncw
Copy link
Member

ncw commented Aug 12, 2019

I've updated according to your comments on the commit and added the final bit of the jigsaw - disabling multithread copies on local -> local copies

Please have a go with this and tell me what you think!

https://beta.rclone.org/branch/v1.48.0-169-g784e5448-fix-3419-multithread-beta/

@mmatczuk
Copy link
Contributor Author

I suppose you can drop e3008c9.

@mmatczuk
Copy link
Contributor Author

mmatczuk commented Aug 13, 2019

I think that
784e544#diff-d1d05ad248544b6abc221402b0b813f3L313 and the following few lines could benefit of a refactoring. Maybe add a function to calculate whether to use or not multiple threads?

I was thinking that with 43a6774 in place the idea was to check if it's local or not. Checking if src supports OpenWriterAt feels like a coincidence and if we have a remote backend that supports that it may be still beneficial to use multiple threads...

@mmatczuk
Copy link
Contributor Author

I think it would be a good idea to be explicit in the logs whenever the hash is/was calculated.

For instance here it's very clear

 michal@lh:scylla/rclone ((784e54484...))$ ./rclone copyto -vv 1GB.bin /var/empty/rclone/1GB.bin.2
2019/08/13 11:01:53 DEBUG : rclone: Version "v1.48.0-169-g784e5448-beta" starting with parameters ["./rclone" "copyto" "-vv" "1GB.bin" "/var/empty/rclone/1GB.bin.2"]
2019/08/13 11:01:53 DEBUG : Using config file from "/home/michal/.config/rclone/rclone.conf"
2019/08/13 11:01:53 DEBUG : 1GB.bin: Modification times differ by -118h56m57.558042664s: 2019-08-13 10:59:49.081421262 +0200 CEST, 2019-08-08 12:02:51.523378598 +0200 CEST
2019/08/13 11:01:55 DEBUG : 1GB.bin: MD5 = e347bf8447d84826265916745655c302 (Local file system at /home/michal/work/scylla/rclone)
2019/08/13 11:01:55 DEBUG : 1GB.bin.2: MD5 = 793f54796b31281fd6a03a2fee1ebf7d (Local file system at /var/empty/rclone)
2019/08/13 11:01:55 DEBUG : 1GB.bin: MD5 differ
2019/08/13 11:01:58 INFO  : 1GB.bin: Copied (replaced existing)
2019/08/13 11:01:58 INFO  : 
Transferred:            1G / 1 GBytes, 100%, 347.532 MBytes/s, ETA 0s
Errors:                 0
Checks:                 0 / 0, -
Transferred:            1 / 1, 100%
Elapsed time:        2.9s

2019/08/13 11:01:58 DEBUG : 3 go routines active
2019/08/13 11:01:58 DEBUG : rclone: Version "v1.48.0-169-g784e5448-beta" finishing with parameters ["./rclone" "copyto" "-vv" "1GB.bin" "/var/empty/rclone/1GB.bin.2"]

But here the hash is calculated but no sign of it in the logs

 michal@lh:scylla/rclone ((784e54484...))$ ./rclone copyto -vv 1GB.bin /var/empty/rclone/1GB.bin.3
2019/08/13 11:02:36 DEBUG : rclone: Version "v1.48.0-169-g784e5448-beta" starting with parameters ["./rclone" "copyto" "-vv" "1GB.bin" "/var/empty/rclone/1GB.bin.3"]
2019/08/13 11:02:36 DEBUG : Using config file from "/home/michal/.config/rclone/rclone.conf"
2019/08/13 11:02:36 DEBUG : 1GB.bin: Couldn't find file - need to transfer
2019/08/13 11:02:39 INFO  : 1GB.bin: Copied (new)
2019/08/13 11:02:39 INFO  : 
Transferred:            1G / 1 GBytes, 100%, 350.713 MBytes/s, ETA 0s
Errors:                 0
Checks:                 0 / 0, -
Transferred:            1 / 1, 100%
Elapsed time:        2.9s

2019/08/13 11:02:39 DEBUG : 3 go routines active
2019/08/13 11:02:39 DEBUG : rclone: Version "v1.48.0-169-g784e5448-beta" finishing with parameters ["./rclone" "copyto" "-vv" "1GB.bin" "/var/empty/rclone/1GB.bin.3"]

@ncw
Copy link
Member

ncw commented Aug 13, 2019

I think it would be a good idea to be explicit in the logs whenever the hash is/was calculated.

You get that long debug if the hashes differ. If they are the same then rclone just carries on.

Do you think a single DEBUG for filename: MD5 = 1231231231312312313123 Matched would be sufficient for the matching case? I don't want to make too much noise in the debug log...

I suppose you can drop e3008c9.

Even though it is the same line of code as the next patch the commit message tells a different story!

I think that
784e544#diff-d1d05ad248544b6abc221402b0b813f3L313 and the following few lines could benefit of a refactoring. Maybe add a function to calculate whether to use or not multiple threads?

You are probably right!

I was thinking that with 43a6774 in place the idea was to check if it's local or not. Checking if src supports OpenWriterAt feels like a coincidence and if we have a remote backend that supports that it may be still beneficial to use multiple threads...

Yes you are right about this.

It isn't actually that easy to tell if a backend is the local backend! Yes you are right OpenWriterAt isn't brilliant as a detection mechanism. For example SFTP could support OpenWriterAt easily enough...

I'll have another go.

As for checking the unwrapped Fs, that was my first attempt, however I discovered that mult-thread-copy for a crypt backend wrapping a local backend is much faster than a single thread copy. I think that is because the crypto can then use multiple CPUs!

So I think what I should be checking is if sourceFs == local && destFS == local and not unwrapping stuff.

...

Did you have a go with the performance? What do you think now?

@mmatczuk
Copy link
Contributor Author

So I think what I should be checking is if sourceFs == local && destFS == local and not unwrapping stuff.

Sounds good, how about throttling?

Did you have a go with the performance? What do you think now?

Yes it's clearly better now, there are some things I do not understand and did not take the effort to properly profile it. Rclone seems to be doing something it spends significant amount of time in user space User time (seconds): 0.16 and CPU usage is over 100% which is kind of surprising.

 michal@lh:scylla/rclone ((784e54484...))$ taskset -c 0,1 /usr/bin/time -v cp 1GB.bin /var/empty/rclone/1GB.bin.9
        Command being timed: "cp 1GB.bin /var/empty/rclone/1GB.bin.9"
        User time (seconds): 0.00
        System time (seconds): 0.33
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.34
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 2492
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 152
        Voluntary context switches: 1
        Involuntary context switches: 5
        Swaps: 0
        File system inputs: 0
        File system outputs: 2097152
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
 michal@lh:scylla/rclone ((784e54484...))$ sync
 michal@lh:scylla/rclone ((784e54484...))$ taskset -c 0,1 /usr/bin/time -v ./rclone copyto --ignore-checksum 1GB.bin /var/empty/rclone/1GB.bin.20
        Command being timed: "./rclone copyto --ignore-checksum 1GB.bin /var/empty/rclone/1GB.bin.20"
        User time (seconds): 0.16
        System time (seconds): 0.47
        Percent of CPU this job got: 158%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.40
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 45024
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 6325
        Voluntary context switches: 2165
        Involuntary context switches: 72
        Swaps: 0
        File system inputs: 0
        File system outputs: 2097152
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

@ncw
Copy link
Member

ncw commented Aug 13, 2019

So I think what I should be checking is if sourceFs == local && destFS == local and not unwrapping stuff.

Sounds good, how about throttling?

What do you mean about throttling? You can use --bwlimit if you want to throttle the speed.

Did you have a go with the performance? What do you think now?

Yes it's clearly better now,

Great!

there are some things I do not understand and did not take the effort to properly profile it. Rclone seems to be doing something it spends significant amount of time in user space User time (seconds): 0.16 and CPU usage is over 100% which is kind of surprising.

I did a CPU profile on rclone and I got this result

	Command being timed: "rclone copyto --ignore-checksum --cpuprofile cpu.prof 1GB 1G.copy"
	User time (seconds): 0.22
	System time (seconds): 1.38
	Percent of CPU this job got: 45%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.55
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 31264
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 2943
	Voluntary context switches: 3446
	Involuntary context switches: 174
	Swaps: 0
	File system inputs: 2097240
	File system outputs: 2097160
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

profile001

(pprof) top20
Showing nodes accounting for 1.49s, 100% of 1.49s total
Showing top 20 nodes out of 66
      flat  flat%   sum%        cum   cum%
     1.17s 78.52% 78.52%      1.19s 79.87%  syscall.Syscall
     0.11s  7.38% 85.91%      0.11s  7.38%  runtime.memmove
     0.10s  6.71% 92.62%      0.10s  6.71%  syscall.Syscall6
     0.01s  0.67% 93.29%      0.53s 35.57%  github.com/rclone/rclone/backend/local.fadviseReadCloser.Read
     0.01s  0.67% 93.96%      0.13s  8.72%  github.com/rclone/rclone/fs/accounting.(*Account).Read
     0.01s  0.67% 94.63%      0.01s  0.67%  github.com/rclone/rclone/fs/asyncreader.(*AsyncReader).getBuffer
     0.01s  0.67% 95.30%      0.80s 53.69%  internal/poll.(*FD).Write
     0.01s  0.67% 95.97%      0.02s  1.34%  runtime.deferreturn
     0.01s  0.67% 96.64%      0.01s  0.67%  runtime.exitsyscall
     0.01s  0.67% 97.32%      0.01s  0.67%  runtime.futex
     0.01s  0.67% 97.99%      0.01s  0.67%  runtime.jmpdefer
     0.01s  0.67% 98.66%      0.01s  0.67%  runtime.reentersyscall
     0.01s  0.67% 99.33%      0.01s  0.67%  runtime.stopm
     0.01s  0.67%   100%      0.01s  0.67%  time.Time.IsZero
         0     0%   100%      0.93s 62.42%  github.com/rclone/rclone/backend/local.(*Fs).Put
         0     0%   100%      0.93s 62.42%  github.com/rclone/rclone/backend/local.(*Object).Update
         0     0%   100%      0.10s  6.71%  github.com/rclone/rclone/backend/local.(*fadvise).freePages
         0     0%   100%      0.10s  6.71%  github.com/rclone/rclone/backend/local.(*fadvise).freePagesIfNeeded
         0     0%   100%      0.10s  6.71%  github.com/rclone/rclone/backend/local.(*fadvise).next
         0     0%   100%      0.93s 62.42%  github.com/rclone/rclone/cmd.Main

So you can see that almost all the time is being used by syscall.Syscall and this is split between the read path and write path as you might expect.

Those times are elapsed times not CPU times though

time reported User time (seconds): 0.22 so if we ignore the syscall time then we are left with the major CPU usage being runtime.memmove so copying buffers about.

Rclone almost certainly does more memory copies than it needs to and that is something that could be optimised I think.

Rclone implements the io.Reader interface extensively. I suspect implementing io.ReaderFrom in a few strategic places would fix this. (or maybe io.WriterTo)

Anyway, I'm happy with the performance for the moment!

@mmatczuk
Copy link
Contributor Author

mmatczuk commented Aug 13, 2019 via email

ncw added a commit that referenced this issue Aug 13, 2019
Before this change, if the caller didn't provide a hint, we would
calculate all hashes for reads and writes.

The new whirlpool hash is particularly expensive and that has become noticeable.

Now we don't calculate any hashes on upload or download unless hints are provided.

This means that some operations may run slower and these will need to be discovered!

It does not affect anything calling operations.Copy which already puts
the corrects hints in.
ncw added a commit that referenced this issue Aug 13, 2019
Before this change for a post copy Hash check we would run the hashes sequentially.

Now we run the hashes in parallel for a useful speedup.

Note that this refactors the hash check in Copy to use the standard
hash checking routine.
ncw added a commit that referenced this issue Aug 13, 2019


Before this change we calculated the checkums when using
--ignore-checksum but ignored them at the end.

Now we don't calculate the checksums at all which is more efficient.
ncw added a commit that referenced this issue Aug 13, 2019
Before this change we didn't calculate or check hashes of transferred
files if --size-only mode was explicitly set.

This problem was introduced in 20da3e6 which was released with v1.37

After this change hashes are checked for all transfers unless
--ignore-checksums is set.
ncw added a commit that referenced this issue Aug 13, 2019
...unless --multi-thread-streams has been set explicitly

Fixes #3419
@mmatczuk
Copy link
Contributor Author

I'd send a patch to fadvise to issue syscall on a dedicated routine.

@mmatczuk
Copy link
Contributor Author

mmatczuk commented Aug 14, 2019

Here is the patch #3440, after applying that it looks much better, but still accounting is an issue.

profile001

@mmatczuk
Copy link
Contributor Author

mmatczuk commented Aug 14, 2019

@ncw if I comment out func (acc *Account) WithBuffer() *Account body to just return acc (and apply your patches plus my fadvise patch) it turns out that rclone is more or less as good as cp.

$ taskset -c 0,1 /usr/bin/time -v ./rclone copyto --ignore-checksum  1GB.bin /var/empty/rclone/1GB.bin.40
        Command being timed: "./rclone copyto --ignore-checksum 1GB.bin /var/empty/rclone/1GB.bin.40"
        User time (seconds): 0.07
        System time (seconds): 0.69
        Percent of CPU this job got: 68%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.12
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 27092
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 1973
        Voluntary context switches: 11448
        Involuntary context switches: 146
        Swaps: 0
        File system inputs: 2097152
        File system outputs: 2097152
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

The pprof graph looks perfect!

profile002

So my question is why do we need this buffering?

@ncw
Copy link
Member

ncw commented Aug 14, 2019

Excellent analysis :-)

So my question is why do we need this buffering?

It is read-ahead buffering, so by default rclone tries to get 16M ahead in a stream.

It was put in originally to solve performance problems with reading from Windows disks, so one thread could be reading ahead while the other was writing.

It is used by people who like to stream media off rclone mount so there is a buffer in case of reading hiccoughs.

I think your patch would be equivalent to doing --buffer-size 0 - it looks like it in my tests.

So we could skip the buffering if doing a local -> local copy easily enough. That might degrade performance on Windows though.

ncw added a commit that referenced this issue Aug 14, 2019
Before this change, if the caller didn't provide a hint, we would
calculate all hashes for reads and writes.

The new whirlpool hash is particularly expensive and that has become noticeable.

Now we don't calculate any hashes on upload or download unless hints are provided.

This means that some operations may run slower and these will need to be discovered!

It does not affect anything calling operations.Copy which already puts
the corrects hints in.
ncw added a commit that referenced this issue Aug 14, 2019
Before this change for a post copy Hash check we would run the hashes sequentially.

Now we run the hashes in parallel for a useful speedup.

Note that this refactors the hash check in Copy to use the standard
hash checking routine.
ncw added a commit that referenced this issue Aug 14, 2019


Before this change we calculated the checkums when using
--ignore-checksum but ignored them at the end.

Now we don't calculate the checksums at all which is more efficient.
ncw added a commit that referenced this issue Aug 14, 2019
Before this change we didn't calculate or check hashes of transferred
files if --size-only mode was explicitly set.

This problem was introduced in 20da3e6 which was released with v1.37

After this change hashes are checked for all transfers unless
--ignore-checksums is set.
ncw added a commit that referenced this issue Aug 14, 2019
...unless --multi-thread-streams has been set explicitly
@mmatczuk
Copy link
Contributor Author

That's fine by me but maybe buffer-size should be 0 by default?

@mmatczuk
Copy link
Contributor Author

mmatczuk commented Aug 14, 2019

I discovered another issue on the hot path. If you take a look at the time output above you can notice large amounts of Voluntary context switches usually it's about 2k. Now I have straced that and it leads me to observation that when you copy one mutex is heavily hammered.

futex(0xc000090848, FUTEX_WAKE_PRIVATE, 1) = 1
write(6, "\300i\363\\Qg\247\317\312\21D\352F\222\314\3431R\362(@\234G\244\\\v\16?\\\312\206\216"..., 32768) = 32768
read(5, "\260\355\236q\261\336\360\367\237J;Gn6\27\221<X,\10\307\253\304\242Ui\0\365\326\30Hf"..., 32768) = 32768
write(6, "\260\355\236q\261\336\360\367\237J;Gn6\27\221<X,\10\307\253\304\242Ui\0\365\326\30Hf"..., 32768) = 32768
read(5, "H\365\v@\213\236b\23\336\345\326\r\265\255\205\270\t\363\356'\253\215F\256\310\375\17\36\374\260lB"..., 32768) = 32768
write(6, "H\365\v@\213\236b\23\336\345\326\r\265\255\205\270\t\363\356'\253\215F\256\310\375\17\36\374\260lB"..., 32768) = 32768
read(5, "\374\275O+\234\200\261{\20\5\257\200\315\2311\232\364\362\316^\213\212%\215&8\234\243\251O\212\200"..., 32768) = 32768
write(6, "\374\275O+\234\200\261{\20\5\257\200\315\2311\232\364\362\316^\213\212%\215&8\234\243\251O\212\200"..., 32768) = 32768
read(5, "\v\251\320\251\311\7\274\272J\204\344\361]\366\263\255C\347\27^\17\377/\305zrz\2642\355\3032"..., 32768) = 32768
futex(0xc000090848, FUTEX_WAKE_PRIVATE, 1) = 1
write(6, "\v\251\320\251\311\7\274\272J\204\344\361]\366\263\255C\347\27^\17\377/\305zrz\2642\355\3032"..., 32768) = 32768
read(5, "B\234\3644n\370\237V\207\246\23\235C\25\224\244\270\5z@\377Qo\22\376\263uL\264\345\30\252"..., 32768) = 32768
write(6, "B\234\3644n\370\237V\207\246\23\235C\25\224\244\270\5z@\377Qo\22\376\263uL\264\345\30\252"..., 32768) = 32768
read(5, "\322o^M\261\374#_\255MQ$~`\201\354J5\36\377W\261\324\243\351\236\334\3526\317#\310"..., 32768) = 32768
write(6, "\322o^M\261\374#_\255MQ$~`\201\354J5\36\377W\261\324\243\351\236\334\3526\317#\310"..., 32768) = 32768
read(5, "y\210\376W=sX\352J\261J\3376Yp'\257\324\245\305T\222c\326iX\360\216/P+\302"..., 32768) = 32768
write(6, "y\210\376W=sX\352J\261J\3376Yp'\257\324\245\305T\222c\326iX\360\216/P+\302"..., 32768) = 32768
read(5, "T\323\4\251\32?E\202\244\3\28V\340\326\206\244az\365E,{\326;\333O\20\327a\33\236"..., 32768) = 32768

Every 80 writes we get a lock.

$ grep 'futex(0xc0000999c8' rsync.strace | wc -l
426
$ grep write\(6 rsync.strace | wc -l
32768

I'm pretty sure this one is caused by https://github.com/rclone/rclone/blob/master/fs/accounting/accounting.go#L31, I read the comment on it and do not fully understand it. Can we use atomics here?

ncw added a commit that referenced this issue Aug 14, 2019
Before this change, if the caller didn't provide a hint, we would
calculate all hashes for reads and writes.

The new whirlpool hash is particularly expensive and that has become noticeable.

Now we don't calculate any hashes on upload or download unless hints are provided.

This means that some operations may run slower and these will need to be discovered!

It does not affect anything calling operations.Copy which already puts
the corrects hints in.
ncw added a commit that referenced this issue Aug 14, 2019
Before this change for a post copy Hash check we would run the hashes sequentially.

Now we run the hashes in parallel for a useful speedup.

Note that this refactors the hash check in Copy to use the standard
hash checking routine.
ncw added a commit that referenced this issue Aug 14, 2019


Before this change we calculated the checkums when using
--ignore-checksum but ignored them at the end.

Now we don't calculate the checksums at all which is more efficient.
ncw added a commit that referenced this issue Aug 14, 2019
Before this change we didn't calculate or check hashes of transferred
files if --size-only mode was explicitly set.

This problem was introduced in 20da3e6 which was released with v1.37

After this change hashes are checked for all transfers unless
--ignore-checksums is set.
ncw added a commit that referenced this issue Aug 14, 2019
...unless --multi-thread-streams has been set explicitly
@ncw
Copy link
Member

ncw commented Aug 14, 2019

I've merged the commits so far to master so they can get some testing in the wider world...

That's fine by me but maybe buffer-size should be 0 by default?

Just for local -> local copies? Or for everything?

I'm pretty sure this one is caused by https://github.com/rclone/rclone/blob/master/fs/accounting/accounting.go#L31, I read the comment on it and do not fully understand it. Can we use atomics here?

        // The mutex is to make sure Read() and Close() aren't called
	// concurrently.  Unfortunately the persistent connection loop
	// in http transport calls Read() after Do() returns on
	// CancelRequest so this race can happen when it apparently
	// shouldn't.

The problem is that Read() and Close() can be called concurrently. This is because the http transport caches open connections

This was put in as 74994a2

There is a discussion here: https://groups.google.com/forum/#!topic/golang-dev/0Nl6k5Sj6UU

It might be that the mutex is no longer needed.. I tried to make the original test code go wrong with go1.12 but failed. It goes wrong with go1.5.

However removing that mutex makes me nervous...

I would have thought that mutex isn't under contention. Do why does it appear in strace? I thought there was a fast path for unlocked mutex?

// Lock locks m.
// If the lock is already in use, the calling goroutine
// blocks until the mutex is available.
func (m *Mutex) Lock() {
	// Fast path: grab unlocked mutex.
	if atomic.CompareAndSwapInt32(&m.state, 0, mutexLocked) {
		if race.Enabled {
			race.Acquire(unsafe.Pointer(m))
		}
		return
	}

I tried using go's mutex profiling tools and it says there aren't any contended mutexes?

@mmatczuk
Copy link
Contributor Author

mmatczuk commented Aug 14, 2019

I need to do some more experimentation, will be off for a couple of days.

@mmatczuk
Copy link
Contributor Author

Overall the large amount of Voluntary context switches is strange I do not think this is a major issue for now.

The performance now is good 👍

Thanks for your work on it @ncw.

@mmatczuk
Copy link
Contributor Author

I did a block profile that points to func (acc *Account) averageLoop() if I remove that the futex syscalls goes away. Maybe this should be opened in a new issue. I do not think that it improves performance nor significantly reduces the number of Voluntary context switches.

profile003

@ncw
Copy link
Member

ncw commented Aug 22, 2019

Thank you for your help in this issue - I think we've made significant progress speeding rclone up :-)

I'll close this issue for now, but do make more issues if you think there is more we can fix!

@ncw ncw closed this as completed Aug 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants