Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dropbox: please report rejected attempts to upload larger objects. #5008

Open
ivandeex opened this issue Feb 9, 2021 · 34 comments
Open

dropbox: please report rejected attempts to upload larger objects. #5008

ivandeex opened this issue Feb 9, 2021 · 34 comments

Comments

@ivandeex
Copy link
Member

ivandeex commented Feb 9, 2021

Originally reported by @borkd at #2158 (comment)

Dropbox site states 50GB upload limit by default, 250GB for some add-on (and 350GB through API but that's unverified. It would a good idea to tell reject attempts to upload larger objects.

https://help.dropbox.com/installs-integrations/sync-uploads/upload-limitations

@rajatgoel
Copy link
Contributor

We (Dropbox) internally rolled out a API perf test daemon and tested the performance of uploading 350GB file using API endpoints. It confirmed we can upload 350GB file in ~50mins using concurrent upload sessions (client is not network/disk bounded, so real experience may be a little slower). This upload attempt may hit API rate-limit errors but if you just backoff and retry, upload will succeed.

@ivandeex
Copy link
Member Author

@rajatgoel Thank you for info

@borkd You might be intetested. See the comment from Dropbox devs above. As stated in #4656 (comment) (AFAICS), rclone plans to test the new dropbox SDK with improved upload concurrency. Retrying at several levels is one of rclone intrinsic features.

@borkd
Copy link

borkd commented Mar 6, 2021

Looking forward to testing this

@ivandeex
Copy link
Member Author

Blocks on #4656 (comment)

@ncw
Copy link
Member

ncw commented Mar 23, 2021

Can try this beta?

v1.55.0-beta.5331.64427951b.fix-dropbox-batch-sync on branch fix-dropbox-batch-sync (uploaded in 15-30 mins)

I'm collecting together dropbox upload improvments in #5156

@borkd
Copy link

borkd commented Apr 1, 2021

What's the expected performance impact (if any) for lots of small files with this beta? Haven't gotten to large files yes

@ncw
Copy link
Member

ncw commented Apr 3, 2021

@borkd

What's the expected performance impact (if any) for lots of small files with this beta? Haven't gotten to large files yes

It should be much improved as it won't trip the dropbox namespace locks and timeouts.

@borkd
Copy link

borkd commented Apr 9, 2021

I've tried it a couple of times, still seeing Failed to copy: upload failed: too_many_requests/., but at a lower frequency than before. When small files are in the mix the overall upload speed is glacially slow.

When rclone ncdu is used:

[...]
<5>NOTICE: too_many_requests/: Too many requests or write operations. Trying again in 300 seconds.
<5>NOTICE: too_many_requests/..: Too many requests or write operations. Trying again in 300 seconds.
<5>NOTICE: too_many_requests/.: Too many requests or write operations. Trying again in 300 seconds.
<5>NOTICE: too_many_requests/.: Too many requests or write operations. Trying again in 300 seconds.

If copy is running it gets affected, too:

021-04-09 12:02:29 NOTICE: too_many_requests/..: Too many requests or write operations. Trying again in 300 seconds.
2021-04-09 12:02:29 NOTICE: too_many_requests/...: Too many requests or write operations. Trying again in 300 seconds.
8.509M / 21.749 GBytes, 0%, 6.221 kBytes/s, ETA 6w9h47m2s (xfr#195/10207)

@borkd
Copy link

borkd commented Apr 9, 2021

Also, have just ran into an internal error with subsequent segmentation violation, which I have never encountered with rclone before. Running on a server with ECC ram, and source data is fetched from ZFS pool and there are no problems there

@ncw
Copy link
Member

ncw commented Apr 10, 2021

Also, have just ran into an internal error with subsequent segmentation violation, which I have never encountered with rclone before.

If you could post the full backtrace that would be helpful - it might be an rclone bug.

I've tried it a couple of times, still seeing Failed to copy: upload failed: too_many_requests/., but at a lower frequency than before. When small files are in the mix the overall upload speed is glacially slow.

Are you doing uploads from more than one rclone process at once? Dropbox doesn't like that.

When rclone ncdu is used:

[...]
<5>NOTICE: too_many_requests/: Too many requests or write operations. Trying again in 300 seconds.
<5>NOTICE: too_many_requests/..: Too many requests or write operations. Trying again in 300 seconds.
<5>NOTICE: too_many_requests/.: Too many requests or write operations. Trying again in 300 seconds.
<5>NOTICE: too_many_requests/.: Too many requests or write operations. Trying again in 300 seconds.

So just rclone ncdu triggered the problem.

If copy is running it gets affected, too:

021-04-09 12:02:29 NOTICE: too_many_requests/..: Too many requests or write operations. Trying again in 300 seconds.
2021-04-09 12:02:29 NOTICE: too_many_requests/...: Too many requests or write operations. Trying again in 300 seconds.
8.509M / 21.749 GBytes, 0%, 6.221 kBytes/s, ETA 6w9h47m2s (xfr#195/10207)

Yes it will.

Dropbox likes all the operations to one namespace to be serialized in one upload thread. If you don't do this it works, but if you do it too fast then you get those "too_many_requests" errors and some long timeouts.

@borkd
Copy link

borkd commented Apr 10, 2021

$ rclone-batch copy data/set/ dropbox:set/ --transfers=4 -P --stats-one-line
2021-04-09 16:54:50 ERROR : Dropbox root 'set': sync batch commit: failed to commit batch length 1: wait for batch completion: check failed: internal_error/...
384.552M / 22.614 GBytes, 2%, 250.631 kBytes/s, ETA 1d1h50m41s (xfr#241/10253)panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0xec3141]

goroutine 82 [running]:
github.com/rclone/rclone/backend/dropbox.(*batcher).commitBatch(0xc000141320, 0x213fce8, 0xc000128000, 0xc00204e010, 0x1, 0x1, 0xc00204e018, 0x1, 0x1, 0x0, ...)
        github.com/rclone/rclone/backend/dropbox/batcher.go:201 +0x241
github.com/rclone/rclone/backend/dropbox.(*batcher).commitLoop.func1()
        github.com/rclone/rclone/backend/dropbox/batcher.go:246 +0x9c
github.com/rclone/rclone/backend/dropbox.(*batcher).commitLoop(0xc000141320, 0x213fce8, 0xc000128000)
        github.com/rclone/rclone/backend/dropbox/batcher.go:277 +0x2b1
created by github.com/rclone/rclone/backend/dropbox.newBatcher
        github.com/rclone/rclone/backend/dropbox/batcher.go:100 +0x245

@borkd
Copy link

borkd commented Apr 10, 2021

For reference, the size of the dataset in question is ~2.8T, and ~800k files.

@ncw
Copy link
Member

ncw commented Apr 10, 2021

I can fix the bug from that backtrace thanks.

Are you doing uploads from more than one rclone process at once? Dropbox doesn't like that.

Thoughts on that?

@borkd
Copy link

borkd commented Apr 10, 2021

Are you doing uploads from more than one rclone process at once? Dropbox doesn't like that.

Thoughts on that?

I know how fragile Dropbox API is when it comes to call frequency, and I do try to serialize uploads or any other lifecycle management operations when possible. That can easily become a roulette without global locking and/or scheduling.

Having to manage serialization on the customer side becomes a serious problem when daily changes within any one namespace start to take a week or longer to sync

@ncw
Copy link
Member

ncw commented Apr 11, 2021

Are you doing uploads from more than one rclone process at once? Dropbox doesn't like that.

Thoughts on that?

I know how fragile Dropbox API is when it comes to call frequency, and I do try to serialize uploads or any other lifecycle management operations when possible. That can easily become a roulette without global locking and/or scheduling.

Having to manage serialization on the customer side becomes a serious problem when daily changes within any one namespace start to take a week or longer to sync

This should be less of a problem when using batch mode as it is only the batch commits which shouldn't overlap, not the actual uploads.

$ rclone-batch copy data/set/ dropbox:set/ --transfers=4 -P --stats-one-line
2021-04-09 16:54:50 ERROR : Dropbox root 'set': sync batch commit: failed to commit batch length 1: wait for batch completion: check failed: internal_error/...
384.552M / 22.614 GBytes, 2%, 250.631 kBytes/s, ETA 1d1h50m41s (xfr#241/10253)panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0xec3141]

goroutine 82 [running]:
github.com/rclone/rclone/backend/dropbox.(*batcher).commitBatch(0xc000141320, 0x213fce8, 0xc000128000, 0xc00204e010, 0x1, 0x1, 0xc00204e018, 0x1, 0x1, 0x0, ...)
        github.com/rclone/rclone/backend/dropbox/batcher.go:201 +0x241
github.com/rclone/rclone/backend/dropbox.(*batcher).commitLoop.func1()
        github.com/rclone/rclone/backend/dropbox/batcher.go:246 +0x9c
github.com/rclone/rclone/backend/dropbox.(*batcher).commitLoop(0xc000141320, 0x213fce8, 0xc000128000)
        github.com/rclone/rclone/backend/dropbox/batcher.go:277 +0x2b1
created by github.com/rclone/rclone/backend/dropbox.newBatcher
        github.com/rclone/rclone/backend/dropbox/batcher.go:100 +0x245

I can't work out which version of rclone you are using here. It doesn't look like the latest from here: https://beta.rclone.org/branch/fix-dropbox-batch-sync/v1.55.0-beta.5334.ac7acb9f0.fix-dropbox-batch-sync/

Can you try again with that one please?

@borkd
Copy link

borkd commented Apr 12, 2021

I have built my binary from the branch suggested earlier in the comments:

$ rclone-batch version
rclone v1.55.0-beta.5331.64427951b.fix-dropbox-batch-sync
- os/type: linux
- os/arch: amd64
- go/version: go1.16.2
- go/linking: static
- go/tags: cmount

@borkd
Copy link

borkd commented Apr 12, 2021

Can try this beta?

v1.55.0-beta.5331.64427951b.fix-dropbox-batch-sync on branch fix-dropbox-batch-sync (uploaded in 15-30 mins)

I'm collecting together dropbox upload improvments in #5156

Built shortly after this was referenced ^^^

@ncw
Copy link
Member

ncw commented Apr 12, 2021

6442795

That is what I needed to know - the commit number of the build you were using.

The crash happened here

github.com/rclone/rclone/backend/dropbox.(*batcher).commitBatch(0xc000141320, 0x213fce8, 0xc000128000, 0xc00204e010, 0x1, 0x1, 0xc00204e018, 0x1, 0x1, 0x0, ...)
github.com/rclone/rclone/backend/dropbox/batcher.go:201 +0x241

entries := complete.Entries

Which looks like complete was nil at that point.

It was called from

github.com/rclone/rclone/backend/dropbox.(*batcher).commitLoop.func1()
github.com/rclone/rclone/backend/dropbox/batcher.go:246 +0x9c

err := b.commitBatch(ctx, items, results)

So why was complete nil?

I think it was mostly likely because the batch didn't complete in 120 seconds and this was called

return batchStatus.Complete, nil

This is fixed in the current version, so hopefully it won't happen again.

@borkd
Copy link

borkd commented Apr 12, 2021

Testing 5394 now

@borkd
Copy link

borkd commented Apr 13, 2021

@ncw - no segfault this time, but some more internal_error feedback. Note I have specified max age to copy changes from last 24 hours. I was careful to not issue any potentially competing operations against the dropbox namespace/bucket, and actually the entire account so API rate limiting is not accidentally triggered.

$ rclone-batch version
rclone v1.56.0-beta.5394.d16702566.fix-dropbox-batch-sync
- os/version: debian 10.6 (64 bit)
- os/kernel: 5.4.65-1-pve (x86_64)
- os/type: linux
- os/arch: amd64
- go/version: go1.16.3
- go/linking: static
- go/tags: none

$ rclone-batch copy data/set/ dropbox:set/ --transfers=8 -P --stats --one-line --max-age 1d
2021-04-12 19:53:03 ERROR : Dropbox root 'set': sync batch commit: failed to commit batch length 5: wait for batch failed after 120 tries: internal_error/...
2021-04-12 20:09:00 ERROR : Dropbox root 'set': sync batch commit: failed to commit batch length 2: wait for batch failed after 120 tries: batch didn't complete
2021-04-12 20:11:17 ERROR : Dropbox root 'set': sync batch commit: failed to commit batch length 1: wait for batch failed after 120 tries: batch didn't complete
15.435G / 46.646 GBytes, 33%, 260.075 kBytes/s, ETA 1d10h57m17s (xfr#5237/15253)

@ncw
Copy link
Member

ncw commented Apr 17, 2021

@ncw - no segfault this time, but some more internal_error feedback. Note I have specified max age to copy changes from last 24 hours. I was careful to not issue any potentially competing operations against the dropbox namespace/bucket, and actually the entire account so API rate limiting is not accidentally triggered.

$ rclone-batch version
rclone v1.56.0-beta.5394.d16702566.fix-dropbox-batch-sync
- os/version: debian 10.6 (64 bit)
- os/kernel: 5.4.65-1-pve (x86_64)
- os/type: linux
- os/arch: amd64
- go/version: go1.16.3
- go/linking: static
- go/tags: none

$ rclone-batch copy data/set/ dropbox:set/ --transfers=8 -P --stats --one-line --max-age 1d
2021-04-12 19:53:03 ERROR : Dropbox root 'set': sync batch commit: failed to commit batch length 5: wait for batch failed after 120 tries: internal_error/...
2021-04-12 20:09:00 ERROR : Dropbox root 'set': sync batch commit: failed to commit batch length 2: wait for batch failed after 120 tries: batch didn't complete
2021-04-12 20:11:17 ERROR : Dropbox root 'set': sync batch commit: failed to commit batch length 1: wait for batch failed after 120 tries: batch didn't complete
15.435G / 46.646 GBytes, 33%, 260.075 kBytes/s, ETA 1d10h57m17s (xfr#5237/15253)

No segfault is an improvement.

However the batch not completing for 2 minutes is a problem, as is internal_error/... which is a message from dropbox rather than rclone, indicating something has gone wrong at their end.

Can you tell me something about the number and size distribution of the upload?

If you could do a copy with -vv --dump responses and attach the (very large!) log that would be very useful - then we can see what is happening hopefully - thanks.

@borkd
Copy link

borkd commented Apr 18, 2021

Sure. This particular set being synced is a relatively small but constantly evolving backup repository. Current footprint is 3TB in 1.4M files scattered across a number of sparsely populated directories. Data there changes multiple times a day.

Rclone can handle it, albeit it takes quite some time. For example, when defaults are used a simple check without any data transfer to a nearby swiftstack backend takes about 95 minutes. With tens of checkers this time gets cut in half. A bunch of transfers need to be specified for actual copy to happen at acceptable overall speed. Dropbox and box backends are where the pain and suffering gets real...

Above set is the smallest in what I'm trying to push through. Next one is changing almost as often but is about an order of magnitude larger. Attempts to keep that up to date with Dropbox and/or box was never successful

@ncw
Copy link
Member

ncw commented Apr 19, 2021

If you can get me a log with -vv I'll attempt to figure out what is wrong! If you don't want to make it public then email it (or a link to it) to nick@craig-wood.com - put a link to this github issue in the email for context - thanks

@borkd
Copy link

borkd commented Apr 22, 2021

The very verbose session with dump responses has been running since. What kind of overhead/slowdown does that impose compared to normal operation? I'm curious whether we are altering the transaction profile enough to not trigger anything on the Dropbox side.

@ncw
Copy link
Member

ncw commented Apr 25, 2021

A good question. Some overhead certainly. Are you logging it straight to file - that will be the quickest way. It does change the processing path of http requests so I guess it could mean that the bug doesn't show.

@borkd
Copy link

borkd commented Apr 27, 2021

It's been running, or rather crawling along for 172hrs and reports another 3 weeks and 5 days to go. It took easily more than a month to find and reproduce the original issue with large uploads, this is becoming time consuming as well. Do you have any ideas how to make this troubleshooting more interactive?

@ncw
Copy link
Member

ncw commented Apr 27, 2021

It's been running, or rather crawling along for 172hrs and reports another 3 weeks and 5 days to go.

Wow!

It took easily more than a month to find and reproduce the original issue with large uploads, this is becoming time consuming as well.

:-(

Do you have any ideas how to make this troubleshooting more interactive?

Is there a smaller subset of the data?

Can you grep the logs so far and see if the problem has ocurred?

@borkd
Copy link

borkd commented Apr 27, 2021

There are some rate limiting messages in the log, so I will send you the incomplete compressed log with whatever accumulated so far.

@ncw
Copy link
Member

ncw commented Apr 28, 2021

Thanks for the log.

I can see two events in here...

First a batch failed to complete after 120 seconds - this all looks completely normal except for the batch (with 1 file) taking 120 seconds to complete.

2021/04/19 20:10:19 DEBUG : HTTP REQUEST (req 0xc001103900)
2021/04/19 20:10:19 DEBUG : POST /2/files/upload_session/finish_batch/check HTTP/1.1
Host: api.dropboxapi.com
User-Agent: rclone/v1.56.0-beta.5394.d16702566.fix-dropbox-batch-sync
Content-Length: 113
Authorization: XXXX
Content-Type: application/json
Accept-Encoding: gzip

2021/04/19 20:10:19 DEBUG : >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2021/04/19 20:10:19 DEBUG : <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2021/04/19 20:10:19 DEBUG : HTTP RESPONSE (req 0xc001103900)
2021/04/19 20:10:19 DEBUG : HTTP/2.0 200 OK
Content-Length: 23
Accept-Encoding: identity,gzip
Cache-Control: no-cache
Content-Type: application/json
Date: Tue, 20 Apr 2021 00:10:19 GMT
Server: envoy
X-Content-Type-Options: nosniff
X-Dropbox-Request-Id: f746ed86c87a4fdcaf660a12d19f1a24
X-Dropbox-Response-Origin: far_remote
X-Frame-Options: SAMEORIGIN
X-Server-Response-Time: 109

{".tag": "in_progress"}
2021/04/19 20:10:20 ERROR : Dropbox root 'jbod2': sync batch commit: failed to commit batch length 1: wait for batch failed after 120 tries: batch didn't complete

So rclone gives up here and tries to finish the next batch

2021/04/19 20:10:21 DEBUG : HTTP REQUEST (req 0xc001373b00)
2021/04/19 20:10:21 DEBUG : POST /2/files/upload_session/finish_batch HTTP/1.1
Host: api.dropboxapi.com
User-Agent: rclone/v1.56.0-beta.5394.d16702566.fix-dropbox-batch-sync
Content-Length: 907
Authorization: XXXX
Content-Type: application/json
Accept-Encoding: gzip

2021/04/19 20:10:21 DEBUG : >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2021/04/19 20:10:21 DEBUG : <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2021/04/19 20:10:21 DEBUG : HTTP RESPONSE (req 0xc001373b00)
2021/04/19 20:10:21 DEBUG : HTTP/2.0 500 Internal Server Error
Content-Length: 0
Accept-Encoding: identity,gzip
Content-Security-Policy: sandbox allow-forms allow-scripts
Content-Type: text/plain; charset=utf-8
Date: Tue, 20 Apr 2021 00:10:21 GMT
Server: envoy
X-Dropbox-Request-Id: c787b53ec12e42eba8e0d650a28086ff
X-Dropbox-Response-Origin: far_remote

And immediately gets a 500 error. Rclone retries this 10 times then gives up.

The two problems look related...

I will ask on the dropbox developer forum to see if they have any hints.

@borkd
Copy link

borkd commented Jun 23, 2021

@ncw - did you ever get some feedback on this from the Dropbox developers?

@ncw
Copy link
Member

ncw commented Jul 5, 2021

The feedback was - try for longer that 2 minutes. I haven't implemented that yet though.

And the 500 error was a "transient availability" problem.

@borkd
Copy link

borkd commented Jul 28, 2021

Rate limit errors do not show with the latest release, but I have run into internal_error/ and complete stall for many hours during uploads. I have opened a new issue #5491, but if the problems are related it might be a duplicate.

@DurvalMenezes
Copy link
Contributor

We (Dropbox) internally rolled out a API perf test daemon and tested the performance of uploading 350GB file using API endpoints. It confirmed we can upload 350GB file in ~50mins using concurrent upload sessions (client is not network/disk bounded, so real experience may be a little slower). This upload attempt may hit API rate-limit errors but if you just backoff and retry, upload will succeed.

@rajatgoel: the 350GB you mentioned, does it mean 350GiB (ie, 350 * 1024 * 1024 * 1024) or 350 decimal GB (ie, 350 * 1000 * 1000 * 1000)?

TIA!

@rajatgoel
Copy link
Contributor

does it mean 350GiB (ie, 350 * 1024 * 1024 * 1024) or 350 decimal GB (ie, 350 * 1000 * 1000 * 1000)?

Its 350 * 1024 * 1024 * 1024 according to the code. I have asked the team to clarify this in the documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants