Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrent uploads #7

Closed
procmail opened this issue Apr 16, 2015 · 12 comments
Closed

Concurrent uploads #7

procmail opened this issue Apr 16, 2015 · 12 comments

Comments

@procmail
Copy link

Hi,

Can this do concurrent uploads? If so, how many?

Thanks!

@yadayada
Copy link
Owner

Strictly speaking, no. But it's possible to launch multiple instances, e.g. to upload different directories.

@chrisidefix
Copy link
Contributor

Have you tested this (running multiple instances in parallel)?
I wonder if this won't result in conflicts when multiple processes are trying to write to the same database or is there some magic to maintain a consistent database even if two or more processes are trying to write into it at the same time?

PS: this is very high on my wish-list as well, since the ACD Desktop App can't even do that, but then again - what can it do 💥 ?

@chrisidefix
Copy link
Contributor

To be clear - this is a great addition, but you have to consider that it will only be useful for people, who have large upload bandwidths available. The server caps the connection speed at 10 MB/s (80 Mbit/s) per connection in my experience (e.g. you can open a few browser taps and upload large files in parallel to try this out). Many people may have fast download connections, but not close to these upload rates.

Imagine your ISP limits your upload to 10 MB/s. What use is it really to upload 10 files at the same time with each file uploading at 1 MB/s, when you are just as fast uploading them one after the other with each file uploading at 10 MB/s?

As I said this only becomes interesting when you have multiples of 10 MB/s in upload speed.

If you have a proper glass-fibre connection, you might get there, but anything else, forget about it.
Examples plans:

Example US Providers

V.Fios: you need a plan above the 75/75 Mbit/s option and also reach these speeds for it to make any sense (matching upload & download speeds are nice, but not always common). They seem to offer up to 500/500 Mbit/s, which should be enough for just over 6 parallel connections.
C.Xfinity: only offers asynchronous options with upload speeds around 20 Mbit/s, which is still 4 times slower than the max. bandwidth available.
G.Fibre: here you get 1000 Mbit/s and you could therefore maintain 12.5 connections at the same time.

Alright, I am drifting off-topic, I'm afraid, but my point was simply to show that you need access to a very fast connection for this to be beneficial.

@chrisidefix
Copy link
Contributor

I wrote a tiny wrapper using import multiprocessing just to test what would happen if acl_cli.py is running parallel in multiple instances. So far it seems to work fine, but I could imagine problems to occur if at some point two data transfers should be completed at exactly the same time...? I haven't seen this happening, yet, but it would be good to test this properly.

@chrisidefix
Copy link
Contributor

This is turning into a bit of a lonesome conversation 👽 but since I am testing this, I thought I should share my findings.

After good results uploading 2 files in parallel, I realised that the performance will also heavily depend on the disk read-speeds. If you are syncing files from an external USB 2.0 hard drive for example, you'll probably not going to exceed 30 MB/s (depending on your drive), which means you will only want to read 2-3 files in parallel off that disk.

I am currently testing 12 files in parallel, but peak transfer rates are stalling at 30 MB/s, even though the connection would support much more than that. I guess I should try with a faster drive to get better results.

Also noteworthy - CPU times are quite reasonable. Every process uses about 5% to 6% of a single CPU core, which will leave you with plenty of leg-room if you are running on a somewhat modern multi-core CPU.

@yadayada
Copy link
Owner

yadayada commented May 4, 2015

@chrisidefix If there are two "overlapping" writes to the sqlite database, the instance that tries to write later should crash because it cannot acquire a lock.
There may also be background hashing going on which lowers net disk transfer speeds. This applies to files larger than 500MB.

However, there also may be unnecessary auth token refreshes.

PS: My maximum upload speed is about 8Mbit/s, so this isn't very high on my priority list.

@chrisidefix
Copy link
Contributor

It may become interesting again, when you either:

(1) want to upload many small files
(2) start downloading files

Download speed for many folks could very well be above the limit, but you are right, it's a nice to have feature that may be too stressful to implement when thinking about its actual benefit.

yadayada added a commit that referenced this issue May 16, 2015
* added QueuedLoader
  - concurrent transfers (#7)
  - retry on error (disabled by default)
* retry_on decorator added for transfer functions (jobs)
* api: add multiple read/write callbacks api for ul/dl
* api: progress printing removed
* api: fix for resuming of incomplete downloads
* db conn thread check disabled
* single file progress wrapper FileProgress added
* progress aggregator MultiProgress added
* progress speed determination improved
* download behavior changed to skip existing files
@dansku
Copy link

dansku commented May 16, 2015

How does the concurrent download works?

@yadayada
Copy link
Owner

There is now an -x argument. E.g. acd_cli dl -x 4 /my_remote_folder for 4 simultaneous connections. Same thing for uploads.

@chrisidefix
Copy link
Contributor

@yadayada Thanks for implementing this. This commit shows a significant performance improvement - I tested this feature and can confirm that even for slower upload speeds it is well worth using parallel uploads. It allows a much more continuous use of the available bandwidth. The only downside is an elevated use of CPU resources. Previously, when I ran 4 processes in parallel, CPU use would be at about 20% (~ 5% per process). Now for -x 4 CPU use is at 80% (~ 20% per thread) 😞 but at least there should be no issues with DB locks.

All in all, you could consider making -x 2 the default - it should make upload/download faster in any case.

UPDATE: I am continuously uploading large files (at least 2 GB in size) from an external USB drive running acd_cli.py for several hours. Python 3.4 is maxing out one of my cores at 100% by now (running under on OS X 10.10.3).

@procmail
Copy link
Author

I am also finding this a great feature, especially when uploading directories with many small files. With small files, uploading serially won't maximize the bandwidth usage.

However, I'm not seeing a big jump in CPU usage with -x 8. Python's CPU usage is currently 20%-50%, hovering at 20+% most of the time.

@yadayada
Copy link
Owner

I'm keeping one thread as the default for now, because I'm currently not sure if it's safe to insert into sqlite from different threads under all conditions. To my regret, sqlite3 seems to be safe for multiple processes, but not necessarily thread-safe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants