Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subpipeline data transfer timeout #237

Closed
3 of 6 tasks
mattwarkentin opened this issue Nov 30, 2020 · 17 comments
Closed
3 of 6 tasks

Subpipeline data transfer timeout #237

mattwarkentin opened this issue Nov 30, 2020 · 17 comments

Comments

@mattwarkentin
Copy link
Contributor

mattwarkentin commented Nov 30, 2020

Prework

  • Read and agree to the code of conduct and contributing guidelines.
  • If there is already a relevant issue, whether open or closed, comment on the existing thread instead of posting a new issue.
  • Post a minimal reproducible example so the maintainer can troubleshoot the problems you identify. A reproducible example is:
    • Runnable: post enough R code and data so any onlooker can create the error on their own computer.
    • Minimal: reduce runtime wherever possible and remove complicated details that are irrelevant to the issue at hand.
    • Readable: format your code according to the tidyverse style guide.

Description

Hi @wlandau,

I'm running into an issue with transferring some probably "too big" data to the HPC via SSH using clustermq. For some additional context, the format of the data is torch - whether that matters or not. Anyway, there are two files which are the same format/structure (torch tensors), but vary in size in the _targets/ data store.

One file is 221Mb on disk, the other is 1.1Gb. Admittedly, I maybe shouldn't be sending that big of a file over SSH, but it's really not THAT big. Anyway, I ran a test where all I did was check the class of the objects on the worker (just a lightweight job that forces the data to be transferred over). It took about a minute for the 221Mb tensor to transfer over and complete, the larger file "ran" until it hit my cmq timeout of 20 minutes of no master/worker communication.

I really don't even know how to diagnose the issue. Is this subpipeline related? clustermq related? I don't think it should take more than 20x longer to transfer over the target value for a file 4x larger. Also 1Gb just seemingly isn't that big. When I run locally it takes like 20 seconds to complete the whole job.

Happy to run any tests I can to diagnose this issue.

@mattwarkentin
Copy link
Contributor Author

I know it's not the same, but when I transfer the larger file by scp it takes 26 seconds. I don't quite know enough about what the sub-pipeline is doing with serializing etc., but something is bogging and I hope to sort it out.

@mattwarkentin
Copy link
Contributor Author

The same timing out also happens when transferring a file that is 800Mb in _targets/ store. These files are larger when loaded into memory; I think the 1.1Gb file is about ~3.5Gb in memory.

@wlandau
Copy link
Member

wlandau commented Nov 30, 2020

Maybe @mschubert could confirm, but I think clustermq tries to avoid transferring data that large. At this point, I would recommend running the main process in targets directly on the login node with retrieval = "worker".

I guess one thing to confirm is that the subpipeline is about the same size as the torch data you are sending. If the subpipeline duplicates data (which I doubt it is) then that would be a bug in targets.

@mattwarkentin
Copy link
Contributor Author

mattwarkentin commented Nov 30, 2020

Would retrieval = "worker" actually work? The HPC doesn't haven't access to the local data store, unless it actually does have access via SSH??

Also, I thought clustermq just threw a warning for larger data, but didn't strictly prohibit it. I could be wrong.

@wlandau
Copy link
Member

wlandau commented Nov 30, 2020

Would retrieval = "worker" actually work? The HPC doesn't haven't access to the local data store, unless it actually does have access via SSH??

I know it's inconvenient, but you would have to manually SSH into the cluster and run the pipeline from the login node instead of running from your local computer using the clustermq SSH connector. Your choice, obviously.

Also, I thought clustermq just threw a warning for larger data, but didn't strictly prohibit it. I could be wrong.

Yeah, there's no hard limit. You may even be able to increase the timeout. That said, 20 minutes seems like a lot, especially since 221 MB only took a minute.

@mattwarkentin
Copy link
Contributor Author

I've hit the frustration level where I'm going to split the data processing part of the pipeline into its own targets project locally, since it needs access to data that has to live locally.

And then the model building targets will be a separate targets project on the HPC so I don't have to deal with any SSH. I was hoping to keep it all together in one project, but I'm spending more time debugging SSH deployment than it's worth, I think.

@mschubert
Copy link

mschubert commented Nov 30, 2020

@mattwarkentin Can you please try to run just the transfer via clustermq and see how long it takes?

It shouldn't take longer than scp with a slight overhead for serializing (maybe 5-10 sec). If it does, that's a bug.

edit: I should clarify that it is supposed to be as fast only if the file size is the same as the object size, and otherwise scale with the object size

@mschubert
Copy link

Something like:

clustermq::Q(object.size, x=list(rnorm(1e8)), n_jobs=1)

should do

@mattwarkentin
Copy link
Contributor Author

clustermq::Q(object.size, x=list(rnorm(1e8)), n_jobs=1)
Connecting USER@HOST via SSH ...
Sending common data ...
Running 1 calculations (0 objs/0 Mb common; 1 calls/chunk) ...
Master: [30.5s 9.3% CPU]; Worker: [avg 13.5% CPU, max 201693041.0 Mb]
Error in summarize_result(job_result, n_errors, n_warnings, cond_msgs,  : 
  1/1 jobs failed (0 warnings). Stopping.
(Error #1) object 'C_objectSize' not found

@mschubert
Copy link

That's clearly wrong. I'll look into it 😅

@mattwarkentin
Copy link
Contributor Author

mattwarkentin commented Nov 30, 2020

The same error is produced when using very small sample sizes in rnorm() (e.g. rnorm(1e1)). So it doesn't seem to be a size issue, so probably (?) shouldn't relate to the original issue, right?

@mattwarkentin
Copy link
Contributor Author

mattwarkentin commented Nov 30, 2020

@wlandau So far it seems like this is unrelated to targets or the subpipeline. After a few tests (mschubert/clustermq#222) it seems like the SSH data transfer times possibly scale nearly cubically (or larger) with in-memory file sizes.

@wlandau
Copy link
Member

wlandau commented Nov 30, 2020

Thanks for tracking that down, @mattwarkentin!

@liutiming
Copy link
Contributor

I realised mschubert/clustermq#229 would probably be relevant to this issue, too. Jobs can still be sent when it was about 1GB here but when it is higher all my attempts to complete the job do not succeed... I am not using SSH here though - running on the cluster.

@wlandau
Copy link
Member

wlandau commented Jan 2, 2021

@liutiming, have you tried storage = "worker" and retrieval = "worker" in tar_target() or tar_option_set()?

@liutiming
Copy link
Contributor

liutiming commented Jan 2, 2021

Thank you @wlandau! Now it is finally working!
I tried the following

  1. followed your advice by configuring storage and retrieval in the target
  2. split the large file (about 4GB in lobstr::object_size) into 22 files
  3. used within-target disk-IO for the large file or the 22 files generated by it to reduce memory usage
  4. changed %dopar% in the target to %do% to avoid clustermq submitting jobs within each target

2/3/2&3 does not work
2&3&4 does work but the targets are generated at a very slow pace.
1&2&3&4 works very fast

Just to confirm if mschubert/clustermq#229 (comment) is a wrong pattern? i.e. we are not supposed to set register_dopar_cmq and then use foreach %dopar% within the function for one target? Interestingly though, I do not think any job was submitted anyways until I put register_dopar_cmq into ~/.Rproofile (removed before trying the above steps).

@mschubert so I am thinking whether I should have used registerDoParallel instead of register_dopar_cmq to take advantage of the multiple cores that each worker has, as registerDoParallel will make use of the cores within each worker and register_dopar_cmq will submit jobs within each worker, which is not recommended by @wlandau if I understand correctly.

For reference:
There is still a large file warning from clustermq of about 2GB after trying 1&2&3&4 together, same as before, but I am happy that it is running. For future reference, that target has about 1000 branches when running against 1 large file and 22000 branches when running against 22 files. This might have complicate the memory issues a bit but I have always been using head(map()) to test so the number of branches tested is always within 10. I will look into creating tar_group since now each branch only takes about 10 seconds to finish. So it is wise to extend the runtime and reduce the total overhead.

@wlandau
Copy link
Member

wlandau commented Jan 2, 2021

Just to confirm if mschubert/clustermq#229 (comment) is a wrong pattern? i.e. we are not supposed to set register_dopar_cmq and then use foreach %dopar% within the function for one target? Interestingly though, I do not think any job was submitted anyways until I put register_dopar_cmq into ~/.Rproofile (removed before trying the above steps).

Yes, clustermq %dopar% within a target is like job Inception: jobs within jobs. When that happens, the first layer of jobs is just waiting around for the second layer of jobs to finish. That's usually not a big deal for one or two processes, but scaled out, that strategy unnecessarily locks a large chunk of the cluster. Nobody else can access those resources even though those processes are idle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants