Subpipeline data transfer timeout #237

mattwarkentin · 2020-11-30T18:07:47Z

Prework

Read and agree to the code of conduct and contributing guidelines.
If there is already a relevant issue, whether open or closed, comment on the existing thread instead of posting a new issue.
Post a minimal reproducible example so the maintainer can troubleshoot the problems you identify. A reproducible example is:
- Runnable: post enough R code and data so any onlooker can create the error on their own computer.
- Minimal: reduce runtime wherever possible and remove complicated details that are irrelevant to the issue at hand.
- Readable: format your code according to the tidyverse style guide.

Description

I'm running into an issue with transferring some probably "too big" data to the HPC via SSH using clustermq. For some additional context, the format of the data is torch - whether that matters or not. Anyway, there are two files which are the same format/structure (torch tensors), but vary in size in the _targets/ data store.

One file is 221Mb on disk, the other is 1.1Gb. Admittedly, I maybe shouldn't be sending that big of a file over SSH, but it's really not THAT big. Anyway, I ran a test where all I did was check the class of the objects on the worker (just a lightweight job that forces the data to be transferred over). It took about a minute for the 221Mb tensor to transfer over and complete, the larger file "ran" until it hit my cmq timeout of 20 minutes of no master/worker communication.

I really don't even know how to diagnose the issue. Is this subpipeline related? clustermq related? I don't think it should take more than 20x longer to transfer over the target value for a file 4x larger. Also 1Gb just seemingly isn't that big. When I run locally it takes like 20 seconds to complete the whole job.

Happy to run any tests I can to diagnose this issue.

The text was updated successfully, but these errors were encountered:

mattwarkentin · 2020-11-30T18:17:58Z

I know it's not the same, but when I transfer the larger file by scp it takes 26 seconds. I don't quite know enough about what the sub-pipeline is doing with serializing etc., but something is bogging and I hope to sort it out.

mattwarkentin · 2020-11-30T18:23:26Z

The same timing out also happens when transferring a file that is 800Mb in _targets/ store. These files are larger when loaded into memory; I think the 1.1Gb file is about ~3.5Gb in memory.

wlandau · 2020-11-30T19:21:07Z

Maybe @mschubert could confirm, but I think clustermq tries to avoid transferring data that large. At this point, I would recommend running the main process in targets directly on the login node with retrieval = "worker".

I guess one thing to confirm is that the subpipeline is about the same size as the torch data you are sending. If the subpipeline duplicates data (which I doubt it is) then that would be a bug in targets.

mattwarkentin · 2020-11-30T19:23:00Z

Would retrieval = "worker" actually work? The HPC doesn't haven't access to the local data store, unless it actually does have access via SSH??

Also, I thought clustermq just threw a warning for larger data, but didn't strictly prohibit it. I could be wrong.

wlandau · 2020-11-30T19:28:34Z

Would retrieval = "worker" actually work? The HPC doesn't haven't access to the local data store, unless it actually does have access via SSH??

I know it's inconvenient, but you would have to manually SSH into the cluster and run the pipeline from the login node instead of running from your local computer using the clustermq SSH connector. Your choice, obviously.

Also, I thought clustermq just threw a warning for larger data, but didn't strictly prohibit it. I could be wrong.

Yeah, there's no hard limit. You may even be able to increase the timeout. That said, 20 minutes seems like a lot, especially since 221 MB only took a minute.

mattwarkentin · 2020-11-30T19:30:56Z

I've hit the frustration level where I'm going to split the data processing part of the pipeline into its own targets project locally, since it needs access to data that has to live locally.

And then the model building targets will be a separate targets project on the HPC so I don't have to deal with any SSH. I was hoping to keep it all together in one project, but I'm spending more time debugging SSH deployment than it's worth, I think.

mschubert · 2020-11-30T19:52:03Z

@mattwarkentin Can you please try to run just the transfer via clustermq and see how long it takes?

It shouldn't take longer than scp with a slight overhead for serializing (maybe 5-10 sec). If it does, that's a bug.

edit: I should clarify that it is supposed to be as fast only if the file size is the same as the object size, and otherwise scale with the object size

mschubert · 2020-11-30T19:59:36Z

Something like:

clustermq::Q(object.size, x=list(rnorm(1e8)), n_jobs=1)

should do

mattwarkentin · 2020-11-30T20:14:03Z

clustermq::Q(object.size, x=list(rnorm(1e8)), n_jobs=1)

Connecting USER@HOST via SSH ...
Sending common data ...
Running 1 calculations (0 objs/0 Mb common; 1 calls/chunk) ...
Master: [30.5s 9.3% CPU]; Worker: [avg 13.5% CPU, max 201693041.0 Mb]
Error in summarize_result(job_result, n_errors, n_warnings, cond_msgs,  : 
  1/1 jobs failed (0 warnings). Stopping.
(Error #1) object 'C_objectSize' not found

mschubert · 2020-11-30T20:26:34Z

That's clearly wrong. I'll look into it 😅

mattwarkentin · 2020-11-30T20:28:27Z

The same error is produced when using very small sample sizes in rnorm() (e.g. rnorm(1e1)). So it doesn't seem to be a size issue, so probably (?) shouldn't relate to the original issue, right?

mattwarkentin · 2020-11-30T21:30:37Z

@wlandau So far it seems like this is unrelated to targets or the subpipeline. After a few tests (mschubert/clustermq#222) it seems like the SSH data transfer times possibly scale nearly cubically (or larger) with in-memory file sizes.

wlandau · 2020-11-30T21:32:00Z

Thanks for tracking that down, @mattwarkentin!

liutiming · 2021-01-02T00:11:59Z

I realised mschubert/clustermq#229 would probably be relevant to this issue, too. Jobs can still be sent when it was about 1GB here but when it is higher all my attempts to complete the job do not succeed... I am not using SSH here though - running on the cluster.

wlandau · 2021-01-02T01:40:12Z

@liutiming, have you tried storage = "worker" and retrieval = "worker" in tar_target() or tar_option_set()?

liutiming · 2021-01-02T09:14:29Z

Thank you @wlandau! Now it is finally working!
I tried the following

followed your advice by configuring storage and retrieval in the target
split the large file (about 4GB in lobstr::object_size) into 22 files
used within-target disk-IO for the large file or the 22 files generated by it to reduce memory usage
changed %dopar% in the target to %do% to avoid clustermq submitting jobs within each target

2/3/2&3 does not work
2&3&4 does work but the targets are generated at a very slow pace.
1&2&3&4 works very fast

Just to confirm if mschubert/clustermq#229 (comment) is a wrong pattern? i.e. we are not supposed to set register_dopar_cmq and then use foreach %dopar% within the function for one target? Interestingly though, I do not think any job was submitted anyways until I put register_dopar_cmq into ~/.Rproofile (removed before trying the above steps).

@mschubert so I am thinking whether I should have used registerDoParallel instead of register_dopar_cmq to take advantage of the multiple cores that each worker has, as registerDoParallel will make use of the cores within each worker and register_dopar_cmq will submit jobs within each worker, which is not recommended by @wlandau if I understand correctly.

For reference:
There is still a large file warning from clustermq of about 2GB after trying 1&2&3&4 together, same as before, but I am happy that it is running. For future reference, that target has about 1000 branches when running against 1 large file and 22000 branches when running against 22 files. This might have complicate the memory issues a bit but I have always been using head(map()) to test so the number of branches tested is always within 10. I will look into creating tar_group since now each branch only takes about 10 seconds to finish. So it is wise to extend the runtime and reduce the total overhead.

wlandau · 2021-01-02T21:28:43Z

Just to confirm if mschubert/clustermq#229 (comment) is a wrong pattern? i.e. we are not supposed to set register_dopar_cmq and then use foreach %dopar% within the function for one target? Interestingly though, I do not think any job was submitted anyways until I put register_dopar_cmq into ~/.Rproofile (removed before trying the above steps).

Yes, clustermq %dopar% within a target is like job Inception: jobs within jobs. When that happens, the first layer of jobs is just waiting around for the second layer of jobs to finish. That's usually not a big deal for one or two processes, but scaled out, that strategy unnecessarily locks a large chunk of the cluster. Nobody else can access those resources even though those processes are idle.

mattwarkentin added the type: trouble label Nov 30, 2020

mattwarkentin closed this as completed Nov 30, 2020

mschubert mentioned this issue Nov 30, 2020

SSH transfer scales badly for large data mschubert/clustermq#222

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subpipeline data transfer timeout #237

Subpipeline data transfer timeout #237

mattwarkentin commented Nov 30, 2020 •

edited

Loading

mattwarkentin commented Nov 30, 2020

mattwarkentin commented Nov 30, 2020

wlandau commented Nov 30, 2020

mattwarkentin commented Nov 30, 2020 •

edited

Loading

wlandau commented Nov 30, 2020

mattwarkentin commented Nov 30, 2020

mschubert commented Nov 30, 2020 •

edited

Loading

mschubert commented Nov 30, 2020

mattwarkentin commented Nov 30, 2020

mschubert commented Nov 30, 2020

mattwarkentin commented Nov 30, 2020 •

edited

Loading

mattwarkentin commented Nov 30, 2020 •

edited

Loading

wlandau commented Nov 30, 2020

liutiming commented Jan 2, 2021

wlandau commented Jan 2, 2021

liutiming commented Jan 2, 2021 •

edited

Loading

wlandau commented Jan 2, 2021

Subpipeline data transfer timeout #237

Subpipeline data transfer timeout #237

Comments

mattwarkentin commented Nov 30, 2020 • edited Loading

Prework

Description

mattwarkentin commented Nov 30, 2020

mattwarkentin commented Nov 30, 2020

wlandau commented Nov 30, 2020

mattwarkentin commented Nov 30, 2020 • edited Loading

wlandau commented Nov 30, 2020

mattwarkentin commented Nov 30, 2020

mschubert commented Nov 30, 2020 • edited Loading

mschubert commented Nov 30, 2020

mattwarkentin commented Nov 30, 2020

mschubert commented Nov 30, 2020

mattwarkentin commented Nov 30, 2020 • edited Loading

mattwarkentin commented Nov 30, 2020 • edited Loading

wlandau commented Nov 30, 2020

liutiming commented Jan 2, 2021

wlandau commented Jan 2, 2021

liutiming commented Jan 2, 2021 • edited Loading

wlandau commented Jan 2, 2021

mattwarkentin commented Nov 30, 2020 •

edited

Loading

mattwarkentin commented Nov 30, 2020 •

edited

Loading

mschubert commented Nov 30, 2020 •

edited

Loading

mattwarkentin commented Nov 30, 2020 •

edited

Loading

mattwarkentin commented Nov 30, 2020 •

edited

Loading

liutiming commented Jan 2, 2021 •

edited

Loading