-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Subpipeline data transfer timeout #237
Comments
I know it's not the same, but when I transfer the larger file by |
The same timing out also happens when transferring a file that is 800Mb in |
Maybe @mschubert could confirm, but I think I guess one thing to confirm is that the subpipeline is about the same size as the |
Would Also, I thought |
I know it's inconvenient, but you would have to manually SSH into the cluster and run the pipeline from the login node instead of running from your local computer using the
Yeah, there's no hard limit. You may even be able to increase the timeout. That said, 20 minutes seems like a lot, especially since 221 MB only took a minute. |
I've hit the frustration level where I'm going to split the data processing part of the pipeline into its own And then the model building targets will be a separate targets project on the HPC so I don't have to deal with any SSH. I was hoping to keep it all together in one project, but I'm spending more time debugging SSH deployment than it's worth, I think. |
@mattwarkentin Can you please try to run just the transfer via It shouldn't take longer than edit: I should clarify that it is supposed to be as fast only if the file size is the same as the object size, and otherwise scale with the object size |
Something like: clustermq::Q(object.size, x=list(rnorm(1e8)), n_jobs=1) should do |
clustermq::Q(object.size, x=list(rnorm(1e8)), n_jobs=1)
|
That's clearly wrong. I'll look into it 😅 |
The same error is produced when using very small sample sizes in |
@wlandau So far it seems like this is unrelated to |
Thanks for tracking that down, @mattwarkentin! |
I realised mschubert/clustermq#229 would probably be relevant to this issue, too. Jobs can still be sent when it was about 1GB here but when it is higher all my attempts to complete the job do not succeed... I am not using SSH here though - running on the cluster. |
@liutiming, have you tried |
Thank you @wlandau! Now it is finally working!
Just to confirm if mschubert/clustermq#229 (comment) is a wrong pattern? i.e. we are not supposed to set @mschubert so I am thinking whether I should have used For reference: |
Yes, |
Prework
Description
Hi @wlandau,
I'm running into an issue with transferring some probably "too big" data to the HPC via SSH using
clustermq
. For some additional context, the format of the data istorch
- whether that matters or not. Anyway, there are two files which are the same format/structure (torch tensors), but vary in size in the_targets/
data store.One file is 221Mb on disk, the other is 1.1Gb. Admittedly, I maybe shouldn't be sending that big of a file over SSH, but it's really not THAT big. Anyway, I ran a test where all I did was check the class of the objects on the worker (just a lightweight job that forces the data to be transferred over). It took about a minute for the 221Mb tensor to transfer over and complete, the larger file "ran" until it hit my
cmq
timeout of 20 minutes of no master/worker communication.I really don't even know how to diagnose the issue. Is this subpipeline related?
clustermq
related? I don't think it should take more than 20x longer to transfer over the target value for a file 4x larger. Also 1Gb just seemingly isn't that big. When I run locally it takes like 20 seconds to complete the whole job.Happy to run any tests I can to diagnose this issue.
The text was updated successfully, but these errors were encountered: