-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Drake plan includes external dependencies stored on the HPC file system #1295
Comments
Glad It is possible to use Does that help? |
What you're describing is possible, but depends on your system. I'm surprised that you're copying data at all, though. It's worth noting (this is something that @wlandau has suggested to me, as well) that This will also help deal with the instances of the path to the data being mounted at different points, as In my experience, (at least with SQLite caches), the network FS lag is, or can be significant, (Bandwidth and latency obviously influence this significantly), if, for example, you have your data stored on a HPC filesystem that you mount via It's not a dealbreaker for when I want to pull data from the remote cache for exploration and visualization, or when my analysists want to fetch the data interactively, but, at least under Australian network conditions, it significantly impacts the performance of an automated build of my plan. |
@mstr3336, I agree with this use of the |
Thank you @wlandau and @mstr3336 for your responses. This is very helpful and informative. Indeed, I think by mounting the cluster file-system using Perhaps I have been using As a follow-up question, if I have a local plan1 <- drake_plan(
data = read_csv(file_in("big.csv")),
results = do_something(data)
) plan2 <- drake_plan(
data = read_csv("big.csv"),
results = do_something(data)
) |
In your example at the bottom of your comment, I like If drake_plan(
data = target(data.table::fread("big.csv"), format = "fst_dt"),
results = do_something(data)
) |
Hi @wlandau, I still must be doing something wrong. I have a CSV file located on my local machine, and I refer to that file in my plan with # Toy example
plan <- drake_plan(
data = read_csv(file_in("path/on/local/machine")),
results = do_something(data)
) When I call This follows from my previous questions (specifically the bold part):
|
If you are using a cluster, it is actually best to house your project and launch Also, in your template file for
|
Thanks for the quick response. Hmm, I was really hoping that since I guess I can just set If |
The storage in |
Naive question: Why shouldn't the workflow @mattwarkentin describes work? If you load your (I totally agree with you that workflows over SSH come with issues, but I also see the power of convenience) |
Side note-. A nice local feeling workflow could be accomplished by calling a server-side script via ssh It doesn't need to be overengineered, put something like the following on your HPC's #!/bin/bash
cd projects/my_r_project
qsub submission_script.pbs |
You are right, it should theoretically should if |
Thanks @mschubert and @wlandau. Just so I'm clear, this comment I made previously should work, so long as
|
Yes, exactly. I would just emphasize here that the "copying over" step happens totally in memory, not in storage. |
Thanks for clarifying. Also, I just discovered |
BackgroundI was not planning to formally announce some of this until later in the year, but since the topic came up, I will address it. For onlookers, For July, August, and probably September, I would stick with
To be clear, there is no time pressure for any of this. Feel free to stick with Response to the original comment@mattwarkentin, you are an advanced user, so even if we have to work out a few early bugs, I think you will be pleased if you switch to Now here's my own bias: early adopters can really help get |
Awesome @wlandau! I will definitely early adopt and hammer you with suggestions. Other than you, I believe I account for a plurality of the issues on |
Fantastic, @kendonB! You helped out so much in the early days of |
Whoops! I hope I didn't play any negative role in you announcing it earlier than desired. I just happened to stumble upon the repo and it seemed like this was the heir apparent, so I thought I might as well adopt the newer package. I quite enjoy testing out new packages and playing my small part in helping to provide feedback towards their improvement. I am happy to be an early adopter of |
That's totally fine, I think threads like these are most likely to be seen by power users anyway. Glad you're having a good experience with |
You know what? I think I finally understand the purpose of |
YES! This is exactly how I've been using it. My options(
clustermq.scheduler = "slurm",
clustermq.template = "some/path/slurm_clustermq.tmpl"
) Which sets the This way, I can choose to send computationally intensive jobs to the HPC ( |
Nice! In
Trying to replicate this myself on SGE. Ever had port problems like mschubert/clustermq#176 (comment)? |
Also, a target building on a ssh worker cannot require direct access to any files on the local system. This one is probably obvious. This is also true for targets that have, as dependencies, upstream targets which are
This would only work if the HPC has internet access, right?
Is the |
Yes, that's what I meant. (1) and (2) are ways to work around that.
I think that's a fair assumption in most cases.
It is (version 4.2.3). Version 4.3.3 is installed locally. |
Hi @wlandau,
I have graduated to the point of use
{drake}
and{clustermq}
to run my plans on a HPC using theSlurm
job scheduler. Everything works great and I am able to run my plans successfully. I love how seamless it all works. Big thanks to you and @mschubert for your hard work!I know the requisite data is copied over to the cluster to build the targets, but I was wondering if it is possible for an external file dependency in my plan (i.e.
file_in()
) to actually reside on the cluster and not on my local machine?I think the really nice part of using
{drake}
and{clustermq}
is that I can develop my plan locally, but utilize external compute resources as needed. I'm imagining a use-case whereby some very large data files (in my case, genetic data) are stored in a common place on the cluster for many people in our lab to jointly access, and I want avoid copying these data, if possible.Is it possible to have static files located on the HPC file system that my
{drake}
plan can access at runtime?The text was updated successfully, but these errors were encountered: