Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drake plan includes external dependencies stored on the HPC file system #1295

Closed
mattwarkentin opened this issue Jul 9, 2020 · 25 comments
Closed

Comments

@mattwarkentin
Copy link

Hi @wlandau,

I have graduated to the point of use {drake} and {clustermq} to run my plans on a HPC using the Slurm job scheduler. Everything works great and I am able to run my plans successfully. I love how seamless it all works. Big thanks to you and @mschubert for your hard work!

I know the requisite data is copied over to the cluster to build the targets, but I was wondering if it is possible for an external file dependency in my plan (i.e. file_in()) to actually reside on the cluster and not on my local machine?

I think the really nice part of using {drake} and {clustermq} is that I can develop my plan locally, but utilize external compute resources as needed. I'm imagining a use-case whereby some very large data files (in my case, genetic data) are stored in a common place on the cluster for many people in our lab to jointly access, and I want avoid copying these data, if possible.

Is it possible to have static files located on the HPC file system that my {drake} plan can access at runtime?

@wlandau
Copy link
Member

wlandau commented Jul 10, 2020

Glad drake + clustermq is working so well for you.

It is possible to use file_in() on a large external file if you mount the drive where the file lives. In Linux systems at least, if you mount a drive, then all the files on that drive have regular path names as far as the user is concerned. This is certainly the case at my workplace. All the drives are abstracted away, and I don't need to actually worry about where the file lives unless I am thinking about network file system lag.

Does that help?

@strazto
Copy link
Contributor

strazto commented Jul 11, 2020

What you're describing is possible, but depends on your system.
I've implemented a similar setup for my workplace's project, and, though it's not without some complexity, it works.

I'm surprised that you're copying data at all, though.

It's worth noting (this is something that @wlandau has suggested to me, as well) that file_in() on large data inputs that aren't subject to change often, or may have changing mount points can be expensive (hashing a large dataset is costly), so using trigger = trigger(change = file.mtime(large_data_path)) can be more helpful, or even rolling your own change detection with heurististics that short circuit full-scale hashing.

This will also help deal with the instances of the path to the data being mounted at different points, as file_in() does take into account the path to the file, so will trigger spurious rebuilds ( unless I'm mistaken, @wlandau can confirm) if the filesystem is mounted at a different point.

In my experience, (at least with SQLite caches), the network FS lag is, or can be significant, (Bandwidth and latency obviously influence this significantly), if, for example, you have your data stored on a HPC filesystem that you mount via sshfs.

It's not a dealbreaker for when I want to pull data from the remote cache for exploration and visualization, or when my analysists want to fetch the data interactively, but, at least under Australian network conditions, it significantly impacts the performance of an automated build of my plan.

@wlandau
Copy link
Member

wlandau commented Jul 11, 2020

@mstr3336, I agree with this use of the change trigger, and I agree changed mount points can trigger undesired rebuilds. It just depends on what kind of system you have. At my work, we have mature hardware with RHEL and SGE, and the mount points do not change. But if you're working on a Windows machine and the drive aliases change often (T://files one day, Z://files the next day) then tixed paths can't be trusted.

@mattwarkentin
Copy link
Author

Thank you @wlandau and @mstr3336 for your responses. This is very helpful and informative.

Indeed, I think by mounting the cluster file-system using sshfs, I will be able to provide standard paths to my {drake} plan as per normal. I am going to try this out right away. I guess I will see how bad the network FS lag is when I end up testing this out on some more complex projects. I definitely need to dig more into setting up custom triggers to determine when re-builds occur. Thanks for pointing me this direction.

Perhaps I have been using file_in() too liberally. I have been using file_in() on basically any/every external file that my targets depend on, even if they are large and very unlikely to change. It sounds like this is not only not necessary, but potentially expensive to do. Should I not be doing this? Perhaps my mental model of when to declare dependencies is not correct.

As a follow-up question, if I have a local {drake} project that contains a large CSV file that won't ever change, and a plan as shown below, how exactly are these different from {drake}s perspective. If I were to run this job on my HPC via {clustermq}, does the CSV get "copied over" to the cluster in order to build the data target?

plan1 <- drake_plan(
  data = read_csv(file_in("big.csv")),
  results = do_something(data)
)
plan2 <- drake_plan(
  data = read_csv("big.csv"),
  results = do_something(data)
)

@wlandau
Copy link
Member

wlandau commented Jul 14, 2020

Perhaps I have been using file_in() too liberally. I have been using file_in() on basically any/every external file that my targets depend on, even if they are large and very unlikely to change. It sounds like this is not only not necessary, but potentially expensive to do. Should I not be doing this? Perhaps my mental model of when to declare dependencies is not correct.

file_in() is actually fine for large files. In fact, it uses time stamps to try to avoid recomputing hashes, which should help speed. If you expect the paths to those files to change frequently, however, then that's the time to look at the change trigger.

In your example at the bottom of your comment, I like plan1 more than plan2 because it reproducibly tracks the CSV file without hitting you with a performance penalty for subsequent make()s. As far as when the data gets transferred, what's going to matter here is the caching argument to make(). if caching = "master", the data target gets loaded on the master process and then gets sent to a remote worker through a ZeroMQ socket in order to build results. Not always great for large data. If you set caching = "worker", the worker itself will load data and there is no transfer over a socket. However, there is some lag because the workers also save data and results, so drake waits for the data to sync over the network file system before moving on to downstream targets.

If results is the only place you load data, you might consider condensing them down to a single target, e.g. plan <- drake_plan(results = do_something(data.table::fread(file_in("big.csv"))). That data target probably isn't eating up too much runtime anyway, so the advice at https://books.ropensci.org/drake/plans.html#how-to-choose-good-targets would suggest to condense those down. But if you do choose to keep data, I recommend a specialized storage format to reduce storage and runtime costs, e.g.

drake_plan(
  data = target(data.table::fread("big.csv"), format = "fst_dt"),
  results = do_something(data)
)

@wlandau wlandau closed this as completed Jul 14, 2020
@mattwarkentin
Copy link
Author

mattwarkentin commented Jul 15, 2020

Hi @wlandau,

I still must be doing something wrong. I have a CSV file located on my local machine, and I refer to that file in my plan with file_in() to explicitly declare this file as a dependency. To be clear, the file lives on my local machine.

# Toy example
plan <- drake_plan(
  data = read_csv(file_in("path/on/local/machine")),
  results = do_something(data)
)

When I call make(plan, parallelism = "clustermq") to build the plan on the HPC via {clustermq}, it errors out right away saying that the file does not exist. I presume that is because this path doesn't exist on the cluster. I guess I don't understand how to refer to local files if I want to be able to run on the HPC.

This follows from my previous questions (specifically the bold part):

As a follow-up question, if I have a local {drake} project that contains a large CSV file that won't ever change, and a plan as shown below, how exactly are these different from {drake}s perspective. If I were to run this job on my HPC via {clustermq}, does the CSV get "copied over" to the cluster in order to build the data target?

@wlandau
Copy link
Member

wlandau commented Jul 15, 2020

If you are using a cluster, it is actually best to house your project and launch make() from the cluster's login node (or "headnode") rather than your local machine. I think this may be where a lot of the confusion is coming from. If you run make() that way, the workers should be able to find that file_in() file. drake itself does not copy file_in() files anywhere.

Also, in your template file for clustermq, you can set a flag to make sure the remote workers operate from the same working directory as the master process on the login node. For the Sun Grid Engine, which is what I use for work, this is the -cwd flag. Other clusters like SLURM may not need it, but it's a good idea to check if your cluster makes the working directories agree.

#$ -cwd # use pwd as work dir

@mattwarkentin
Copy link
Author

Thanks for the quick response.

Hmm, I was really hoping that since {clustermq} offers the ability to submit cluster jobs via ssh, that it meant I could have the best of both worlds by developing my project/plan within my local development environment, but still be able to build the plan on the HPC when additional computational resources are required.

I guess I can just set target(expr, hpc = FALSE) for any targets directly depending on local files, and that gets me most of the way there. Those targets get built first, and then once those targets are cached, they will get copied over to the HPC to build remaining targets.

If {drake} knew to copy over any local file_in() / knitr_in() files to tempdir() on the cluster, wouldn't that make it so a local project could be ran entirely on the cluster? I am just thinking out loud now.

@wlandau
Copy link
Member

wlandau commented Jul 15, 2020

Hmm, I was really hoping that since {clustermq} offers the ability to submit cluster jobs via ssh, that it meant I could have the best of both worlds by developing my project/plan within my local development environment, but still be able to build the plan on the HPC when additional computational resources are required.

clustermq does have an ssh backend, but I do not recommend it for most cluster-powered workflows because it takes a lot of extra work to communicate with the resource manager. Best to work entirely on the cluster if you can.

I guess I can just set target(expr, hpc = FALSE) for any targets directly depending on local files, and that gets me most of the way there. Those targets get built first, and then once those targets are cached, they will get copied over to the HPC to build remaining targets.

drake does make sure upstream targets are in memory when they need to be, but nothing gets copied in storage. If caching = "worker", then the workers try to load data from the central cache. If caching = "master", then the master process loads data from storage into memory and sends the in-memory data to the workers via ZeroMQ sockets. For you, caching = "master" makes sense for the targets where hpc = TRUE, especially if you do go with the ssh clustermq backend. (But working entirely on the cluster, e.g. launching from the headnode, is standard recommended practice.)

If {drake} knew to copy over any local file_in() / knitr_in() files to tempdir() on the cluster, wouldn't that make it so a local project could be ran entirely on the cluster? I am just thinking out loud now.

The storage in tempdir() is usually local to each compute node. Data would load into memory faster, but shuffling around files to and from temp storage could be time-consuming. For drake, I consider it out of scope because drake relies on the abstraction of clustermq and tries not to get too involved with the low-level details when it comes to managing workers.

@mschubert
Copy link

Naive question: Why shouldn't the workflow @mattwarkentin describes work? If you load your file_in()s on the local process with caching = "master" and then add them to cmq's exports, I believe this would already work.

(I totally agree with you that workflows over SSH come with issues, but I also see the power of convenience)

@strazto
Copy link
Contributor

strazto commented Jul 17, 2020

Side note-. A nice local feeling workflow could be accomplished by calling a server-side script via ssh

It doesn't need to be overengineered, put something like the following on your HPC's PATH:

#!/bin/bash

cd projects/my_r_project
qsub submission_script.pbs

@ropensci ropensci deleted a comment from wlandau-lilly Jul 17, 2020
@wlandau
Copy link
Member

wlandau commented Jul 17, 2020

Naive question: Why shouldn't the workflow @mattwarkentin describes work? If you load your file_in()s on the local process with caching = "master" and then add them to cmq's exports, I believe this would already work.

You are right, it should theoretically should if hpc = FALSE for all targets with file_in(), file_out(), knitr_in(), or format = "file", as long as caching is "master". @mattwarkentin, feel free to try if that sounds worth it to you. I am just skeptical about both convenience and efficiency in the long run. It seems like local files are more likely to break (which I suppose depends on how you cluster is maintained and backed up) and network lag will increase.

@mattwarkentin
Copy link
Author

Thanks @mschubert and @wlandau.

Just so I'm clear, this comment I made previously should work, so long as caching = "master":

I guess I can just set target(expr, hpc = FALSE) for any targets directly depending on local files, and that gets me most of the way there. Those targets get built first, and then once those targets are cached, they will get copied over to the HPC to build remaining targets.

@wlandau
Copy link
Member

wlandau commented Jul 17, 2020

Yes, exactly. I would just emphasize here that the "copying over" step happens totally in memory, not in storage.

@mattwarkentin
Copy link
Author

mattwarkentin commented Jul 17, 2020

Thanks for clarifying. Also, I just discovered {targets}, should I just switch over now? I have only been using {drake} for a few months.

@wlandau
Copy link
Member

wlandau commented Jul 18, 2020

Background

I was not planning to formally announce some of this until later in the year, but since the topic came up, I will address it.

For onlookers, targets is the long-term successor to drake, and the rationale behind change is thoroughly discussed at https://wlandau.github.io/targets/articles/need.html. I have been slowly planning it on and off for about two years, and the closed-source implementation phase ran from early March to early July. I recently released targets as an open source work in progress, and it is just beginning its beta phase. Right now, I am trying to roll it out to power users. During September or October, I will submit it to rOpenSci for peer review, which will then kick off a CRAN submission and announcements to the wider R community.

For July, August, and probably September, I would stick with drake in the following situations:

  1. Your work requires software that is published in a formal release (e.g. CRAN), peer reviewed, and/or formally validated.
  2. Your current project already uses drake and you never want to go back and switch it over.
  3. You are not personally willing to live at the bleeding edge of new packages, either because you are relatively new to R or you just do not enjoy the software development side of data science.

To be clear, there is no time pressure for any of this. Feel free to stick with drake for years to come. Because of its widespread use, I will maintain drake indefinitely. Although I am no longer willing to pursue difficult refactoring adventures like #1191, I will spend the same amount of time and effort on things that bring the most value to users: mainly one-on-one help, bug fixes, requested features, and known performance issues. So even after targets debuts for real, you really can't go wrong either way.

Response to the original comment

@mattwarkentin, you are an advanced user, so even if we have to work out a few early bugs, I think you will be pleased if you switch to targets. The overall quality of the software is substantially improved, and personally, I already find it much easier to use. Plus, the feature set is as close to complete as I can get it prior to community feedback. It's pretty much everything that worked best in drake, plus a couple more nice things that are now possible, such as dynamic branching over dplyr::group_by() subsets of data frames (originally proposed by @AlexAxthelm in #77; also thinking of you, @kendonB).

Now here's my own bias: early adopters can really help get targets off the ground. drake already has a wealth of community feedback, and targets is just getting started. So in that sense, for those who are comfortable with it, I guess adopting targets does more long-term good for others than continuing with drake.

@kendonB
Copy link
Contributor

kendonB commented Jul 18, 2020

Awesome @wlandau! I will definitely early adopt and hammer you with suggestions. Other than you, I believe I account for a plurality of the issues on drake and I'll try and keep it up with targets 😄. Excited to try it!

@wlandau
Copy link
Member

wlandau commented Jul 19, 2020

Fantastic, @kendonB! You helped out so much in the early days of drake, and it will be amazing to have your help again this time.

@mattwarkentin
Copy link
Author

Whoops! I hope I didn't play any negative role in you announcing it earlier than desired. I just happened to stumble upon the repo and it seemed like this was the heir apparent, so I thought I might as well adopt the newer package. I quite enjoy testing out new packages and playing my small part in helping to provide feedback towards their improvement.

I am happy to be an early adopter of {targets}. I'm sure I will have questions/feedback as I work through using {targets}, so hopefully that will serve as helpful and not onerous. For what it's worth, I cloned your customer churn example repo and played around with it on Friday, and I really like the new API. It seems much more clean, organized, and intuitive. Even more incentive for me to switch over.

@wlandau
Copy link
Member

wlandau commented Jul 20, 2020

That's totally fine, I think threads like these are most likely to be seen by power users anyway. Glad you're having a good experience with targets so far. This is a great time to discuss development if you have ideas.

@wlandau
Copy link
Member

wlandau commented Oct 14, 2020

You know what? I think I finally understand the purpose of clustermq's ssh connector. From mschubert/clustermq#176 (comment), it seems like options(clustermq.scheduler = "ssh") tells the local process to talk to a head node / login node which then sets its own clustermq.scheduler which could be "slurm" or similar. I completely missed that and am no longer skeptical of this approach. It may even allow drake and targets to have easier access to cloud computing via AWS ParallelCluster. You guys were way ahead of me all along.

@mattwarkentin
Copy link
Author

mattwarkentin commented Oct 14, 2020

YES! This is exactly how I've been using it. My targets project lives on the hard drive of my local machine that serves as my development environment. Locally, options(clustermq.scheduler = "ssh") tells clustermq to connect to an R session on the login node via ssh. In my home directory on the login node my .Rprofile contains:

options(
    clustermq.scheduler = "slurm",
    clustermq.template = "some/path/slurm_clustermq.tmpl"
)

Which sets the clustermq job scheduler that I'm using and points clustermq to a template file called slurm_clustermq.tmpl which defines the skeleton template/sbatch directives for submitting slurm batch jobs.

This way, I can choose to send computationally intensive jobs to the HPC ("worker"), meanwhile quick jobs can run locally on the "main" process. It creates a seamless tunnel between my local machine and the HPC in a way that makes me smile every time!

@wlandau
Copy link
Member

wlandau commented Oct 14, 2020

Nice! In targets, this should work as long as either

  1. The storage and retrieval settings are both set to "main", or
  2. You use AWS S3 storage: https://wlandau.github.io/targets-manual/cloud.html#storage

Trying to replicate this myself on SGE. Ever had port problems like mschubert/clustermq#176 (comment)?

@mattwarkentin
Copy link
Author

mattwarkentin commented Oct 14, 2020

The storage and retrieval settings are both set to "main", or

Also, a target building on a ssh worker cannot require direct access to any files on the local system. This one is probably obvious. This is also true for targets that have, as dependencies, upstream targets which are format = "file".

You use AWS S3 storage: https://wlandau.github.io/targets-manual/cloud.html#storage

This would only work if the HPC has internet access, right?

Trying to replicate this myself on SGE. Ever had port problems like mschubert/clustermq#176 (comment)?

Is the zeromq system library installed on your SGE cluster login node and whichever compute nodes you are requesting?

@wlandau
Copy link
Member

wlandau commented Oct 14, 2020

Also, a target building on a ssh worker cannot require direct access to any files on the local system. This one is probably obvious. This is also true for targets that have, as dependencies, upstream targets which are format = "file".

Yes, that's what I meant. (1) and (2) are ways to work around that.

This would only work if the HPC has internet access, right?

I think that's a fair assumption in most cases.

Is the zeromq system library installed on your SGE cluster login node and whichever compute nodes you are requesting?

It is (version 4.2.3). Version 4.3.3 is installed locally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants