Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A target archetype that runs on AWS through Metaflow #8

Closed
3 tasks done
wlandau opened this issue Aug 11, 2020 · 8 comments
Closed
3 tasks done

A target archetype that runs on AWS through Metaflow #8

wlandau opened this issue Aug 11, 2020 · 8 comments
Assignees

Comments

@wlandau
Copy link
Collaborator

wlandau commented Aug 11, 2020

Prework

  • I understand and agree to tarchetypes' code of conduct.
  • I understand and agree to tarchetypes' contributing guidelines.
  • New features take time and effort to create, and they take even more effort to maintain. So if the purpose of the feature is to resolve a struggle you are encountering personally, please consider first posting a "trouble" or "other" issue so we can discuss your use case and search for existing solutions first.

Proposal

targets (and drake) are currently tied to traditional HPC (Slurm, SGE, etc.). That's enough for me and my team right now, but not for the increasing number of people like @MilesMcBain who rely on AWS and other cloud platforms. There is not yet much cloud support native to the R ecosystem, and since I don't use AWS for my own work, I am not prepared to do much at a low level.

Metaflow not only runs computation on the cloud and stores the results on S3, it also abstracts away the devops overhead that comes along with that, and it supports a sophisticated versioning system for code and data. I think we will gain a ton of power and convenience if we leverage Metaflow's rich and potentially complementary feature set in situations where targets needs the cloud.

Earlier, I proposed a targets-within-Metaflow approach, which I think would be useful for people with multiple {targets} pipelines in the same project. Here, I would like to explore the reverse: a target archetype that that runs some R code as a single AWS Metaflow step. Sketch:

# Archetype:
tar_metaflow(some_name, some_command(dep1, dep2), cpu = 4, memory = 10000)

# Equivalent to:
tar_target(some_name, {
  metaflow::metaflow("some_name") %>%
    metaflow::step(
      step = "some_name",
      metaflow::decorator("batch", cpu = 4, memory = 1000),
      r_function = function(self) self$some_name <- some_command(dep1, dep2)
    ) %>%
    metaflow::run()
  download_artifact_from_aws("some_name") # needs to be defined
})

cc @savingoyal (for when you return) @jasonge27, @bcgalvin.

@wlandau wlandau self-assigned this Aug 11, 2020
@wlandau wlandau changed the title Target archetype that runs on AWS through Metaflow A target archetype that runs on AWS through Metaflow Aug 11, 2020
@wlandau
Copy link
Collaborator Author

wlandau commented Aug 12, 2020

I need to figure out how to write download_artifact_from_aws(), but that should be straightforward in principle.

A bigger issue is probably the way tar_metaflow() creates an entire new flow for each new target. This could lead to thousands of flows in practice, and I do not know if that will incur extra overhead. We could alternatively try to stick to a single flow for the entire targets pipeline, but that flow would have a completely different definition for each target, which might not bode well either.

@mdneuzerling
Copy link

Quick note on this: Metaflow doesn't support anonymous functions as written here. I think it's an easy, non-breaking change and I've drafted some code. I'll clean it up and submit that as a pull request to the repo.

@wlandau
Copy link
Collaborator Author

wlandau commented Aug 14, 2020

Quick note on this: Metaflow doesn't support anonymous functions as written here

Seems straightforward to work around if we define a function from inside the command for the target.

I think it's an easy, non-breaking change and I've drafted some code. I'll clean it up and submit that as a pull request to the repo.

Thank you so much, David! Really looking forward to this! If it works out, it could be a huge win-win.

@wlandau
Copy link
Collaborator Author

wlandau commented Aug 20, 2020

My opinion is changing on this one. I think tar_metaflow() would still be nice for a small number of targets that need both AWS computing and S3 data versioning. However, since targets (and drake) already do distributed computing on clusters, I think AWS ParallelCluster might be a more natural fit for heavily scaled-out pipelines. Related: ropensci-books/targets#21.

@wlandau
Copy link
Collaborator Author

wlandau commented Sep 6, 2020

I read up more on AWS ParallelCluster, AWS Batch, and Metaflow's HPC, and I no longer think ParallelCluster is something that makes sense to integrate with directly. I think a Metaflow target archetype makes more sense to start with, and the versioning could still help even after targets adopts the cloud.

Some future development ideas:

  1. For S3 storage on its own, let's try A target archetype that runs on AWS through Metaflow #8 (using https://mdneuzerling.com/post/sourcing-data-from-s3-with-drake/).
  2. AWS Batch scheduling as an externalized algorithm subclass (related: Factor out future and clustermq scheduling algorithms  targets#148). Should look like the existing clustermq and future algorithm subclasses but built on top of paws::batch().

@wlandau
Copy link
Collaborator Author

wlandau commented Sep 15, 2020

Just learned some neat stuff from experimenting with metaflow.org/sandbox. It gave me another idea for AWS S3 integration in targets: ropensci/targets#154.

@wlandau
Copy link
Collaborator Author

wlandau commented Sep 28, 2020

Update: thanks to http://metaflow.org/sandbox, I think I figured out what AWS S3 integration in targets should look like, and we're off to a great start: ropensci/targets#176.

AWS Batch integration is going to be a lot harder. What we really need is a batchtools or clustermq for AWS Batch, plus a future extension on top of that. With that in place, there should be nothing more to implement in targets itself.

If/when we get that far, the value added from tar_metaflow() will just be the versioning system. But that in itself a big deal, and it's something targets is never going to have on its own. (targets instead tries to make the data store light and readable so third party data versioning tools have an easier time.)

@wlandau
Copy link
Collaborator Author

wlandau commented Nov 13, 2020

On reflection, I am closing this issue. The maintainers of clustermq and future have expressed interest in supporting some form of AWS compute, which would automatically let targets deploy work to the cloud. I believe this is the best route for targets.

@wlandau wlandau closed this as completed Nov 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants