A target archetype that runs on AWS through Metaflow #8

wlandau · 2020-08-11T02:26:28Z

Prework

I understand and agree to tarchetypes' code of conduct.
I understand and agree to tarchetypes' contributing guidelines.
New features take time and effort to create, and they take even more effort to maintain. So if the purpose of the feature is to resolve a struggle you are encountering personally, please consider first posting a "trouble" or "other" issue so we can discuss your use case and search for existing solutions first.

Proposal

targets (and drake) are currently tied to traditional HPC (Slurm, SGE, etc.). That's enough for me and my team right now, but not for the increasing number of people like @MilesMcBain who rely on AWS and other cloud platforms. There is not yet much cloud support native to the R ecosystem, and since I don't use AWS for my own work, I am not prepared to do much at a low level.

Metaflow not only runs computation on the cloud and stores the results on S3, it also abstracts away the devops overhead that comes along with that, and it supports a sophisticated versioning system for code and data. I think we will gain a ton of power and convenience if we leverage Metaflow's rich and potentially complementary feature set in situations where targets needs the cloud.

Earlier, I proposed a targets-within-Metaflow approach, which I think would be useful for people with multiple {targets} pipelines in the same project. Here, I would like to explore the reverse: a target archetype that that runs some R code as a single AWS Metaflow step. Sketch:

# Archetype:
tar_metaflow(some_name, some_command(dep1, dep2), cpu = 4, memory = 10000)

# Equivalent to:
tar_target(some_name, {
  metaflow::metaflow("some_name") %>%
    metaflow::step(
      step = "some_name",
      metaflow::decorator("batch", cpu = 4, memory = 1000),
      r_function = function(self) self$some_name <- some_command(dep1, dep2)
    ) %>%
    metaflow::run()
  download_artifact_from_aws("some_name") # needs to be defined
})

cc @savingoyal (for when you return) @jasonge27, @bcgalvin.

The text was updated successfully, but these errors were encountered:

wlandau · 2020-08-12T15:56:01Z

I need to figure out how to write download_artifact_from_aws(), but that should be straightforward in principle.

A bigger issue is probably the way tar_metaflow() creates an entire new flow for each new target. This could lead to thousands of flows in practice, and I do not know if that will incur extra overhead. We could alternatively try to stick to a single flow for the entire targets pipeline, but that flow would have a completely different definition for each target, which might not bode well either.

mdneuzerling · 2020-08-13T23:10:26Z

Quick note on this: Metaflow doesn't support anonymous functions as written here. I think it's an easy, non-breaking change and I've drafted some code. I'll clean it up and submit that as a pull request to the repo.

wlandau · 2020-08-14T03:22:42Z

Quick note on this: Metaflow doesn't support anonymous functions as written here

Seems straightforward to work around if we define a function from inside the command for the target.

I think it's an easy, non-breaking change and I've drafted some code. I'll clean it up and submit that as a pull request to the repo.

Thank you so much, David! Really looking forward to this! If it works out, it could be a huge win-win.

wlandau · 2020-08-20T14:29:25Z

My opinion is changing on this one. I think tar_metaflow() would still be nice for a small number of targets that need both AWS computing and S3 data versioning. However, since targets (and drake) already do distributed computing on clusters, I think AWS ParallelCluster might be a more natural fit for heavily scaled-out pipelines. Related: ropensci-books/targets#21.

wlandau · 2020-09-06T22:33:28Z

I read up more on AWS ParallelCluster, AWS Batch, and Metaflow's HPC, and I no longer think ParallelCluster is something that makes sense to integrate with directly. I think a Metaflow target archetype makes more sense to start with, and the versioning could still help even after targets adopts the cloud.

Some future development ideas:

For S3 storage on its own, let's try A target archetype that runs on AWS through Metaflow #8 (using https://mdneuzerling.com/post/sourcing-data-from-s3-with-drake/).
AWS Batch scheduling as an externalized algorithm subclass (related: Factor out future and clustermq scheduling algorithms targets#148). Should look like the existing clustermq and future algorithm subclasses but built on top of paws::batch().

wlandau · 2020-09-15T02:54:54Z

Just learned some neat stuff from experimenting with metaflow.org/sandbox. It gave me another idea for AWS S3 integration in targets: ropensci/targets#154.

wlandau · 2020-09-28T03:29:53Z

Update: thanks to http://metaflow.org/sandbox, I think I figured out what AWS S3 integration in targets should look like, and we're off to a great start: ropensci/targets#176.

AWS Batch integration is going to be a lot harder. What we really need is a batchtools or clustermq for AWS Batch, plus a future extension on top of that. With that in place, there should be nothing more to implement in targets itself.

If/when we get that far, the value added from tar_metaflow() will just be the versioning system. But that in itself a big deal, and it's something targets is never going to have on its own. (targets instead tries to make the data store light and readable so third party data versioning tools have an easier time.)

wlandau · 2020-11-13T02:13:36Z

On reflection, I am closing this issue. The maintainers of clustermq and future have expressed interest in supporting some form of AWS compute, which would automatically let targets deploy work to the cloud. I believe this is the best route for targets.

wlandau added the type: new feature label Aug 11, 2020

wlandau self-assigned this Aug 11, 2020

wlandau changed the title ~~Target archetype that runs on AWS through Metaflow~~ A target archetype that runs on AWS through Metaflow Aug 11, 2020

wlandau mentioned this issue Aug 20, 2020

Cloud computing guide ropensci-books/targets#21

Closed

3 tasks

wlandau mentioned this issue Sep 6, 2020

Target archetype that stores to S3 #11

Closed

3 tasks

wlandau mentioned this issue Sep 15, 2020

Seamless integration with AWS S3 buckets ropensci/targets#154

Closed

3 tasks

This was referenced Oct 11, 2020

AWS Batch scheduler mschubert/clustermq#208

Closed

AWS Batch backend HenrikBengtsson/future#423

Open

wlandau closed this as completed Nov 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A target archetype that runs on AWS through Metaflow #8

A target archetype that runs on AWS through Metaflow #8

wlandau commented Aug 11, 2020 •

edited

wlandau commented Aug 12, 2020 •

edited

mdneuzerling commented Aug 13, 2020

wlandau commented Aug 14, 2020 •

edited

wlandau commented Aug 20, 2020 •

edited

wlandau commented Sep 6, 2020 •

edited

wlandau commented Sep 15, 2020

wlandau commented Sep 28, 2020 •

edited

wlandau commented Nov 13, 2020

A target archetype that runs on AWS through Metaflow #8

A target archetype that runs on AWS through Metaflow #8

Comments

wlandau commented Aug 11, 2020 • edited

Prework

Proposal

wlandau commented Aug 12, 2020 • edited

mdneuzerling commented Aug 13, 2020

wlandau commented Aug 14, 2020 • edited

wlandau commented Aug 20, 2020 • edited

wlandau commented Sep 6, 2020 • edited

wlandau commented Sep 15, 2020

wlandau commented Sep 28, 2020 • edited

wlandau commented Nov 13, 2020

wlandau commented Aug 11, 2020 •

edited

wlandau commented Aug 12, 2020 •

edited

wlandau commented Aug 14, 2020 •

edited

wlandau commented Aug 20, 2020 •

edited

wlandau commented Sep 6, 2020 •

edited

wlandau commented Sep 28, 2020 •

edited