Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EmrEtlRunner: add support for running steps on a persistent EMR cluster #3930

Closed
jbeemster opened this issue Nov 21, 2018 · 2 comments
Closed

Comments

@jbeemster
Copy link
Member

Using dataflow-runner to run Snowflake DB transform and load has worked really well but replicating this behaviour for Redshift loading is much more complicated due to the way the applications for this flow have been configured overtime.

As a stop-gap measure to allow for this and to take advantage of the speed that persistent clusters would give is there a possibility of adding in the ability to run EmrEtlRunner against an already created cluster in the same way that dataflow-runner works?

As it seems that we are going to be using this application for a while longer it would be much easier to add this feature here rather than trying to write a custom logic layer on top of base dataflow-runner steps to do exactly the same thing.

Thoughts?

cc/ @alexanderdean @BenFradet @chuwy

@jbeemster
Copy link
Member Author

jbeemster commented Nov 21, 2018

Looking at the code what we would need to add to support this:

  1. At this line we check whether a JobFlow already exists - if it does we use Elasticity::JobFlow.from_jobflow_id('jobflow ID', 'region') rather than Elasticity::JobFlow.new
  2. We could quite easily write detection logic to fetch out a JobFlow ID using the unique name
  3. We would need to be able to change the steps "action_on_failure" parameter - looks like these are all hardcoded at the moment but could easily be edited as they have public access
  4. We would need a "down" command to terminate the persistent cluster

This could work in one of two ways:

  1. We specify the want to use a persistent cluster and EMR detects whether one exists already; if it does it uses this; if it cannot find one it creates a new one
  2. We have a distinct "up" command as with dataflow-runner

In any case we would need a distinct command to terminate a persistent cluster so that no external logic would be needed to manage this system.

@yalisassoon
Copy link
Member

From my perspective this makes a lot of sense - it doesn't sound like very much functionality to add to EmrEtlRunner, but would drive a big benefit for our users in terms of much more frequent loading of Redshift

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants