New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EmrEtlRunner: add support for running steps on a persistent EMR cluster #3930
Comments
Looking at the code what we would need to add to support this:
This could work in one of two ways:
In any case we would need a distinct command to terminate a persistent cluster so that no external logic would be needed to manage this system. |
From my perspective this makes a lot of sense - it doesn't sound like very much functionality to add to EmrEtlRunner, but would drive a big benefit for our users in terms of much more frequent loading of Redshift |
Using dataflow-runner to run Snowflake DB transform and load has worked really well but replicating this behaviour for Redshift loading is much more complicated due to the way the applications for this flow have been configured overtime.
As a stop-gap measure to allow for this and to take advantage of the speed that persistent clusters would give is there a possibility of adding in the ability to run EmrEtlRunner against an already created cluster in the same way that dataflow-runner works?
As it seems that we are going to be using this application for a while longer it would be much easier to add this feature here rather than trying to write a custom logic layer on top of base dataflow-runner steps to do exactly the same thing.
Thoughts?
cc/ @alexanderdean @BenFradet @chuwy
The text was updated successfully, but these errors were encountered: