EmrEtlRunner: add support for running steps on a persistent EMR cluster #3930

jbeemster · 2018-11-21T08:58:53Z

Using dataflow-runner to run Snowflake DB transform and load has worked really well but replicating this behaviour for Redshift loading is much more complicated due to the way the applications for this flow have been configured overtime.

As a stop-gap measure to allow for this and to take advantage of the speed that persistent clusters would give is there a possibility of adding in the ability to run EmrEtlRunner against an already created cluster in the same way that dataflow-runner works?

As it seems that we are going to be using this application for a while longer it would be much easier to add this feature here rather than trying to write a custom logic layer on top of base dataflow-runner steps to do exactly the same thing.

Thoughts?

cc/ @alexanderdean @BenFradet @chuwy

jbeemster · 2018-11-21T09:19:16Z

Looking at the code what we would need to add to support this:

At this line we check whether a JobFlow already exists - if it does we use Elasticity::JobFlow.from_jobflow_id('jobflow ID', 'region') rather than Elasticity::JobFlow.new
We could quite easily write detection logic to fetch out a JobFlow ID using the unique name
We would need to be able to change the steps "action_on_failure" parameter - looks like these are all hardcoded at the moment but could easily be edited as they have public access
We would need a "down" command to terminate the persistent cluster

This could work in one of two ways:

We specify the want to use a persistent cluster and EMR detects whether one exists already; if it does it uses this; if it cannot find one it creates a new one
We have a distinct "up" command as with dataflow-runner

In any case we would need a distinct command to terminate a persistent cluster so that no external logic would be needed to manage this system.

yalisassoon · 2018-11-21T10:38:24Z

From my perspective this makes a lot of sense - it doesn't sound like very much functionality to add to EmrEtlRunner, but would drive a big benefit for our users in terms of much more frequent loading of Redshift

…er (closes #3930)

…wplow/snowplow#3930)

…plow/snowplow#3930)

jbeemster added the 3. Enrich label Nov 21, 2018

jbeemster added a commit that referenced this issue Dec 4, 2018

EmrEtlRunner: add support for running steps on a persistent EMR clust…

55e6c69

…er (closes #3930)

BenFradet added this to the R112 Baalbek (Batch increased stability) milestone Dec 26, 2018

BenFradet pushed a commit that referenced this issue Feb 13, 2019

EmrEtlRunner: add support for running steps on a persistent EMR clust…

c34dabf

…er (closes #3930)

BenFradet closed this as completed in 70ebffb Feb 19, 2019

peel pushed a commit to snowplow/emr-etl-runner that referenced this issue May 26, 2020

Add support for running steps on a persistent EMR cluster (closes sno…

66dd5bd

…wplow/snowplow#3930)

peel pushed a commit to snowplow/emr-etl-runner that referenced this issue May 28, 2020

Add support for running steps on a persistent EMR cluster (closes sno…

79dbfce

…wplow/snowplow#3930)

peel pushed a commit to snowplow/emr-etl-runner that referenced this issue May 28, 2020

Add support for running steps on a persistent EMR cluster (closes sno…

3b718ed

…wplow/snowplow#3930)

peel pushed a commit to snowplow/emr-etl-runner that referenced this issue May 28, 2020

Add support for running steps on a persistent EMR cluster (close snow…

b8df572

…plow/snowplow#3930)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EmrEtlRunner: add support for running steps on a persistent EMR cluster #3930

EmrEtlRunner: add support for running steps on a persistent EMR cluster #3930

jbeemster commented Nov 21, 2018

jbeemster commented Nov 21, 2018 •

edited

yalisassoon commented Nov 21, 2018

EmrEtlRunner: add support for running steps on a persistent EMR cluster #3930

EmrEtlRunner: add support for running steps on a persistent EMR cluster #3930

Comments

jbeemster commented Nov 21, 2018

jbeemster commented Nov 21, 2018 • edited

yalisassoon commented Nov 21, 2018

jbeemster commented Nov 21, 2018 •

edited