Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rebase SparkSubmitTask onto ExternalProgramTask #1525

Merged
merged 4 commits into from Jan 30, 2016

Conversation

ehdr
Copy link
Contributor

@ehdr ehdr commented Jan 25, 2016

As a continuation of #1520, this is an example of how SparkSubmitTask could be "rebased" onto ExternalProgramTask, since most of the code is shared between them. Not sure if this is a good way to go?

The public interface of SparkSubmitTask remains as-is.

However, there will be subtle changes to the output to stdout and logs (e.g. 'Program failed[...]' with this patch vs. 'Spark job failed[...]' before). Also it will raise a ExternalProgramRunError on execution errors instead of a SparkJobError as before. Is this considered acceptable deviations from backwards compatibility?

Finally, it will now log stdout and stderr whenever it's (and only when it's) non-empty, even if the process exits successfully (return code 0). Previously, stderr was only logged if the return code was non-zero. [changed back to preserve the old behavior in a later commit, thanks to comment from @deuxpi]

@deuxpi
Copy link

deuxpi commented Jan 25, 2016

The stdout from a Spark job can be terribly verbose, so it's nice to have that silenced in the case where there were no errors. I personally would like that behavior to stay available somehow, even if I have to add a flag parameter to each of our task classes.

@@ -189,12 +146,18 @@ def hadoop_conf_dir(self):
return configuration.get_config().get("spark", "hadoop-conf-dir", None)

def get_environment(self):
env = os.environ.copy()
env = super(SparkSubmitTask, self).program_environment()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a big fan of calling methods on the parent class. should only be doable for constructors

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! It's just doing os.environ.copy() anyway. I'll change it.

@ehdr
Copy link
Contributor Author

ehdr commented Jan 25, 2016

@deuxpi Good idea. But stdout was always logged even before this patch, so that does not change. The only change is for stderr.

It could be a nice new feature though, to add an option to turn it off?

@deuxpi
Copy link

deuxpi commented Jan 25, 2016

@ehdr Ack! Sorry, I meant stderr. An option would be great, I remember at least one time where I was actually interested by the output of Spark, but most of the time I'm clearly not.

@Tarrasch
Copy link
Contributor

However, there will be subtle changes to the output to stdout and logs (e.g. 'Program failed[...]' with this patch vs. 'Spark job failed[...]' before). Also it will raise a ExternalProgramRunError on execution errors instead of a SparkJobError as before. Is this considered acceptable deviations from backwards compatibility?

Yes I think this is acceptable. :)

@Tarrasch
Copy link
Contributor

I like this patch. I'm +1 the idea. @ehdr, When you think you've fixed all the comments from this PR, let us know and we'll merge.

@ehdr ehdr force-pushed the rebase-spark-onto-external-program-task branch 2 times, most recently from 002d95b to db48a78 Compare January 26, 2016 22:25
Add a property to ExternalProgramTask, which can be overridden to
only log `stderr` when the program execution fails. The default remains
to always log `stderr`, even if execution succeeds.
Seems to be an obsolete left-over from something.
@ehdr ehdr force-pushed the rebase-spark-onto-external-program-task branch from db48a78 to 8d7876c Compare January 26, 2016 22:29
The public interface of `SparkSubmitTask` remains as-is.

However, there will be subtle changes to the output to `stdout` and
logs (e.g. 'Program failed[...]' with this patch vs. 'Spark job
failed[...]' before). Also it will raise a `ExternalProgramRunError`
on execution errors instead of a `SparkJobError` as before.
@ehdr ehdr force-pushed the rebase-spark-onto-external-program-task branch from 8d7876c to a0417c6 Compare January 26, 2016 22:39
@ehdr
Copy link
Contributor Author

ehdr commented Jan 26, 2016

I think I got everything now @Tarrasch @deuxpi @erikbern. Not super elegant... but please comment! :)

@Tarrasch
Copy link
Contributor

Great! All improvements are welcome and most code in this code base is not really elegant ^^

erikbern added a commit that referenced this pull request Jan 30, 2016
…-task

Rebase SparkSubmitTask onto ExternalProgramTask
@erikbern erikbern merged commit f771622 into spotify:master Jan 30, 2016
@ehdr ehdr deleted the rebase-spark-onto-external-program-task branch January 30, 2016 21:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants