Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retries when polling for the job status in BigQuery #1946

Merged
merged 2 commits into from
Dec 7, 2016

Conversation

fabriziodemaria
Copy link
Contributor

@fabriziodemaria fabriziodemaria commented Dec 2, 2016

Description

Add retries when polling for the job status in BigQuery, in case of specific HTTP errors encountered.

Motivation and Context

As suggested by documentation from Google, in case of 5xx errors encountered while polling for the job status via jobs.get(), it is advised to try to poll again [1].
Without this fix, Luigi might exit with a failing Task even if the underlying loading job to BigQuery is still running.

Additionally, the way the job success is verified is changed to use status.errorResult instead of status.errors [2].

Have you tested this? If so, how?

  • Tested that these changes cover a success scenario in the exact same way as before.
  • Tested that critical failures are handled in the same way as before, but using errorResult:
  File "/src/luigi/luigi/contrib/bigquery.py", line 332, in run_job
    raise Exception('BigQuery job failed: {}'.format(status['status']['errorResult']))
Exception: BigQuery job failed: {u'reason': u'invalid', u'message': u'The Apache Avro library failed to parse file...
  • Unfortunately it is hard to reproduce a 5xx http error to test the retrying mechanism [3], but I think in the worst case Luigi would just fail instead of retrying thus handling this scenario no worse than before.

[1] - https://cloud.google.com/bigquery/troubleshooting-errors#backendError
[2] - https://cloud.google.com/bigquery/loading-data#loading_avro_files
[3] - https://google.github.io/google-api-python-client/docs/epy/googleapiclient.http.HttpRequest-class.html#execute

@mention-bot
Copy link

@fabriziodemaria, thanks for your PR! By analyzing the history of the files in this pull request, we identified @mikekap, @ulzha and @mbruggmann to be potential reviewers.

try:
status = self.client.jobs().get(projectId=project_id, jobId=job_id).execute()
if status['status']['state'] == 'DONE':
if status['status'].get('errors'):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Errors being present doesn't imply that the job failed, presence of errorResult means that the job failed.
See https://cloud.google.com/bigquery/loading-data#loading_avro_files and example here: https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/bigquery/cloud-client/load_data_from_file.py#L58

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, will update this to use errorResult instead.

if int(httpError.resp.status) == 503 and poll_counter < self.MAX_POLLING_RETRIES:
poll_counter = poll_counter + 1
logger.warning('Got error 503 while requesting info for job %s:%s, retrying in %d seconds...', project_id, job_id, poll_counter)
time.sleep(float(poll_counter))
Copy link
Contributor

@bergman bergman Dec 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't have to be float, at least according to docstring of time.sleep. You could remove the .0 in the other call too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@mbruggmann
Copy link
Member

@fabriziodemaria Have you tried request.execute(num_retries=X)? Looks like that is a thing in the google api python client...

@fabriziodemaria
Copy link
Contributor Author

Have you tried request.execute(num_retries=X)? Looks like that is a thing in the google api python client...

Awesome @mbruggmann, thanks for pointing that out! I am changing the PR to make use of that instead.

@bergman
Copy link
Contributor

bergman commented Dec 6, 2016

👍

if status['status'].get('errors'):
raise Exception('BigQuery job failed: {}'.format(status['status']['errors']))
if status['status'].get('errorResult'):
raise Exception('BigQuery job failed: {}'.format(status['status']['errorResult']))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 'errors' no more in the dict at this point? I would try to make sure that the exception dumps any information available. Why not all the status?

@ulzha
Copy link
Contributor

ulzha commented Dec 6, 2016

Great work!

If we have tested that the code is capable of handling a success scenario and a failure scenario no worse than the previous code, then :shipit:

@fabriziodemaria
Copy link
Contributor Author

@Uldis

  • I tested that these changes cover a success scenario in the exact same way as before.
  • I tested that critical failures are handled in the same way as before, but using errorResult:
  File "/src/luigi/luigi/contrib/bigquery.py", line 332, in run_job
    raise Exception('BigQuery job failed: {}'.format(status['status']['errorResult']))
Exception: BigQuery job failed: {u'reason': u'invalid', u'message': u'The Apache Avro library failed to parse file...
  • Unfortunately it is hard to reproduce a 5xx http error to test the retrying mechanism, but I think in the worst case Luigi would just fail instead of retrying thus handling this scenario no worse than before.

@bergman
Copy link
Contributor

bergman commented Dec 7, 2016

👍

@bergman bergman merged commit 644d11b into spotify:master Dec 7, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants