Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better error propagation for download and upload client #2925

Closed
davidgcameron opened this issue Sep 24, 2019 · 16 comments · Fixed by #2952
Closed

Better error propagation for download and upload client #2925

davidgcameron opened this issue Sep 24, 2019 · 16 comments · Fixed by #2952

Comments

@davidgcameron
Copy link
Contributor

Motivation

Pilots using rucio mover do not report useful information on why a transfer failed. This is because on failure the download and upload client throw an exception with no information inside. The real reason for the error should be propagated better back to the caller of the client.

Modification

The dowloadclient and uploadclient should return a dictionary of file status (containing appropriate error information) instead of raising an exception.

@davidgcameron
Copy link
Contributor Author

Please assign to me :)

@mlassnig
Copy link
Contributor

avec plaisir ;-)

@bari12
Copy link
Member

bari12 commented Sep 24, 2019

I think there is a bit of duplication between #2408, #2168 and this. Would be good to clarify what is done where. 😄

@davidgcameron
Copy link
Contributor Author

Proposed changes here: davidgcameron@b97ac11

I didn't want to make a pull request yet, since we should discuss how these changes may affect other clients (if there are any?) using the download and upload clients. I've tested that the pilot (with my latest changes that were merged) and rucio download and upload CLI work ok.

@mlassnig
Copy link
Contributor

thanks!
@TWAtGH, can you please have a look

@TWAtGH
Copy link

TWAtGH commented Sep 26, 2019

Why do we need to do changes to the API at all? Are the log strings not enough?
I think it was implemented like this because Pilot expected Rucio to throw an exception if an error occurs.

@bari12
Copy link
Member

bari12 commented Sep 26, 2019

Removing the exceptions might be a bit problematic yes.
Why not put the info within the exception? Thus in the downloadclient line 1423 and 1425?
Adding it to the logging is also fine, but passing the error structurally on to the pilot is probably better within an exception?

@davidgcameron
Copy link
Contributor Author

Logs are not enough for callers of the API which need to pass error messages up to other systems (i.e. pilot passing the error back to panda to display on the panda monitor).

Exceptions are bad because in a bulk call you may not want to stop transfers just because one file failed. The exception also does not tell you which file failed unless you do some error-prone string parsing. The doc for download_pfns even says that it returns a dictionary with file states including FAILED, so I suppose there were not meant to be exceptions originally. Also I don't think that a transfer failure is an exceptional or unexpected situation so throwing an exception is not really the right thing to do.

@TWAtGH
Copy link

TWAtGH commented Sep 26, 2019

Alright, but there needs to be an aggregated error message per API call that could be passed to other systems like panda? Or how will it be displayed if one file fails because of no source found, another one fails because of checksum validation and a third one succeeds? 'Not all files downloaded' sounds reasonable for me at this point.

For the download the API will try to download all given files even if one fails. As Martin mentioned it should be possible to put one error string per file into the exception and this looks to me like the best (and non API breaking) solution. And I think this should then be done for every error and not just in case the protocol gives an error like in your code changes currently.

@davidgcameron
Copy link
Contributor Author

Summary of face to face discussion (please correct anything I got wrong): no change in API so as not to break backwards compatibility. There are two options for error propagation: pass the file status dictionary as a keyword argument in the exception, or use the information in the traces that is passed to download client (traces_copy_out) and filled with information.

davidgcameron added a commit to davidgcameron/rucio that referenced this issue Sep 27, 2019
davidgcameron added a commit to davidgcameron/rucio that referenced this issue Sep 27, 2019
@davidgcameron
Copy link
Contributor Author

The commits above revert the previous changes and instead fill the 'stateReason' of the traces with error messages. I had to add a parameter to the uploadclient to support passing a reference to a traces list (it was already in the download client).

@TWAtGH
Copy link

TWAtGH commented Sep 28, 2019

Thanks a lot! Looks fine to me except some small comments:

In the download client you have this unused variable state_reason:
https://github.com/davidgcameron/rucio/blob/8d0019545fe11cb626f0acd7e8a98fdbee1644b7/lib/rucio/client/downloadclient.py#L519

And I think you missed at least two points which raise errors quite frequently:

  1. no sources available:
    https://github.com/davidgcameron/rucio/blob/8d0019545fe11cb626f0acd7e8a98fdbee1644b7/lib/rucio/client/downloadclient.py#L475
  2. failed checksum validation:
    https://github.com/davidgcameron/rucio/blob/8d0019545fe11cb626f0acd7e8a98fdbee1644b7/lib/rucio/client/downloadclient.py#L590

Is there a reason the upload client uses the state_reason variable instead of writing it directly to the trace dict like the download client?

And one less important point: is stateReason as key name in the traces dict already used somewhere or could it might be changed to e.g., errorDescription ? That would also make clear why it's empty on success :)

@davidgcameron
Copy link
Contributor Author

Thanks for the comments. I fixed the minor issues.

Is there a reason the upload client uses the state_reason variable instead of writing it directly to the trace dict like the download client?

The download client sends a trace for each attempt whereas the upload client only sends the trace after all attempts have finished, so this is the reason to keep state_reason outside the attempt loop. I'm not sure if this difference is intentional or not.

And one less important point: is stateReason as key name in the traces dict already used somewhere or could it might be changed to e.g., errorDescription ? That would also make clear why it's empty on success :)

Maybe @tbeerman could answer that one, it could be used for example for the automatic declaration of lost files.

davidgcameron added a commit to davidgcameron/rucio that referenced this issue Oct 1, 2019
@davidgcameron
Copy link
Contributor Author

Travis tests still fail due to:

$ if [[ $SUITE == "all" ]]; then docker exec -it rucio /bin/sh -c "/opt/rucio/tools/run_tests_docker.sh" ; fi
Error: No such container: rucio
The command "if [[ $SUITE == "all" ]]; then docker exec -it rucio /bin/sh -c "/opt/rucio/tools/run_tests_docker.sh" ; fi" exited with 1.

Is that expected?

@bari12
Copy link
Member

bari12 commented Oct 1, 2019

Did you try to restart or does it consistently fail? Sometimes this one shows up but I think this is more travis related.

@davidgcameron
Copy link
Contributor Author

The tests also fail in the pull requests but with an Oracle error which I think is unrelated to my changes.

bari12 added a commit that referenced this issue Oct 7, 2019
…ion_in_download_and_upload_client

Clients: return error info in traces Fix #2925
bari12 added a commit that referenced this issue Oct 7, 2019
…ion_in_download_and_upload_client

Clients: return error info in traces Fix #2925
@bari12 bari12 modified the milestones: 1.20.7-clients, 1.20.8-clients Oct 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants