New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rucio download watchdog thread #4153
Comments
What level of implementation needed from this ticket? Is it simple detecting that a download is stuck ? Or do we want to kill the active download and restart with the next source ? Do we allow the luxury to create an additional watchdog thread for each download_worker thread, or do I try to put all the supervision in the main thread? |
@PalNilsson what are the implications for the pilot here? |
The pilot still needs to be able to kill the main process - ie doing so should not leave this new thread hanging (as that will cause problems with the pilot to finish, as it is in turn waiting for threads to finish). Killing the main process must therefore lead to stopping the new thread as well. |
Thanks Paul, we should aim at 3 parallel downloads on average to start with. This is what we had in the past and didn't "overload" the storage. |
5 additional threads is not something I would worry about, but I'm not sure which are the functional requirements of this ticket. For what purpose do we want to monitor the download. And should I implement any actions when a failed download is detected? Considering that many "protocols" are implemented with blocking i/o; I don't even know if I can easily implement recovery from a stuck download in a protocol-agnostic way. The ticket says that we want "to better deal with timeout issues and misbehaving clients/storage". Is this behavior only needed when using the download client in pilot with the gfal protocol? or should it be implemented generically, for all protocols. In the first case, there is already the basis for some "transfer_timeout" logic which has a non-working implementation for the gfal protocol and I can try to fix it. But this is only one of 2 protocols which will support this flag. Is this the desired solution? Alternatively, in the second case, I will have to think about how to re-work the download client internals to have the watchdog function implemented in the protocol-agnostic code. However, I'm not aware of any way to terminate a thread stuck doing blocking i/o in non-async python. (Do you know any? ). Meaning I can detect a stuck download, but not sure if I'll be able to gracefully recover from it. |
Is this a good place to add the requirement to have any timeout scale with file-size. Lack of this causes problems for users downloading big files, that hit the default timeout. A short timeout on bytes moves is of course better, but the scaling global timeout is trivial to implement, e..g. assume 1MB/s. Cheers, |
Hi Rod, |
Dynamic timeout is implemented; rucio/lib/rucio/rse/protocols/gfal.py Line 429 in a106bb1
Will not continue with watchdog thread, but aim to proper handling in gfal. |
Closing this now; We can re-open should this need re-discussion. |
Supervise the download to check if actually bytes are moving. This to better deal with timeout issues and misbehaving clients/storage.
The text was updated successfully, but these errors were encountered: