Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rucio download watchdog thread #4153

Closed
mlassnig opened this issue Nov 24, 2020 · 9 comments
Closed

rucio download watchdog thread #4153

mlassnig opened this issue Nov 24, 2020 · 9 comments

Comments

@mlassnig
Copy link
Contributor

Supervise the download to check if actually bytes are moving. This to better deal with timeout issues and misbehaving clients/storage.

@rcarpa
Copy link
Contributor

rcarpa commented Feb 17, 2021

What level of implementation needed from this ticket? Is it simple detecting that a download is stuck ? Or do we want to kill the active download and restart with the next source ? Do we allow the luxury to create an additional watchdog thread for each download_worker thread, or do I try to put all the supervision in the main thread?

@mlassnig
Copy link
Contributor Author

@PalNilsson what are the implications for the pilot here?

@PalNilsson
Copy link

The pilot still needs to be able to kill the main process - ie doing so should not leave this new thread hanging (as that will cause problems with the pilot to finish, as it is in turn waiting for threads to finish). Killing the main process must therefore lead to stopping the new thread as well.
Also, I'll "soon" start testing parallel downloads, so maybe not good to spawn too many additional threads? (How many simultaneous downloads can we handle? 5?).

@mlassnig
Copy link
Contributor Author

Thanks Paul, we should aim at 3 parallel downloads on average to start with. This is what we had in the past and didn't "overload" the storage.

@rcarpa
Copy link
Contributor

rcarpa commented Feb 17, 2021

5 additional threads is not something I would worry about, but I'm not sure which are the functional requirements of this ticket. For what purpose do we want to monitor the download. And should I implement any actions when a failed download is detected? Considering that many "protocols" are implemented with blocking i/o; I don't even know if I can easily implement recovery from a stuck download in a protocol-agnostic way. The ticket says that we want "to better deal with timeout issues and misbehaving clients/storage". Is this behavior only needed when using the download client in pilot with the gfal protocol? or should it be implemented generically, for all protocols.

In the first case, there is already the basis for some "transfer_timeout" logic which has a non-working implementation for the gfal protocol and I can try to fix it. But this is only one of 2 protocols which will support this flag. Is this the desired solution?

Alternatively, in the second case, I will have to think about how to re-work the download client internals to have the watchdog function implemented in the protocol-agnostic code. However, I'm not aware of any way to terminate a thread stuck doing blocking i/o in non-async python. (Do you know any? ). Meaning I can detect a stuck download, but not sure if I'll be able to gracefully recover from it.

@rodwalker
Copy link

Is this a good place to add the requirement to have any timeout scale with file-size. Lack of this causes problems for users downloading big files, that hit the default timeout. A short timeout on bytes moves is of course better, but the scaling global timeout is trivial to implement, e..g. assume 1MB/s.

Cheers,
Rod.

@rcarpa
Copy link
Contributor

rcarpa commented Mar 8, 2021

Hi Rod,
Solving this issue is probably the best way to allow enforcing dynamic timeout. However, as you noticed yourself, it will be difficult and take time. Very recently, Otilia opened an issue on this exact topic: #4374. We intend to implement a compromise solution similar to the one you proposed.
Cheers,
Radu

@bari12
Copy link
Member

bari12 commented Jun 3, 2021

Dynamic timeout is implemented;
Check with gfal devel why params.timeout is not always considered

params.timeout = int(transfer_timeout)

Will not continue with watchdog thread, but aim to proper handling in gfal.

@bari12
Copy link
Member

bari12 commented Oct 13, 2021

Closing this now; We can re-open should this need re-discussion.

@bari12 bari12 closed this as completed Oct 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants