rucio download watchdog thread #4153

mlassnig · 2020-11-24T08:59:38Z

Supervise the download to check if actually bytes are moving. This to better deal with timeout issues and misbehaving clients/storage.

rcarpa · 2021-02-17T10:39:49Z

What level of implementation needed from this ticket? Is it simple detecting that a download is stuck ? Or do we want to kill the active download and restart with the next source ? Do we allow the luxury to create an additional watchdog thread for each download_worker thread, or do I try to put all the supervision in the main thread?

mlassnig · 2021-02-17T10:43:17Z

@PalNilsson what are the implications for the pilot here?

PalNilsson · 2021-02-17T10:57:44Z

The pilot still needs to be able to kill the main process - ie doing so should not leave this new thread hanging (as that will cause problems with the pilot to finish, as it is in turn waiting for threads to finish). Killing the main process must therefore lead to stopping the new thread as well.
Also, I'll "soon" start testing parallel downloads, so maybe not good to spawn too many additional threads? (How many simultaneous downloads can we handle? 5?).

mlassnig · 2021-02-17T11:01:13Z

Thanks Paul, we should aim at 3 parallel downloads on average to start with. This is what we had in the past and didn't "overload" the storage.

rcarpa · 2021-02-17T12:49:31Z

5 additional threads is not something I would worry about, but I'm not sure which are the functional requirements of this ticket. For what purpose do we want to monitor the download. And should I implement any actions when a failed download is detected? Considering that many "protocols" are implemented with blocking i/o; I don't even know if I can easily implement recovery from a stuck download in a protocol-agnostic way. The ticket says that we want "to better deal with timeout issues and misbehaving clients/storage". Is this behavior only needed when using the download client in pilot with the gfal protocol? or should it be implemented generically, for all protocols.

In the first case, there is already the basis for some "transfer_timeout" logic which has a non-working implementation for the gfal protocol and I can try to fix it. But this is only one of 2 protocols which will support this flag. Is this the desired solution?

Alternatively, in the second case, I will have to think about how to re-work the download client internals to have the watchdog function implemented in the protocol-agnostic code. However, I'm not aware of any way to terminate a thread stuck doing blocking i/o in non-async python. (Do you know any? ). Meaning I can detect a stuck download, but not sure if I'll be able to gracefully recover from it.

rodwalker · 2021-03-03T12:35:51Z

Is this a good place to add the requirement to have any timeout scale with file-size. Lack of this causes problems for users downloading big files, that hit the default timeout. A short timeout on bytes moves is of course better, but the scaling global timeout is trivial to implement, e..g. assume 1MB/s.

Cheers,
Rod.

rcarpa · 2021-03-08T10:39:41Z

Hi Rod,
Solving this issue is probably the best way to allow enforcing dynamic timeout. However, as you noticed yourself, it will be difficult and take time. Very recently, Otilia opened an issue on this exact topic: #4374. We intend to implement a compromise solution similar to the one you proposed.
Cheers,
Radu

bari12 · 2021-06-03T09:15:54Z

Dynamic timeout is implemented;
Check with gfal devel why params.timeout is not always considered

rucio/lib/rucio/rse/protocols/gfal.py

Line 429 in a106bb1

params.timeout = int(transfer_timeout)

Will not continue with watchdog thread, but aim to proper handling in gfal.

bari12 · 2021-10-13T15:22:26Z

Closing this now; We can re-open should this need re-discussion.

mlassnig added enhancement Clients good first issue labels Nov 24, 2020

rcarpa self-assigned this Feb 10, 2021

rcarpa mentioned this issue Feb 12, 2021

Functional download/upload tests with xrootd containers #2311

Closed

rcarpa mentioned this issue Mar 8, 2021

Relative transfer timeout for downloads #4374

Closed

bari12 removed the good first issue label Mar 10, 2021

bari12 closed this as completed Oct 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rucio download watchdog thread #4153

rucio download watchdog thread #4153

mlassnig commented Nov 24, 2020

rcarpa commented Feb 17, 2021

mlassnig commented Feb 17, 2021

PalNilsson commented Feb 17, 2021

mlassnig commented Feb 17, 2021

rcarpa commented Feb 17, 2021

rodwalker commented Mar 3, 2021

rcarpa commented Mar 8, 2021

bari12 commented Jun 3, 2021

bari12 commented Oct 13, 2021

rucio download watchdog thread #4153

rucio download watchdog thread #4153

Comments

mlassnig commented Nov 24, 2020

rcarpa commented Feb 17, 2021

mlassnig commented Feb 17, 2021

PalNilsson commented Feb 17, 2021

mlassnig commented Feb 17, 2021

rcarpa commented Feb 17, 2021

rodwalker commented Mar 3, 2021

rcarpa commented Mar 8, 2021

bari12 commented Jun 3, 2021

bari12 commented Oct 13, 2021