-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x-rate/total operation timeout for xrdcl #1439
Comments
@adriansev : just to make sure I understand correctly, there two enhancements needed:
is this right? |
@simonmichal yes, that's exactly right! thanks a lot! |
@adriansev : I've just added a global cp timeout for non tpc transfers: fd723da, you should be able to enable this by using |
@adriansev : the transfer rate threshold is now implemented, the feature is available in |
@simonmichal Hi and sorry for the late reply! So, using the env var XRD_CPTIMEOUT seem not to work .. context file here |
@adriansev : are you using the lates I run a quick test and it seems to work with your file:
Regarding the xrate threshold, sure I can add an envar, stay tuned ;-) |
@simonmichal well, first line of the xrd dump say it so :) .. but i continue to investigate as i also get:
where that start is printed from here: https://github.com/adriansev/jalien_py/blob/master/alienpy/alien.py#L1762 (my subclassing of client.utils.CopyProgressHandler) |
@adriansev : you should be able now to enable xrate-threshold with |
@simonmichal so, the bad alloc seems to be happening only when the file is zip component, see the logs here |
well, have you tried different values than 1, it might be this is an artefact of the in-fly requests (I will have a look) |
@adriansev : I cannot reproduce the
Can you point me to the file this happened with? |
@simonmichal so, in the order of messages :)
|
@simonmichal so, in the same directory https://asevcenc.web.cern.ch/asevcenc/xrd_timeout/bad_alloc/ i put all the logs i can get ... the error is almost always reproductible (there are cases where the process get stuck for more than 1min, at which point i TERM it) |
OK, I managed to reproduce the problem ... |
The xrootd/src/XrdCl/XrdClClassicCopyJob.cc Lines 992 to 996 in c9bed0b
it should be like in the xrootd/src/XrdCl/XrdClClassicCopyJob.cc Lines 676 to 685 in c9bed0b
Actually, the bug was there since ever, thanks for discovering it!!! |
This should fix the problem: 577812c |
@simonmichal i can confirm that i no longer have bad_alloc errors. still i observe an offset between the timeout setting and the actual begin-->end of copy job process but is of the order of 1-2 seconds and it's already useful as it is now, so this part is verified. i will check right away also the xratethresold.. by the way what is the unit? |
@simonmichal before checking xratethreshold i verified if setting the env var from within the application work .. and it does not work .. i'm doing a very basic set up of env var, see https://github.com/adriansev/jalien_py/blob/master/alienpy/alien.py#L1609 |
@adriansev : thanks for verifying the cptimeout! If you use the envar you need to give the threshold in B/s, now when I write this I realise that this might be cumbersome, I can tweak it if needed so it understands K,M and G suffixes |
@adriansev : well, the envars are resolved only once when the static xrootd/bindings/python/libs/client/env.py Line 41 in 1913644
the respective keys are: |
@simonmichal great! that did the trick! (by the way why is there a need for separate Int and String methods?) for myself i did this wrapper .. i'm not sure how to do it for EnvGet as i see that there are some cpp calls. |
@adriansev : thanks a lot for testing!!! Regarding the Int / String methods, thanks for pointing it out (I guess my mindset gravitates to much towards C++ ;-), I will make it nicer ;-) |
@simonmichal i completely forgot! is it possible (and IMHO it should be healthy) that if transfer fail, the incomplete file to be deleted? (like |
@adriansev : yes I think that's doable but let me explore what the possible side effects are, in the meanwhile could you cut a separate issue for this |
@simonmichal sure, done @#1448 , thanks a lot! |
Hi! Following my question here @abh3 suggested to add it as a ticket so here i go :)
It would be great if a transfer could be cancel on a threshold for transfer rate or elapsed time can be added.
The threshold would be applied per individual transfers so in case of metafiles the next replica will be tried.
The x-rate can/could be computed based on
XRD_TIMEOUTRESOLUTION
the the total timeout would be the equivalent of XRD_CPTPCTIMEOUT for a normal copy.
The text was updated successfully, but these errors were encountered: