Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x-rate/total operation timeout for xrdcl #1439

Closed
adriansev opened this issue Apr 8, 2021 · 25 comments
Closed

x-rate/total operation timeout for xrdcl #1439

adriansev opened this issue Apr 8, 2021 · 25 comments
Assignees

Comments

@adriansev
Copy link
Contributor

Hi! Following my question here @abh3 suggested to add it as a ticket so here i go :)
It would be great if a transfer could be cancel on a threshold for transfer rate or elapsed time can be added.
The threshold would be applied per individual transfers so in case of metafiles the next replica will be tried.
The x-rate can/could be computed based on XRD_TIMEOUTRESOLUTION
the the total timeout would be the equivalent of XRD_CPTPCTIMEOUT for a normal copy.

@simonmichal
Copy link
Contributor

@adriansev : just to make sure I understand correctly, there two enhancements needed:

  1. A global timeout for non-tpc transfers, similar as the XRD_CPTPCTIMEOUT

  2. A setting that would force the copy to move to another replica or fail if no replica is available if the transfer rate drops under given threshold

is this right?

@adriansev
Copy link
Contributor Author

@simonmichal yes, that's exactly right! thanks a lot!

@simonmichal
Copy link
Contributor

@adriansev : I've just added a global cp timeout for non tpc transfers: fd723da, you should be able to enable this by using XRD_CPTIMEOUT or by passing the cptimeout argument to the CopyProcess in python. Please give it a try!

@simonmichal
Copy link
Contributor

@adriansev : the transfer rate threshold is now implemented, the feature is available in xrdcp through --xrate-threshold option or from Python bindings by passing the xratethreshold argument to the CopyProcess. Please give it a try!

@adriansev
Copy link
Contributor Author

@simonmichal Hi and sorry for the late reply! So, using the env var XRD_CPTIMEOUT seem not to work .. context file here
Also, because it took to long (see the cmd_out file) i killed the process and i got a partially downloaded file (so the transfer was in fact ongoing)
It would be very useful to me if you could do an env var also for xratethreshold as by using env var i do not have to adapt the function signature depending on client version.
Thanks a lot!!

@simonmichal
Copy link
Contributor

@adriansev : are you using the lates HEAD from git?

I run a quick test and it seems to work with your file:

$ export XRD_CPTIMEOUT=1
$ ./xrdcp -f "root://alicegrid2.recas.ba.infn.it:1094//15/32077/d7903a02-9195-11e7-9755-3b36a265d99d?xrd.wantprot=unix&authz=
-----BEGIN SEALED CIPHER-----
fLni9LEtpp+y4YLRzMh960CfmdwqFD50XbGJj0WyHeNRKpkQEvGRiNuZBUOvq1NWMAlRmQno6Ink
CSSxEJclU50-9u84eZbAuh0mgZn6nnc4KjK9XPrvmIVNTZoouOp6Vrinsf556ljTHdiF2Ha4-sTB
v-6Gvh8vVx4epvQkjOk=
-----END SEALED CIPHER-----
-----BEGIN SEALED ENVELOPE-----
AAAAgJknT55jOMG9g+S76ZrdtH3g9J54UvJG-SruYIdZmXAocnAh0PUhCGfXnz3ullxXk9rUIqcY
EY+p9W3SU-DrAPykPwgTef4dQ-LNmPKguPmAz42X8RuJDImVF07n802wTL7NiYuzQHgB6mTaMfNP
1jJI3KiONaw-wtrGBdwOX2DYgO629jH-jtobAsi1be9tj5tX+8bCU3UAOWyWDtbkPnH0LBNaZyN+
hhsH-DkowiTDG5jSw8f7koZUKTO+ivCOacGhRv+1+qLgj+K8CLTRhLdQzDsT4YgFWdUfqCftT9Fo
qIcPSQ5jdr9UkMM0om0QgLKHCC0dbAOIUOGLT13ydIs0j6M72jzyh4hTfRpmvUksZ0jNa0y8tlOX
rjz1AUv86ycueXTwjMm8vPw0rnqOKxeYIe4mRhERxCckdV5Ct3h32TJRsverUma+lFyq-xhcog5u
6dBG43YCyGGGQW8rNuwQ-qp7SnUPP8so23ENQSnRRWedKsBHXWrGiRr6Xdk-uo2VanIjwMRHrQOo
G2mcFqh8C23DjcPtR+S7gzKaf3DYPibSaEo-LdGYvpDwXTv4LtnMkGINcE4Llrofe62yu70FhMm2
SBl0B4XfLQSjul8pkE90lr0xBIK-GM1IKxD8n6zJvTwQnLzHj9R4D6fXWWK8D8Xazo+Q5vmO2wFx
7B5UCTp1M-RuNcXS46gdGvJCCdGN93VCKDxKbME5RhyS4KZIG1wJIdsEPYgn4VWoHNOMF4C6fQcf
QmWjTJBKw5eCHiRMKcClqRb2lNlkFlXZJSJb-3OjFwS6zpJujhWNGSmTkEWNMJDXMMDo7HYjsR8K
aRqsahDEen6fjSnUFN9vXS2nzYW3qRlOmVTc-fK7gQ0NCpqndzUgv3ea4CirME+4+IruIKMDcMd2
ff8LNuadsaGQbe+KKSonSmpKvq341fKTQvDgHQL4-gSpJG5T5+l1PsgW8-DclnyRylhVpCGPyxp6
P8q1gjkbH3TDYxrLIeFRDk6DjF9LV7xx-R3Y3D4sCo8i-11RnSOh6Vswud78q5g=
-----END SEALED ENVELOPE-----" .
[72MB/731MB][  9%][====>                                             ][36MB/s]  
Run: [ERROR] Operation expired: CPTimeout exceeded.

Regarding the xrate threshold, sure I can add an envar, stay tuned ;-)

@adriansev
Copy link
Contributor Author

@simonmichal well, first line of the xrd dump say it so :) .. but i continue to investigate as i also get:

jobID: 1/1 >>> Start
terminate called after throwing an instance of 'std::bad_alloc'
what():  std::bad_alloc
Aborted (core dumped)

where that start is printed from here: https://github.com/adriansev/jalien_py/blob/master/alienpy/alien.py#L1762 (my subclassing of client.utils.CopyProgressHandler)
i will get back with the stack trace

@simonmichal
Copy link
Contributor

@adriansev : you should be able now to enable xrate-threshold with XRD_XRATETHRESHOLD: e0af76e

@adriansev
Copy link
Contributor Author

@simonmichal so, the bad alloc seems to be happening only when the file is zip component, see the logs here
for normal files, the process is stopped but seems to be for twice the value. to test a big file, i uploaded to our EOS storage and the benchmark for the transfer was an average of 107 MiB/s (at least 5 tries)
Then, with XRD_CPTIMEOUT=1 the downloaded portion of the file (that was kept locally) was consistently at 208 MiB (and never threw any error)
and thanks a lot for XRD_XRATETHRESHOLD i will try it

@simonmichal
Copy link
Contributor

well, have you tried different values than 1, it might be this is an artefact of the in-fly requests (I will have a look)

@simonmichal
Copy link
Contributor

@adriansev : I cannot reproduce the std::bad_alloc, I did try with a zip file, e.g.:

$ ./xrdcp -f root://localhost//tmp/input.zip?xrdcl.unzip=AliESDs.root  .
[16MB/26.47MB][ 60%][==============================>                   ][2.667MB/s]  
Run: [ERROR] Operation expired: CPTimeout exceeded.

Can you point me to the file this happened with?

@adriansev
Copy link
Contributor Author

@simonmichal so, in the order of messages :)

  1. i do not know what is the exact time to measure the cp process, i put a delta time computation in begin() and end() of client.utils.CopyProgressHandler and i got this (5 consecutive tries each no removal of any outliers):
  • 1s with an average of 2.256
  • 2s with an average of 3.348
  • 3s with an average of 5.2 (but there was an outlier, 4 values around 4.3 and one at 8.62)
  • 4s with an average of 5.28
  • 5s with an average of 6.418
  1. the file was that one that i provided you with metafile, i will get a fresh set of logs right away

@adriansev
Copy link
Contributor Author

@simonmichal so, in the same directory https://asevcenc.web.cern.ch/asevcenc/xrd_timeout/bad_alloc/ i put all the logs i can get ... the error is almost always reproductible (there are cases where the process get stuck for more than 1min, at which point i TERM it)

@simonmichal
Copy link
Contributor

OK, I managed to reproduce the problem ...

@simonmichal
Copy link
Contributor

The XRootDSourceZip destructor is missing a CleanUpChunks invocation:

~XRootDSourceZip()
{
XrdCl::XRootDStatus status = pZipArchive->Close();
delete pZipArchive;
}

it should be like in the XRootDSource destructor:

virtual ~XRootDSource()
{
if( pDataConnCB )
pDataConnCB->Cancel();
CleanUpChunks();
if( pFile->IsOpen() )
XrdCl::XRootDStatus status = pFile->Close();
delete pFile;
}

Actually, the bug was there since ever, thanks for discovering it!!!

@simonmichal
Copy link
Contributor

This should fix the problem: 577812c

@adriansev
Copy link
Contributor Author

@simonmichal i can confirm that i no longer have bad_alloc errors. still i observe an offset between the timeout setting and the actual begin-->end of copy job process but is of the order of 1-2 seconds and it's already useful as it is now, so this part is verified. i will check right away also the xratethresold.. by the way what is the unit?

@adriansev
Copy link
Contributor Author

@simonmichal before checking xratethreshold i verified if setting the env var from within the application work .. and it does not work .. i'm doing a very basic set up of env var, see https://github.com/adriansev/jalien_py/blob/master/alienpy/alien.py#L1609
any idea why these values are not picked up?
Thanks a lot!

@simonmichal
Copy link
Contributor

@adriansev : thanks for verifying the cptimeout!

If you use the envar you need to give the threshold in B/s, now when I write this I realise that this might be cumbersome, I can tweak it if needed so it understands K,M and G suffixes

@simonmichal
Copy link
Contributor

@adriansev : well, the envars are resolved only once when the static XrdCl::Env object is created, so doing so from the source code is risky, better use EnvPutInt:

def EnvPutInt( key, value ):

the respective keys are: CPTimeout and XRateThreshold

@adriansev
Copy link
Contributor Author

@simonmichal great! that did the trick! (by the way why is there a need for separate Int and String methods?) for myself i did this wrapper .. i'm not sure how to do it for EnvGet as i see that there are some cpp calls.
So, the features are working, and this can be closed. Thanks a lot!!

@simonmichal
Copy link
Contributor

@adriansev : thanks a lot for testing!!!

Regarding the Int / String methods, thanks for pointing it out (I guess my mindset gravitates to much towards C++ ;-), I will make it nicer ;-)

@adriansev
Copy link
Contributor Author

@simonmichal i completely forgot! is it possible (and IMHO it should be healthy) that if transfer fail, the incomplete file to be deleted? (like --posc but also for download :) ) or make --posc to work both ways :) Thanks a lot!!

@simonmichal
Copy link
Contributor

@adriansev : yes I think that's doable but let me explore what the possible side effects are, in the meanwhile could you cut a separate issue for this

@adriansev
Copy link
Contributor Author

@simonmichal sure, done @#1448 , thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants