New directive request for http internal buffer sizes #1532

snafus · 2021-10-12T11:12:21Z

I like to ask if a new (esoteric?) directive could be included to set the size of the internal HTTP buffer size for 'normal' and tpc davs transfers.
Currently it appears that there are hard-coded 1MiB buffers in the XrdHttp and XrdTpc locations.
For RAL (with Echo, and it's Ceph object store), it would be very helpful to be able to increase this to a some configurable sensible multiple of this.
For read requests we can mitigate to some extent with a memcache proxy and appropriate async sizes.
However for writes, I don't think we can do anything to make improvements without code changes.

From tests privately updating the code, this does show significant improvements, but there may be some other implications (side effects, etc.) I wouldn't be aware of. To have this as a configurable option in the official codebase would give demonstrable improvements for us.

Some of the locations I spotted (essentially reference 1024*1024) are:

xrootd/src/XrdTpc/XrdTpcStream.cc

Line 83 in 0e460d7

if (offset == m_offset && (force || (size && !(size % (1024*1024))))) {

xrootd/src/XrdTpc/XrdTpcTPC.cc

Line 29 in 70066fe

size_t TPCHandler::m_small_block_size = 1*1024*1024;

xrootd/src/XrdHttp/XrdHttpProtocol.cc

Line 283 in 9b03713

hp->myBuff = BPool->Obtain(1024 * 1024);

xrootd/src/XrdHttp/XrdHttpReq.cc

Line 1291 in db9bf87

l = (long)min(filesize-writtenbytes, (long long)1024*1024);

xrootd/src/XrdHttp/XrdHttpReq.cc

Line 1296 in db9bf87

    
           l = min(rwOps[0].byteend - rwOps[0].bytestart + 1 - writtenbytes, (long long)1024*1024);

Do you think this is something that can be accommodated?
Kind regards,
James

ffurano · 2021-10-12T14:10:37Z

Hi, I think that the places that may matter for what you mention are just the first two, i.e. inside XrdTPC (the other places are just related to the request header) Maybe Brian can comment on these two inside XrdTPC? Maybe just one would need to be changed. Cheers Fabrizio Il 12/10/2021 13:12, snafus ha scritto:

…

bbockelm · 2021-10-12T18:47:35Z

Hi James -- I think this sounds like a great idea.

For TPC, you found the correct location for increasing the buffer size. Note TPCHandler::m_block_size is the size of the HTTP request generated when HTTP-TPC is doing multistreams and must be a multiple of the block size (otherwise you may hit deadlocks). My recommendation is to make it no smaller than 16MB.

I never did much exploration into optimal block sizes; without it, libcurl would generate extremely small writes (16KB or smaller, depending on how often you poll the TCP socket) and the Ceph RADOS integration would outright fail with unaligned writes. So, I'd welcome your work!

The hard decision point is the buffering in XrdHttp. Right now, the size of the buffer for PUT appears to simply be whatever comes off the socket. You may put the buffer in the XrdHttp layer ... or it might make more sense at the filesystem layer. We already "paid the price" in XrdTpc in terms of code complexity but it's not immediately clear the same decision makes sense for XrdHttp.

xrootd-dev · 2021-10-12T21:09:08Z

It's not clear what is meant by "filesystem layer". I do not like mucking around a generic interface to simply satisfy one protocol that may not staisfy another. So, I'd like to keep the complexity level to a minimum. We have a hard enough time with http as it is.

…

On Tue, 12 Oct 2021, Brian P Bockelman wrote: Hi James -- I think this sounds like a great idea. For TPC, you found the correct location for increasing the buffer size. Note `TPCHandler::m_block_size` is the size of the HTTP request generated when HTTP-TPC is doing multistreams and must be a multiple of the block size (otherwise you may hit deadlocks). My recommendation is to make it no smaller than 16MB. I never did much exploration into optimal block sizes; without it, `libcurl` would generate extremely small writes (16KB or smaller, depending on how often you poll the TCP socket) and the Ceph RADOS integration would outright fail with unaligned writes. So, I'd welcome your work! The hard decision point is the buffering in XrdHttp. Right now, the size of the buffer for PUT appears to simply be whatever comes off the socket. You _may_ put the buffer in the XrdHttp layer ... or it might make more sense at the filesystem layer. We already "paid the price" in XrdTpc in terms of code complexity but it's not immediately clear the same decision makes sense for XrdHttp. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: #1532 (comment) ######################################################################## Use REPLY-ALL to reply to list To unsubscribe from the XROOTD-DEV list, click the following link: https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1

ffurano · 2021-10-13T07:17:38Z

Hi, AFAIU (Brian) you are proposing a buffering layer that accumulates sequential writes before flushing to the fs layer. I think that the behaviour of such a buffer would be dependent on the needs of the underlying fs implementation, and in this case (correct me please if I am wrong) the underlying implementation (and the only one that would need it) is XrdCeph. So I would implement it in XrdCeph as a private write buffer... f

…

On 10/12/21 11:09 PM, xrootd-dev wrote: It's not clear what is meant by "filesystem layer". I do not like mucking around a generic interface to simply satisfy one protocol that may not staisfy another. So, I'd like to keep the complexity level to a minimum. We have a hard enough time with http as it is. On Tue, 12 Oct 2021, Brian P Bockelman wrote: > Hi James -- I think this sounds like a great idea. > > For TPC, you found the correct location for increasing the buffer size. Note `TPCHandler::m_block_size` is the size of the HTTP request generated when HTTP-TPC is doing multistreams and must be a multiple of the block size (otherwise you may hit deadlocks). My recommendation is to make it no smaller than 16MB. > > I never did much exploration into optimal block sizes; without it, `libcurl` would generate extremely small writes (16KB or smaller, depending on how often you poll the TCP socket) and the Ceph RADOS integration would outright fail with unaligned writes. So, I'd welcome your work! > > The hard decision point is the buffering in XrdHttp. Right now, the size of the buffer for PUT appears to simply be whatever comes off the socket. You _may_ put the buffer in the XrdHttp layer ... or it might make more sense at the filesystem layer. We already "paid the price" in XrdTpc in terms of code complexity but it's not immediately clear the same decision makes sense for XrdHttp. > > -- > You are receiving this because you are subscribed to this thread. > Reply to this email directly or view it on GitHub: > #1532 (comment) > > ######################################################################## > Use REPLY-ALL to reply to list > > To unsubscribe from the XROOTD-DEV list, click the following link: > https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1532 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABJBUTYPWHRLQUOEGXDQ7X3UGSPYBANCNFSM5F2J2QJA>. Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

abh3 · 2021-10-13T07:22:58Z

If you plan to do that you really need to copordinate with RAL as they are the current maintainers of that plugin. I don't think it would be wise to change that without going through it with them as they have been making active changes.

…

On Wed, 13 Oct 2021, Fabrizio Furano wrote: Hi, AFAIU (Brian) you are proposing a buffering layer that accumulates sequential writes before flushing to the fs layer. I think that the behaviour of such a buffer would be dependent on the needs of the underlying fs implementation, and in this case (correct me please if I am wrong) the underlying implementation (and the only one that would need it) is XrdCeph. So I would implement it in XrdCeph as a private write buffer... f On 10/12/21 11:09 PM, xrootd-dev wrote: > It's not clear what is meant by "filesystem layer". I do not like mucking > around a generic interface to simply satisfy one protocol that may not > staisfy another. So, I'd like to keep the complexity level to a minimum. > We have a hard enough time with http as it is. > > On Tue, 12 Oct 2021, Brian P Bockelman wrote: > >> Hi James -- I think this sounds like a great idea. >> >> For TPC, you found the correct location for increasing the buffer size. Note `TPCHandler::m_block_size` is the size of the > HTTP request generated when HTTP-TPC is doing multistreams and must be a multiple of the block size (otherwise you may hit > deadlocks). My recommendation is to make it no smaller than 16MB. >> >> I never did much exploration into optimal block sizes; without it, `libcurl` would generate extremely small writes (16KB or > smaller, depending on how often you poll the TCP socket) and the Ceph RADOS integration would outright fail with unaligned > writes. So, I'd welcome your work! >> >> The hard decision point is the buffering in XrdHttp. Right now, the size of the buffer for PUT appears to simply be whatever > comes off the socket. You _may_ put the buffer in the XrdHttp layer ... or it might make more sense at the filesystem layer. We > already "paid the price" in XrdTpc in terms of code complexity but it's not immediately clear the same decision makes sense for > XrdHttp. >> >> -- >> You are receiving this because you are subscribed to this thread. >> Reply to this email directly or view it on GitHub: >> #1532 (comment) >> >> ######################################################################## >> Use REPLY-ALL to reply to list >> >> To unsubscribe from the XROOTD-DEV list, click the following link: >> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1 > > ? > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub <#1532 (comment)>, or > unsubscribe <https://github.com/notifications/unsubscribe-auth/ABJBUTYPWHRLQUOEGXDQ7X3UGSPYBANCNFSM5F2J2QJA>. > Triage notifications on the go with GitHub Mobile for iOS > <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android > <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. > > -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #1532 (comment)

ffurano · 2022-01-07T11:19:54Z

I see the points. Trying to find the best solution.

abh3 · 2022-01-08T01:10:09Z

Well, the issue is that the buffers are fixed for TPC and that is causing them some issues. I can see that but @bbockelm would need to address that as that part of the code is rather complicated and it's not exactly clear you can just change the buffer size without side-effects. As for the http buffer size, I agree with you, there is no reason to change that as they should use the xrd.buffers directive to enable maxi-buffers via a special option that I will relay to them and then use the xrootd.async segsize option to get the size they want for async I/O. So, for normal reads/writes we already have a solution. What we need is a solution for http TPC and @bbockelm has to weigh in on that. But even then, they still need a memcache layer to handle small application reads. All this will do is speed up file copies (i.e. xrdcp and TPC).

ffurano · 2022-01-10T10:46:08Z

I agree that a buffering layer has to be added somewhere to work around the slowness of the ofs being used (Ceph?)
Changing the XrdHttp buffers can be good a for quick and dirty test, but I don't consider this a solution, as there are
drawbacks that can be quite heavy.
I am confident that a short discussion will clarify what is the best place to buffer this data before flushing.

snafus · 2022-01-10T11:55:45Z

Hi,
Thanks for the updated comments. We have been working on a buffering component to the XrdCeph plugin that sits between ceph (and libradosstriper) and the XRootD calls. It still requires some further AIO functionality, but in a lightly-loaded system appears to recover good performance).
Once the system is reasonably loaded we do still observe a drop in the per-file transfer speed (say to O(5) MB/s), and some increased variation in transfer speeds. Identifying if / where that bottleneck is, is hopefully a final step to the performance improvements.

On the side of changing things in XrdTPC/Http it still could be interesting, and at some point I'd tried with hard-coded values with reasonable results. If there is some tuning / optimisation studies that might still be useful, I'd be interested in helping. I did also observe some multistream / out-of-order issues in tpc, but had not yet had chance to see where exactly that was appearing, and will report back when I find out more.
Thanks,
James

ffurano · 2022-01-10T14:07:25Z

Hi,

I see, so I understand that you implemented aio in the XrdCeph layer. Correct? Do you think it would still be beneficial to receive larger chunks in the write call ?

bbockelm · 2022-01-10T14:59:42Z

FWIW - I'm still fine with making the existing buffering in XrdTpc configurable. As I said, we already paid the price on that one and we're talking about a very modest config knob only for RAL.

However, for the XrdHTTP components, seems we're all in agreement that XrdCeph is really the best place to land this.

abh3 · 2022-01-10T19:10:38Z

Yes, but (there is always a but). Since http hands off the reads/writes to the xroot layer, we already have tuning knobs there that allow for very large buffers. We don't document that feature because we've seen problems with people not understanding what that tuning knob is for -- and it's for very specific use cases like CTA and now RAL. I will start an offline thread on this. Since TPC bypasses the xroot layer, there is no knob. So, yes, we would need one there.

abh3 assigned ffurano Jan 4, 2022

abh3 added the enhancement label Jan 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New directive request for http internal buffer sizes #1532

New directive request for http internal buffer sizes #1532

snafus commented Oct 12, 2021

ffurano commented Oct 12, 2021 via email

bbockelm commented Oct 12, 2021

xrootd-dev commented Oct 12, 2021 via email

ffurano commented Oct 13, 2021 via email

abh3 commented Oct 13, 2021 via email

ffurano commented Jan 7, 2022

abh3 commented Jan 8, 2022

ffurano commented Jan 10, 2022

snafus commented Jan 10, 2022

ffurano commented Jan 10, 2022

bbockelm commented Jan 10, 2022

abh3 commented Jan 10, 2022

New directive request for http internal buffer sizes #1532

New directive request for http internal buffer sizes #1532

Comments

snafus commented Oct 12, 2021

ffurano commented Oct 12, 2021 via email

bbockelm commented Oct 12, 2021

xrootd-dev commented Oct 12, 2021 via email

ffurano commented Oct 13, 2021 via email

abh3 commented Oct 13, 2021 via email

ffurano commented Jan 7, 2022

abh3 commented Jan 8, 2022

ffurano commented Jan 10, 2022

snafus commented Jan 10, 2022

ffurano commented Jan 10, 2022

bbockelm commented Jan 10, 2022

abh3 commented Jan 10, 2022