New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an HTTP/2-aware application type supporting trailers and pushed streams. #426

Merged
merged 67 commits into from Sep 13, 2015

Conversation

Projects
None yet
5 participants
@awpr
Member

awpr commented Aug 29, 2015

There's a lot of prose in the commit messages if you find yourself wondering about the reasons behind some particular change.

I still need to try the Windows-specific stuff -- I haven't installed GHC on my Windows partition yet and I've heard it's rather difficult.

Fixes #402, relevant to #415.

awpr added some commits Jul 25, 2015

Refactor headerContinue to take a Builder
This allows it to be used for trailers, which will need to avoid
inserting status and other headers automatically.
Add initial HTTP/2-aware Application type
This is expected to continue to evolve; for now it's nearly identical to
Wai.Application, because that's the interface currently supported by
Warp's HTTP/2 connection handler.
Move queue creation from 'response' to 'worker'
This will allow 'worker' to send the real trailers when Http2Application
is modified to return them as its final result.
Add 'promoteApplication'; use it in 'serveConnection'
This is the identity function for now, but eventually it will handle
discarding the irrelevant functionality of the HTTP/2 application type.
Add Http2Application param to all 'run' flavors
Each one is renamed to mention "Http2" and a wrapper with the old name
and type signature is added to preseve compatibility.
Generalize HTTP/2 run variants to ServeConnection
This will allow for many different protocol-level configurations, rather
than just the two represented by 'run' and 'runHttp2'.  In particular,
HTTP/2-only could be supported through these variants, and
HTTP/1-with-upgrade could be a ServeConnection that starts off like
HTTP/1 but hands over the connection to HTTP/2 when the appropriate
Upgrade header is present.
Clone the response type for HTTP2
This is currently identical to the normal WAI response type, but I will
be changing it drastically over the next few commits.
Allow stream bodies to have a result type
This allows trailers to be computed based on what happened during
streaming, to allow e.g. checksums or signatures.
Curry Http2Application's responder function
Since Response was not a multi-constructor datatype under HTTP/2, it
could be turned into curried arguments to a respond function.
Change "Http2" to "HTTP2" globally
This seems to be the accepted naming style for initialisms here;
previously the two styles were clashing in the same identifiers in
warp-tls.
@awpr

This comment has been minimized.

Show comment
Hide comment
@awpr

awpr Sep 10, 2015

Member

better to pass it a ByteArray# and we could avoid the global lock

I think there's some more confusion here: the global lock is for allocating pinned memory, right? But we don't do that; there's no pinned memory allocation just to wrap up an existing Ptr Word8 into a ByteString, which is what happens when Warp HTTP/2 writes its buffer.

I don't know enough about GHC internals to comment authoritatively on changing ByteString itself, but I suspect people will object on the grounds that this would change the contract of the ByteString type (namely that its buffer location won't change out from under it).

Member

awpr commented Sep 10, 2015

better to pass it a ByteArray# and we could avoid the global lock

I think there's some more confusion here: the global lock is for allocating pinned memory, right? But we don't do that; there's no pinned memory allocation just to wrap up an existing Ptr Word8 into a ByteString, which is what happens when Warp HTTP/2 writes its buffer.

I don't know enough about GHC internals to comment authoritatively on changing ByteString itself, but I suspect people will object on the grounds that this would change the contract of the ByteString type (namely that its buffer location won't change out from under it).

@AaronFriel

This comment has been minimized.

Show comment
Hide comment
@AaronFriel

AaronFriel Sep 10, 2015

Contributor

say "Trust me, this ByteString is large enough"

Builders can do this; it's called 'insertByteString'.

But as you said, there's no way to communicate this to WAI. Yes, I know that means two writes, but I really think it's important that the API support expressing this. If I'm sending a 128KiB chunk down the wire, there will be a lot more than two system calls to send it (with HTTP/2.0).

What I would like is for the API to support expressing, "Yes, I'm really sure that I want WAI to not buffer this separately." That could be done with Builder, I suppose, but it needs to be a parameter that can be tuned. As you wrote,

Currently Warp will copy the yielded chunk anyway, but that's an implementation detail; arguably it'd be better to eat the double write call (header; data) when the fragment is bigger than some threshold. That's totally orthogonal to whether we use Builders as the interface, though.

My response is then: fine, that should be something that the user can opt into. If that means I have to be careful about using insertByteString and using the appropriate WAI parameters, I won't think it's ideal, but I do think it's worthwhile finding the right sizes for both. And perhaps that means getting some real world telemetry and finding the right sizes for bytestrings that shouldn't copy by default, and how large of a chunk WAI should decide to send as two separate writes.

I'm not sure about this -- my intuition is that the big cost of ForeignPtrs is mostly associated with their finalizers, and ForeignPtrs that don't have a finalizer will have much lower cost. The send buffer that we're writing to spends most of its life as a 'Ptr Word8', and is only briefly wrapped in a finalizer-less ForeignPtr when we call the actual write function. Intuition is often wrong, though -- I should defer to Kazu for deeper knowledge of this.

I'd have to dig into this too, the way I read it is that the GC still has to do extra work on the nursery that's linear in the number of ForeignPtrs, though now I'm not sure.

better to pass it a ByteArray# and we could avoid the global lock

I think there's some more confusion here: the global lock is for allocating pinned memory, right? But we don't do that; there's no pinned memory allocation just to wrap up an existing Ptr Word8 into a ByteString, which is what happens when Warp HTTP/2 writes its buffer.

The motivation for changing WAI in the past has been to avoid the global lock. If applications that are using the API are causing ByteStrings to be created which are unnecessarily pinned by default, then that still negatively impacts WAI performance, but it's something many more Haskell applications would benefit from if it existed.

I don't know enough about GHC internals to comment authoritatively on changing ByteString itself, but I suspect people will object on the grounds that this would change the contract of the ByteString type (namely that its buffer location won't change out from under it).

I'm also unconvinced now that the memory needs to be pinned in many cases. I'm trying to think of what sort of libraries that rely on that contract, and it occurs to me that the number is probably small enough. I agree, though, that for backwards compatibility, the default functions for ByteString should remain using pinned memory.

But if the cost of pinning memory is a global lock that isn't necessary in the majority of use cases, then that is an unnecessary additional cost to using ByteString.

My stance is not that I think ByteString and Builder are always bad, I think they have done a lot of work on exploring the domain of efficient string representations. But I think there is an over-reliance on them for performance sensitive APIs where the assumptions implicit in ByteString and Builder do not hold up. For example, that strings will never be contiguous (and so appends should always produce copies) and that it is almost always cheaper to copy memory than to return a small chunk (where small has been arbitrarily defined as less than 4 KiB.)

My hesitance on urging the HTTP/2.0 patches to be merged into WAI are based on this: I think there are some assumptions being made which should be carefully considered. Are we making users pay for unnecessary copying? It sounds like with the buffer based approach, it is mandatory that Builders will create copies. That's fine, but are we creating two copies in the case of strings less than 4 KiB? This is something I'll have to trace through the Builder library to check, but it looks like < 4KiB chunks written to a builder are copied even if the desired end result is to write them to a third buffer.

Another assumption is that it's always cheaper to allocate and copy memory than to perform two syscalls. There are many networking solutions, which are becoming more popular, that significantly change that dynamic. So even if we can't agree on a size, I think it should be a tunable parameter.

Regarding changing ByteString, if I were to develop a patch and ask them to merge it, it would be to switch the constructor to use ByteArray# directly and add a new module which allows permits creating unpinned ByteStrings. In other words: maintain the current contract, but add an option for unpinned memory, which I think will work well for the majority of FFI cases.

Contributor

AaronFriel commented Sep 10, 2015

say "Trust me, this ByteString is large enough"

Builders can do this; it's called 'insertByteString'.

But as you said, there's no way to communicate this to WAI. Yes, I know that means two writes, but I really think it's important that the API support expressing this. If I'm sending a 128KiB chunk down the wire, there will be a lot more than two system calls to send it (with HTTP/2.0).

What I would like is for the API to support expressing, "Yes, I'm really sure that I want WAI to not buffer this separately." That could be done with Builder, I suppose, but it needs to be a parameter that can be tuned. As you wrote,

Currently Warp will copy the yielded chunk anyway, but that's an implementation detail; arguably it'd be better to eat the double write call (header; data) when the fragment is bigger than some threshold. That's totally orthogonal to whether we use Builders as the interface, though.

My response is then: fine, that should be something that the user can opt into. If that means I have to be careful about using insertByteString and using the appropriate WAI parameters, I won't think it's ideal, but I do think it's worthwhile finding the right sizes for both. And perhaps that means getting some real world telemetry and finding the right sizes for bytestrings that shouldn't copy by default, and how large of a chunk WAI should decide to send as two separate writes.

I'm not sure about this -- my intuition is that the big cost of ForeignPtrs is mostly associated with their finalizers, and ForeignPtrs that don't have a finalizer will have much lower cost. The send buffer that we're writing to spends most of its life as a 'Ptr Word8', and is only briefly wrapped in a finalizer-less ForeignPtr when we call the actual write function. Intuition is often wrong, though -- I should defer to Kazu for deeper knowledge of this.

I'd have to dig into this too, the way I read it is that the GC still has to do extra work on the nursery that's linear in the number of ForeignPtrs, though now I'm not sure.

better to pass it a ByteArray# and we could avoid the global lock

I think there's some more confusion here: the global lock is for allocating pinned memory, right? But we don't do that; there's no pinned memory allocation just to wrap up an existing Ptr Word8 into a ByteString, which is what happens when Warp HTTP/2 writes its buffer.

The motivation for changing WAI in the past has been to avoid the global lock. If applications that are using the API are causing ByteStrings to be created which are unnecessarily pinned by default, then that still negatively impacts WAI performance, but it's something many more Haskell applications would benefit from if it existed.

I don't know enough about GHC internals to comment authoritatively on changing ByteString itself, but I suspect people will object on the grounds that this would change the contract of the ByteString type (namely that its buffer location won't change out from under it).

I'm also unconvinced now that the memory needs to be pinned in many cases. I'm trying to think of what sort of libraries that rely on that contract, and it occurs to me that the number is probably small enough. I agree, though, that for backwards compatibility, the default functions for ByteString should remain using pinned memory.

But if the cost of pinning memory is a global lock that isn't necessary in the majority of use cases, then that is an unnecessary additional cost to using ByteString.

My stance is not that I think ByteString and Builder are always bad, I think they have done a lot of work on exploring the domain of efficient string representations. But I think there is an over-reliance on them for performance sensitive APIs where the assumptions implicit in ByteString and Builder do not hold up. For example, that strings will never be contiguous (and so appends should always produce copies) and that it is almost always cheaper to copy memory than to return a small chunk (where small has been arbitrarily defined as less than 4 KiB.)

My hesitance on urging the HTTP/2.0 patches to be merged into WAI are based on this: I think there are some assumptions being made which should be carefully considered. Are we making users pay for unnecessary copying? It sounds like with the buffer based approach, it is mandatory that Builders will create copies. That's fine, but are we creating two copies in the case of strings less than 4 KiB? This is something I'll have to trace through the Builder library to check, but it looks like < 4KiB chunks written to a builder are copied even if the desired end result is to write them to a third buffer.

Another assumption is that it's always cheaper to allocate and copy memory than to perform two syscalls. There are many networking solutions, which are becoming more popular, that significantly change that dynamic. So even if we can't agree on a size, I think it should be a tunable parameter.

Regarding changing ByteString, if I were to develop a patch and ask them to merge it, it would be to switch the constructor to use ByteArray# directly and add a new module which allows permits creating unpinned ByteStrings. In other words: maintain the current contract, but add an option for unpinned memory, which I think will work well for the majority of FFI cases.

@awpr

This comment has been minimized.

Show comment
Hide comment
@awpr

awpr Sep 10, 2015

Member

chunks written to a builder are copied even if the desired end result is to write them to a third buffer

Actually... it suddenly occurs to me that maybe you're mixing up Builder with something like Java's StringBuilder. A Builder isn't a mutable/growable buffer, and it doesn't keep any direct representation in memory as a sequence of bytes -- it's just a function to put bytes into a buffer and/or yield existing buffers. That can be a closure that (when called against a buffer) copies the bytes out of a strict ByteString (like what you'd get by calling 'byteString "a"'), a function that serializes an int (like 'intDec 42'), a function that always writes one fixed byte (something equivalent to '\ptr -> poke ptr 0x55'), or a closure that always returns an existing block of memory (such as 'insertByteString "a"'). None of these involve copying data "on the way in", but some of them are implemented via memcpy on the way out.

This mixup would explain... well, everything, really. If a Builder were a StringBuilder, all of your arguments would be true and it would indeed be a bad choice for use in a streaming interface, and impose single-or-double copies on everything, and be expected to have zero-cost conversion to and from ByteStrings, and be unable to communicate to WAI that a chunk should be inserted vs. copied, and rely on the assumption that copies are cheaper than small chunks.

I had written a bunch of individual replies to parts of the latest message, but I've deleted them because I think this is actually the heart of the matter.

Member

awpr commented Sep 10, 2015

chunks written to a builder are copied even if the desired end result is to write them to a third buffer

Actually... it suddenly occurs to me that maybe you're mixing up Builder with something like Java's StringBuilder. A Builder isn't a mutable/growable buffer, and it doesn't keep any direct representation in memory as a sequence of bytes -- it's just a function to put bytes into a buffer and/or yield existing buffers. That can be a closure that (when called against a buffer) copies the bytes out of a strict ByteString (like what you'd get by calling 'byteString "a"'), a function that serializes an int (like 'intDec 42'), a function that always writes one fixed byte (something equivalent to '\ptr -> poke ptr 0x55'), or a closure that always returns an existing block of memory (such as 'insertByteString "a"'). None of these involve copying data "on the way in", but some of them are implemented via memcpy on the way out.

This mixup would explain... well, everything, really. If a Builder were a StringBuilder, all of your arguments would be true and it would indeed be a bad choice for use in a streaming interface, and impose single-or-double copies on everything, and be expected to have zero-cost conversion to and from ByteStrings, and be unable to communicate to WAI that a chunk should be inserted vs. copied, and rely on the assumption that copies are cheaper than small chunks.

I had written a bunch of individual replies to parts of the latest message, but I've deleted them because I think this is actually the heart of the matter.

@awpr

This comment has been minimized.

Show comment
Hide comment
@awpr

awpr Sep 10, 2015

Member

Oh, right, I forgot about this comment in all the chaos -- @kazu-yamamoto I was using the "~" syntax as type equality, as enabled by the DataKinds extension and some other extension I can't think of right now. I can change the Haddocks to avoid it if you think it's liable to confuse readers.

Member

awpr commented Sep 10, 2015

Oh, right, I forgot about this comment in all the chaos -- @kazu-yamamoto I was using the "~" syntax as type equality, as enabled by the DataKinds extension and some other extension I can't think of right now. I can change the Haddocks to avoid it if you think it's liable to confuse readers.

@AaronFriel

This comment has been minimized.

Show comment
Hide comment
@AaronFriel

AaronFriel Sep 10, 2015

Contributor

Let me get this straight then:

  1. Builder as utilized by WAI is told to write to a buffer no matter what. Right now there's no signalling path for that to avoid writing to a buffer?
  2. I think you're right that I'm confused, but I don't think it is from confusion with Java -- I'm actually not very familiar with Java's StringBuilder, but what you describe sounds like the C# equivalent. Rather, it seems to me that the Builder API guarantees small strings are copied.
byteStringThreshold maxCopySize =
    \bs -> builder $ step bs
  where
    step !bs@(S.PS _ _ len) !k br@(BufferRange !op _)
      | len <= maxCopySize = byteStringCopyStep bs k br
      | otherwise          = return $ insertChunk op bs k

I'm trying to wrap my head around the Builder internals, the continuation passing style is wonderful for performance and for the optimizer, but it does take some time to understand. If I am understanding you correctly, you're saying that byteStringThreshold, which uses byteStringCopyStep, will not create an intermediate copy of a string for small strings?

That is: if I pass to a Response constructor the result of byteString ("AUniqueString :: ByteString), the Response will write that builder to its own buffer, and so there will be at most two copies of "AUniqueString" in memory? The first being the source ByteString, and the second being the buffer WAI fills? If that is the case, then Builder is fine as an API surface and avoids that extra allocation.

It was my understanding that byteStringCopyStep always allocates an intermediate buffer, so that there would be three locations in memory containing the unique string. I think we would both agree that an extra copy there is completely unnecessary if it's just going to be copied again to the final buffer that WAI will send. That is the source of my confusion.

I still would like to urge that there be a Response API that does as you said earlier:

Currently Warp will copy the yielded chunk anyway, but that's an implementation detail; arguably it'd be better to eat the double write call (header; data) when the fragment is bigger than some threshold. That's totally orthogonal to whether we use Builders as the interface, though.

If that threshold was a tunable parameter in the Response, I would be satisfied.

The rest of this comment is beside the point, and more for my own education than anything else.

My issues with Builder and ByteString remain, but they are orthogonal to changing the WAI API. I still am interested in determining if there is room for improving performance by tweaking the threshold size of the Builder, and there is definitely room to improve ByteString. Simon Marlow has commented on Reddit about the weak pointer cost, and it looks like ByteString does avoid creating weak pointers. Still, I am going to look into submitting a patch to ByteString to optimize adjacent strings appends.

It sounds like you're more familiar with Builder than I am. Given the continuation passing style of Builder, is it possible to optimize writing adjacent strings without a significant amount of work? It looks like the source (ptr, offset, length) of the previous BuildStep isn't visible to subsequent BuildSteps, and this makes sense. If Builder can manufacture bytes out of nowhere with its numeric and other APIs, not every step comes from a definite location in memory. It would be nonsense to test for adjacency when there is no source location.

Contributor

AaronFriel commented Sep 10, 2015

Let me get this straight then:

  1. Builder as utilized by WAI is told to write to a buffer no matter what. Right now there's no signalling path for that to avoid writing to a buffer?
  2. I think you're right that I'm confused, but I don't think it is from confusion with Java -- I'm actually not very familiar with Java's StringBuilder, but what you describe sounds like the C# equivalent. Rather, it seems to me that the Builder API guarantees small strings are copied.
byteStringThreshold maxCopySize =
    \bs -> builder $ step bs
  where
    step !bs@(S.PS _ _ len) !k br@(BufferRange !op _)
      | len <= maxCopySize = byteStringCopyStep bs k br
      | otherwise          = return $ insertChunk op bs k

I'm trying to wrap my head around the Builder internals, the continuation passing style is wonderful for performance and for the optimizer, but it does take some time to understand. If I am understanding you correctly, you're saying that byteStringThreshold, which uses byteStringCopyStep, will not create an intermediate copy of a string for small strings?

That is: if I pass to a Response constructor the result of byteString ("AUniqueString :: ByteString), the Response will write that builder to its own buffer, and so there will be at most two copies of "AUniqueString" in memory? The first being the source ByteString, and the second being the buffer WAI fills? If that is the case, then Builder is fine as an API surface and avoids that extra allocation.

It was my understanding that byteStringCopyStep always allocates an intermediate buffer, so that there would be three locations in memory containing the unique string. I think we would both agree that an extra copy there is completely unnecessary if it's just going to be copied again to the final buffer that WAI will send. That is the source of my confusion.

I still would like to urge that there be a Response API that does as you said earlier:

Currently Warp will copy the yielded chunk anyway, but that's an implementation detail; arguably it'd be better to eat the double write call (header; data) when the fragment is bigger than some threshold. That's totally orthogonal to whether we use Builders as the interface, though.

If that threshold was a tunable parameter in the Response, I would be satisfied.

The rest of this comment is beside the point, and more for my own education than anything else.

My issues with Builder and ByteString remain, but they are orthogonal to changing the WAI API. I still am interested in determining if there is room for improving performance by tweaking the threshold size of the Builder, and there is definitely room to improve ByteString. Simon Marlow has commented on Reddit about the weak pointer cost, and it looks like ByteString does avoid creating weak pointers. Still, I am going to look into submitting a patch to ByteString to optimize adjacent strings appends.

It sounds like you're more familiar with Builder than I am. Given the continuation passing style of Builder, is it possible to optimize writing adjacent strings without a significant amount of work? It looks like the source (ptr, offset, length) of the previous BuildStep isn't visible to subsequent BuildSteps, and this makes sense. If Builder can manufacture bytes out of nowhere with its numeric and other APIs, not every step comes from a definite location in memory. It would be nonsense to test for adjacency when there is no source location.

@awpr

This comment has been minimized.

Show comment
Hide comment
@awpr

awpr Sep 11, 2015

Member

Let me get this straight then:

  1. Builder as utilized by WAI is told to write to a buffer no matter what. Right now there's no signalling path for that to avoid writing to a buffer?

Mostly -- there's a signal produced to insert chunks directly, but Warp HTTP/2 currently handles it by copying the chunk into the buffer anyway (see https://github.com/yesodweb/wai/blob/master/warp/Network/Wai/Handler/Warp/HTTP2/Sender.hs#L303). This is arguably bad and we should probably open an issue to track it.

Rather, it seems to me that the Builder API guarantees small strings are copied.

Well, some Builder-creating APIs guarantee that small strings are copied -- namely 'byteStringCopy' which always copies the input, 'byteStringThreshold n' which copies inputs smaller than n bytes, and 'byteString', which is 'byteStringThreshold 256'. (And the lazy* variants of those).

there will be at most two copies of "AUniqueString" in memory

Yes -- the Builder value 'byteString "AUniqueString"' is effectively a closure with a reference to "AUniqueString" (which presumably lives in the data segment since it's a string literal, but it could just as well have come from I/O), which takes a buffer (void* / Ptr Word8), calls memcpy, and says "I wrote 13 bytes into your buffer, and I'm done now". This ends up happening at https://github.com/yesodweb/wai/blob/master/warp/Network/Wai/Handler/Warp/HTTP2/Sender.hs#L335, where the provided buffer is the connection-wide output buffer.

The Builder value 'byteStringInsert "AUniqueString"' (... sorry, I've had the name backwards this whole time) is also a closure with the same reference, except when you give it a buffer it says "I wrote 0 bytes into your buffer. Use this ByteString instead". What Warp does currently is then take that ByteString and copy it into the connection send buffer anyway. So (currently) either way exactly 2 copies exist -- the source and the send buffer. It should be possible to take note of the length of that ByteString, add it to any bytes we previously buffered, write the frame header, and send both the buffer and the ByteString chunk, so that there would only ever be one copy of that data. (This is the aforementioned new issue).

that byteStringCopyStep always allocates an intermediate buffer

I really hope that's not the case, since I've been asserting it's not this whole time -- I should probably go check now. <...> Yeah, it won't make an intermediate buffer. The call to 'copyBytes' there happens inside the BuildStep, i.e. at output time, while we're writing it to the send buffer. Until then it's just a closure with a reference to the original ByteString (albeit one that GHC knows might call 'touchForeignPtr' on the ByteString -- I don't totally know how this works internally, but it keeps the ByteString's memory alive until it's used).

If that threshold was a tunable parameter

I totally agree that it's important to be able to tune the threshold of copy vs. insert for these Builders. However I don't think such a threshold should actually be represented in Warp at all; it would be a property of the code creating the Builder -- i.e. by defining at module level "myByteString = byteStringThreshold 3" in your application, and using that instead of 'byteString' to construct responses. Of course, Warp could have some canned advice about this, with experimentally determined values, like: "On Linux 4.2, we suggest using 'byteStringThreshold 256' to optimize for HTTP/1.*, and 'byteStringThreshold 512' for HTTP/2. This is because inserting a chunk costs 2 syscalls instead of just 1 in HTTP/2. If you have a MagicNIC and use MagicOS, you should try setting the threshold to 3 instead because of the Magic Hardware Buffer and cheap syscalls. If for some reason your main memory is still a rotating drum, use 0 because copying a byte is obscenely expensive."

is it possible to optimize writing adjacent strings without a significant amount of work?

Maybe? But probably not in the way you're imagining. You're right that a BuildStep can't very well look at where the last BuildStep's data came from. But, it should be possible to have something like 'byteStringCoalesce :: [ByteString] -> Builder' that picks apart the ByteStrings and turns spans of adjacent ranges into a single 'byteStringThreshold'. Of course, it'd decompose nicely into (byteStringThreshold n . coalesce bss), where coalesce :: [ByteString] -> [ByteString].

One more thing: I don't fully understand the properties of ByteArray#, but I think changing ByteString to use that would outlaw a lot of existing ByteString uses. I believe that change would require all ByteStrings to be allocated in a particular way by GHC itself. But ByteStrings are often allocated and freed in arbitrary ways -- I've personally written a few modules that construct ByteStrings around memory allocated by C++ code, and I believe things like X11 and GTK bindings (libraries which have their own allocators) require this ability too. So you might see significant resistance to the change you're suggesting.

In fact (I just Googled this...) ShortByteString sounds like exactly what you're proposing to change ByteString to: https://hackage.haskell.org/package/bytestring-0.10.6.0/docs/src/Data.ByteString.Short.Internal.html#ShortByteString

Member

awpr commented Sep 11, 2015

Let me get this straight then:

  1. Builder as utilized by WAI is told to write to a buffer no matter what. Right now there's no signalling path for that to avoid writing to a buffer?

Mostly -- there's a signal produced to insert chunks directly, but Warp HTTP/2 currently handles it by copying the chunk into the buffer anyway (see https://github.com/yesodweb/wai/blob/master/warp/Network/Wai/Handler/Warp/HTTP2/Sender.hs#L303). This is arguably bad and we should probably open an issue to track it.

Rather, it seems to me that the Builder API guarantees small strings are copied.

Well, some Builder-creating APIs guarantee that small strings are copied -- namely 'byteStringCopy' which always copies the input, 'byteStringThreshold n' which copies inputs smaller than n bytes, and 'byteString', which is 'byteStringThreshold 256'. (And the lazy* variants of those).

there will be at most two copies of "AUniqueString" in memory

Yes -- the Builder value 'byteString "AUniqueString"' is effectively a closure with a reference to "AUniqueString" (which presumably lives in the data segment since it's a string literal, but it could just as well have come from I/O), which takes a buffer (void* / Ptr Word8), calls memcpy, and says "I wrote 13 bytes into your buffer, and I'm done now". This ends up happening at https://github.com/yesodweb/wai/blob/master/warp/Network/Wai/Handler/Warp/HTTP2/Sender.hs#L335, where the provided buffer is the connection-wide output buffer.

The Builder value 'byteStringInsert "AUniqueString"' (... sorry, I've had the name backwards this whole time) is also a closure with the same reference, except when you give it a buffer it says "I wrote 0 bytes into your buffer. Use this ByteString instead". What Warp does currently is then take that ByteString and copy it into the connection send buffer anyway. So (currently) either way exactly 2 copies exist -- the source and the send buffer. It should be possible to take note of the length of that ByteString, add it to any bytes we previously buffered, write the frame header, and send both the buffer and the ByteString chunk, so that there would only ever be one copy of that data. (This is the aforementioned new issue).

that byteStringCopyStep always allocates an intermediate buffer

I really hope that's not the case, since I've been asserting it's not this whole time -- I should probably go check now. <...> Yeah, it won't make an intermediate buffer. The call to 'copyBytes' there happens inside the BuildStep, i.e. at output time, while we're writing it to the send buffer. Until then it's just a closure with a reference to the original ByteString (albeit one that GHC knows might call 'touchForeignPtr' on the ByteString -- I don't totally know how this works internally, but it keeps the ByteString's memory alive until it's used).

If that threshold was a tunable parameter

I totally agree that it's important to be able to tune the threshold of copy vs. insert for these Builders. However I don't think such a threshold should actually be represented in Warp at all; it would be a property of the code creating the Builder -- i.e. by defining at module level "myByteString = byteStringThreshold 3" in your application, and using that instead of 'byteString' to construct responses. Of course, Warp could have some canned advice about this, with experimentally determined values, like: "On Linux 4.2, we suggest using 'byteStringThreshold 256' to optimize for HTTP/1.*, and 'byteStringThreshold 512' for HTTP/2. This is because inserting a chunk costs 2 syscalls instead of just 1 in HTTP/2. If you have a MagicNIC and use MagicOS, you should try setting the threshold to 3 instead because of the Magic Hardware Buffer and cheap syscalls. If for some reason your main memory is still a rotating drum, use 0 because copying a byte is obscenely expensive."

is it possible to optimize writing adjacent strings without a significant amount of work?

Maybe? But probably not in the way you're imagining. You're right that a BuildStep can't very well look at where the last BuildStep's data came from. But, it should be possible to have something like 'byteStringCoalesce :: [ByteString] -> Builder' that picks apart the ByteStrings and turns spans of adjacent ranges into a single 'byteStringThreshold'. Of course, it'd decompose nicely into (byteStringThreshold n . coalesce bss), where coalesce :: [ByteString] -> [ByteString].

One more thing: I don't fully understand the properties of ByteArray#, but I think changing ByteString to use that would outlaw a lot of existing ByteString uses. I believe that change would require all ByteStrings to be allocated in a particular way by GHC itself. But ByteStrings are often allocated and freed in arbitrary ways -- I've personally written a few modules that construct ByteStrings around memory allocated by C++ code, and I believe things like X11 and GTK bindings (libraries which have their own allocators) require this ability too. So you might see significant resistance to the change you're suggesting.

In fact (I just Googled this...) ShortByteString sounds like exactly what you're proposing to change ByteString to: https://hackage.haskell.org/package/bytestring-0.10.6.0/docs/src/Data.ByteString.Short.Internal.html#ShortByteString

@kazu-yamamoto

This comment has been minimized.

Show comment
Hide comment
@kazu-yamamoto

kazu-yamamoto Sep 11, 2015

Contributor

@AaronFriel Do you understand that small values can be directly converted to Builder, not via ByteString. E.g. http://hackage.haskell.org/package/bytestring-0.10.6.0/docs/Data-ByteString-Builder.html#v:word8

Contributor

kazu-yamamoto commented Sep 11, 2015

@AaronFriel Do you understand that small values can be directly converted to Builder, not via ByteString. E.g. http://hackage.haskell.org/package/bytestring-0.10.6.0/docs/Data-ByteString-Builder.html#v:word8

@kazu-yamamoto

This comment has been minimized.

Show comment
Hide comment
@kazu-yamamoto

kazu-yamamoto Sep 11, 2015

Contributor

Oh, right, I forgot about this comment in all the chaos -- @kazu-yamamoto I was using the "~" syntax as type equality, as enabled by the DataKinds extension and some other extension I can't think of right now. I can change the Haddocks to avoid it if you think it's liable to confuse readers.

In my opinion, = is better than ~.

Contributor

kazu-yamamoto commented Sep 11, 2015

Oh, right, I forgot about this comment in all the chaos -- @kazu-yamamoto I was using the "~" syntax as type equality, as enabled by the DataKinds extension and some other extension I can't think of right now. I can change the Haddocks to avoid it if you think it's liable to confuse readers.

In my opinion, = is better than ~.

@awpr

This comment has been minimized.

Show comment
Hide comment
@awpr

awpr Sep 11, 2015

Member

Updated to use words instead of symbols -- here, on yesodweb/wai:http2-push, and on awpr/wai:wai_http2_staging.

Member

awpr commented Sep 11, 2015

Updated to use words instead of symbols -- here, on yesodweb/wai:http2-push, and on awpr/wai:wai_http2_staging.

@kazu-yamamoto

This comment has been minimized.

Show comment
Hide comment
@kazu-yamamoto

kazu-yamamoto Sep 11, 2015

Contributor

I would say this again. ByteStrings are pinned objects. But global locks are taken only when large ByteString (>= 3,272 bytes on 64 bit machines) are allocated. They are called large objects in the GC terminology. No global locks are taken to allocate small ByteStrings.

Contributor

kazu-yamamoto commented Sep 11, 2015

I would say this again. ByteStrings are pinned objects. But global locks are taken only when large ByteString (>= 3,272 bytes on 64 bit machines) are allocated. They are called large objects in the GC terminology. No global locks are taken to allocate small ByteStrings.

@kazu-yamamoto

This comment has been minimized.

Show comment
Hide comment
@kazu-yamamoto

kazu-yamamoto Sep 11, 2015

Contributor

That is: if I pass to a Response constructor the result of byteString ("AUniqueString :: ByteString), the Response will write that builder to its own buffer, and so there will be at most two copies of "AUniqueString" in memory? The first being the source ByteString, and the second being the buffer WAI fills? If that is the case, then Builder is fine as an API surface and avoids that extra allocation.

Yes. Two copies now.

Reading disccussion, I hit upon to reduce one copying:

  • For plain HTTP, writev() can avoid copying. Warp has writev() API which is not used currently.
  • For HTTP over TLS, I found that zero copy (in Warp) is still possible. sendData of TLS takes lazy ByteString. So, we can convert Builder to lazy ByteString (which is a list of strict ByteString) without copying and pass it to sendData. In TLS lib, the lazy ByteString will be copied. But we can reduce one copying in this way.

It's worth trying.

Contributor

kazu-yamamoto commented Sep 11, 2015

That is: if I pass to a Response constructor the result of byteString ("AUniqueString :: ByteString), the Response will write that builder to its own buffer, and so there will be at most two copies of "AUniqueString" in memory? The first being the source ByteString, and the second being the buffer WAI fills? If that is the case, then Builder is fine as an API surface and avoids that extra allocation.

Yes. Two copies now.

Reading disccussion, I hit upon to reduce one copying:

  • For plain HTTP, writev() can avoid copying. Warp has writev() API which is not used currently.
  • For HTTP over TLS, I found that zero copy (in Warp) is still possible. sendData of TLS takes lazy ByteString. So, we can convert Builder to lazy ByteString (which is a list of strict ByteString) without copying and pass it to sendData. In TLS lib, the lazy ByteString will be copied. But we can reduce one copying in this way.

It's worth trying.

@kazu-yamamoto

This comment has been minimized.

Show comment
Hide comment
@kazu-yamamoto

kazu-yamamoto Sep 11, 2015

Contributor

For this, we should learn the internal of toLazyByteString.

Contributor

kazu-yamamoto commented Sep 11, 2015

For this, we should learn the internal of toLazyByteString.

@kazu-yamamoto

This comment has been minimized.

Show comment
Hide comment
@kazu-yamamoto

kazu-yamamoto Sep 11, 2015

Contributor

Updated to use words instead of symbols -- here, on yesodweb/wai:http2-push, and on awpr/wai:wai_http2_staging.

Nice!

Contributor

kazu-yamamoto commented Sep 11, 2015

Updated to use words instead of symbols -- here, on yesodweb/wai:http2-push, and on awpr/wai:wai_http2_staging.

Nice!

@kazu-yamamoto

This comment has been minimized.

Show comment
Hide comment
@kazu-yamamoto

kazu-yamamoto Sep 11, 2015

Contributor

@awpr In type HTTP2Application = Request -> PushFunc -> Responder, why PsuhFunc is the second? Do you use partial apply somewhere?

It seems to me that type HTTP2Application = PushFunc -> Request -> Responder is more understandable.

Contributor

kazu-yamamoto commented Sep 11, 2015

@awpr In type HTTP2Application = Request -> PushFunc -> Responder, why PsuhFunc is the second? Do you use partial apply somewhere?

It seems to me that type HTTP2Application = PushFunc -> Request -> Responder is more understandable.

@awpr

This comment has been minimized.

Show comment
Hide comment
@awpr

awpr Sep 11, 2015

Member

It was somewhat arbitrary, but initially Responder was inlined there, so the PushFunc was adjacent to the respond function.

I'm wondering whether it would make sense to give Responder a push function too, since it's possible for the associated stream of a PUSH_PROMISE frame to be itself a pushed stream. In that case it would be just type HTTP2Application = Request -> Responder. I don't mind swapping the argument order, but I'd like to consider whether this is better.

Member

awpr commented Sep 11, 2015

It was somewhat arbitrary, but initially Responder was inlined there, so the PushFunc was adjacent to the respond function.

I'm wondering whether it would make sense to give Responder a push function too, since it's possible for the associated stream of a PUSH_PROMISE frame to be itself a pushed stream. In that case it would be just type HTTP2Application = Request -> Responder. I don't mind swapping the argument order, but I'd like to consider whether this is better.

@awpr

This comment has been minimized.

Show comment
Hide comment
@awpr

awpr Sep 11, 2015

Member

Regarding writev() and sending lazy ByteStrings to tls, I think Warp can actually do better than toLazyByteString because we also have the guarantee that the buffer is safe to reuse as soon as the write call returns. I'm sure there's still insight to be gained from its implementation, though.

I think I might be able to reuse part of a failed experiment to make this complexity manageable -- using a bespoke Monad stack to automatically track the write offset and flush function, so that individual steps of Builders become composable. It seems like it'd be easy enough to add a FIFO queue of strict ByteStrings to the state, to be turned into a writev() call and/or lazy ByteString when the buffer space is exhausted. (I implemented that while trying to move data frame writing into per-stream buffers, but it turned out that was a dead end because headers must be sent in the same order they're compressed, and the priority queue can't guarantee that if they're sent concurrently.)

Member

awpr commented Sep 11, 2015

Regarding writev() and sending lazy ByteStrings to tls, I think Warp can actually do better than toLazyByteString because we also have the guarantee that the buffer is safe to reuse as soon as the write call returns. I'm sure there's still insight to be gained from its implementation, though.

I think I might be able to reuse part of a failed experiment to make this complexity manageable -- using a bespoke Monad stack to automatically track the write offset and flush function, so that individual steps of Builders become composable. It seems like it'd be easy enough to add a FIFO queue of strict ByteStrings to the state, to be turned into a writev() call and/or lazy ByteString when the buffer space is exhausted. (I implemented that while trying to move data frame writing into per-stream buffers, but it turned out that was a dead end because headers must be sent in the same order they're compressed, and the priority queue can't guarantee that if they're sent concurrently.)

@kazu-yamamoto

This comment has been minimized.

Show comment
Hide comment
@kazu-yamamoto

kazu-yamamoto Sep 11, 2015

Contributor

I'm wondering whether it would make sense to give Responder a push function too, since it's possible for the associated stream of a PUSH_PROMISE frame to be itself a pushed stream. In that case it would be just type HTTP2Application = Request -> Responder. I don't mind swapping the argument order, but I'd like to consider whether this is better.

Do you mean the followings?

type HTTP2Application = Request -> Responder
type Responder = PushFunc -> (forall a. Status -> ResponseHeaders -> Body a -> IO a) -> IO Trailers
type PushFunc = PushPromise -> Responder -> IO Bool

In this definition, PushFunc and Responder are mutually recursive. Is this possible?

Contributor

kazu-yamamoto commented Sep 11, 2015

I'm wondering whether it would make sense to give Responder a push function too, since it's possible for the associated stream of a PUSH_PROMISE frame to be itself a pushed stream. In that case it would be just type HTTP2Application = Request -> Responder. I don't mind swapping the argument order, but I'd like to consider whether this is better.

Do you mean the followings?

type HTTP2Application = Request -> Responder
type Responder = PushFunc -> (forall a. Status -> ResponseHeaders -> Body a -> IO a) -> IO Trailers
type PushFunc = PushPromise -> Responder -> IO Bool

In this definition, PushFunc and Responder are mutually recursive. Is this possible?

@awpr

This comment has been minimized.

Show comment
Hide comment
@awpr

awpr Sep 11, 2015

Member

Ah, yes, I think GHC would complain about infinite types there. However if Response becomes a newtype like in awpr@211c75c, it would be fine:

type HTTP2Application = Request -> Responder
newtype Responder = Responder { runResponder :: forall s. PushFunc -> (forall a. RespondFunc s a) -> IO s }
type PushFunc = PushPromise -> Responder -> IO Bool
Member

awpr commented Sep 11, 2015

Ah, yes, I think GHC would complain about infinite types there. However if Response becomes a newtype like in awpr@211c75c, it would be fine:

type HTTP2Application = Request -> Responder
newtype Responder = Responder { runResponder :: forall s. PushFunc -> (forall a. RespondFunc s a) -> IO s }
type PushFunc = PushPromise -> Responder -> IO Bool
@kazu-yamamoto

This comment has been minimized.

Show comment
Hide comment
@kazu-yamamoto

kazu-yamamoto Sep 11, 2015

Contributor

@awpr I feel that the original one is much easier to understand.

Contributor

kazu-yamamoto commented Sep 11, 2015

@awpr I feel that the original one is much easier to understand.

@kazu-yamamoto

This comment has been minimized.

Show comment
Hide comment
@kazu-yamamoto

kazu-yamamoto Sep 11, 2015

Contributor

In your example, indexPage pushes two files. Is it done concurrently? Or sequentially?

If this is sequential, I would implement concurrent pushing.

Contributor

kazu-yamamoto commented Sep 11, 2015

In your example, indexPage pushes two files. Is it done concurrently? Or sequentially?

If this is sequential, I would implement concurrent pushing.

@awpr

This comment has been minimized.

Show comment
Hide comment
@awpr

awpr Sep 11, 2015

Member

It's definitely easier to understand, but based on the anecdotal evidence that I managed to write a call-the-wrong-respond-function bug in the one HTTP2Application I've ever written using PushPromise, I think the conceptual overhead is justified -- that's a 100% error rate!

In indexPage, each call to push blocks just long enough to know the sender has dequeued the push Output, but the actual stream body is run concurrently (in its own thread). That allows applications to adhere to the HTTP/2 spec in section 8.2.1: "The server SHOULD send PUSH_PROMISE (Section 6.6) frames prior to sending any frames that reference the promised responses."

Member

awpr commented Sep 11, 2015

It's definitely easier to understand, but based on the anecdotal evidence that I managed to write a call-the-wrong-respond-function bug in the one HTTP2Application I've ever written using PushPromise, I think the conceptual overhead is justified -- that's a 100% error rate!

In indexPage, each call to push blocks just long enough to know the sender has dequeued the push Output, but the actual stream body is run concurrently (in its own thread). That allows applications to adhere to the HTTP/2 spec in section 8.2.1: "The server SHOULD send PUSH_PROMISE (Section 6.6) frames prior to sending any frames that reference the promised responses."

@kazu-yamamoto

This comment has been minimized.

Show comment
Hide comment
@kazu-yamamoto

kazu-yamamoto Sep 11, 2015

Contributor

@awpr OK. Would you push 211c75c to wai repo? I will continue the review based on that.

Contributor

kazu-yamamoto commented Sep 11, 2015

@awpr OK. Would you push 211c75c to wai repo? I will continue the review based on that.

@awpr

This comment has been minimized.

Show comment
Hide comment
@awpr

awpr Sep 11, 2015

Member

Done (as well as the usability changes that followed it).

Member

awpr commented Sep 11, 2015

Done (as well as the usability changes that followed it).

@kazu-yamamoto

This comment has been minimized.

Show comment
Hide comment
@kazu-yamamoto

kazu-yamamoto Sep 11, 2015

Contributor

Thank you.

Is the definition of HTTP2Application still old?

Also, I would like to have the gist example updated.

Contributor

kazu-yamamoto commented Sep 11, 2015

Thank you.

Is the definition of HTTP2Application still old?

Also, I would like to have the gist example updated.

@awpr

This comment has been minimized.

Show comment
Hide comment
@awpr

awpr Sep 11, 2015

Member

HTTP2Application isn't missing any changes that I've actually implemented -- I just haven't made the change to move PushFunc into Responder yet.

Updated the example.

Member

awpr commented Sep 11, 2015

HTTP2Application isn't missing any changes that I've actually implemented -- I just haven't made the change to move PushFunc into Responder yet.

Updated the example.

@AaronFriel

This comment has been minimized.

Show comment
Hide comment
@AaronFriel

AaronFriel Sep 11, 2015

Contributor

I totally agree that it's important to be able to tune the threshold of copy vs. insert for these Builders. However I don't think such a threshold should actually be represented in Warp at all; it would be a property of the code creating the Builder -- i.e. by defining at module level "myByteString = byteStringThreshold 3" in your application, and using that instead of 'byteString' to construct responses. Of course, Warp could have some canned advice about this, with experimentally determined values...

Ah, so will WAI change to hew more closely to the Builder spec of not copying when using byteStringInsert? That would be excellent and allow finer-grained control.

I would say this again. ByteStrings are pinned objects. But global locks are taken only when large ByteString (>= 3,272 bytes on 64 bit machines) are allocated. They are called large objects in the GC terminology. No global locks are taken to allocate small ByteStrings.

I am not sure if we need to use pinned objects for WAI, ever. I understand that the current ByteString API contract states that all backing memory for them is pinned, but that, strictly speaking, doesn't have to be the case. Pinned objects cause heap fragmentation and make the garbage collector's work more difficult, but I think it's the lock more than anything else that is the issue.

There are plenty of cases where unpinned memory works great for ByteString, and it makes sense that the default is pinned (it makes sure that complex FFI won't break because of a moved value), but one-off FFI where you just need to pass a string once don't require pinned values.

I don't think ShortByteString is what you want either, because there is no Builder for it, and that takes us back to all of the previously discussed objections to using a ByteString API. What you gain in not having a global lock, you lose in having to memcpy everything or write your own (brittle) wrapper around ShortByteString to act as a Builder (for filling buffers) and taking substrings.

Regarding writev() and sending lazy ByteStrings to tls, I think Warp can actually do better than toLazyByteString because we also have the guarantee that the buffer is safe to reuse as soon as the write call returns. I'm sure there's still insight to be gained from its implementation, though.

I would agree with this analysis. We can't use tls without having to write the encrypted string somewhere, but I think we can at least be zero-copy until that point.

Contributor

AaronFriel commented Sep 11, 2015

I totally agree that it's important to be able to tune the threshold of copy vs. insert for these Builders. However I don't think such a threshold should actually be represented in Warp at all; it would be a property of the code creating the Builder -- i.e. by defining at module level "myByteString = byteStringThreshold 3" in your application, and using that instead of 'byteString' to construct responses. Of course, Warp could have some canned advice about this, with experimentally determined values...

Ah, so will WAI change to hew more closely to the Builder spec of not copying when using byteStringInsert? That would be excellent and allow finer-grained control.

I would say this again. ByteStrings are pinned objects. But global locks are taken only when large ByteString (>= 3,272 bytes on 64 bit machines) are allocated. They are called large objects in the GC terminology. No global locks are taken to allocate small ByteStrings.

I am not sure if we need to use pinned objects for WAI, ever. I understand that the current ByteString API contract states that all backing memory for them is pinned, but that, strictly speaking, doesn't have to be the case. Pinned objects cause heap fragmentation and make the garbage collector's work more difficult, but I think it's the lock more than anything else that is the issue.

There are plenty of cases where unpinned memory works great for ByteString, and it makes sense that the default is pinned (it makes sure that complex FFI won't break because of a moved value), but one-off FFI where you just need to pass a string once don't require pinned values.

I don't think ShortByteString is what you want either, because there is no Builder for it, and that takes us back to all of the previously discussed objections to using a ByteString API. What you gain in not having a global lock, you lose in having to memcpy everything or write your own (brittle) wrapper around ShortByteString to act as a Builder (for filling buffers) and taking substrings.

Regarding writev() and sending lazy ByteStrings to tls, I think Warp can actually do better than toLazyByteString because we also have the guarantee that the buffer is safe to reuse as soon as the write call returns. I'm sure there's still insight to be gained from its implementation, though.

I would agree with this analysis. We can't use tls without having to write the encrypted string somewhere, but I think we can at least be zero-copy until that point.

@awpr

This comment has been minimized.

Show comment
Hide comment
@awpr

awpr Sep 11, 2015

Member

Builder spec of not copying when using byteStringInsert

There's not really any such spec -- byteStringInsert promises that running the Builder will give 'Chunk bs rest' rather than writing the contents to the given buffer, but it doesn't (and can't) guarantee anything about what the caller will do with that ByteString value.

That said I do think Warp's HTTP/2 sender (not WAI, which has no control over this) should potentially change to avoid copying ByteString chunks yielded by the builder. This could conceivably cause poor performance for existing Builders that yield small chunks, though. Maybe it could be a boolean setting -- "yes, I'm sure my Builders are tuned; please don't copy them" -- if there's a large performance cost for naive applications.

I am not sure if we need to use pinned objects for WAI, ever

The send buffer already doesn't use pinned memory; it's allocated by Run.hs calling 'allocateBuffer', which is defined in Buffer.hs as 'mallocBytes', which allocates "outside of the area maintained by the Haskell storage manager". Pinned memory is special in that it's garbage-collected but not movable; this memory is not garbage-collected and therefore doesn't need any special treatment to be sure it won't move.

that, strictly speaking, doesn't have to be the case

With the current ByteString representation (namely a pointer in the form of ForeignPtr Word8), it does, because a ByteString pointing at movable memory would become a dangling reference if that memory were moved. Changing the ByteString constructor to use ByteArray# is almost certainly a non-starter because it would prevent using memory allocated by anything other than GHC's heap allocator.

one-off FFI where you just need to pass a string once don't require pinned values

I don't think this is true anyway, because there's nothing preventing the GC from running during an FFI call. If the FFI call is marked as 'safe', then the runtime will have released the Capability before entering the call, and a collection can happen in the middle of the call. Even if you can make it safe by marking the FFI call as 'unsafe' (which I'm not sure you can), you'd be better off using something like ShortByteString anyway.

Since Warp wraps up its buffers in ByteStrings for I/O and uses them in FFI'd calls like pread, they really do need to be immovable, so they couldn't be ShortByteString even if we wanted them to. But I still don't see where you're getting this talk of a global lock anyway -- each HTTP/2 session gets one send buffer which is not even pinned heap memory anyway, and sending individual chunks of data for a stream shouldn't allocate any ByteStrings large or otherwise.

Member

awpr commented Sep 11, 2015

Builder spec of not copying when using byteStringInsert

There's not really any such spec -- byteStringInsert promises that running the Builder will give 'Chunk bs rest' rather than writing the contents to the given buffer, but it doesn't (and can't) guarantee anything about what the caller will do with that ByteString value.

That said I do think Warp's HTTP/2 sender (not WAI, which has no control over this) should potentially change to avoid copying ByteString chunks yielded by the builder. This could conceivably cause poor performance for existing Builders that yield small chunks, though. Maybe it could be a boolean setting -- "yes, I'm sure my Builders are tuned; please don't copy them" -- if there's a large performance cost for naive applications.

I am not sure if we need to use pinned objects for WAI, ever

The send buffer already doesn't use pinned memory; it's allocated by Run.hs calling 'allocateBuffer', which is defined in Buffer.hs as 'mallocBytes', which allocates "outside of the area maintained by the Haskell storage manager". Pinned memory is special in that it's garbage-collected but not movable; this memory is not garbage-collected and therefore doesn't need any special treatment to be sure it won't move.

that, strictly speaking, doesn't have to be the case

With the current ByteString representation (namely a pointer in the form of ForeignPtr Word8), it does, because a ByteString pointing at movable memory would become a dangling reference if that memory were moved. Changing the ByteString constructor to use ByteArray# is almost certainly a non-starter because it would prevent using memory allocated by anything other than GHC's heap allocator.

one-off FFI where you just need to pass a string once don't require pinned values

I don't think this is true anyway, because there's nothing preventing the GC from running during an FFI call. If the FFI call is marked as 'safe', then the runtime will have released the Capability before entering the call, and a collection can happen in the middle of the call. Even if you can make it safe by marking the FFI call as 'unsafe' (which I'm not sure you can), you'd be better off using something like ShortByteString anyway.

Since Warp wraps up its buffers in ByteStrings for I/O and uses them in FFI'd calls like pread, they really do need to be immovable, so they couldn't be ShortByteString even if we wanted them to. But I still don't see where you're getting this talk of a global lock anyway -- each HTTP/2 session gets one send buffer which is not even pinned heap memory anyway, and sending individual chunks of data for a stream shouldn't allocate any ByteStrings large or otherwise.

@AaronFriel

This comment has been minimized.

Show comment
Hide comment
@AaronFriel

AaronFriel Sep 11, 2015

Contributor

I am not sure if we need to use pinned objects for WAI, ever

The send buffer already doesn't use pinned memory; it's allocated by Run.hs calling 'allocateBuffer', which is defined in Buffer.hs as 'mallocBytes', which allocates "outside of the area maintained by the Haskell storage manager". Pinned memory is special in that it's garbage-collected but not movable; this memory is not garbage-collected and therefore doesn't need any special treatment to be sure it won't move.

Oh, I agree. I think we're talking past each other though. The issue is that ByteString always uses pinned memory, and Builder is built on ByteString, so when the web applications create ByteStrings, for example, when encoding from Text to ByteString, that creates unnecessary fragmentation. That is: even if WAI always does the right thing and doesn't unnecessarily pin memory, applications are forced to do so because there's no way to opt out of pinning.

This is why I think that it might be useful to rewrite ByteString in terms of ByteArray#, where it would use pinned memory by default (to preserve existing contracts) and an Unpinned module (to avoid global locking.)

Changing the ByteString constructor to use ByteArray# is almost certainly a non-starter because it would prevent using memory allocated by anything other than GHC's heap allocator.

I'm not sure why that would be the case? I would have to look into the GHC primops, but if we receive memory via FFI allocated outside of GHC, I think there is a way to treat it as pinned - it wouldn't be in the GHC heap anyhow. I would have to look into this, but I know ByteArray# is a supported type for FFI, so it would be surprising to me that this would not work.

That would require significant digging into the FFI - but I want to reiterate:

I am not suggesting changing the default contract of ByteString from using pinned memory. I want to look into improving perf by avoiding the lock when it's unnecessary. That is: when the memory is allocated by GHC and it will be sent to one-off FFI functions that do not hold any reference to it after the call completes.

one-off FFI where you just need to pass a string once don't require pinned values

I don't think this is true anyway, because there's nothing preventing the GC from running during an FFI call. If the FFI call is marked as 'safe', then the runtime will have released the Capability before entering the call, and a collection can happen in the middle of the call. Even if you can make it safe by marking the FFI call as 'unsafe' (which I'm not sure you can), you'd be better off using something like ShortByteString anyway.

The manual docs specifically say that ByteArray# is FFI safe. If the GC might move the memory during the call, that seems like something that should be noted.

8.1.1. Unboxed types

The following unboxed types may be used as basic foreign types (see FFI Addendum, Section 3.2): Int#, Word#, Char#, Float#, Double#, Addr#, StablePtr# a, MutableByteArray#, ForeignObj#, and ByteArray#.

(From GHC Manual).

Contributor

AaronFriel commented Sep 11, 2015

I am not sure if we need to use pinned objects for WAI, ever

The send buffer already doesn't use pinned memory; it's allocated by Run.hs calling 'allocateBuffer', which is defined in Buffer.hs as 'mallocBytes', which allocates "outside of the area maintained by the Haskell storage manager". Pinned memory is special in that it's garbage-collected but not movable; this memory is not garbage-collected and therefore doesn't need any special treatment to be sure it won't move.

Oh, I agree. I think we're talking past each other though. The issue is that ByteString always uses pinned memory, and Builder is built on ByteString, so when the web applications create ByteStrings, for example, when encoding from Text to ByteString, that creates unnecessary fragmentation. That is: even if WAI always does the right thing and doesn't unnecessarily pin memory, applications are forced to do so because there's no way to opt out of pinning.

This is why I think that it might be useful to rewrite ByteString in terms of ByteArray#, where it would use pinned memory by default (to preserve existing contracts) and an Unpinned module (to avoid global locking.)

Changing the ByteString constructor to use ByteArray# is almost certainly a non-starter because it would prevent using memory allocated by anything other than GHC's heap allocator.

I'm not sure why that would be the case? I would have to look into the GHC primops, but if we receive memory via FFI allocated outside of GHC, I think there is a way to treat it as pinned - it wouldn't be in the GHC heap anyhow. I would have to look into this, but I know ByteArray# is a supported type for FFI, so it would be surprising to me that this would not work.

That would require significant digging into the FFI - but I want to reiterate:

I am not suggesting changing the default contract of ByteString from using pinned memory. I want to look into improving perf by avoiding the lock when it's unnecessary. That is: when the memory is allocated by GHC and it will be sent to one-off FFI functions that do not hold any reference to it after the call completes.

one-off FFI where you just need to pass a string once don't require pinned values

I don't think this is true anyway, because there's nothing preventing the GC from running during an FFI call. If the FFI call is marked as 'safe', then the runtime will have released the Capability before entering the call, and a collection can happen in the middle of the call. Even if you can make it safe by marking the FFI call as 'unsafe' (which I'm not sure you can), you'd be better off using something like ShortByteString anyway.

The manual docs specifically say that ByteArray# is FFI safe. If the GC might move the memory during the call, that seems like something that should be noted.

8.1.1. Unboxed types

The following unboxed types may be used as basic foreign types (see FFI Addendum, Section 3.2): Int#, Word#, Char#, Float#, Double#, Addr#, StablePtr# a, MutableByteArray#, ForeignObj#, and ByteArray#.

(From GHC Manual).

@awpr

This comment has been minimized.

Show comment
Hide comment
@awpr

awpr Sep 11, 2015

Member

I have no authority on matters of ByteString internals whatsoever, so I'm not going to spend any more time debating that change, but my reference for ByteArray#s moving during FFI calls is https://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/GC/Pinned. I've seen some really stale information on that wiki before, so maybe it's not true anymore.

As for Builders, it's entirely possible to use Builders without ever allocating a ByteString. Yielding a 'ByteString' chunk is an optimization for when a ByteString already exists. Needing to use a ByteString as part of a Builder is a symptom of either a missing library function (like Text -> Builder that will encode the Text's codepoints directly into the given buffer), or some upstream API that gave you a ByteString.

Member

awpr commented Sep 11, 2015

I have no authority on matters of ByteString internals whatsoever, so I'm not going to spend any more time debating that change, but my reference for ByteArray#s moving during FFI calls is https://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/GC/Pinned. I've seen some really stale information on that wiki before, so maybe it's not true anymore.

As for Builders, it's entirely possible to use Builders without ever allocating a ByteString. Yielding a 'ByteString' chunk is an optimization for when a ByteString already exists. Needing to use a ByteString as part of a Builder is a symptom of either a missing library function (like Text -> Builder that will encode the Text's codepoints directly into the given buffer), or some upstream API that gave you a ByteString.

@AaronFriel

This comment has been minimized.

Show comment
Hide comment
@AaronFriel

AaronFriel Sep 12, 2015

Contributor

As for Builders, it's entirely possible to use Builders without ever allocating a ByteString. Yielding a 'ByteString' chunk is an optimization for when a ByteString already exists. Needing to use a ByteString as part of a Builder is a symptom of either a missing library function (like Text -> Builder that will encode the Text's codepoints directly into the given buffer), or some upstream API that gave you a ByteString.

I understand, I'm just looking for ways to avoid the ByteString intermediate (or really: the pinning of memory and global lock for large chunks) if it's unnecessary. I'm unsure if I should file a trac ticket or where I could get clarification on the wiki article, or just bug people in #haskell on IRC until someone can clarify.

Contributor

AaronFriel commented Sep 12, 2015

As for Builders, it's entirely possible to use Builders without ever allocating a ByteString. Yielding a 'ByteString' chunk is an optimization for when a ByteString already exists. Needing to use a ByteString as part of a Builder is a symptom of either a missing library function (like Text -> Builder that will encode the Text's codepoints directly into the given buffer), or some upstream API that gave you a ByteString.

I understand, I'm just looking for ways to avoid the ByteString intermediate (or really: the pinning of memory and global lock for large chunks) if it's unnecessary. I'm unsure if I should file a trac ticket or where I could get clarification on the wiki article, or just bug people in #haskell on IRC until someone can clarify.

@awpr

This comment has been minimized.

Show comment
Hide comment
@awpr

awpr Sep 12, 2015

Member

I'm not sure why I couldn't find this on Hoogle yesterday, but I stumbled upon it browsing the 'text' Haddocks: http://hackage.haskell.org/package/text-1.2.1.3/docs/Data-Text-Encoding.html#g:6. If you're concerned about the Text case in particular, this is your answer -- it transcodes the internal UTF-16 directly into UTF-8 in the output buffer without allocating intermediate ByteStrings.

Member

awpr commented Sep 12, 2015

I'm not sure why I couldn't find this on Hoogle yesterday, but I stumbled upon it browsing the 'text' Haddocks: http://hackage.haskell.org/package/text-1.2.1.3/docs/Data-Text-Encoding.html#g:6. If you're concerned about the Text case in particular, this is your answer -- it transcodes the internal UTF-16 directly into UTF-8 in the output buffer without allocating intermediate ByteStrings.

@awpr

This comment has been minimized.

Show comment
Hide comment
@awpr

awpr Sep 12, 2015

Member

Well that's it, I'm switching to Stackage Hoogle -- over there it's the first result for "Text -> Builder". On haskell.org Hoogle it doesn't seem to be indexed at all ("encodeUtf8Builder +text" gives nothing).

Member

awpr commented Sep 12, 2015

Well that's it, I'm switching to Stackage Hoogle -- over there it's the first result for "Text -> Builder". On haskell.org Hoogle it doesn't seem to be indexed at all ("encodeUtf8Builder +text" gives nothing).

kazu-yamamoto added a commit that referenced this pull request Sep 13, 2015

Merge pull request #426 from awpr/wai_http2
Add an HTTP/2-aware application type supporting trailers and pushed streams.

@kazu-yamamoto kazu-yamamoto merged commit 7645791 into yesodweb:master Sep 13, 2015

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details
@kazu-yamamoto

This comment has been minimized.

Show comment
Hide comment
@kazu-yamamoto

kazu-yamamoto Sep 13, 2015

Contributor

@awpr OK. My review is done. The quality of this patch is really high. I have merged this pull request. Thank you for your contribution.

Please update TBD in the doc of Network.WAI.HTTP2.

Contributor

kazu-yamamoto commented Sep 13, 2015

@awpr OK. My review is done. The quality of this patch is really high. I have merged this pull request. Thank you for your contribution.

Please update TBD in the doc of Network.WAI.HTTP2.

kazu-yamamoto added a commit that referenced this pull request Sep 14, 2015

Merge pull request #432 from yesodweb/http2-push
Commits lost in the confusion of #426
@AaronFriel

This comment has been minimized.

Show comment
Hide comment
@AaronFriel

AaronFriel Sep 14, 2015

Contributor

@awpr Good catch. Yeah, I wasn't able to find it on Hoogle either and assumed it didn't exist. I'm glad that it does! And it would be great to see more such functions.

Contributor

AaronFriel commented Sep 14, 2015

@awpr Good catch. Yeah, I wasn't able to find it on Hoogle either and assumed it didn't exist. I'm glad that it does! And it would be great to see more such functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment