Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buffer's queue does not remove canceled requests #71

Open
seanmonstar opened this issue May 10, 2018 · 5 comments
Open

Buffer's queue does not remove canceled requests #71

seanmonstar opened this issue May 10, 2018 · 5 comments
Labels
A-buffer Area: The tower "buffer" middleware A-reconnect Area: The tower "reconnect" middleware C-bug Category: This is a bug. E-needs-mcve Call for participation: This issue needs a Minimal Complete and Verifiable Example I-hang Program never terminates, resulting from infinite loops, deadlock, livelock, etc. S-waiting-on-author Status: awaiting some action (such as code changes) from the PR or issue author. T-middleware Topic: middleware

Comments

@seanmonstar
Copy link
Collaborator

Due to Buffer using a futures::sync::mpsc channel, any ResponseFutures that have been dropped will continue to consume space in the queue until the underlying service has progressed through the requests in front of them. A Buffer could be wrapped in Timeout, which could cancel the requests if waiting took too long. The oneshot::Sender still being somewhere in the queue means the buffer's capacity could become full of canceled requests.

There's the additional issue that a wrapped Reconnect may wish to only retry a failed connect if there are still response futures waiting, but being in the queue makes it impossible to determine that.

This is kind of the "other" half of a pool. hyper does have a queue of waiters internally, and can check when they are canceled, since they are actually in a VecDeque. To allow new requests to enter this queue, it's wrapped in an Arc<Mutex>. While perhaps not the best thing in the world, it does work, and people still get excellent performance from hyper's client, so we could consider that as a first pass.

Related Conduit issue: linkerd/linkerd2#899

@carllerche
Copy link
Member

I believe that this describes two separate issues.

  • Only reconnecting when there is a request to send.
  • The inability to purge canceled requests from the queue.

Reconnect

Thinking more about it. I am leaning towards refining Service::poll_ready with the following:

  • An error is terminal. The service will never work again.
  • The fn should only be called when there is a request to send. This means that if the request "goes away", poll_ready should not be called anymore.

This means that, if there are no requests in the buffer, poll_ready would not be called and the connection would not be established.

The reason that I like this is because it models what would happen if you did a select over the request being canceled and the service's ready future. It also encapsulates the poll_ready concept and avoids leaking out of the abstraction.

The problem is that the connection logic might end up being half way there (socket connected and TLS handshake in progress), then poll_ready is no longer called and the handshake never finishes. Either this is OK (the remote will timeout) or there there could be a task spawned to drive the connection to completion regardless if poll_ready is called. I would also say that this problem exists today w/ the current Service API.

Buffer

The second problem is the queue used by Buffer does not eagerly release resources for canceled requests. I'm not exactly sure what you are proposing.

There are ways to deal with it w/o going w/ a Mutex<Vec<_>>, but of course, we shouldn't over optimize for perf w/o numbers. Is the queue holding on to memory a real issue today? I would think it isn't a ton of memory. It is also worth noting that the mpsc queue algorithm can support iteration on the consumer side too w/o a ton of trouble.

@seanmonstar
Copy link
Collaborator Author

This morning after thinking about it some more, I started to lean more in this direction:

  • Reconnect: In the proxy, we can treat the error from Reconnect::poll_ready as terminal, just as you mentioned. We would log that it happened, and then create a fresh Reconnect service.
  • Buffer: The memory waste itself isn't the larger issue, just something I noticed while reading through the source. However, as long as we can't iterate the waiters, we don't know if we should keep polling the inner Service::poll_ready.

Trying to combine the two needs a bit of a dance. So that the inner service in the proxy doesn't keep creating new Reconnect services and polling them, and trashing on error over and over, we would only want to poll the new Reconnect if there is still a request waiting. However, we don't really know if there is, and so what do we return from poll_ready?

If Buffer were to iterate its waiters in poll_ready, and pop any canceled, then we could potentially just task::current().notify() in the inner service, and let the Buffer only poll it again if there are still requests waiting.

@carllerche
Copy link
Member

Ah, sorry, there was another bit that I forgot to say in my previous comment.

The Buffer task implementation would always pop one request from the queue, then start calling poll_ready on the inner service. It would also call poll_cancel (or whatever it is) on the response oneshot. This way, it knows if the request is canceled. If the request is canceled, it pops another request from the queue. If all requests are canceled, this will effectively drain the queue.

@seanmonstar
Copy link
Collaborator Author

Yes, that would help if there is a general timeout applied to all requests, since it's unlikely that a request at the front of the queue is not timed out, while another further back is.

However, it doesn't take into consideration if a request is canceled for another purpose, such as in the proxy the server connection could be closed (since we coalesce requests to the same target from different connections), or we could have gotten a RST_STREAM frame for it.

@jonhoo
Copy link
Collaborator

jonhoo commented Mar 31, 2020

@seanmonstar has this since been fixed? if not, could you re-iterate the issues after #72 landed?

@jonhoo jonhoo added A-buffer Area: The tower "buffer" middleware A-reconnect Area: The tower "reconnect" middleware C-bug Category: This is a bug. E-needs-mcve Call for participation: This issue needs a Minimal Complete and Verifiable Example I-hang Program never terminates, resulting from infinite loops, deadlock, livelock, etc. S-waiting-on-author Status: awaiting some action (such as code changes) from the PR or issue author. T-middleware Topic: middleware labels Mar 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-buffer Area: The tower "buffer" middleware A-reconnect Area: The tower "reconnect" middleware C-bug Category: This is a bug. E-needs-mcve Call for participation: This issue needs a Minimal Complete and Verifiable Example I-hang Program never terminates, resulting from infinite loops, deadlock, livelock, etc. S-waiting-on-author Status: awaiting some action (such as code changes) from the PR or issue author. T-middleware Topic: middleware
Projects
None yet
Development

No branches or pull requests

3 participants