Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Negative acknowledgments #17

Closed
martinthomson opened this issue Apr 7, 2015 · 13 comments
Closed

Negative acknowledgments #17

martinthomson opened this issue Apr 7, 2015 · 13 comments

Comments

@martinthomson
Copy link
Contributor

We need to consider this.

@brianraymor
Copy link

The original proposal for context:

The push server MUST generate a 504 (Gateway Timeout) if the user agent fails to acknowledge the receipt of the push message or the push server fails to deliver the message prior to its expiration.

@martinthomson
Copy link
Contributor Author

Action for Martin to choose the right status code to use for NACKs.

@martinthomson
Copy link
Contributor Author

OK, here's the problem: there is no status code that works for this scenario.

We could try to be clever and attach different semantics to 404 and 410, or we could mint a new status code. As for whether this is part of -00, we need text soon if it is going to be in.

We could use 418 (I'm A Teapot), but IANA still don't have it registered.

@brianraymor
Copy link

I still believe that 504 is closest to what we need:

The 504 (Gateway Timeout) status code indicates that the server, while acting as a gateway or proxy, did not receive a timely response from an upstream server it needed to access in order to complete the
request.

@martinthomson
Copy link
Contributor Author

The server (the push service in this case) is not acting as a gateway or proxy. It is the authority for the resource, so using 504 would be nonsensical. A status code for "the contents of the resource expired" are more appropriate.

@brianraymor
Copy link

503 is also close. Pick one for 00 and discuss -or- mint a new one?

@martinthomson
Copy link
Contributor Author

I think that we need a new one. Close doesn't work.

@martinthomson
Copy link
Contributor Author

We just had a discussion about this issue. The concern raised was that there isn't a great deal of information that is provided to the application server with the NACK. This means that the application server is unable to distinguish between all the error cases: push service failure, subscription removal, TTL expiry, error in the application at the user agent, intentional NACK by the application at the user agent.

The two obvious options in response to a NACK for an application server is to either resend or to escalate. Resending has the problem that you could end up in a tight loop. Escalating might work, but escalation options are quickly exhausted. For instance, you can send a different message that has stronger semantics, like sending a reset state message if the state update message fails.

Other than that, the primary advantage provide to the application server is the running of the TTL timer. I know that Elio thought that this it was pretty important to run timers-as-a-service, but I still think that this is better left to the application server. See https://en.wikipedia.org/wiki/End-to-end_principle :

Put in economics terms, the marginal cost of additional reliability in the network exceeds the marginal cost of obtaining the same additional reliability by measures in the end hosts. The economically efficient level of reliability improvement inside the network depends on the specific circumstances; however, it is certainly nowhere near zero:[Ref 2] Clearly, some effort at the lower levels to improve network reliability can have a significant effect on application performance. (p. 281)

@brianraymor
Copy link

This means that the application server is unable to distinguish between all the error cases: push
service failure, subscription removal, TTL expiry, error in the application at the user agent, intentional
NACK by the application at the user agent.

This suggests that the push message needs to include data with the status to distinguish between the error cases.

@martinthomson
Copy link
Contributor Author

It was not my intent to suggest that. You will note that at least one of those failures results in no feedback at all.

The intent was to try to lay out why building something like this isn't simple. And why the end-to-end solution is perhaps superior in all respects - other than having the push service run a timer that the application server could run.

Here's another:
My push service manages message TTL in the most efficient way possible. It stores messages against a subscription and delivers them when a user agent requests them. It only expires messages in one of two ways: If it is delivering messages to the user agent, it filters out all the messages that have expired; otherwise, it performs a regular, continuous sweep of all stored messages in the system. This cleanup sweep can take as much as a day to run depending on load and other conditions. The consequence is that a NACK delivery might be sent almost a day late in the worst case.

Requiring NACK as this does makes this sort of push service architecture more expensive to run. And the only benefit it delivers is that application servers can not run timers. Well, that is, application servers that care about reliability, but they can't care too much about it, or they are back to running timers again because only they can account for push service failures.

@martinthomson
Copy link
Contributor Author

We have the following error scenarios to consider:

  • a message times out (?)
  • the push service gives up on delivery (either because it was not acknowledged in time, or otherwise) (504?)
  • the subscription is removed or deleted (410?)

MS have another state, where the app can send a negative signal indicating "yes, I received your message, but it caused an error". That might be a positive acknowledgment, but some applications want to have additional information carried back.

@martinthomson
Copy link
Contributor Author

Elio suggested that we might want to leave some latitude for the push service to generate a range of status codes. Some suggestions might be sensible though.

@brianraymor
Copy link

Proposal discussed at IETF 93:

Prior to TTL, if the push service gives up on a message (or an acknowledgment) - signal the error to the application server

Even (or especially) with full reliability, there is no point in signaling the expiration of the TTL - the application server might be offline

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants