Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sending outbox messages is fraught with issues #3881

Open
rk-for-zulip opened this issue Feb 7, 2020 · 17 comments
Open

Sending outbox messages is fraught with issues #3881

rk-for-zulip opened this issue Feb 7, 2020 · 17 comments
Assignees
Labels
a-compose/send Compose box, autocomplete, camera/upload, outbox, sending P1 high-priority

Comments

@rk-for-zulip
Copy link
Contributor

rk-for-zulip commented Feb 7, 2020

There is a small constellation of intertwined problems with message-sending.

  • All outbox messages are sent immediately, with no throttling.
  • Outbox messages may be sent in any order, regardless of their input order.
  • Server-rejected messages are not marked as such and will be repeatedly re-sent, wasting bandwidth and battery.
  • Outbox messages may be successfully sent multiple times.
  • If the app is interrupted and killed while trying to send messages, it may become permanently stuck, and will never send messages again unless it is uninstalled or its data is wiped.
  • (... etc.?)
@gnprice
Copy link
Member

gnprice commented Feb 7, 2020

  • If the app is interrupted and killed while trying to send messages, it may become permanently stuck, and will never send messages again unless it is uninstalled or its data is wiped.

Huh fascinating! How does this one happen?

  • Outbox messages may be successfully sent multiple times.

Definitely possible in principle: I think the scenario would be that a send request succeeds on the server, but the response doesn't make it to us; and the new-message event (which we make into EVENT_NEW_MESSAGE) doesn't reach us either before we retry.

Is that the kind of scenario you have in mind, or is there another path?

All of these would be good to address. I imagine a single solution may address several of them at once.

@gnprice gnprice added the a-compose/send Compose box, autocomplete, camera/upload, outbox, sending label Feb 7, 2020
@gnprice
Copy link
Member

gnprice commented Feb 7, 2020

See also #3829, #3731, #3584, #2374. And #3247 is an issue with a similar nature on a different codepath: double-sends of posting an emoji reaction to a message.

@gnprice
Copy link
Member

gnprice commented Feb 7, 2020

OK, and in particular:

  • Outbox messages may be successfully sent multiple times.

Definitely possible in principle: I think the scenario would be that a send request succeeds on the server, but the response doesn't make it to us; and the new-message event (which we make into EVENT_NEW_MESSAGE) doesn't reach us either before we retry.

This one is #2374. There's some nice discussion on that thread, too.

  • Server-rejected messages are not marked as such and will be repeatedly re-sent, wasting bandwidth and battery.

I think this is the same issue as #3731, describing another aspect of the symptoms. (Which would be a nice point to add to that thread.)

@rk-for-zulip
Copy link
Contributor Author

rk-for-zulip commented Feb 7, 2020

  • If the app is interrupted and killed while trying to send messages, it may become permanently stuck, and will never send messages again unless it is uninstalled or its data is wiped.

Huh fascinating! How does this one happen?

outboxSending is stored in Redux. If the app is dehydrated with that set to true, then killed, there is no mechanism that will clear it.

On reread, I think that's actually not possible at the moment... but only thanks to other bugs. tryUntilSuccessful doesn't await (it's not even async), and since it always returns true, sendOutbox will never await either. I don't think it's currently possible for the dehydration code to run while outboxSending is true.

At least, not unless there's an exception-generating bug in tryUntilSuccessful somewhere. That could do it.

  • Outbox messages may be successfully sent multiple times.

Definitely possible in principle: I think the scenario would be that a send request succeeds on the server, but the response doesn't make it to us; and the new-message event (which we make into EVENT_NEW_MESSAGE) doesn't reach us either before we retry.

Is that the kind of scenario you have in mind, or is there another path?

There is, sadly. The message-sending tasks are completely loose – their conclusion is unordered with the rest of the program, so there's no sequencing between them and outboxSending being unset. In particular, if we get two calls to sendOutbox in synchronous succession, the message-sending tasks will all be fired twice before any send-attempts are made.

This can all be fixed on our end, at least in theory. The scenario you described is another matter, and may require API changes to prevent. (Though EVENT_NEW_MESSAGE should provide some mitigation.)

All of these would be good to address. I imagine a single solution may address several of them at once.

I would be very wary of any solution that did not.

@rk-for-zulip
Copy link
Contributor Author

rk-for-zulip commented Feb 7, 2020

I think this is the same issue as #3731, describing another aspect of the symptoms. (Which would be a nice point to add to that thread.)

They're certainly closely related, but only because they'll need essentially the same data to be stored in order to fix. (See below.) It's otherwise technically possible to fix either without the other – and in particular I wouldn't want the fix for the network behavior to be blocked on having an appropriate display form.

It's probably appropriate to copy my comment there over here, though:

From offline brainstorming: perhaps there should be an optional httpResponse field in Outbox. [...]

We could also give httpResponse a special non-numeric "too old to send" value [...] and retain messages that would have been dropped due to age [...].

@gnprice
Copy link
Member

gnprice commented Feb 29, 2020

outboxSending is stored in Redux. If the app is dehydrated with that set to true, then killed, there is no mechanism that will clear it.

Aha. There's actually a good reason this can't happen. That flag specifically appears at state.session.outboxSending... and state.session is among the parts of our Redux state we specifically don't have redux-persist put into persistent storage.

Quoting from src/boot/store.js:

/**
 * Properties on the global store which we explicitly choose not to persist.
 *
 * All properties on the global store should appear either here or in the
 * lists of properties we do persist, below.
 */
// prettier-ignore
export const discardKeys: Array<$Keys<GlobalState>> = [
  'alertWords', 'caughtUp', 'fetching',
  'nav', 'presence', 'session', 'topics', 'typing', 'userStatus',
];

That makes state.session an appropriate spot for information, like this, that's about what the current live app process is actively doing.

@gnprice
Copy link
Member

gnprice commented Feb 29, 2020

The message-sending tasks are completely loose – their conclusion is unordered with the rest of the program, so there's no sequencing between them and outboxSending being unset. In particular, if we get two calls to sendOutbox in synchronous succession, the message-sending tasks will all be fired twice before any send-attempts are made.

I see. I think you're referring to the fact that this await:

export const trySendMessages = (dispatch: Dispatch, getState: GetState): boolean => {
  // ...
  try {
    outboxToSend.forEach(async item => {
      // ...
      await api.sendMessage(auth, {

is in a forEach callback, and so nothing actually awaits the promise that that async function returns. And as a result here:

export const sendOutbox = () => async (dispatch: Dispatch, getState: GetState) => {
  // ...
  dispatch(toggleOutboxSending(true));
  while (!trySendMessages(dispatch, getState)) {
    await progressiveTimeout(); // eslint-disable-line no-await-in-loop
  }
  dispatch(toggleOutboxSending(false));
};

we set outboxSending back to false without actually waiting for any of those api.sendMessage promises to complete.

One nuance:

the message-sending tasks will all be fired twice before any send-attempts are made.

I don't think this is quite right. They'll all be fired twice before any responses to the send-attempts are received. But remember that await foo() is await (foo()), i.e. the foo() function call happens before any awaiting is needed... and that when a JS async function is called, the function body starts executing immediately, just like a non-async function, and yields only at await or return.

So it should be the case that when you call api.sendMessage(..), it synchronously gets all the way to a fetch call. (If not, that's probably a bug in our code, and kind of a surprising one.) And that, similarly, will fire off an actual network request -- then return a promise which will get resolved (or rejected) when the request concludes, and ultimately api.sendMessage(..) returns some promise chained off of that one.

Still, firing twice before seeing any responses is clearly not the desired behavior here either.

@gnprice
Copy link
Member

gnprice commented Feb 29, 2020

Looking again at this code, I think the basic problem is that that outboxToSend.forEach loop is a forEach loop, and the code (as quoted in my previous comment) was clearly written with the expectation it would behave like a for loop.

In particular the boolean return from trySendMessages is almost completely useless given the way the code actually works -- the only way to make sense of the intention here is that it wires up a failed request in api.sendMessage to cause us to do an await progressiveTimeout(). An exception in the top half of that forEach callback, before reaching the await, (a) is a lot less likely, (b) would be a bug in our code, not an external condition like a bad network connection, (c) would therefore make no sense to retry and less sense to back off before retrying.

But also, if that loop is changed to be a real JS loop so that the awaits actually happen in sequence, then I believe that fixes three of the points mentioned in the OP:

  • All outbox messages are sent immediately, with no throttling.
  • Outbox messages may be sent in any order, regardless of their input order.
  • Outbox messages may be successfully sent multiple times.

(For the third, it fixes the cause of it you had in mind, though not the other cause I then mentioned above.)

So, seems like we should do that!

I also went and took a look at the history of this file, following my speculations above about what the authors were thinking. 😉 The logic became the way it is in #3272 -- which in particular was supposed to fix #3259, sending messages out of order. Before that, #1079 introduced the forEach(async ..) confusion.

@rk-for-zulip
Copy link
Contributor Author

So, seems like we should do that!

Except that if we do – or at least, if we just do that – then that causes another bug to surface: the presence of an unsendable message in the Outbox will prevent any other messages from being sent for a week, after which all sendable messages (up to the next unsendable message, anyway) will suddenly be flushed to the server at once.

(This would have been worse before the recent 'discard unsendable messages after a week' change. Back then it would just have silently prevented any further messages from ever being sent at all.)

@gnprice
Copy link
Member

gnprice commented Feb 29, 2020

Aha. There's actually a good reason this can't happen. [...] state.session is among the parts of our Redux state we specifically don't have redux-persist put into persistent storage.

This is a pretty subtle aspect of how the app works. I've just pushed 664ee09, which adds a bit more discussion of this in a place which hopefully helps make it somewhat easier to discover.

@gnprice
Copy link
Member

gnprice commented Feb 29, 2020

the presence of an unsendable message in the Outbox will prevent any other messages from being sent for a week

Yeah, we should behave differently on

  • a 4xx status, indicating there's a problem with the specific request; and
  • a 5xx status or lack of response, either way indicating there's a problem with the server and/or network.

In the latter case, we should retry with backoff, and keep other outbox messages in the queue behind this one, along the lines of what this code was intended to do.

In the former, we should treat the message as permanently unsendable. One more-immediate thing we could do, to unblock fixing the rest of the logic without regressing anything, would be: on a 4xx status, move on to the next outbox message in the queue, but leave this one there, basically like we (accidentally) do in the current code.

rk-for-zulip added a commit to rk-for-zulip/zulip-mobile that referenced this issue May 20, 2020
Write a complete replacement for `trySendMessages`.

This is not yet hooked in.

Will fix: zulip#3881, (others...)
gnprice added a commit to gnprice/zulip-mobile that referenced this issue Nov 11, 2020
This loop doesn't do what it looks like it does.  It looks like a
loop that tries to send one message, awaits that, then tries the
next, and so on.  In fact, because Array#forEach does nothing with
the return values of its callback -- in particular it doesn't notice
if they're promises or anything else, and doesn't await them -- the
actual behavior is to fire off a bunch of requests in parallel, wait
for nothing, and ignore errors.

That behavior isn't good, and we should fix it; that's zulip#3881.

For a start, just make the code more explicit about what it actually
does: desugar the async/await keywords to the equivalent use of
Promise#then, and add a comment for good measure.

This also smooths the way to rewriting Array#forEach here as part of
an automated sweep across our code.
gnprice added a commit to gnprice/zulip-mobile that referenced this issue Nov 12, 2020
This loop doesn't do what it looks like it does.  It looks like a
loop that tries to send one message, awaits that, then tries the
next, and so on.  In fact, because Array#forEach does nothing with
the return values of its callback -- in particular it doesn't notice
if they're promises or anything else, and doesn't await them -- the
actual behavior is to fire off a bunch of requests in parallel, wait
for nothing, and ignore errors.

That behavior isn't good, and we should fix it; that's zulip#3881.

For a start, just make the code more explicit about what it actually
does: desugar the async/await keywords to the equivalent use of
Promise#then, and add a comment for good measure.

This also smooths the way to rewriting Array#forEach here as part of
an automated sweep across our code.
gnprice added a commit to gnprice/zulip-mobile that referenced this issue Nov 14, 2020
This loop doesn't do what it looks like it does.  It looks like a
loop that tries to send one message, awaits that, then tries the
next, and so on.  In fact, because Array#forEach does nothing with
the return values of its callback -- in particular it doesn't notice
if they're promises or anything else, and doesn't await them -- the
actual behavior is to fire off a bunch of requests in parallel, wait
for nothing, and ignore errors.

That behavior isn't good, and we should fix it; that's zulip#3881.

For a start, just make the code more explicit about what it actually
does: desugar the async/await keywords to the equivalent use of
Promise#then, and add a comment for good measure.

This also smooths the way to rewriting Array#forEach here as part of
an automated sweep across our code.
chrisbobbe added a commit to agrawal-d/zulip-mobile that referenced this issue Dec 7, 2020
It'll be easier to write a test for it, when we eventually attempt
that, probably pending resolution of zulip#3881. In particular, without
this change, Flow would complain if we tried to pass
`store.dispatch` -- where `store` is mocked using redux-mock-store
-- as the first argument, for which our custom `Dispatch` type is
expected. I suspect this could be fixed by tightening up the
redux-mock-store libdef's `Dispatch` type. But, for consistency's
sake, `trySendMessages` should probably be a thunk action creator in
any case.

In e5268bb (and confirmed again with logging just now), we
observed that the return value of our function (in this case a
boolean) will be passed through and returned by the `dispatch` call.
chrisbobbe added a commit to agrawal-d/zulip-mobile that referenced this issue Dec 7, 2020
It'll be easier to write a test for it, when we eventually attempt
that, probably pending resolution of zulip#3881. In particular, without
this change, Flow would complain if we tried to pass
`store.dispatch` -- where `store` is mocked using redux-mock-store
-- as the first argument, for which our custom `Dispatch` type is
expected. I suspect this could be fixed by tightening up the
redux-mock-store libdef's `Dispatch` type. But, for consistency's
sake, `trySendMessages` should probably be a thunk action creator in
any case.

In e5268bb (and confirmed again with logging just now), we
observed that the return value of our function (in this case a
boolean) will be passed through and returned by the `dispatch` call.
@chrisbobbe
Copy link
Contributor

There's been quite a bit of productive discussion around here. The main piece of prep work we'd like to get done for our solution is #4193, which introduces a handy promiseTimeout function. We can use that function to time out a send-message request.

Marking as blocked by #4193, then, as noted in chat.

@gnprice
Copy link
Member

gnprice commented Mar 17, 2022

Marking as blocked by #4193, then

That PR was merged (after being split in two: #4753, #4754), so unblocking.

@gnprice gnprice removed the blocked on other work To come back to after another related PR, or some other task. label Mar 17, 2022
@gnprice
Copy link
Member

gnprice commented Mar 17, 2022

I think the next step, when we pick this thread of work up again, will be to take that chat thread from 2020-12 and make a fresh draft of the proposed design. (@chrisbobbe sent a design up at the top of the thread, and then there was further discussion but nobody has compiled a revised draft all in one place.)

That will include a state diagram, akin to this one, and a type definition, like this one crossed with the full draft from the start of the thread.

@chrisbobbe
Copy link
Contributor

chrisbobbe commented Jul 13, 2022

I think the next step, when we pick this thread of work up again, will be to take that chat thread from 2020-12 and make a fresh draft of the proposed design.

I've picked this back up again, here. I've written a new state diagram but I'd like to go through some concerns with it before doing the other parts.

@chrisbobbe
Copy link
Contributor

chrisbobbe commented Jul 21, 2022

Here's our latest state diagram, from discussion:

            User cancels the scheduled send.
           ┌────────────────────────────────────────────────────────────────────────┐
           │                                                                        │
           │                                                                        │
           │                                   User cancels during send (#4170).    │
           │                                  ┌───────────────────────────────────┐ │
           │                                  │                                   │ │
           │                                  │                                   │ │
           │                                  │                Event received,    │ │
(create)   │      Time for the scheduled      │                or we abandoned    │ │
  │        │      send (a try/retry).         │    200.        the queue.         ▼ ▼
  └► should-send ───────────────────────► sending ─────► sent ────────────────► (delete)
       │ ▲ ▲                                │ │                                     ▲
       │ │ │                                │ │                                     │
       │ │ │ App quit: schedule auto-retry. │ │                                     │
       │ │ │                                │ │                                     │
       │ │ │ 5xx, network error, or (with   │ │                                     │
       │ │ │ #4170) 60s network timeout:    │ │                                     │
       │ │ │ schedule auto-retry with       │ │                                     │
       │ │ │ backoff.                       │ │ 4xx.                                │
       │ │ └────────────────────────────────┘ └───────────────────────┐             │
       │ │                                                            │             │
       │ │                                                            │    User     │
       │ │ User requested a retry; schedule to run immediately.       ▼    cancels. │
       │ └──────────────────────────────────────────────────────── failed ──────────┘
       │                                                              ▲
       │                                                              │
       │ Too old: message is "better never than late". Too much time  │
       │ from creation or last user-requested retry.                  │
       └──────────────────────────────────────────────────────────────┘

@sevmonster
Copy link

I just had a situation where a user sending 4 messages turned into 20 due to a connectivity issue with the server... I assume that #5525 is also related to this.

chrisbobbe added a commit to chrisbobbe/zulip-mobile that referenced this issue Jul 10, 2024
…Error

This is our minimal support for zulip#5870 in this legacy codebase.
Supporting it properly with a client-side check of the setting is
more effort than we can spare right now.

Probably more error handling is called for, but zulip#3881 ("Sending
outbox messages is fraught with issues") is complicated and it's
probably best to leave it be.

Fixes: zulip#5870
chrisbobbe added a commit to chrisbobbe/zulip-mobile that referenced this issue Jul 10, 2024
…Error

This is our minimal support for zulip#5870 in this legacy codebase.
Supporting it properly with a client-side check of the setting is
more effort than we can spare right now.

Probably more error handling is called for in general (like for
network or server issues), but zulip#3881 ("Sending outbox messages is
fraught with issues") is complicated and it's probably best to leave
it be.

Fixes: zulip#5870
gnprice pushed a commit to chrisbobbe/zulip-mobile that referenced this issue Jul 10, 2024
…Error

This is our minimal support for zulip#5870 in this legacy codebase.
Supporting it properly with a client-side check of the setting is
more effort than we can spare here, because it requires implementing
the group-based permissions system.

Probably more error handling is called for in general (like for
network or server issues), but zulip#3881 ("Sending outbox messages is
fraught with issues") is complicated and it's probably best to leave
it be.

Fixes: zulip#5870
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a-compose/send Compose box, autocomplete, camera/upload, outbox, sending P1 high-priority
Projects
None yet
Development

No branches or pull requests

4 participants