Get and maintain a handy optimized data structure of recent PM conversations #4392

gnprice · 2021-01-08T04:46:46Z

This fixes #3133 -- making a lot more PM conversations visible on the "PM conversations" screen -- by using the nice recent_private_conversations data structure that the server introduced in version 2.1, precisely in order to support fixing #3133.

This PR replaces #3535, and #4104 which was partly based on a rebase of #3535. There's one commit in here which derives from those two PRs; but upon rebasing #4104 and studying it, I decided it'd be better to take a different approach for most of what it needs to do.

Most fundamentally: instead of maintaining the data in the exact same format as it came over the wire as JSON, we arrange it on receipt into a data structure that's optimized for efficiently maintaining and using. The "store it like it came over the wire" approach is the one we use for a lot of the data in the app... but the needs of an in-memory data structure for efficiently updating data, and efficiently finding things in it, are very different from the needs of a simple, highly portable, wire format like JSON. Our #3339 and #3949 already describe how we should be using more appropriate data structures for a lot of our data.

This data structure is another example of that -- and as we're adding it now from scratch, it makes a perfect occasion to go straight for a data structure designed for the purpose, without any of the additional considerations or work required to migrate something we already have and use.

In doing so, we make good use of our new (since #4201) support for Immutable.js data structures in our Redux state: without Immutable (and short of basically reimplementing large parts of Immutable ourselves), it would have been quite difficult to make these data structures as efficient as they should be in a way compatible with Redux's requirement that the state not be mutated.

This also builds on #4385 yesterday, and on the long series of work sorting out the different ways we identify PM conversations (#4035).

Fixes: #3133

chrisbobbe

Thanks! I'm eager to get #3133 fixed, and I'm glad to see fetchPrivateMessages marked for future deletion; it will make doInitialFetch cleaner. Just a few comments below.

chrisbobbe · 2021-01-08T16:05:08Z

src/message/fetchActions.js

@@ -251,6 +263,7 @@ const fetchPrivateMessages = () => async (dispatch: Dispatch, getState: GetState
      numAfter: 0,
      foundNewest: found_newest,
      foundOldest: found_oldest,
+      ownUserId: getOwnUserId(getState()),


nit (here and in fetchMessages, above): From one perspective, this isn't NFC, right? getOwnUserId will throw with "No server data found" if the user has logged out, which might have happened while the API request was in progress. (Before this commit, that error would not have been thrown.)

Still, I don't think it's a very bad thing to throw here; the alternative is allowing race condition where the result of this message fetch is applied when the account it's meant for is no longer logged in.

Still, I don't think it's a very bad thing to throw here; the alternative is allowing race condition where the result of this message fetch is applied when the account it's meant for is no longer logged in.

Er, well, of course it's bad in that it would (I think) cause a crash—but having corrupted data is also quite bad and possibly harder to debug. 🙂 This is the kind of thing that we hope to work out with logic for cancelling in-progress requests when one logs out.

Hmm true! Yeah, we should really fix this possible case in general by cancelling these requests (#4170), but we haven't yet.

I'll note this in the commit message.

chrisbobbe · 2021-01-08T18:40:11Z

src/pm-conversations/__tests__/pmConversationsModel-test.js

+});
+
+describe('reducer', () => {
+  const initialState = reducer(undefined, ({ type: eg.randString() }: $FlowFixMe));


Interesting, I hadn't thought of this way of getting the initial/previous state, i.e., running a dummy action through the reducer starting from an undefined state.

It looks like we've often just spelled out the initial/previous state literally, although in my current revision of #4390 I take advantage of our new eg.makeMessagesState to set the initial/previous state in the tests for messagesReducer. That approach lets me minimize the diff when we change over to Immutable.Map for state.messages, while making it easier to avoid the Immutable.Map({ 1: 'foo' }) (vs. Immutable.Map([[1, 'foo']])) mistake. I think your approach here would work just as well for both of those things, so I'd be happy to give it a try over there if you'd like. (In that branch there are a few uses of eg.makeMessagesState in other test files, like unreadHuddlesReducer-test, that should probably stay, to get those advantages.)

In 612ec04, I renamed some initialStates to prevState. At first, I might have suggested that we make the same rename here for consistency. But I see that the initialState here really is kind of an initial state; it's never the variable that gets passed directly to the reducer; that variable is called state or someState, with state sometimes getting reassigned as the test progresses. I think that makes sense as a new pattern to try.

I noticed that we lose just a bit of coverage by not calling deepFreeze on state objects we pass to the reducer; in theory, I suppose, the reducer could get away with improperly mutating the object that holds map and sorted.

this way of getting the initial/previous state, i.e., running a dummy action through the reducer starting from an undefined state.

Yep! One nice thing about this is that it exercises the API that Redux already requires a reducer to support -- in fact it's mimicking what Redux itself does to initialize the state on creating the store.

In 612ec04, I renamed some initialStates to prevState. At first, I might have suggested that we make the same rename here for consistency. But I see that the initialState here really is kind of an initial state

Yep. I do mean this as the initial state, in the same sense we use that name in our reducers: a given test case may or may not choose it as the "previous" state for e.g. applying the reducer with some action, but in any case initialState is the one constant state that is the one the given reducer initializes its state to.

In that commit 612ec04, a few of the things called initialState happened to actually equal the reducer's initial state, but most of them didn't; they were all local to a particular test, and what they had in common is that they were the starting state for their respective tests. So prevState was a much better name, to avoid the collision with the reducer's own actual initial state.

It looks like we've often just spelled out the initial/previous state literally, although in my current revision of #4390 I take advantage of our new eg.makeMessagesState to set the initial/previous state in the tests for messagesReducer. That approach lets me minimize the diff when we change over to Immutable.Map for state.messages, while making it easier to avoid the Immutable.Map({ 1: 'foo' }) (vs. Immutable.Map([[1, 'foo']])) mistake.

Yeah. One thing I like about this is that it's a step toward making the tests not care about the internals of how we implement a particular part of the state, treating it more as a black box. That's good for being able to change the data structures; I think it also makes the tests more readable and informative, by having them act at a level that's more like how we want to think about what the app is doing: send events and other actions in, and select meaningful data out.

I think your approach here would work just as well for both of those things, so I'd be happy to give it a try over there if you'd like. (In that branch there are a few uses of eg.makeMessagesState in other test files, like unreadHuddlesReducer-test, that should probably stay, to get those advantages.)

I think in fact a good way to think about eg.makeMessagesState may be as a shorthand for starting from the initial state and iterating through the given messages to add them. It's good to have such a shorthand, because it helps keep the tests short and focused on the interesting parts.

It should be exactly equivalent to that alternate version; for pure black-box testing, it could even be implemented that way. But I think it's fine to have it implemented directly, as it is -- it's simple, probably more efficient, and it's one central piece of code to update when we change the representation so it's not a burden there.

I noticed that we lose just a bit of coverage by not calling deepFreeze on state objects we pass to the reducer; in theory, I suppose, the reducer could get away with improperly mutating the object that holds map and sorted.

Yeah. I'm not sure our use of deepFreeze has ever actually caught a bug -- I think that kind of bug is largely pretty avoidable just by writing the code in the style we're accustomed to writing it in. So the approach I've been inclined to is to use it in central places like exampleData.js, where one call can cover a large swath of tests (see also 45308e7); and to generally skip it in individual test files, to avoid the clutter.

chrisbobbe · 2021-01-08T19:15:14Z

src/pm-conversations/pmConversationsSelectors.js

+        allUsersById,
+        ownUserId,
+      );
+      if (keyRecipients === null) {


if (keyRecipients === null) {

I'm having a little trouble figuring out under which circumstances keyRecipients is null.

That's what pmKeyRecipientsFromIds returns if there's a user ID it can't find in the given allUsersById map. By returning null, it forces (via the type-checker) its caller to deal with it -- either by throwing, or as I've done here by ignoring the conversation and moving on.

I see its jsdoc doesn't mention that, though! I'll add that.

That's what pmKeyRecipientsFromIds returns if there's a user ID it can't find in the given allUsersById map.

True, but I guess I was thinking, why would there be a user ID that can't be found in the given allUsersById map? 😛

Probably shouldn't be possible 🙂 -- there aren't even the scenarios there are for emails, where someone may have changed their email. So probably we should emit a warning to Sentry from pmKeyRecipientsFromIds in this case.

If it does happen, then at least at this callsite of pmKeyRecipientsFromIds it seems pretty clear how best to proceed: ignore that one conversation and carry on. If there weren't a clear reasonable way to proceed, then we'd want to throw instead.

chrisbobbe · 2021-01-08T19:24:43Z

src/pm-conversations/pmConversationsModel.js

+  } else if (prev >= msgId) {
+    // The conversation already has a newer message.  Do nothing.
+    return state;
+    //


I wanted a blank line here so that the three cases would be visually separated; but Prettier wants to push them back together again. This was a way of making space... though not a great way, as the // really diminishes the visual separation.

I'll try just prettier-ignoreing this section.

chrisbobbe · 2021-01-08T19:25:50Z

src/pm-conversations/pmConversationsModel.js

+    } else {
+      // It wasn't the latest.  Just handle the general case.
+      sorted = sorted.delete(i); // linear time, ouch
+      sorted = insertSorted(sorted, map, key, msgId);


insertSorted here is linear in writes here, too, at least in the common case where it does an .insert, I think. Does it equally deserve the "ouch" label at this callsite, then? (Or maybe it's enough that "O(n)" gets a mention in the comments above its definition.)

At least in cases where we do a .delete followed by an .insert, I wonder if it would be just a bit more efficient to do a single .update instead; what do you think?

I suppose that would mean writing code that overlaps with some of what insertSorted does. But then I wonder if insertSorted might be pared back so that it's only responsible for finding the proper index. That would also give its interface the nice feature that it's at worst linear only in reads, not in reads and writes (if that matters).

Yeah, in general both of these lines can take linear time.

The common case, though, should be that the message that just came in is the newest message we've ever seen. In fact, for EVENT_NEW_MESSAGE, if it's ever not the newest then something odd has happened. (This could potentially happen if we go fetch the messages in some narrow, and right around the same time someone sends a message into that narrow, and the fetch sees that message and returns sooner than we get the reply from the event queue to tell us about the message. Or a similar scenario with two different messages sent right around the time we fetch: one of them in the fetch results, and a slightly older one that comes as an event shortly after. But I think that kind of race condition is basically the only way it can happen.) Meanwhile for messages we learn about through REALM_INIT or MESSAGE_FETCH_COMPLETE, if we want to optimize those they probably won't be going through this codepath anyway.

If it is the newest message we've ever seen, then the optimization described at the top of insertSorted applies, and it'll take constant time.

At least in cases where we do a .delete followed by an .insert, I wonder if it would be just a bit more efficient to do a single .update instead; what do you think?

The trouble is that this bit of code's job (the stanza manipulating sorted inside this "The conversation needs to be updated" case) is to rearrange the order of the list's elements. The conversation was at some position in the list, and we need to move it to the new position it belongs at. In general those are two different positions -- which also means we need to change which values appear at all the positions in between. So an .update won't be enough.

It can happen that we end up doing a .delete followed by .insert at the same position. In that case an .update would indeed be enough... but we'd be finding the place where this conversation's key appears, and updating it to write the conversation's key there, i.e. the same value that's already there. So better yet in that case is to do nothing.

That's what this commit toward the end of the branch is about:
16d2d7b recent-pms: Optimize the case of a new message in the latest thread.
It does it for the case where we'd be doing that at position 0, because that's expected to be common. To do it in general would take more logic -- and for this to happen at any other position would mean we're seeing a message that isn't the newest message we've ever seen, which is rare, see above, so it isn't worth adding complexity for.

I'll add some more comments trying to explain all this more than the code currently does.

The trouble is that this bit of code's job (the stanza manipulating sorted inside this "The conversation needs to be updated" case) is to rearrange the order of the list's elements.

Ohh, OK, right.

I'll add some more comments trying to explain all this more than the code currently does.

I think it's pretty well explained as-is; I just missed the mark here, oops. 🙂 Thanks for explaining. Looking back, I think I may have seen the i in insert and the i in insertSorted (totally reasonable to give them the same name), and short-circuited into thinking they'd end up being the same value, when of course they sometimes won't be, as you've said.

chrisbobbe · 2021-01-08T19:43:26Z

src/pm-conversations/pmConversationsModel.js

+  });
+
+  if (i === 0) {
+    return sorted.unshift(key);


This is the case where you're saying it takes O(1) time, right?

I don't yet follow how it's O(1). I would expect .unshift to be O(n): when one new item is inserted at index 0, each later item will have its index increased by one. By contrast, I would expect .push (where the item gets added at the end, not the beginning) to be O(1) because no pre-existing item would have to have its index rewritten.

That's based on this bit from the Immutable.List docs:

Lists are immutable and fully persistent with O(log32 N) gets and sets, and O(1) push and pop.

Lists implement Deque, with efficient addition and removal from both the end (push, pop) and beginning (unshift, shift).

Ah OK, got it. That doesn't get a mention at the entry for .unshift, and I guess I didn't look closely enough at that discussion at the top.

Indeed, that'd be good to clarify in the Immutable docs 🙂

That doesn't get a mention at the entry for .unshift, and I guess I didn't look closely enough at that discussion at the top.

Oh, one hint I just noticed there in the .unshift entry -- this doesn't change that it should be made clearer, but it could be a helpful thing to notice -- is that it says this:

Note: unshift can be used in withMutations.

In general the things that can be used in withMutations are the operations that efficiently act on just a small bit of the collection, with cost like O(1) or O(log N), and the things that can't are those that act on potentially the whole collection or a large swath of it. (I think the idea is that if you're doing one of the latter inside withMutations, then you're getting no benefit from that and might as well do so outside of withMutations, and this serves as a warning in case you thought you were doing something nice and efficient.) So when you see that note on an Immutable method, that's a signal that that's a cheap local operation and not an expensive whole-collection operation.

Note: unshift can be used in withMutations.

Huh, interesting! It makes sense, but I wouldn't have been looking for it if you hadn't mentioned it; I'll keep that in mind. 🙂

We'll start using these shortly, as part of the data structure for `recent_private_conversations`.

We'll use this in a new reducer shortly. There's a case where this can potentially change behavior: the thunk actions that dispatch these actions do so after an `await`, which means that by the time they get to calling `getOwnUserId` (after their HTTP requests have returned) the user may have logged out, and it might throw with "No server data found". This is a case where we already have a worse bug, in that if the user has switched to a *different* account and completed the initial fetch for it by the time this request returns, then we'll be adding these messages to our state to display alongside the messages that belong in that account. (Or try to, anyway: if the user IDs and stream IDs and so on don't exist in the other account's realm, then we'll likely just crash.) We're aware of that bug and intend to fix it, but haven't done the work for it yet. See discussion: zulip#4392 (comment)

…uest. The server added in zulip/zulip#11944 support for providing a summary of the most recent PM conversations that a user was involved in, as part of the `/register` response. This commit adds only the minimum machinery necessary to request this data from the server at `/register` time. [ray: Split original initial commit into two parts, this being the first. Minor changes to Flow types.] [greg: Squashed in a few bits of second commit; updated a comment; shortened commit message.]

This takes the data from `recent_private_conversations` (which we just started requesting in the previous commit) in the initial fetch, and then maintains it as new messages arrive. We make good use of our new support for Immutable.js data structures. Instead of maintaining the data in the exact same format as it came over the wire as JSON, we arrange it on receipt into a data structure that's optimized for efficiently maintaining and using. The organization of the code is slightly different from our usual existing practice: * The state type is defined right in the same file as the reducer which updates it. * Also in that file are some related stateless helper functions: in particular `usersOfKey`, which code elsewhere in the app can use to help interpret the data provided by this model. * The file is named in a general way like "thingModel", rather than our usual "thingReducer", and the reducer is a named rather than default export. I think this is helpful here for keeping together code that logically goes together in that it has shared internal knowledge and invariants, like the format of PmConversationKey strings. I think something similar would be helpful for a lot of our model code (including most of what's in reducers and selectors), but I'm in no rush to do a sweeping refactor; we'll want to experiment first and find the arrangement that works best. This version doesn't put any selector code in this file. The existing selector in the neighboring `pmConversationsSelectors.js` doesn't really belong here, because it mixes together a number of different areas of the app model, notably the unread counts. After some future refactoring to separate that out, there may be a more focused selector that naturally lives here... or it may be that no selector is needed, because it'd just be accessing a property on the state type, and we can simplify by just saying that directly.

In this version, the "modern" implementation is a stub that just invokes the "legacy" implementation. Next, we'll actually write a modern implementation.

We'll want to do the same thing in the upcoming "modern" implementation of this selector (because this is using another data structure unrelated to the one we're modernizing); so factor it out as a function we can call there.

This is only a little shorter in line count; but I think it has a good bit less complexity to take in. It's also doing quite a lot less work: instead of iterating through all the PMs (messages!) we know about, it iterates through just one item per PM *conversation*, in data that we've efficiently maintained as the messages came in. That's the big payoff of this server API feature. Or, relatedly: because this is so much more efficient, we're able to provide much more thorough data. The old way was fed primarily by `fetchPrivateMessages`, where we fetch the latest 100 PMs -- which for someone who uses PMs a lot could leave out conversations as recent as yesterday, and include just a handful of conversations. With current servers, the new way gets initialized from the user's last 1000 PMs. As a result, if you look at the PM-conversations screen with the app before this change and then after it, you'll probably see a lot more conversations. For me on czo, I see about 4x as many, extending far back into the past. Fixes: zulip#3133

This seems likely to be a common case -- probably the majority of the EVENT_NEW_MESSAGE events. We already had it taking constant time (aka O(1) time) thanks to the `i === 0` optimization in `insertSorted`... but we can do better by not mutating `sorted` at all. Also add a test that this optimization is present. In general we can't easily test the performance, or the asymptotic complexity, of our data structures and algorithms... but when the intended behavior is "the new thing is identical to the old thing", that we can.

With the spiffy new recent-pms data structure, this hack is no longer needed!

gnprice · 2021-01-09T00:46:12Z

Thanks for the review! New revision pushed.

chrisbobbe · 2021-01-09T01:12:10Z

Thanks for the revision and replies! Merged.

gnprice added a-data-sync Zulip's event system, event queues, staleness/liveness P1 high-priority labels Jan 8, 2021

gnprice requested a review from chrisbobbe January 8, 2021 04:46

This was referenced Jan 8, 2021

Add recent_private_conversations key to /register. #3535

Closed

Use recent_private_conversations on PMs screen, when available #4104

Closed

chrisbobbe reviewed Jan 8, 2021

View reviewed changes

gnprice and others added 11 commits January 8, 2021 16:03

recipient [nfc]: Document pmKeyRecipientsFromIds's error-handling.

cbcf00c

recipient: Log a warning if pmKeyRecipientsFromIds has to bail.

232444a

replaceRevive: Add support for Immutable.List.

4fb1587

We'll start using these shortly, as part of the data structure for `recent_private_conversations`.

recent-pms [nfc]: Add a version switch in getRecentConversations.

cea4a44

In this version, the "modern" implementation is a stub that just invokes the "legacy" implementation. Next, we'll actually write a modern implementation.

recent-pms: Skip fetchPrivateMessages on recent servers.

02d869b

With the spiffy new recent-pms data structure, this hack is no longer needed!

gnprice force-pushed the pr-recent-pms branch from 41b826c to 02d869b Compare January 9, 2021 00:45

chrisbobbe merged commit 02d869b into zulip:master Jan 9, 2021

gnprice deleted the pr-recent-pms branch January 9, 2021 01:22

gnprice mentioned this pull request Jan 12, 2021

unreads: Prepare for converting to efficient data structures #4400

Merged

This was referenced Jan 27, 2021

Make "unread" model efficient at marking a message as read #4438

Closed

Use good data structures in Redux state, to avoid quadratic work #3949

Open

Consolidate internal details of each of our Redux data structures #3934

Open

WesleyAC mentioned this pull request May 9, 2021

UX : Restarting the mobile apps makes PMs disappear for a second or two #4006

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get and maintain a handy optimized data structure of recent PM conversations #4392

Get and maintain a handy optimized data structure of recent PM conversations #4392

gnprice commented Jan 8, 2021

chrisbobbe left a comment

chrisbobbe Jan 8, 2021 •

edited

Loading

chrisbobbe Jan 8, 2021

chrisbobbe Jan 8, 2021 •

edited

Loading

gnprice Jan 8, 2021

chrisbobbe Jan 8, 2021

chrisbobbe Jan 8, 2021

gnprice Jan 8, 2021

gnprice Jan 8, 2021

gnprice Jan 8, 2021

chrisbobbe Jan 8, 2021

gnprice Jan 8, 2021

chrisbobbe Jan 8, 2021

gnprice Jan 8, 2021 •

edited

Loading

chrisbobbe Jan 8, 2021

gnprice Jan 8, 2021

chrisbobbe Jan 8, 2021 •

edited

Loading

gnprice Jan 8, 2021

chrisbobbe Jan 9, 2021 •

edited

Loading

chrisbobbe Jan 8, 2021

gnprice Jan 8, 2021

chrisbobbe Jan 8, 2021

gnprice Jan 8, 2021

gnprice Jan 9, 2021

chrisbobbe Jan 9, 2021

gnprice commented Jan 9, 2021

chrisbobbe commented Jan 9, 2021

Get and maintain a handy optimized data structure of recent PM conversations #4392

Get and maintain a handy optimized data structure of recent PM conversations #4392

Conversation

gnprice commented Jan 8, 2021

chrisbobbe left a comment

Choose a reason for hiding this comment

chrisbobbe Jan 8, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chrisbobbe Jan 8, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnprice Jan 8, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chrisbobbe Jan 8, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chrisbobbe Jan 9, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnprice commented Jan 9, 2021

chrisbobbe commented Jan 9, 2021

chrisbobbe Jan 8, 2021 •

edited

Loading

chrisbobbe Jan 8, 2021 •

edited

Loading

gnprice Jan 8, 2021 •

edited

Loading

chrisbobbe Jan 8, 2021 •

edited

Loading

chrisbobbe Jan 9, 2021 •

edited

Loading