Skip to content
This repository has been archived by the owner on Sep 3, 2021. It is now read-only.

Unexpected TimeTravel in SongBook Studio #272

Closed
aboodman opened this issue Nov 15, 2020 · 14 comments
Closed

Unexpected TimeTravel in SongBook Studio #272

aboodman opened this issue Nov 15, 2020 · 14 comments
Assignees

Comments

@aboodman
Copy link
Contributor

aboodman commented Nov 15, 2020

@elsigh got stuck on his Replicache integration because push and pull were failing with auth errors, but the getDataLayerAuth callback was also not getting called.

I looked at the logs and saw "TimeTravelProhibited" in the console (sadly I don't have a screencap of this -- @elsigh ?)

When we looked at the Pull response we saw that the ClientView response was HTTP 401, and that was correctly getting sent down to client, and the lastMutationID was 0.

But the client had synced before so its base lastMutationID was > 0, thus time travel, and therefore sync was failing before it got to reauthing.

It seems like the only way this can happen is if either the client ID changes on the client somehow, or if the client data gets dropped on the server. It's a mystery how either of these could occur.

@aboodman
Copy link
Contributor Author

aboodman commented Nov 15, 2020

cc @phritz. @elsigh still has the problematic client state alive for us to investigate.

@aboodman aboodman changed the title getDataLayerAuth not getting called for some reason Unexpected TimeTravel in SongStudio Nov 15, 2020
@aboodman
Copy link
Contributor Author

aboodman commented Nov 15, 2020

Also, for the record, I remember one other time that @elsigh and I noticed more clientIDs talking to the server than we expected. We though we only had one live client, but we saw two client IDs in Firestore db (or something like that). So there might be a bug in the creation of client IDs.

@aboodman aboodman transferred this issue from rocicorp/replicache Nov 15, 2020
@aboodman aboodman changed the title Unexpected TimeTravel in SongStudio Unexpected TimeTravel in SongBook Studio Nov 15, 2020
@phritz
Copy link
Contributor

phritz commented Nov 16, 2020

But the client had synced before so its base lastMutationID was > 0, thus time travel, and therefore sync was failing before it got to reauthing.
It seems like the only way this can happen is if either the client ID changes on the client somehow, or if the client data gets dropped on the server. It's a mystery how either of these could occur.

Possibly more likely, or one of the above in combination with: the client has fresher data than the diffserver. Client sends pull request with lastmutationid N and stateid S. Diffserver does clientview request, gets a 401. Diffserver is now going to try to send the client the best diff it has: https://github.com/rocicorp/diff-server/blob/6528d96191aa0f633e2b4df3e8fd9afd78f96702/serve/pull.go#L99. If it can't find the client's state, it does a full sync: https://github.com/rocicorp/diff-server/blob/6528d96191aa0f633e2b4df3e8fd9afd78f96702/db/diff.go#L59. But... if the diffserver's head for this client is empty, it is going to try to sync back to mutationid 0 and the initial server state. Doh. The client rightfully disallows going backwards in time from N to 0 and beginsync errors. When it errors the fact that the client view returned 401 is lost. (This last part is a classic example of why errors should be signaled within the rpc response, rather than outside of it. More grist for the #119 mill.)

This is a bug in the diffserver. The diffserver should refuse to send a patch if its head state lastmutationid is < the client's. This gets tricky though: what to do when they are equal? On one hand: we have to send the patch because the client view could have returned us new state with the same lastmutationid (state changed out of band). On the other hand suppose the client gets (LMID1, SSID1) from the diffserver. Then the client pulls new state (LMID1, SSID2) from the diffserver. Something happens in the diffserver where it forgets about (LMID1, SSID2). Client pulls again; the diffserver thinks (LMID1, SSID1) is head and so sends a diff to SSID1. Client has gone backwards. We happen to know that this kind of failure is unlikely in the diffserver so for now we can ignore it and always return the state the diffserver has if last mutation id is >= what the client has. But we should address it once the current dust settles.

I am going to timebox 45m to try to fix the diffserver issue rn on my personal laptop, will see how that goes.

BTW aaron as you know logging is hardcoded to info:

if let Err(e) = console_log::init_with_level(log::Level::Info) {
. We're not going to get any useful information at that level. Suggest setting to debug and rebuilding. (Assume head of master is in an incompatible state.) And then we really need to get on #243

@phritz
Copy link
Contributor

phritz commented Nov 16, 2020

Note: seems like the diffserver needs an explicit way to tell the client nop. I'm going to have to reflect the client's lmid and ssid back to it for now.

please keep this issue open even after the proximate error has been fixed until i can properly spin out followups we need

@aboodman
Copy link
Contributor Author

aboodman commented Nov 16, 2020 via email

@phritz
Copy link
Contributor

phritz commented Nov 16, 2020

how could this happen?

because the diffserver treats what the clientview says as the source of truth, and will happily overwrite its head with whatever the clientview says, no matter whether it is sensible or not. so like if diffserver has state with lmid 22 and the client view returns state with lmid 10, the diffserver will happily treat that as canonical. the client will reject it.

plausible explanation for what happened here:

  1. client successfully pushes data with lmid n > 0 to the data layer
  2. client successfully pulls data with lmid n with checksum cdf6acf8 through the diffserver
  3. data layer decides for some reason to start returning lastmutationid 0 at least some of the time
  4. client pulls. clientview returns data with lmid 0 and checksum cdf6acf8. this data is written to diffserver's head. the client rejects the data as a time traveling error, but it is now enshrined in head on the diffserver.
  5. data layer decides it wants auth, goddammit
  6. client attempts to pull. diffserver gets 401 from clientview and computes a diff to head. client and head have same checksum cdf6acf8 so the patch is empty, however head has lastmutationid 0 which it includes in the response. client rejects response as going backwards in time. since it errors we dont get the clientviewinfo which contains the re-auth signal so re-auth never happens.

i'm a little fuzzy on how the client view could return a lmid > 0 at some point and then start returning 0. if that didn't happen, i'm worried because how did the client get a lmid > 0?

clearly, we need to proof the protocol better against data layer mistakes. above i said that the diffserver should not attempt to send the client a diff if the client's lmid is > what the diffserver has. that will require the client to send along its lmid which is both a repc and diffserver change. seems like we maybe should also not let the diffserver go back in time, dont accept a clientview with lmid < what the diffserver has.

medium term we need to have structured errors at dispatch at least so we can signal the 401 regardless of sync success.

@phritz
Copy link
Contributor

phritz commented Nov 16, 2020

(read the edited version of my last comment, corrected a few errors)

BTW this issue is not going to be sorted today, or maybe even tomorrow. But the good news is that the problems here only happen if the clientview does crazy stuff. If you nuke the client and reset server lmid to 0 and the server returns the lmid faithfully then replicache should work fine. (until you hit the next bug ;)

@phritz phritz self-assigned this Nov 16, 2020
@aboodman
Copy link
Contributor Author

Thank you for this analysis, it is amazing.

It is definitely true that @elsigh's data layer went from returning 0 for lmid to returning >0. However, I can't think of any reason it would have flapped back. @elsigh is it possible you deployed a client view that returned zero at some point after the one that returned the correct value? Otherwise I buy this.

@phritz
Copy link
Contributor

phritz commented Nov 16, 2020

ok mystery solved: the clientview did return lmid 0 after 43 for some reason (old instance, stale data, dunno). here are followups:

  • have repc send lastmutationid it its pullrequest and have the diffserver refuse to try to send repc back in time
  • have the diffserver refuse to accept a clientview that goes back in time
  • add notes on using noms for debugging diffserver. i think adding them to the diffserver repo is fine even though it is public yes?
  • implement setloglevel dispatch rpc :) #243
  • improve diffserver logging to truncate long log lines

for discussion (comments please):

  • would have been useful to have repc logs. apparently elsigh didn't have 'persist logs' enabled, maybe make it a stronger recommendation? how long are they preserved? might it be worthwhile to find a way to log persistently or to have repc hold a copy of logs (a rotating buffer, eg) and have a way to get the logs out by command?
  • should we make a simple web interface on the diffserver to show commit history for a given clientid? command line works i guess, but web interface would be easy.
  • if we added a dataLayerVersion string to the clientviewresponse that the data layer could populate with a version string do we think that customers would use it? we could store it in the commit in the diffserver, could be helpful? i worry they would just not update it.
  • i can imagine different kinds of data layer mistakes a customer could make once their app is deployed with replicache. like eg if they returned a very high lmid by mistake once it would freeze the protocol bc the diffserver (soon) wont go back in time. i think we are going to want a way to blow away the diffserver commit history for a given client id and maybe for an entire account (all its client ids) so they can start fresh.

@aboodman
Copy link
Contributor Author

here are followups:
...

👊🏽

would have been useful to have repc logs. apparently elsigh didn't have 'persist logs' enabled, maybe make it a stronger recommendation? how long are they preserved? might it be worthwhile to find a way to log persistently or to have repc hold a copy of logs (a rotating buffer, eg) and have a way to get the logs out by command?

Seems worthwhile. Have to be careful to not slow down Replicache itself :-/.

should we make a simple web interface on the diffserver to show commit history for a given clientid? command line works i guess, but web interface would be easy.

Kind of ambivalent on this. I'm fine with just using Noms, but if you'd rather have a webview that's fine with me, you're going to be the main customer.

if we added a dataLayerVersion string to the clientviewresponse that the data layer could populate with a version string do we think that customers would use it? we could store it in the commit in the diffserver, could be helpful? i worry they would just not update it.

-1 from me. I don't think most customers will have such a version available.

i can imagine different kinds of data layer mistakes a customer could make once their app is deployed with replicache. like eg if they returned a very high lmid by mistake once it would freeze the protocol bc the diffserver (soon) wont go back in time. i think we are going to want a way to blow away the diffserver commit history for a given client id and maybe for an entire account (all its client ids) so they can start fresh.

noms ds -d <db-spec>::<client-id>

@phritz
Copy link
Contributor

phritz commented Nov 17, 2020

i can imagine different kinds of data layer mistakes a customer could make once their app is deployed with replicache. like eg if they returned a very high lmid by mistake once it would freeze the protocol bc the diffserver (soon) wont go back in time. i think we are going to want a way to blow away the diffserver commit history for a given client id and maybe for an entire account (all its client ids) so they can start fresh.
noms ds -d ::

OK you're right, easy for a given client id. But we couldn't do that for all clients of a given customer because we don't keep that association anywhere. I think we should keep that association both ways (account id => client ids and also clientid => account id). The account service would be the place for that if it was running but for now the diffserver could could keep it in a separate map in noms.

@aboodman
Copy link
Contributor Author

aboodman commented Nov 17, 2020 via email

@phritz
Copy link
Contributor

phritz commented Nov 17, 2020

We don't have an easy way to go from client id -> account id

We get both in the pull request (client id comes in the body and account id comes in the Authorization header).

@phritz
Copy link
Contributor

phritz commented Nov 18, 2020

spun out rocicorp/diff-server#78 to store clientid/account association. we get get to it later.

think we are done here

@phritz phritz closed this as completed Nov 18, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants