Unexpected TimeTravel in SongBook Studio #272

aboodman · 2020-11-15T23:04:20Z

@elsigh got stuck on his Replicache integration because push and pull were failing with auth errors, but the getDataLayerAuth callback was also not getting called.

I looked at the logs and saw "TimeTravelProhibited" in the console (sadly I don't have a screencap of this -- @elsigh ?)

When we looked at the Pull response we saw that the ClientView response was HTTP 401, and that was correctly getting sent down to client, and the lastMutationID was 0.

But the client had synced before so its base lastMutationID was > 0, thus time travel, and therefore sync was failing before it got to reauthing.

It seems like the only way this can happen is if either the client ID changes on the client somehow, or if the client data gets dropped on the server. It's a mystery how either of these could occur.

The text was updated successfully, but these errors were encountered:

aboodman · 2020-11-15T23:06:07Z

cc @phritz. @elsigh still has the problematic client state alive for us to investigate.

aboodman · 2020-11-15T23:12:21Z

Also, for the record, I remember one other time that @elsigh and I noticed more clientIDs talking to the server than we expected. We though we only had one live client, but we saw two client IDs in Firestore db (or something like that). So there might be a bug in the creation of client IDs.

phritz · 2020-11-16T00:11:28Z

But the client had synced before so its base lastMutationID was > 0, thus time travel, and therefore sync was failing before it got to reauthing.
It seems like the only way this can happen is if either the client ID changes on the client somehow, or if the client data gets dropped on the server. It's a mystery how either of these could occur.

Possibly more likely, or one of the above in combination with: the client has fresher data than the diffserver. Client sends pull request with lastmutationid N and stateid S. Diffserver does clientview request, gets a 401. Diffserver is now going to try to send the client the best diff it has: https://github.com/rocicorp/diff-server/blob/6528d96191aa0f633e2b4df3e8fd9afd78f96702/serve/pull.go#L99. If it can't find the client's state, it does a full sync: https://github.com/rocicorp/diff-server/blob/6528d96191aa0f633e2b4df3e8fd9afd78f96702/db/diff.go#L59. But... if the diffserver's head for this client is empty, it is going to try to sync back to mutationid 0 and the initial server state. Doh. The client rightfully disallows going backwards in time from N to 0 and beginsync errors. When it errors the fact that the client view returned 401 is lost. (This last part is a classic example of why errors should be signaled within the rpc response, rather than outside of it. More grist for the #119 mill.)

This is a bug in the diffserver. The diffserver should refuse to send a patch if its head state lastmutationid is < the client's. This gets tricky though: what to do when they are equal? On one hand: we have to send the patch because the client view could have returned us new state with the same lastmutationid (state changed out of band). On the other hand suppose the client gets (LMID1, SSID1) from the diffserver. Then the client pulls new state (LMID1, SSID2) from the diffserver. Something happens in the diffserver where it forgets about (LMID1, SSID2). Client pulls again; the diffserver thinks (LMID1, SSID1) is head and so sends a diff to SSID1. Client has gone backwards. We happen to know that this kind of failure is unlikely in the diffserver so for now we can ignore it and always return the state the diffserver has if last mutation id is >= what the client has. But we should address it once the current dust settles.

I am going to timebox 45m to try to fix the diffserver issue rn on my personal laptop, will see how that goes.

BTW aaron as you know logging is hardcoded to info:

repc/src/wasm.rs

Line 28 in d2e03db

if let Err(e) = console_log::init_with_level(log::Level::Info) {

. We're not going to get any useful information at that level. Suggest setting to debug and rebuilding. (Assume head of master is in an incompatible state.) And then we really need to get on #243

phritz · 2020-11-16T00:17:35Z

Note: seems like the diffserver needs an explicit way to tell the client nop. I'm going to have to reflect the client's lmid and ssid back to it for now.

please keep this issue open even after the proximate error has been fixed until i can properly spin out followups we need

aboodman · 2020-11-16T00:20:41Z

On Sun, Nov 15, 2020 at 2:11 PM Phritz ***@***.***> wrote: If it can't find the client's state, it does a full sync: https://github.com/rocicorp/diff-server/blob/6528d96191aa0f633e2b4df3e8fd9afd78f96702/db/diff.go#L59. How could this happen? The client has a previous lastMutationID. So it

successfully synced before.

…

But... if the diffserver's head for this client is empty, it is going to try to sync back to mutationid 0 and the initial server state. How could that happen? Where did the head go? Doh. The client rightfully disallows going backwards in time from N to 0 and beginsync errors. When it errors the fact that the client view returned 401 is lost. (This last part is a classic example of why errors should be signaled *within* the rpc response, rather than outside of it. More grist for the #119 <#119> mill.) This is a bug in the diffserver. The diffserver should refuse to send a patch if its head state lastmutationid is < the client's. This gets tricky though: what to do when they are equal? On one hand: we have to send the patch because the client view could have returned us new state with the same lastmutationid (state changed out of band). On the other hand suppose the client gets (LMID1, SSID1) from the diffserver. Then the client pulls new state (LMID1, SSID2) from the diffserver. Something happens in the diffserver where it forgets about (LMID1, SSID2). Client pulls again; the diffserver thinks (LMID1, SSID1) is head and so sends a diff to SSID1. Client has gone backwards. We happen to know that this kind of failure is unlikely in the diffserver so for now we can ignore it and always return the state the diffserver has if last mutation id is >= what the client has. But we should address it once the current dust settles. I am going to timebox 45m to try to fix the diffserver issue rn on my personal laptop, will see how that goes. BTW aaron as you know logging is hardcoded to info: https://github.com/rocicorp/repc/blob/d2e03db36a9a436d32da60abc9a863543384ba18/src/wasm.rs#L28. We're not going to get any useful information at that level. Suggest setting to debug and rebuilding. (Assume head of master is in an incompatible state.) And then we really need to get on #243 <#243> — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#272 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAATUBBMNQ77UZBWAVIQAYDSQBU3ZANCNFSM4TWPTECA> .

phritz · 2020-11-16T01:22:05Z

how could this happen?

because the diffserver treats what the clientview says as the source of truth, and will happily overwrite its head with whatever the clientview says, no matter whether it is sensible or not. so like if diffserver has state with lmid 22 and the client view returns state with lmid 10, the diffserver will happily treat that as canonical. the client will reject it.

plausible explanation for what happened here:

client successfully pushes data with lmid n > 0 to the data layer
client successfully pulls data with lmid n with checksum cdf6acf8 through the diffserver
data layer decides for some reason to start returning lastmutationid 0 at least some of the time
client pulls. clientview returns data with lmid 0 and checksum cdf6acf8. this data is written to diffserver's head. the client rejects the data as a time traveling error, but it is now enshrined in head on the diffserver.
data layer decides it wants auth, goddammit
client attempts to pull. diffserver gets 401 from clientview and computes a diff to head. client and head have same checksum cdf6acf8 so the patch is empty, however head has lastmutationid 0 which it includes in the response. client rejects response as going backwards in time. since it errors we dont get the clientviewinfo which contains the re-auth signal so re-auth never happens.

i'm a little fuzzy on how the client view could return a lmid > 0 at some point and then start returning 0. if that didn't happen, i'm worried because how did the client get a lmid > 0?

clearly, we need to proof the protocol better against data layer mistakes. above i said that the diffserver should not attempt to send the client a diff if the client's lmid is > what the diffserver has. that will require the client to send along its lmid which is both a repc and diffserver change. seems like we maybe should also not let the diffserver go back in time, dont accept a clientview with lmid < what the diffserver has.

medium term we need to have structured errors at dispatch at least so we can signal the 401 regardless of sync success.

phritz · 2020-11-16T01:30:07Z

(read the edited version of my last comment, corrected a few errors)

BTW this issue is not going to be sorted today, or maybe even tomorrow. But the good news is that the problems here only happen if the clientview does crazy stuff. If you nuke the client and reset server lmid to 0 and the server returns the lmid faithfully then replicache should work fine. (until you hit the next bug ;)

aboodman · 2020-11-16T07:36:32Z

Thank you for this analysis, it is amazing.

It is definitely true that @elsigh's data layer went from returning 0 for lmid to returning >0. However, I can't think of any reason it would have flapped back. @elsigh is it possible you deployed a client view that returned zero at some point after the one that returned the correct value? Otherwise I buy this.

phritz · 2020-11-16T21:36:55Z

ok mystery solved: the clientview did return lmid 0 after 43 for some reason (old instance, stale data, dunno). here are followups:

have repc send lastmutationid it its pullrequest and have the diffserver refuse to try to send repc back in time
have the diffserver refuse to accept a clientview that goes back in time
add notes on using noms for debugging diffserver. i think adding them to the diffserver repo is fine even though it is public yes?
implement setloglevel dispatch rpc :) #243
improve diffserver logging to truncate long log lines

for discussion (comments please):

would have been useful to have repc logs. apparently elsigh didn't have 'persist logs' enabled, maybe make it a stronger recommendation? how long are they preserved? might it be worthwhile to find a way to log persistently or to have repc hold a copy of logs (a rotating buffer, eg) and have a way to get the logs out by command?
should we make a simple web interface on the diffserver to show commit history for a given clientid? command line works i guess, but web interface would be easy.
if we added a dataLayerVersion string to the clientviewresponse that the data layer could populate with a version string do we think that customers would use it? we could store it in the commit in the diffserver, could be helpful? i worry they would just not update it.
i can imagine different kinds of data layer mistakes a customer could make once their app is deployed with replicache. like eg if they returned a very high lmid by mistake once it would freeze the protocol bc the diffserver (soon) wont go back in time. i think we are going to want a way to blow away the diffserver commit history for a given client id and maybe for an entire account (all its client ids) so they can start fresh.

aboodman · 2020-11-16T21:42:45Z

here are followups:
...

👊🏽

would have been useful to have repc logs. apparently elsigh didn't have 'persist logs' enabled, maybe make it a stronger recommendation? how long are they preserved? might it be worthwhile to find a way to log persistently or to have repc hold a copy of logs (a rotating buffer, eg) and have a way to get the logs out by command?

Seems worthwhile. Have to be careful to not slow down Replicache itself :-/.

should we make a simple web interface on the diffserver to show commit history for a given clientid? command line works i guess, but web interface would be easy.

Kind of ambivalent on this. I'm fine with just using Noms, but if you'd rather have a webview that's fine with me, you're going to be the main customer.

if we added a dataLayerVersion string to the clientviewresponse that the data layer could populate with a version string do we think that customers would use it? we could store it in the commit in the diffserver, could be helpful? i worry they would just not update it.

-1 from me. I don't think most customers will have such a version available.

i can imagine different kinds of data layer mistakes a customer could make once their app is deployed with replicache. like eg if they returned a very high lmid by mistake once it would freeze the protocol bc the diffserver (soon) wont go back in time. i think we are going to want a way to blow away the diffserver commit history for a given client id and maybe for an entire account (all its client ids) so they can start fresh.

noms ds -d <db-spec>::<client-id>

phritz · 2020-11-17T18:22:14Z

i can imagine different kinds of data layer mistakes a customer could make once their app is deployed with replicache. like eg if they returned a very high lmid by mistake once it would freeze the protocol bc the diffserver (soon) wont go back in time. i think we are going to want a way to blow away the diffserver commit history for a given client id and maybe for an entire account (all its client ids) so they can start fresh.
noms ds -d ::

OK you're right, easy for a given client id. But we couldn't do that for all clients of a given customer because we don't keep that association anywhere. I think we should keep that association both ways (account id => client ids and also clientid => account id). The account service would be the place for that if it was running but for now the diffserver could could keep it in a separate map in noms.

aboodman · 2020-11-17T19:00:26Z

On Tue, Nov 17, 2020 at 8:22 AM Phritz ***@***.***> wrote: OK you're right, easy for a given client id. But we couldn't do that for all clients of a given customer because we don't keep that association anywhere. noms ds <customer-db> | xargs -I{} noms ds -d <customer-db>::{}

:)

I think we should keep that association both ways (account id => client ids and also clientid => account id). The account service would be the place for that if it was running but for now the diffserver could could keep it in a separate map in noms.

We don't have an easy way to go from client id -> account id because Noms is non-relational :-/. Actually given just a client ID there's no good way (in just the diff server) to even find it. You need to know the account id too.

…

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#272 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAATUBDX33JHCNCHNT6OKSLSQK5ORANCNFSM4TWPTECA> .

phritz · 2020-11-17T21:08:31Z

We don't have an easy way to go from client id -> account id

We get both in the pull request (client id comes in the body and account id comes in the Authorization header).

phritz · 2020-11-18T19:55:54Z

spun out rocicorp/diff-server#78 to store clientid/account association. we get get to it later.

think we are done here

aboodman changed the title ~~getDataLayerAuth not getting called for some reason~~ Unexpected TimeTravel in SongStudio Nov 15, 2020

aboodman transferred this issue from rocicorp/replicache Nov 15, 2020

aboodman changed the title ~~Unexpected TimeTravel in SongStudio~~ Unexpected TimeTravel in SongBook Studio Nov 15, 2020

phritz self-assigned this Nov 16, 2020

phritz added this to the 2: Repc ready for launch 🤾🏻‍♂️🏹🚀↗️ milestone Nov 16, 2020

phritz mentioned this issue Nov 17, 2020

implement setLogLevel #273

Merged

This was referenced Nov 17, 2020

send lastMutationID in PullRequest #274

Merged

don't allow diffserver to go backwards in time rocicorp/diff-server#77

Merged

store account / client id association rocicorp/diff-server#78

Open

phritz closed this as completed Nov 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected TimeTravel in SongBook Studio #272

Unexpected TimeTravel in SongBook Studio #272

aboodman commented Nov 15, 2020 •

edited

Loading

aboodman commented Nov 15, 2020 •

edited

Loading

aboodman commented Nov 15, 2020 •

edited

Loading

phritz commented Nov 16, 2020

phritz commented Nov 16, 2020

aboodman commented Nov 16, 2020 via email

phritz commented Nov 16, 2020 •

edited

Loading

phritz commented Nov 16, 2020

aboodman commented Nov 16, 2020

phritz commented Nov 16, 2020 •

edited

Loading

aboodman commented Nov 16, 2020

phritz commented Nov 17, 2020

aboodman commented Nov 17, 2020 via email

phritz commented Nov 17, 2020

phritz commented Nov 18, 2020

Unexpected TimeTravel in SongBook Studio #272

Unexpected TimeTravel in SongBook Studio #272

Comments

aboodman commented Nov 15, 2020 • edited Loading

aboodman commented Nov 15, 2020 • edited Loading

aboodman commented Nov 15, 2020 • edited Loading

phritz commented Nov 16, 2020

phritz commented Nov 16, 2020

aboodman commented Nov 16, 2020 via email

phritz commented Nov 16, 2020 • edited Loading

phritz commented Nov 16, 2020

aboodman commented Nov 16, 2020

phritz commented Nov 16, 2020 • edited Loading

aboodman commented Nov 16, 2020

phritz commented Nov 17, 2020

aboodman commented Nov 17, 2020 via email

phritz commented Nov 17, 2020

phritz commented Nov 18, 2020

aboodman commented Nov 15, 2020 •

edited

Loading

aboodman commented Nov 15, 2020 •

edited

Loading

aboodman commented Nov 15, 2020 •

edited

Loading

phritz commented Nov 16, 2020 •

edited

Loading

phritz commented Nov 16, 2020 •

edited

Loading