%ames reload on ~zod didn't sync to network #501

belisarius222 · 2017-12-14T02:32:36Z

Trying a |reset now to force the issue, but this feels broken. Not sure what went wrong here.

ohAitch · 2017-12-14T03:04:50Z

Talk seems to be crashing in response to an update hall is sending in response to the merge(?), thereby preventing the merge from occurring. @Fang- help

…

On Wed, Dec 13, 2017 at 6:32 PM, Ted Blackman ***@***.***> wrote: Trying a |reset now to force the issue, but this feels broken. Not sure what went wrong here. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#501>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABxXhgrvwvJee65BmP3WYcJZFobWCv9-ks5tAIjFgaJpZM4RBded> .

belisarius222 · 2017-12-14T03:59:49Z

Also, it should be noted that performing a |reset on ~zod will not actually cause anything to sync to other ships. Only Clay changes will cause that to happen, and |reset doesn't modify Clay.

Fang- · 2017-12-14T11:32:26Z

If talk's just printing a stacktrace pointing to line 461, then that is a known issue. Shouldn't actually break anything though.

Didn't see this stack trace in ~zod's scrollback, and its talk seems to be working fine (it can send messages to its own inbox and have those show up), so I'm not sure what's causing the sync to not go through. Will investigate further.

Fang- · 2017-12-14T12:18:34Z

It seems ~marzod and ~samzod got the update. (I can type ;sources %something into their talks.)
|mergeing their %kids desk into an old comet applies the update just fine. I'm assuming that their planets got the update as well.

~wanzod and ~binzod have not gotten the update, and are both showing the mentioned stacktrace along with the following:

  [%swim-take-vane %g %unto ~]
  [%gall-take /~wanzod/use/talk/~wanzod/out/hall/server/inbox]
[%fail-ack 0]
crude: all delivery failed!
  [%swim-take-vane %g %unto ~]
  [%gall-take /~wanzod/use/talk/~wanzod/out/hall/server/inbox]
crude: punted

I'll get to work on finding a reproduction case and fixing that talk error.

Fang- · 2017-12-18T23:33:07Z

Breaking syncs in four easy steps:

Boot up a fakezod and accompanying star. Latest arvo is fine.
On the star, send a message. Default audience (inbox) is fine.
On the star, run :talk 'screw' in dojo. This should print 'screwing things up...'. This will cause the below to runtime error because of an out-of-bounds snag.
On the fakezod, make the following change to /app/talk.hoon:

  [~ ..prep(+<+ u.old)]

in ++prep, on or around line 123, needs to become:

  :_  ..prep(+<+ u.old)
  :~  [0 %pull /server/client server ~]
      [0 %pull /server/inbox server ~]
      peer-client
      peer-inbox
  ==

Then watch as the fakezod pushes the change to the star, and the star spits some errors. Note how changing a file on fakezod now no longer pushes it down to the star.

For completeness, the errors it spits out (excluding the talk runtime error):

  [%swim-take-vane %g %unto ~]
  [%gall-take /~palzod/use/talk/~palzod/out/hall/server/inbox]
[%fail-ack 0]
crude: all delivery failed!
  [%swim-take-vane %g %unto ~]
  [%gall-take /~palzod/use/talk/~palzod/out/hall/server/inbox]
crude: punted
sync succeeded from %base on ~palzod to %kids

This is not the exact same thing as what happened on the stars. The error messages were slightly different there (%c %writ instead of %g %unto, also saying %not-actually-merging) and they were left in a state where |syncs didn't give the usual 3-item list output.
Since this also stops syncs from working properly though, I figured this might be a case worth investigating further.

Maybe interesting to note that doing an |merge %base ~zod %kids on the star causes an %already-merging-from-somewhere error. Doing an |unsync %base ~zod %kids, and then doing a |sync for that again causes an endless stream of them.

Fang- · 2017-12-18T23:50:24Z

And to get the printfs that also appeared on the stars:

Do the above
:talk 'rebuild' from dojo to resolve the talk stacktrace
|cancel %base to cancel the merge that apparently got stuck
Wait for syncing to happen. Should print:

sync succeeded from %base on ~palzod to %kids
sync succeeded from %base on ~palzod to %home
sync succeeded from %kids on ~zod to %base
  [%swim-take-vane %c %writ ~]
  %not-actually-merging
[%fail-ack 0]
crude: all delivery failed!
  [%swim-take-vane %c %writ ~]
  %not-actually-merging
crude: punted
sync succeeded from %base on ~palzod to %kids
sync succeeded from %base on ~palzod to %home
sync succeeded from %kids on ~zod to %base
  [%swim-take-vane %c %writ ~]
  %not-actually-merging
[%fail-ack 0]

belisarius222 · 2017-12-19T00:15:03Z

Should we separate the notification of file change to a new event? That way the sync would still succeed; it would just cause the next event to fail, without leaving Clay in a weird state. Conceptually, I'm not sure a filesystem merge should fail just because the result caused an error in a different part of the system.

ohAitch · 2017-12-19T00:21:02Z

imo the merge from kids to base should succeed and the merge from base to home should fail, but I guess we already crossed this bridge upon disabling the similar userspace behavior of "non-compiling app source code changes are not accepted"

…

On Mon, Dec 18, 2017 at 4:15 PM, Ted Blackman ***@***.***> wrote: Should we separate the notification of file change to a new event? That way the sync would still succeed; it would just cause the next event to fail, without leaving Clay in a weird state. Conceptually, I'm not sure a filesystem merge should fail just because the result caused an error in a different part of the system. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#501 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABxXhq1nJ1PjpccmASRxLzkf4Je1RO_pks5tBwAIgaJpZM4RBded> .

belisarius222 · 2017-12-19T00:37:43Z

Ok, let's see if I understand you correctly:

In an older version of the system, if you tried to merge in app source code that failed to compile, the merge would failed. That was changed, and now if you merge in app source code that fails to compile, the merge still succeeds. In keeping with this more permissive attitude, it would be consistent to also allow the parent->child syncing to succeed even if it causes userspace code to fail.

Did I get that right? If so, could you explain your reasoning behind wanting the merge from base to home to fail?

I also have a slight worry about potential race conditions if the callbacks for a filesystem sync could happen after some other events. What if those callbacks need to update some other part of the Arvo state? If some other event runs between the merge and those callbacks, then it could potentially read from both Clay and the other part of the system that was supposed to match, but instead read from them in an inconsistent state.

For example, let's say there was an app that kept as part of its state a tally of the number of files in a Clay desk. It's subscribed to notifications on this desk so it can rerun this tally whenever the desk changes. If a sync succeeds that changes the number of files in a desk, then before the app gets notified of the sync, some other event could run that accesses the app's tally and also accesses Clay. In that case, the tally wouldn't match the number of files in Clay.

It's a contrived example, so I'm not sure if this is actually something to worry about.

cgyarvin · 2017-12-19T00:39:10Z

Ted that’s exactly right.

…

Sent from my iPhone

On Dec 18, 2017, at 4:15 PM, Ted Blackman ***@***.***> wrote: Should we separate the notification of file change to a new event? That way the sync would still succeed; it would just cause the next event to fail, without leaving Clay in a weird state. Conceptually, I'm not sure a filesystem merge should fail just because the result caused an error in a different part of the system. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

cgyarvin · 2017-12-19T00:40:35Z

The inserted event is a transaction. It can fail. If it fails, the C layer must generate a fail event. So we at least know %gall is out of sync.

…

Sent from my iPhone

On Dec 18, 2017, at 4:37 PM, Ted Blackman ***@***.***> wrote: Ok, let's see if I understand you correctly: In an older version of the system, if you tried to merge in app source code that failed to compile, the merge would failed. That was changed, and now if you merge in app source code that fails to compile, the merge still succeeds. In keeping with this more permissive attitude, it would be consistent to also allow the parent->child syncing to succeed even if it causes userspace code to fail. Did I get that right? If so, could you explain your reasoning behind wanting the merge from base to home to fail? I also have a slight worry about potential race conditions if the callbacks for a filesystem sync could happen after some other events. What if those callbacks need to update some other part of the Arvo state? If some other event runs between the merge and those callbacks, then it could potentially read from both Clay and the other part of the system that was supposed to match, but instead read from them in an inconsistent state. For example, let's say there was an app that kept as part of its state a tally of the number of files in a Clay desk. It's subscribed to notifications on this desk so it can rerun this tally whenever the desk changes. If a sync succeeds that changes the number of files in a desk, then before the app gets notified of the sync, some other event could run that accesses the app's tally and also accesses Clay. In that case, the tally wouldn't match the number of files in Clay. It's a contrived example, so I'm not sure if this is actually something to worry about. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

ohAitch · 2017-12-19T00:43:16Z

yeah, and the actual correct behavior here would be to abort the transaction *for the merge*, bouncing all the way back to kiln which is trying to sync things, and getting printed there; but over the past few years this capability of the system has lapsed considerably, and the best remaining choice is permissiveness

…

On Mon, Dec 18, 2017 at 4:40 PM, cgyarvin ***@***.***> wrote: The inserted event is a transaction. It can fail. If it fails, the C layer must generate a fail event. So we at least know %gall is out of sync. Sent from my iPhone > On Dec 18, 2017, at 4:37 PM, Ted Blackman ***@***.***> wrote: > > Ok, let's see if I understand you correctly: > > In an older version of the system, if you tried to merge in app source code that failed to compile, the merge would failed. That was changed, and now if you merge in app source code that fails to compile, the merge still succeeds. In keeping with this more permissive attitude, it would be consistent to also allow the parent->child syncing to succeed even if it causes userspace code to fail. > > Did I get that right? If so, could you explain your reasoning behind wanting the merge from base to home to fail? > > I also have a slight worry about potential race conditions if the callbacks for a filesystem sync could happen after some other events. What if those callbacks need to update some other part of the Arvo state? If some other event runs between the merge and those callbacks, then it could potentially read from both Clay and the other part of the system that was supposed to match, but instead read from them in an inconsistent state. > > For example, let's say there was an app that kept as part of its state a tally of the number of files in a Clay desk. It's subscribed to notifications on this desk so it can rerun this tally whenever the desk changes. If a sync succeeds that changes the number of files in a desk, then before the app gets notified of the sync, some other event could run that accesses the app's tally and also accesses Clay. In that case, the tally wouldn't match the number of files in Clay. > > It's a contrived example, so I'm not sure if this is actually something to worry about. > > — > You are receiving this because you are subscribed to this thread. > Reply to this email directly, view it on GitHub, or mute the thread. > — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#501 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABxXhkCo1o0wEXoewmgJeIOOp5gTR9ncks5tBwYEgaJpZM4RBded> .

belisarius222 · 2017-12-19T01:15:01Z

My worry is about the case where the sync succeeds, but another (previously scheduled) event gets run before the event that runs the notifications of the Clay change. At that point, the Clay change itself has been persisted, but not the effects of the Clay change. That wouldn't necessarily cause a nock error in the intruding event; it might just do something subtly wrong because it expected the Gall state to be up to date with the Clay state.

Arguably, ensuring all callbacks have been run at any time for a given Clay revision is merely a guarantee the system doesn't provide, so we could mention it in the docs as a gotcha. Unless that would break something ...

Anton says because of this potentially out-of-sync state, we should consider the callbacks to be part of the merge transaction; if they fail, the merge fails. This is a coherent argument, but I'm not so sure it's the best approach in practice, because it's possible for a sync to cause an error that's only tenuously related to the initial file merge. It could be thought of as a matter of priority: if we have to pick either the app updating successfully, or the merge succeeding, which do we choose? In the case of stars, it's definitely annoying that some pesky userspace code could clog up network-wide updates.

The reason to have transactional semantics is to avoid inconsistent states. For the past several days, @Fang- and I have been struggling to push updates to a network that's in an inconsistent state -- it's not about to corrupt data, but different machines are running different code, even though ~zod thinks its syncing job is done. The network is stuck mid-sync, in such a way that ~zod is now incapable of saving its children by pushing down new code. Since we use filesystem syncing to rescue children, this suggests to me that the success of filesystem syncs should take precedence over app code running successfully, or that we need an extra layer of transaction around syncing so that ~zod's push to its %kids desk fails when that tries to sync to another machine. The latter feels brittle to me, though, so I think we'd be better off limiting the filesystem syncing transaction so that userspace errors don't stop it halfway.

If there's a better way to think about this, I'm open to ideas.

cgyarvin · 2017-12-19T02:12:46Z

I agree that the sync should work no matter what. If a sync that pushes code pushes something that breaks, the old code should just keep running, NBD. That means we do have to treat a “build broken” state as normal. Just try again on the next desk change.

…

Sent from my iPhone

On Dec 18, 2017, at 5:15 PM, Ted Blackman ***@***.***> wrote: My worry is about the case where the sync succeeds, but another (previously scheduled) event gets run before the event that runs the notifications of the Clay change. At that point, the Clay change itself has been persisted, but not the effects of the Clay change. That wouldn't necessarily cause a nock error in the intruding event; it might just do something subtly wrong because it expected the Gall state to be up to date with the Clay state. Arguably, ensuring all callbacks have been run at any time for a given Clay revision is merely a guarantee the system doesn't provide, so we could mention it in the docs as a gotcha. Unless that would break something ... Anton says because of this potentially out-of-sync state, we should consider the callbacks to be part of the merge transaction; if they fail, the merge fails. This is a coherent argument, but I'm not so sure it's the best approach in practice, because it's possible for a sync to cause an error that's only tenuously related to the initial file merge. It could be thought of as a matter of priority: if we have to pick either the app updating successfully, or the merge succeeding, which do we choose? In the case of stars, it's definitely annoying that some pesky userspace code could clog up network-wide updates. The reason to have transactional semantics is to avoid inconsistent states. For the past several days, @Fang- and I have been struggling to push updates to a network that's in an inconsistent state -- it's not about to corrupt data, but different machines are running different code, even though ~zod thinks its syncing job is done. The network is stuck mid-sync, in such a way that ~zod is now incapable of saving its children by pushing down new code. Since we use filesystem syncing to rescue children, this suggests to me that the success of filesystem syncs should take precedence over app code running successfully, or that we need an extra layer of transaction around syncing so that ~zod's push to its %kids desk fails when that tries to sync to another machine. The latter feels brittle to me, though, so I think we'd be better off limiting the filesystem syncing transaction so that userspace errors don't stop it halfway. If there's a better way to think about this, I'm open to ideas. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

Fang- · 2017-12-20T23:52:53Z

I'm going to be picking this up. Just discussed with @belisarius222 to make sure I understand everything correctly. For the record, we settled on just making clay set a behn timer for itself 0 seconds in the future whenever it wants to send out file-change notifications, and then sending those notifications from the corresponding ++wake instead.

The file tally example @belisarius222 mentioned should probably be documented as a gotcha.

Fang- · 2018-02-09T19:40:43Z

See #611 for further discussion of related issue.

Fang- mentioned this issue Dec 14, 2017

Make talk actually revise reheard messages #502

Merged

ohAitch mentioned this issue Dec 18, 2017

Ames spinning forever #508

Closed

Fang- mentioned this issue Jan 8, 2018

Prevent gall from breaking things over bone 0 #538

Merged

belisarius222 added bug difficulty high %clay %gall cause known root cause of the issue is clear reproduction known well-specified repro case labels Feb 9, 2018

Fang- mentioned this issue Feb 9, 2018

Crash caused by sync causes sync to break permanently #611

Closed

Fang- closed this as completed Feb 9, 2018

Fang- mentioned this issue Aug 9, 2019

Old version of application stays running after update urbit/urbit#1481

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

%ames reload on ~zod didn't sync to network #501

%ames reload on ~zod didn't sync to network #501

belisarius222 commented Dec 14, 2017

ohAitch commented Dec 14, 2017 via email

belisarius222 commented Dec 14, 2017

Fang- commented Dec 14, 2017

Fang- commented Dec 14, 2017

Fang- commented Dec 18, 2017 •

edited

Loading

Fang- commented Dec 18, 2017

belisarius222 commented Dec 19, 2017

ohAitch commented Dec 19, 2017 via email

belisarius222 commented Dec 19, 2017

cgyarvin commented Dec 19, 2017 via email

cgyarvin commented Dec 19, 2017 via email

ohAitch commented Dec 19, 2017 via email

belisarius222 commented Dec 19, 2017

cgyarvin commented Dec 19, 2017 via email

Fang- commented Dec 20, 2017 •

edited

Loading

Fang- commented Feb 9, 2018

%ames reload on ~zod didn't sync to network #501

%ames reload on ~zod didn't sync to network #501

Comments

belisarius222 commented Dec 14, 2017

ohAitch commented Dec 14, 2017 via email

belisarius222 commented Dec 14, 2017

Fang- commented Dec 14, 2017

Fang- commented Dec 14, 2017

Fang- commented Dec 18, 2017 • edited Loading

Fang- commented Dec 18, 2017

belisarius222 commented Dec 19, 2017

ohAitch commented Dec 19, 2017 via email

belisarius222 commented Dec 19, 2017

cgyarvin commented Dec 19, 2017 via email

cgyarvin commented Dec 19, 2017 via email

ohAitch commented Dec 19, 2017 via email

belisarius222 commented Dec 19, 2017

cgyarvin commented Dec 19, 2017 via email

Fang- commented Dec 20, 2017 • edited Loading

Fang- commented Feb 9, 2018

Fang- commented Dec 18, 2017 •

edited

Loading

Fang- commented Dec 20, 2017 •

edited

Loading