Skip to content
This repository has been archived by the owner on Aug 14, 2019. It is now read-only.

%ames reload on ~zod didn't sync to network #501

Closed
belisarius222 opened this issue Dec 14, 2017 · 16 comments
Closed

%ames reload on ~zod didn't sync to network #501

belisarius222 opened this issue Dec 14, 2017 · 16 comments
Labels
bug cause known root cause of the issue is clear %clay %gall reproduction known well-specified repro case

Comments

@belisarius222
Copy link
Contributor

Trying a |reset now to force the issue, but this feels broken. Not sure what went wrong here.

@ohAitch
Copy link
Contributor

ohAitch commented Dec 14, 2017 via email

@belisarius222
Copy link
Contributor Author

Also, it should be noted that performing a |reset on ~zod will not actually cause anything to sync to other ships. Only Clay changes will cause that to happen, and |reset doesn't modify Clay.

@Fang-
Copy link
Member

Fang- commented Dec 14, 2017

If talk's just printing a stacktrace pointing to line 461, then that is a known issue. Shouldn't actually break anything though.

Didn't see this stack trace in ~zod's scrollback, and its talk seems to be working fine (it can send messages to its own inbox and have those show up), so I'm not sure what's causing the sync to not go through. Will investigate further.

@Fang-
Copy link
Member

Fang- commented Dec 14, 2017

It seems ~marzod and ~samzod got the update. (I can type ;sources %something into their talks.)
|mergeing their %kids desk into an old comet applies the update just fine. I'm assuming that their planets got the update as well.

~wanzod and ~binzod have not gotten the update, and are both showing the mentioned stacktrace along with the following:

  [%swim-take-vane %g %unto ~]
  [%gall-take /~wanzod/use/talk/~wanzod/out/hall/server/inbox]
[%fail-ack 0]
crude: all delivery failed!
  [%swim-take-vane %g %unto ~]
  [%gall-take /~wanzod/use/talk/~wanzod/out/hall/server/inbox]
crude: punted

I'll get to work on finding a reproduction case and fixing that talk error.

@Fang-
Copy link
Member

Fang- commented Dec 18, 2017

Breaking syncs in four easy steps:

  1. Boot up a fakezod and accompanying star. Latest arvo is fine.
  2. On the star, send a message. Default audience (inbox) is fine.
  3. On the star, run :talk 'screw' in dojo. This should print 'screwing things up...'. This will cause the below to runtime error because of an out-of-bounds snag.
  4. On the fakezod, make the following change to /app/talk.hoon:
  [~ ..prep(+<+ u.old)]

in ++prep, on or around line 123, needs to become:

  :_  ..prep(+<+ u.old)
  :~  [0 %pull /server/client server ~]
      [0 %pull /server/inbox server ~]
      peer-client
      peer-inbox
  ==

Then watch as the fakezod pushes the change to the star, and the star spits some errors. Note how changing a file on fakezod now no longer pushes it down to the star.

For completeness, the errors it spits out (excluding the talk runtime error):

  [%swim-take-vane %g %unto ~]
  [%gall-take /~palzod/use/talk/~palzod/out/hall/server/inbox]
[%fail-ack 0]
crude: all delivery failed!
  [%swim-take-vane %g %unto ~]
  [%gall-take /~palzod/use/talk/~palzod/out/hall/server/inbox]
crude: punted
sync succeeded from %base on ~palzod to %kids

This is not the exact same thing as what happened on the stars. The error messages were slightly different there (%c %writ instead of %g %unto, also saying %not-actually-merging) and they were left in a state where |syncs didn't give the usual 3-item list output.
Since this also stops syncs from working properly though, I figured this might be a case worth investigating further.

Maybe interesting to note that doing an |merge %base ~zod %kids on the star causes an %already-merging-from-somewhere error. Doing an |unsync %base ~zod %kids, and then doing a |sync for that again causes an endless stream of them.

@Fang-
Copy link
Member

Fang- commented Dec 18, 2017

And to get the printfs that also appeared on the stars:

  1. Do the above
  2. :talk 'rebuild' from dojo to resolve the talk stacktrace
  3. |cancel %base to cancel the merge that apparently got stuck
  4. Wait for syncing to happen. Should print:
sync succeeded from %base on ~palzod to %kids
sync succeeded from %base on ~palzod to %home
sync succeeded from %kids on ~zod to %base
  [%swim-take-vane %c %writ ~]
  %not-actually-merging
[%fail-ack 0]
crude: all delivery failed!
  [%swim-take-vane %c %writ ~]
  %not-actually-merging
crude: punted
sync succeeded from %base on ~palzod to %kids
sync succeeded from %base on ~palzod to %home
sync succeeded from %kids on ~zod to %base
  [%swim-take-vane %c %writ ~]
  %not-actually-merging
[%fail-ack 0]

@belisarius222
Copy link
Contributor Author

Should we separate the notification of file change to a new event? That way the sync would still succeed; it would just cause the next event to fail, without leaving Clay in a weird state. Conceptually, I'm not sure a filesystem merge should fail just because the result caused an error in a different part of the system.

@ohAitch
Copy link
Contributor

ohAitch commented Dec 19, 2017 via email

@belisarius222
Copy link
Contributor Author

Ok, let's see if I understand you correctly:

In an older version of the system, if you tried to merge in app source code that failed to compile, the merge would failed. That was changed, and now if you merge in app source code that fails to compile, the merge still succeeds. In keeping with this more permissive attitude, it would be consistent to also allow the parent->child syncing to succeed even if it causes userspace code to fail.

Did I get that right? If so, could you explain your reasoning behind wanting the merge from base to home to fail?

I also have a slight worry about potential race conditions if the callbacks for a filesystem sync could happen after some other events. What if those callbacks need to update some other part of the Arvo state? If some other event runs between the merge and those callbacks, then it could potentially read from both Clay and the other part of the system that was supposed to match, but instead read from them in an inconsistent state.

For example, let's say there was an app that kept as part of its state a tally of the number of files in a Clay desk. It's subscribed to notifications on this desk so it can rerun this tally whenever the desk changes. If a sync succeeds that changes the number of files in a desk, then before the app gets notified of the sync, some other event could run that accesses the app's tally and also accesses Clay. In that case, the tally wouldn't match the number of files in Clay.

It's a contrived example, so I'm not sure if this is actually something to worry about.

@cgyarvin
Copy link
Contributor

cgyarvin commented Dec 19, 2017 via email

@cgyarvin
Copy link
Contributor

cgyarvin commented Dec 19, 2017 via email

@ohAitch
Copy link
Contributor

ohAitch commented Dec 19, 2017 via email

@belisarius222
Copy link
Contributor Author

My worry is about the case where the sync succeeds, but another (previously scheduled) event gets run before the event that runs the notifications of the Clay change. At that point, the Clay change itself has been persisted, but not the effects of the Clay change. That wouldn't necessarily cause a nock error in the intruding event; it might just do something subtly wrong because it expected the Gall state to be up to date with the Clay state.

Arguably, ensuring all callbacks have been run at any time for a given Clay revision is merely a guarantee the system doesn't provide, so we could mention it in the docs as a gotcha. Unless that would break something ...

Anton says because of this potentially out-of-sync state, we should consider the callbacks to be part of the merge transaction; if they fail, the merge fails. This is a coherent argument, but I'm not so sure it's the best approach in practice, because it's possible for a sync to cause an error that's only tenuously related to the initial file merge. It could be thought of as a matter of priority: if we have to pick either the app updating successfully, or the merge succeeding, which do we choose? In the case of stars, it's definitely annoying that some pesky userspace code could clog up network-wide updates.

The reason to have transactional semantics is to avoid inconsistent states. For the past several days, @Fang- and I have been struggling to push updates to a network that's in an inconsistent state -- it's not about to corrupt data, but different machines are running different code, even though ~zod thinks its syncing job is done. The network is stuck mid-sync, in such a way that ~zod is now incapable of saving its children by pushing down new code. Since we use filesystem syncing to rescue children, this suggests to me that the success of filesystem syncs should take precedence over app code running successfully, or that we need an extra layer of transaction around syncing so that ~zod's push to its %kids desk fails when that tries to sync to another machine. The latter feels brittle to me, though, so I think we'd be better off limiting the filesystem syncing transaction so that userspace errors don't stop it halfway.

If there's a better way to think about this, I'm open to ideas.

@cgyarvin
Copy link
Contributor

cgyarvin commented Dec 19, 2017 via email

@Fang-
Copy link
Member

Fang- commented Dec 20, 2017

I'm going to be picking this up. Just discussed with @belisarius222 to make sure I understand everything correctly. For the record, we settled on just making clay set a behn timer for itself 0 seconds in the future whenever it wants to send out file-change notifications, and then sending those notifications from the corresponding ++wake instead.

The file tally example @belisarius222 mentioned should probably be documented as a gotcha.

@Fang-
Copy link
Member

Fang- commented Feb 9, 2018

See #611 for further discussion of related issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug cause known root cause of the issue is clear %clay %gall reproduction known well-specified repro case
Projects
None yet
Development

No branches or pull requests

4 participants