-
Notifications
You must be signed in to change notification settings - Fork 57
%ames reload on ~zod didn't sync to network #501
Comments
Talk seems to be crashing in response to an update hall is sending in
response to the merge(?), thereby preventing the merge from occurring.
@Fang- help
…On Wed, Dec 13, 2017 at 6:32 PM, Ted Blackman ***@***.***> wrote:
Trying a |reset now to force the issue, but this feels broken. Not sure
what went wrong here.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#501>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABxXhgrvwvJee65BmP3WYcJZFobWCv9-ks5tAIjFgaJpZM4RBded>
.
|
Also, it should be noted that performing a |reset on ~zod will not actually cause anything to sync to other ships. Only Clay changes will cause that to happen, and |reset doesn't modify Clay. |
If talk's just printing a stacktrace pointing to line 461, then that is a known issue. Shouldn't actually break anything though. Didn't see this stack trace in ~zod's scrollback, and its talk seems to be working fine (it can send messages to its own inbox and have those show up), so I'm not sure what's causing the sync to not go through. Will investigate further. |
It seems ~marzod and ~samzod got the update. (I can type ~wanzod and ~binzod have not gotten the update, and are both showing the mentioned stacktrace along with the following:
I'll get to work on finding a reproduction case and fixing that talk error. |
Breaking syncs in four easy steps:
in ++prep, on or around line 123, needs to become:
Then watch as the fakezod pushes the change to the star, and the star spits some errors. Note how changing a file on fakezod now no longer pushes it down to the star. For completeness, the errors it spits out (excluding the talk runtime error):
This is not the exact same thing as what happened on the stars. The error messages were slightly different there ( Maybe interesting to note that doing an |
And to get the printfs that also appeared on the stars:
|
Should we separate the notification of file change to a new event? That way the sync would still succeed; it would just cause the next event to fail, without leaving Clay in a weird state. Conceptually, I'm not sure a filesystem merge should fail just because the result caused an error in a different part of the system. |
imo the merge from kids to base should succeed and the merge from base to
home should fail, but I guess we already crossed this bridge upon disabling
the similar userspace behavior of "non-compiling app source code changes
are not accepted"
…On Mon, Dec 18, 2017 at 4:15 PM, Ted Blackman ***@***.***> wrote:
Should we separate the notification of file change to a new event? That
way the sync would still succeed; it would just cause the next event to
fail, without leaving Clay in a weird state. Conceptually, I'm not sure a
filesystem merge should fail just because the result caused an error in a
different part of the system.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#501 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABxXhq1nJ1PjpccmASRxLzkf4Je1RO_pks5tBwAIgaJpZM4RBded>
.
|
Ok, let's see if I understand you correctly: In an older version of the system, if you tried to merge in app source code that failed to compile, the merge would failed. That was changed, and now if you merge in app source code that fails to compile, the merge still succeeds. In keeping with this more permissive attitude, it would be consistent to also allow the parent->child syncing to succeed even if it causes userspace code to fail. Did I get that right? If so, could you explain your reasoning behind wanting the merge from base to home to fail? I also have a slight worry about potential race conditions if the callbacks for a filesystem sync could happen after some other events. What if those callbacks need to update some other part of the Arvo state? If some other event runs between the merge and those callbacks, then it could potentially read from both Clay and the other part of the system that was supposed to match, but instead read from them in an inconsistent state. For example, let's say there was an app that kept as part of its state a tally of the number of files in a Clay desk. It's subscribed to notifications on this desk so it can rerun this tally whenever the desk changes. If a sync succeeds that changes the number of files in a desk, then before the app gets notified of the sync, some other event could run that accesses the app's tally and also accesses Clay. In that case, the tally wouldn't match the number of files in Clay. It's a contrived example, so I'm not sure if this is actually something to worry about. |
Ted that’s exactly right.
…Sent from my iPhone
On Dec 18, 2017, at 4:15 PM, Ted Blackman ***@***.***> wrote:
Should we separate the notification of file change to a new event? That way the sync would still succeed; it would just cause the next event to fail, without leaving Clay in a weird state. Conceptually, I'm not sure a filesystem merge should fail just because the result caused an error in a different part of the system.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
The inserted event is a transaction. It can fail. If it fails, the C layer must generate a fail event. So we at least know %gall is out of sync.
…Sent from my iPhone
On Dec 18, 2017, at 4:37 PM, Ted Blackman ***@***.***> wrote:
Ok, let's see if I understand you correctly:
In an older version of the system, if you tried to merge in app source code that failed to compile, the merge would failed. That was changed, and now if you merge in app source code that fails to compile, the merge still succeeds. In keeping with this more permissive attitude, it would be consistent to also allow the parent->child syncing to succeed even if it causes userspace code to fail.
Did I get that right? If so, could you explain your reasoning behind wanting the merge from base to home to fail?
I also have a slight worry about potential race conditions if the callbacks for a filesystem sync could happen after some other events. What if those callbacks need to update some other part of the Arvo state? If some other event runs between the merge and those callbacks, then it could potentially read from both Clay and the other part of the system that was supposed to match, but instead read from them in an inconsistent state.
For example, let's say there was an app that kept as part of its state a tally of the number of files in a Clay desk. It's subscribed to notifications on this desk so it can rerun this tally whenever the desk changes. If a sync succeeds that changes the number of files in a desk, then before the app gets notified of the sync, some other event could run that accesses the app's tally and also accesses Clay. In that case, the tally wouldn't match the number of files in Clay.
It's a contrived example, so I'm not sure if this is actually something to worry about.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
yeah, and the actual correct behavior here would be to abort the
transaction *for the merge*, bouncing all the way back to kiln which is
trying to sync things, and getting printed there; but over the past few
years this capability of the system has lapsed considerably, and the best
remaining choice is permissiveness
…On Mon, Dec 18, 2017 at 4:40 PM, cgyarvin ***@***.***> wrote:
The inserted event is a transaction. It can fail. If it fails, the C layer
must generate a fail event. So we at least know %gall is out of sync.
Sent from my iPhone
> On Dec 18, 2017, at 4:37 PM, Ted Blackman ***@***.***>
wrote:
>
> Ok, let's see if I understand you correctly:
>
> In an older version of the system, if you tried to merge in app source
code that failed to compile, the merge would failed. That was changed, and
now if you merge in app source code that fails to compile, the merge still
succeeds. In keeping with this more permissive attitude, it would be
consistent to also allow the parent->child syncing to succeed even if it
causes userspace code to fail.
>
> Did I get that right? If so, could you explain your reasoning behind
wanting the merge from base to home to fail?
>
> I also have a slight worry about potential race conditions if the
callbacks for a filesystem sync could happen after some other events. What
if those callbacks need to update some other part of the Arvo state? If
some other event runs between the merge and those callbacks, then it could
potentially read from both Clay and the other part of the system that was
supposed to match, but instead read from them in an inconsistent state.
>
> For example, let's say there was an app that kept as part of its state a
tally of the number of files in a Clay desk. It's subscribed to
notifications on this desk so it can rerun this tally whenever the desk
changes. If a sync succeeds that changes the number of files in a desk,
then before the app gets notified of the sync, some other event could run
that accesses the app's tally and also accesses Clay. In that case, the
tally wouldn't match the number of files in Clay.
>
> It's a contrived example, so I'm not sure if this is actually something
to worry about.
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub, or mute the thread.
>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#501 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABxXhkCo1o0wEXoewmgJeIOOp5gTR9ncks5tBwYEgaJpZM4RBded>
.
|
My worry is about the case where the sync succeeds, but another (previously scheduled) event gets run before the event that runs the notifications of the Clay change. At that point, the Clay change itself has been persisted, but not the effects of the Clay change. That wouldn't necessarily cause a nock error in the intruding event; it might just do something subtly wrong because it expected the Gall state to be up to date with the Clay state. Arguably, ensuring all callbacks have been run at any time for a given Clay revision is merely a guarantee the system doesn't provide, so we could mention it in the docs as a gotcha. Unless that would break something ... Anton says because of this potentially out-of-sync state, we should consider the callbacks to be part of the merge transaction; if they fail, the merge fails. This is a coherent argument, but I'm not so sure it's the best approach in practice, because it's possible for a sync to cause an error that's only tenuously related to the initial file merge. It could be thought of as a matter of priority: if we have to pick either the app updating successfully, or the merge succeeding, which do we choose? In the case of stars, it's definitely annoying that some pesky userspace code could clog up network-wide updates. The reason to have transactional semantics is to avoid inconsistent states. For the past several days, @Fang- and I have been struggling to push updates to a network that's in an inconsistent state -- it's not about to corrupt data, but different machines are running different code, even though If there's a better way to think about this, I'm open to ideas. |
I agree that the sync should work no matter what. If a sync that pushes code pushes something that breaks, the old code should just keep running, NBD.
That means we do have to treat a “build broken” state as normal. Just try again on the next desk change.
…Sent from my iPhone
On Dec 18, 2017, at 5:15 PM, Ted Blackman ***@***.***> wrote:
My worry is about the case where the sync succeeds, but another (previously scheduled) event gets run before the event that runs the notifications of the Clay change. At that point, the Clay change itself has been persisted, but not the effects of the Clay change. That wouldn't necessarily cause a nock error in the intruding event; it might just do something subtly wrong because it expected the Gall state to be up to date with the Clay state.
Arguably, ensuring all callbacks have been run at any time for a given Clay revision is merely a guarantee the system doesn't provide, so we could mention it in the docs as a gotcha. Unless that would break something ...
Anton says because of this potentially out-of-sync state, we should consider the callbacks to be part of the merge transaction; if they fail, the merge fails. This is a coherent argument, but I'm not so sure it's the best approach in practice, because it's possible for a sync to cause an error that's only tenuously related to the initial file merge. It could be thought of as a matter of priority: if we have to pick either the app updating successfully, or the merge succeeding, which do we choose? In the case of stars, it's definitely annoying that some pesky userspace code could clog up network-wide updates.
The reason to have transactional semantics is to avoid inconsistent states. For the past several days, @Fang- and I have been struggling to push updates to a network that's in an inconsistent state -- it's not about to corrupt data, but different machines are running different code, even though ~zod thinks its syncing job is done. The network is stuck mid-sync, in such a way that ~zod is now incapable of saving its children by pushing down new code. Since we use filesystem syncing to rescue children, this suggests to me that the success of filesystem syncs should take precedence over app code running successfully, or that we need an extra layer of transaction around syncing so that ~zod's push to its %kids desk fails when that tries to sync to another machine. The latter feels brittle to me, though, so I think we'd be better off limiting the filesystem syncing transaction so that userspace errors don't stop it halfway.
If there's a better way to think about this, I'm open to ideas.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
I'm going to be picking this up. Just discussed with @belisarius222 to make sure I understand everything correctly. For the record, we settled on just making clay set a behn timer for itself 0 seconds in the future whenever it wants to send out file-change notifications, and then sending those notifications from the corresponding ++wake instead. The file tally example @belisarius222 mentioned should probably be documented as a gotcha. |
See #611 for further discussion of related issue. |
Trying a |reset now to force the issue, but this feels broken. Not sure what went wrong here.
The text was updated successfully, but these errors were encountered: