Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filewatch: avoid accepting stale/duplicate events #4397

Merged
merged 3 commits into from Apr 2, 2021

Conversation

milas
Copy link
Member

@milas milas commented Apr 2, 2021

This fixes the somewhat legitimate flakiness experienced by one of
the tests (TestCrashRebuildTwoContainersTwoImages).

Previously, we added logic to avoid missed live updates, but the
safer logic means that it's possible to get a duplicate change:
in the case of the test that was failing intermittently, it had two
image targets that watched overlapping sets of files. If foo is
changed, once img1 target sees the event, it adds a record to its
pending file set for time.Now(), and a build is triggered.
Then, img2 target sees the same event, and adds a record to its
pending file set for time.Now(), which will intentionally not
be cleared after the build finishes because it appears to have been
a change that came from after the build was started. As a result,
it'll end up triggering another build.

The revised logic in this commit tries to handle cases safely but
also avoid redundant triggered builds:
1) The event pre-dates the start time of the last finished build, so
just drop it. (In practice, this is exceedingly unlikely to happen
because it would mean an event was very delayed.)

2) There's a current build in progress
* If it was started after the file was seen AND already knows
about the file, just ignore it. (This handles the case of the
overlapping target watch sets as in the flaky test.)
* Otherwise, use time.Now() so that it won't be erroneously
cleared at the finish of the build. (This is the missed live
update case.)

NOTE: The first case was removed because it's causing issues with Windows tests due to timing issues -- will fix & re-add in its own PR because it's really a "new" optimization

It's still possible to get a redundant build in some cases, but that
is impossible to avoid reliably given the current logic. Since we do
make an effort to batch file events (within a target), the most likely
cause would be two image targets with non-overlapping changed files,
where the later-to-be-processed file change happened before the build
was started, but we use time.Now() since it isn't in the list of
affected files for the build. The likelihood of multiple almost
instantaneous changes to files across targets is low enough that this
is still an acceptable trade-off vs the alternative of missing updates.

If extra builds become a common complaint, we could add a slight
waiting period of ~100ms or so after the latest file change before we
would start a build to increase the likelihood that changes across
targets have successfully propagated.

This fixes the somewhat legitimate flakiness experienced by one of
the tests (`TestCrashRebuildTwoContainersTwoImages`).

Previously, we added logic to avoid missed live updates, but the
safer logic means that it's possible to get a duplicate change:
in the case of the test that was failing intermittently, it had two
image targets that watched overlapping sets of files. If `foo` is
changed, once `img1` target sees the event, it adds a record to its
pending file set for `time.Now()`, and a build is triggered.
Then, `img2` target sees the *same* event, and adds a record to its
pending file set for `time.Now()`, which will _intentionally_ not
be cleared after the build finishes because it appears to have been
a change that came from _after_ the build was started. As a result,
it'll end up triggering another build.

The revised logic in this commit tries to handle cases safely but
also avoid redundant triggered builds:
 1) The event pre-dates the start time of the last finished build, so
    just drop it. (In practice, this is exceedingly unlikely to happen
    because it would mean an event was very delayed.)
 2) There's a current build in progress
      * If it was started _after_ the file was seen AND already knows
        about the file, just ignore it. (This handles the case of the
        overlapping target watch sets as in the flaky test.)
      * Otherwise, use `time.Now()` so that it won't be erroneously
        cleared at the finish of the build. (This is the missed live
        update case.)

It's still possible to get a redundant build in some cases, but that
is impossible to avoid reliably given the current logic. Since we do
make an effort to batch file events (within a target), the most likely
cause would be two image targets with non-overlapping changed files,
where the later-to-be-processed file change happened before the build
was started, but we use `time.Now()` since it isn't in the list of
affected files for the build. The likelihood of multiple almost
instantaneous changes to files across targets is low enough that this
is still an acceptable trade-off vs the alternative of missing updates.

If extra builds become a common complaint, we could add a slight
waiting period of ~100ms or so after the latest file change before we
would start a build to increase the likelihood that changes across
targets have successfully propagated.
@milas milas requested review from nicks and landism April 2, 2021 14:55
@milas
Copy link
Member Author

milas commented Apr 2, 2021

For context, the failure in the TestCrashRebuildTwoContainersTwoImages test would arise here:

assert.Equal(t, model.BuildReasonFlagCrash, ms.LastBuild().Reason)

We expect the flag to be exactly BuildReasonFlagCrash, which should be reliably true given the test conditions, but it was actually BuildReasonFlagCrash | BuildReasonFlagChangedFiles since the second target's change got missed in the prior build.

I actually originally tried to just put a wait further up in the test until all pending changes had been consumed, but they were getting "stuck" because buildcontrol fails the IsLiveUpdateTargetWaitingOnDeploy check - I think this might actually be a deficiency in our mocks? I tracked it pretty deep and it was bailing out due to deployment UID logic:

// If there was a recent deploy, the runtime state might not have the
// new pods yet. We check the PodAncestorID and see if it's in the most
// recent deploy set. If it's not, then we can should ignore these pods.
ancestorUID := runtimeState.PodAncestorUID
if ancestorUID != "" && !runtimeState.DeployedUIDSet.Contains(ancestorUID) {
return nil, nil
}

That said, while I'm sure it's possible to band-aid that test, it is indicative of a potential cause for redundant builds due to timing issues, so seemed worth trying to improve the pending file change logic further.

@milas
Copy link
Member Author

milas commented Apr 2, 2021

FYI I had to drop out the first case (event received that's older than the last finished build) because the Windows tests are hitting timing issues. That was a totally new optimization, so it's not a huge surprise. I'll see if I can patch up the tests and re-add that in its own PR later.

Realistically that case should never hit anyway 🤷

(I wanted to get this merged because the changes to exit controller in #4367 seem to exacerbate the flaky test failing due to timing changes)

@milas milas merged commit 3ed62b5 into master Apr 2, 2021
@milas milas deleted the milas/filewatch-avoid-redundant-builds branch April 2, 2021 20:13
nicks added a commit that referenced this pull request May 18, 2021
PendingConfigFileChanges had its own change tracking,
which didn't get the fixes we applied in
#4397

This removes that state completely and unifies
Tiltfile change tracking with other stuff
nicks added a commit that referenced this pull request May 19, 2021
PendingConfigFileChanges had its own change tracking,
which didn't get the fixes we applied in
#4397

This removes that state completely and unifies
Tiltfile change tracking with other stuff
nicks added a commit that referenced this pull request May 19, 2021
PendingConfigFileChanges had its own change tracking,
which didn't get the fixes we applied in
#4397

This removes that state completely and unifies
Tiltfile change tracking with other stuff
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants