filewatch: avoid accepting stale/duplicate events #4397

milas · 2021-04-02T14:55:15Z

This fixes the somewhat legitimate flakiness experienced by one of
the tests (TestCrashRebuildTwoContainersTwoImages).

Previously, we added logic to avoid missed live updates, but the
safer logic means that it's possible to get a duplicate change:
in the case of the test that was failing intermittently, it had two
image targets that watched overlapping sets of files. If foo is
changed, once img1 target sees the event, it adds a record to its
pending file set for time.Now(), and a build is triggered.
Then, img2 target sees the same event, and adds a record to its
pending file set for time.Now(), which will intentionally not
be cleared after the build finishes because it appears to have been
a change that came from after the build was started. As a result,
it'll end up triggering another build.

The revised logic in this commit tries to handle cases safely but
also avoid redundant triggered builds:
1) The event pre-dates the start time of the last finished build, so
just drop it. (In practice, this is exceedingly unlikely to happen
because it would mean an event was very delayed.)
2) There's a current build in progress
* If it was started after the file was seen AND already knows
about the file, just ignore it. (This handles the case of the
overlapping target watch sets as in the flaky test.)
* Otherwise, use time.Now() so that it won't be erroneously
cleared at the finish of the build. (This is the missed live
update case.)

NOTE: The first case was removed because it's causing issues with Windows tests due to timing issues -- will fix & re-add in its own PR because it's really a "new" optimization

It's still possible to get a redundant build in some cases, but that
is impossible to avoid reliably given the current logic. Since we do
make an effort to batch file events (within a target), the most likely
cause would be two image targets with non-overlapping changed files,
where the later-to-be-processed file change happened before the build
was started, but we use time.Now() since it isn't in the list of
affected files for the build. The likelihood of multiple almost
instantaneous changes to files across targets is low enough that this
is still an acceptable trade-off vs the alternative of missing updates.

If extra builds become a common complaint, we could add a slight
waiting period of ~100ms or so after the latest file change before we
would start a build to increase the likelihood that changes across
targets have successfully propagated.

This fixes the somewhat legitimate flakiness experienced by one of the tests (`TestCrashRebuildTwoContainersTwoImages`). Previously, we added logic to avoid missed live updates, but the safer logic means that it's possible to get a duplicate change: in the case of the test that was failing intermittently, it had two image targets that watched overlapping sets of files. If `foo` is changed, once `img1` target sees the event, it adds a record to its pending file set for `time.Now()`, and a build is triggered. Then, `img2` target sees the *same* event, and adds a record to its pending file set for `time.Now()`, which will _intentionally_ not be cleared after the build finishes because it appears to have been a change that came from _after_ the build was started. As a result, it'll end up triggering another build. The revised logic in this commit tries to handle cases safely but also avoid redundant triggered builds: 1) The event pre-dates the start time of the last finished build, so just drop it. (In practice, this is exceedingly unlikely to happen because it would mean an event was very delayed.) 2) There's a current build in progress * If it was started _after_ the file was seen AND already knows about the file, just ignore it. (This handles the case of the overlapping target watch sets as in the flaky test.) * Otherwise, use `time.Now()` so that it won't be erroneously cleared at the finish of the build. (This is the missed live update case.) It's still possible to get a redundant build in some cases, but that is impossible to avoid reliably given the current logic. Since we do make an effort to batch file events (within a target), the most likely cause would be two image targets with non-overlapping changed files, where the later-to-be-processed file change happened before the build was started, but we use `time.Now()` since it isn't in the list of affected files for the build. The likelihood of multiple almost instantaneous changes to files across targets is low enough that this is still an acceptable trade-off vs the alternative of missing updates. If extra builds become a common complaint, we could add a slight waiting period of ~100ms or so after the latest file change before we would start a build to increase the likelihood that changes across targets have successfully propagated.

milas · 2021-04-02T15:08:05Z

For context, the failure in the TestCrashRebuildTwoContainersTwoImages test would arise here:

tilt/internal/engine/buildcontroller_test.go

Line 451 in 228fba6

assert.Equal(t, model.BuildReasonFlagCrash, ms.LastBuild().Reason)

We expect the flag to be exactly BuildReasonFlagCrash, which should be reliably true given the test conditions, but it was actually BuildReasonFlagCrash | BuildReasonFlagChangedFiles since the second target's change got missed in the prior build.

I actually originally tried to just put a wait further up in the test until all pending changes had been consumed, but they were getting "stuck" because buildcontrol fails the IsLiveUpdateTargetWaitingOnDeploy check - I think this might actually be a deficiency in our mocks? I tracked it pretty deep and it was bailing out due to deployment UID logic:

tilt/internal/store/build_result.go

Lines 472 to 478 in 228fba6

    
           // If there was a recent deploy, the runtime state might not have the 
        
           // new pods yet. We check the PodAncestorID and see if it's in the most 
        
           // recent deploy set. If it's not, then we can should ignore these pods. 
        
           ancestorUID := runtimeState.PodAncestorUID 
        
           if ancestorUID != "" && !runtimeState.DeployedUIDSet.Contains(ancestorUID) { 
        
           	return nil, nil 
        
           }

That said, while I'm sure it's possible to band-aid that test, it is indicative of a potential cause for redundant builds due to timing issues, so seemed worth trying to improve the pending file change logic further.

this might be breaking Windows tests?

milas · 2021-04-02T20:12:51Z

FYI I had to drop out the first case (event received that's older than the last finished build) because the Windows tests are hitting timing issues. That was a totally new optimization, so it's not a huge surprise. I'll see if I can patch up the tests and re-add that in its own PR later.

Realistically that case should never hit anyway 🤷

(I wanted to get this merged because the changes to exit controller in #4367 seem to exacerbate the flaky test failing due to timing changes)

PendingConfigFileChanges had its own change tracking, which didn't get the fixes we applied in #4397 This removes that state completely and unifies Tiltfile change tracking with other stuff

milas requested review from nicks and landism April 2, 2021 14:55

nicks approved these changes Apr 2, 2021

View reviewed changes

milas added 2 commits April 2, 2021 14:00

tests: remove duplicative & unnecessary file change logic

2e14e4e

filewatch: don't drop even if older than last build

7110ecf

this might be breaking Windows tests?

milas merged commit 3ed62b5 into master Apr 2, 2021

milas deleted the milas/filewatch-avoid-redundant-builds branch April 2, 2021 20:13

nicks mentioned this pull request May 18, 2021

engine: fix a race condition in Tiltfile reload #4565

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

filewatch: avoid accepting stale/duplicate events #4397

filewatch: avoid accepting stale/duplicate events #4397

milas commented Apr 2, 2021 •

edited

milas commented Apr 2, 2021

milas commented Apr 2, 2021 •

edited

filewatch: avoid accepting stale/duplicate events #4397

filewatch: avoid accepting stale/duplicate events #4397

Conversation

milas commented Apr 2, 2021 • edited

milas commented Apr 2, 2021

milas commented Apr 2, 2021 • edited

milas commented Apr 2, 2021 •

edited

milas commented Apr 2, 2021 •

edited