-
Notifications
You must be signed in to change notification settings - Fork 259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
flaky test: TestFileCache_{compiler, wazevo} #2039
Comments
relevant: #1815 |
Unrelated, but I am noticing that if the function returns early (line 36) then the finalizer is not set. wazero/internal/engine/compiler/engine_cache.go Lines 35 to 54 in 99c057b
As for the actual issue. I still don't know, but one thing that I notice is that before this commit the original code was a little more defensive see here and especially here e.g. originally:
and I wonder if the defensive check against nil was actually needed. WDYT @achille-roussel could this be related at all? |
My understanding of this code is that we don't need to set the finalizer on values we got from the in-memory table of compiled modules because they have already been attached a finalizer if needed (e.g., if they were obtained from the cache). I think the current code we have still has the same validation, it's just been moved into the func (seg *CodeSegment) Unmap() error {
if seg.code != nil {
if err := platform.MunmapCodeSegment(seg.code[:cap(seg.code)]); err != nil {
return err
}
seg.code = nil
seg.size = 0
}
return nil
} |
I can reproduce the same error (although I don't know if am reproducing the exact same conditions) if I try to
So my guess is the finalizer might have to be guarded to check whether we are unmapping a region that is still in use because of some race condition:
EDIT: I admit this has nothing to do with the file cache though, so this might be still misguided... |
an update, although nothing is really working. I have added a bunch of logging statements on a few strategic places to observe the behavior while running I also tried to run multiple times with hammer/-race high count... No clue, I'll continue tomorrow. Maybe playing with docker/cgroups I might be able to reproduce the same conditions on CI. |
more observations that might be useful.
Also they both fail possibly because of a finalizer (likely because one goroutine is in
Notice that on Linux
means we're in traceback.go#L266 and segmentation violation code is while on macOS we're just in runtime/panic.go:1047 (i.e. panic()) and segmentation violation code is On Linux (but also macOS):
Now, because tests are compiled by package:
it means that the executable has run for a relatively short time, thus it can't have accumulated "a lot" of garbage, and that garbage is only related to the tests in that sub-package. So this rules out any interference from "other" tests. This is how I am running
For completeness, I am running my tests on Fedora using buildah/podman:
(--cpuset-cpus= might need to be configured) and then I am creating a stresstest on the same cpus
obviously one can fiddle with the parameters at will. Anyway, no luck yet. |
while I still haven't reproduced in a natural way the errors, I can confirm that a simulated is compatible with one of the two problems, i.e. the one on the old compiler. Essentially the trace containing code snippet 👇func TestEngine_Call(t *testing.T) {
cache := filecache.New(t.TempDir())
e := NewEngine(testCtx, api.CoreFeaturesV2, cache).(*engine)
module := &wasm.Module{
TypeSection: []wasm.FunctionType{{}},
FunctionSection: []wasm.Index{0},
CodeSection: []wasm.Code{
{Body: []byte{
wasm.OpcodeLoop, 0,
wasm.OpcodeBr, 0,
wasm.OpcodeEnd,
wasm.OpcodeEnd,
}},
},
ExportSection: []wasm.Export{
{Name: "1", Type: wasm.ExternTypeFunc, Index: 0},
},
Exports: map[string]*wasm.Export{
"1": {Name: "1", Type: wasm.ExternTypeFunc, Index: 0},
},
NameSection: &wasm.NameSection{
FunctionNames: wasm.NameMap{{Index: 0, Name: "1"}},
ModuleName: "test",
},
ID: wasm.ModuleID{},
}
ctx, cancelFunc := context.WithTimeout(testCtx, 100*time.Millisecond)
t.Cleanup(cancelFunc)
store := wasm.NewStore(api.CoreFeaturesV2, e)
sysctx := sys.DefaultContext(nil)
err := e.CompileModule(ctx, module, nil, true)
cm := e.codes[module.ID]
require.NoError(t, err)
modInst, err := store.Instantiate(ctx, module, "test", sysctx, []wasm.FunctionTypeID{0})
require.NoError(t, err)
newModuleEngine, err := e.NewModuleEngine(module, modInst)
modInst.Engine = newModuleEngine
require.NoError(t, err)
go modInst.ExportedFunction("1").Call(ctx)
releaseCompiledModule(cm) // invoke the finalizer with the infinite loop still running
} The same example however does not seem compatible with the trace for wazevo (it would contain EDIT: huh actually I didn't notice that it might also be EDIT2: for wazevo, failing in |
I added a follow up comment to #2088 (comment) essentially, I think that might be a solution, and I might have found a reproducer (but I have to clean it up because for now I just hard-patched a code path in the cache file reader to make it blow up); the solution would be a combo of that PR + some extra fix to the file cache so that we validate the contents of the file when we mmap (i.e. checksum or another similar mechanism) |
https://github.com/tetratelabs/wazero/actions/runs/8077516060/job/22067963967?pr=2099 this hasn't been resolved yet I think |
it's interesting that in both cases when the failure was in the old compiler it was amd64+macos-12 🧐 |
I think I got it in #2102 tl;dr: there's a call_indirect in a spec test that points to a function that does not exist because its module has been collected earlier |
We recently are having the flaky seg fault as in
since the failure is happening for both wazevo and old compiler, I am pretty confident that
this is something to do with life cycle issue around mmap/finalizer in file caches.
The text was updated successfully, but these errors were encountered: