flaky test: TestFileCache_{compiler, wazevo} #2039

mathetake · 2024-02-10T18:18:01Z

We recently are having the flaky seg fault as in

since the failure is happening for both wazevo and old compiler, I am pretty confident that
this is something to do with life cycle issue around mmap/finalizer in file caches.

mathetake · 2024-02-10T18:21:53Z

relevant: #1815

evacchi · 2024-02-19T19:43:53Z

Unrelated, but I am noticing that if the function returns early (line 36) then the finalizer is not set.

wazero/internal/engine/compiler/engine_cache.go

Lines 35 to 54 in 99c057b

    
           func (e *engine) getCompiledModule(module *wasm.Module, listeners []experimental.FunctionListener) (cm *compiledModule, ok bool, err error) { 
        
           	cm, ok = e.getCompiledModuleFromMemory(module) 
        
           	if ok { 
        
           		return 
        
           	} 
        
           	cm, ok, err = e.getCompiledModuleFromCache(module) 
        
           	if ok { 
        
           		e.addCompiledModuleToMemory(module, cm) 
        
           		if len(listeners) > 0 { 
        
           			// Files do not contain the actual listener instances (it's impossible to cache them as files!), so assign each here. 
        
           			for i := range cm.functions { 
        
           				cm.functions[i].listener = listeners[i] 
        
           			} 
        
           		} 
        
           		// As this uses mmap, we need to munmap on the compiled machine code when it's GCed. 
        
           		e.setFinalizer(cm, releaseCompiledModule) 
        
           	} 
        
           	return 
        
           }

As for the actual issue. I still don't know, but one thing that I notice is that before this commit the original code was a little more defensive see here and especially here

e.g. originally:

	e := cm.executable
	if e == nil {
		return // already released
	}

	// Setting this to nil allows tests to know the correct finalizer function was called.
	cm.executable = nil
	if err := platform.MunmapCodeSegment(e); err != nil {
		// munmap failure cannot recover, and happen asynchronously on the finalizer thread. While finalizer
		// functions can return errors, they are ignored.
		panic(fmt.Errorf("compiler: failed to munmap code segment: %w", err))
	}

and I wonder if the defensive check against nil was actually needed. WDYT @achille-roussel could this be related at all?

achille-roussel · 2024-02-19T20:06:46Z

My understanding of this code is that we don't need to set the finalizer on values we got from the in-memory table of compiled modules because they have already been attached a finalizer if needed (e.g., if they were obtained from the cache).

I think the current code we have still has the same validation, it's just been moved into the asm.(*CodeSegment).Unmap function 9780f0f#diff-7257d290f32f0a9c96c45985d8131d1a37533bfb68bb6cff425cf11d15f25757R66-R75

func (seg *CodeSegment) Unmap() error {
	if seg.code != nil {
		if err := platform.MunmapCodeSegment(seg.code[:cap(seg.code)]); err != nil {
			return err
		}
		seg.code = nil
		seg.size = 0
	}
	return nil
}

evacchi · 2024-02-20T11:41:44Z

I can reproduce the same error (although I don't know if am reproducing the exact same conditions) if I try to releaseCompiledModule(cm) while that module is being executed. Essentially:

		go modInst.ExportedFunction("1").Call(testCtx)
		releaseCompiledModule(cm)

So my guess is the finalizer might have to be guarded to check whether we are unmapping a region that is still in use because of some race condition:

we scheduled a region to be released
the module is no longer being executed or it has been deleted, so eventually the GC will want to collect
the module is retrieved again from the file cache because we reloaded it concurrently
the finalizer is still scheduled to unmap, so it kills the mmap'd region while we are inside it

EDIT: I admit this has nothing to do with the file cache though, so this might be still misguided...
EDIT2: another thought, maybe it might make sense to register the finalizer on the segment instead of the compiled module, we do not really care about the lifecycle of the CompiledModule, do we?

evacchi · 2024-02-21T21:03:57Z

an update, although nothing is really working. I have added a bunch of logging statements on a few strategic places to observe the behavior while running TestFileCache_compiler / TestFileCache_wazevo. I tried to strangle the executable with GOGC/GOMEMLIMIT. Also, tried a combo of taskset --cpu-list 0-3 nice -n 19 ./filecache.test to see if running at a low priority with some noisy neighbours would trigger the bug. Nada.

I also tried to run multiple times with hammer/-race high count... No clue, I'll continue tomorrow. Maybe playing with docker/cgroups I might be able to reproduce the same conditions on CI.

evacchi · 2024-02-22T12:09:44Z

more observations that might be useful.

The first failure is on the old compiler on macOS amd64 + Go 1.20 during a normal run of the test suite.
The second failure is on wazevo on linux amd64 + Go 1.20 but it is a scratch container. So the test is first compiled and then the container is created.

Also they both fail possibly because of a finalizer (likely because one goroutine is in finalizer wait state), but they fail in a slightly different way:

macOS:

        require.go:331: expected no error, but was exit status 2: unexpected fault address 0x2c879000
            fatal error: fault
            [signal SIGSEGV: segmentation violation code=0x1 addr=0x2c879000 pc=0x2c879000]
            
            goroutine 16348 [running]:
            runtime.throw({0x1514656?, 0xc000e401a6?})
            	/Users/runner/hostedtoolcache/go/1.20.12/x64/src/runtime/panic.go:1047 +0x5d fp=0xc000106f00 sp=0xc000106ed0 pc=0x1087d9d
            runtime.sigpanic()
            	/Users/runner/hostedtoolcache/go/1.20.12/x64/src/runtime/signal_unix.go:855 +0x28a fp=0xc000106f60 sp=0xc000106f00 pc=0x10a060a
            github.com/tetratelabs/wazero/internal/engine/compiler.(*callEngine).execWasmFunction(0xc000414000, {0x19ce548, 0xc00010e028}, 0x6417b0?, 0xe0?)

Linux:

        require.go:331: expected no error, but was exit status 2: unexpected fault address 0xc000c8ed00
            fatal error: fault
            [signal SIGSEGV: segmentation violation code=0x2 addr=0xc000c8ed00 pc=0xc000c8ed00]
            
            goroutine 18274 [running]:
            runtime.throw({0x726384?, 0x0?})
            	/opt/hostedtoolcache/go/1.20.13/x64/src/runtime/panic.go:1047 +0x5d fp=0xc000c8cba0 sp=0xc000c8cb70 pc=0x43699d
            runtime: g 18274: unexpected return pc for runtime.sigpanic called from 0xc000c8ed00
            stack: frame={sp:0xc000c8cba0, fp:0xc000c8cc00} stack=[0xc00008c000,0xc00008e000)
            
            runtime.sigpanic()
            	/opt/hostedtoolcache/go/1.20.13/x64/src/runtime/signal_unix.go:855 +0x28a fp=0xc000c8cc00 sp=0xc000c8cba0 pc=0x44d90a
            created by testing.(*T).Run
            	/opt/hostedtoolcache/go/1.20.13/x64/src/testing/testing.go:1629 +0x3ea

Notice that on Linux

 runtime: g 18274: unexpected return pc for runtime.sigpanic called from 0xc000c8ed00

means we're in traceback.go#L266 and segmentation violation code is 0x2

while on macOS we're just in runtime/panic.go:1047 (i.e. panic()) and segmentation violation code is 0x1

On Linux (but also macOS):

#define SEGV_MAPERR	1	/* address not mapped to object */
#define SEGV_ACCERR	2	/* invalid permissions for mapped object */

Now, because tests are compiled by package:

# use CGO_ENABLED=0 to ensure the build does not link against glibc
CGO_ENABLED=0 go test ./internal/integration_test/filecache/... -c -o filecache.test

it means that the executable has run for a relatively short time, thus it can't have accumulated "a lot" of garbage, and that garbage is only related to the tests in that sub-package. So this rules out any interference from "other" tests.

This is how I am running filecache.test similarly to how it runs on CI (case 2)

#Dockerfile
FROM scratch
CMD ["/test"]

For completeness, I am running my tests on Fedora using buildah/podman:

buildah build -t wazero:test  .
podman run --cpus=2 --cpuset-cpus=0,1 --memory=300m  -v $PWD/filecache.test:/test --tmpfs /tmp -it localhost/wazero:test /test

(--cpuset-cpus= might need to be configured)

and then I am creating a stresstest on the same cpus

taskset -c 0,1 stress --cpu 4 --vm 4 --vm-hang 1

obviously one can fiddle with the parameters at will. Anyway, no luck yet.

evacchi · 2024-02-22T14:09:58Z

while I still haven't reproduced in a natural way the errors, I can confirm that a simulated is compatible with one of the two problems, i.e. the one on the old compiler. Essentially the trace containing execWasmFunction() and SIGSEGV code=0x1 would be compatible with unmapping a function body while it is being executed.

code snippet 👇

func TestEngine_Call(t *testing.T) {
	cache := filecache.New(t.TempDir())
	e := NewEngine(testCtx, api.CoreFeaturesV2, cache).(*engine)

	module := &wasm.Module{
		TypeSection:     []wasm.FunctionType{{}},
		FunctionSection: []wasm.Index{0},
		CodeSection: []wasm.Code{
			{Body: []byte{
				wasm.OpcodeLoop, 0,
				wasm.OpcodeBr, 0,
				wasm.OpcodeEnd,
				wasm.OpcodeEnd,
			}},
		},
		ExportSection: []wasm.Export{
			{Name: "1", Type: wasm.ExternTypeFunc, Index: 0},
		},
		Exports: map[string]*wasm.Export{
			"1": {Name: "1", Type: wasm.ExternTypeFunc, Index: 0},
		},
		NameSection: &wasm.NameSection{
			FunctionNames: wasm.NameMap{{Index: 0, Name: "1"}},
			ModuleName:    "test",
		},
		ID: wasm.ModuleID{},
	}

	ctx, cancelFunc := context.WithTimeout(testCtx, 100*time.Millisecond)
	t.Cleanup(cancelFunc)

	store := wasm.NewStore(api.CoreFeaturesV2, e)
	sysctx := sys.DefaultContext(nil)
	err := e.CompileModule(ctx, module, nil, true)
	cm := e.codes[module.ID]
	require.NoError(t, err)
	modInst, err := store.Instantiate(ctx, module, "test", sysctx, []wasm.FunctionTypeID{0})
	require.NoError(t, err)

	newModuleEngine, err := e.NewModuleEngine(module, modInst)
	modInst.Engine = newModuleEngine
	require.NoError(t, err)

	go modInst.ExportedFunction("1").Call(ctx)
	releaseCompiledModule(cm) // invoke the finalizer with the infinite loop still running
}

The same example however does not seem compatible with the trace for wazevo (it would contain entrypoint() and code=0x1; while on CI code=0x2 and the trace did not include entrypoint()), which makes me think that -- if finalizers are involved -- it might have to do with sharedFunctionsFinalizer

EDIT: huh actually I didn't notice that it might also be entryPreambles, but that (or sharedFunctions for that matter) wouldn't explain the different code though

EDIT2: for wazevo, failing in traceback.go means that it was not possible to generate a valid traceback; that might still mean we're within generated code that the Go runtime does not know about.

evacchi · 2024-02-23T10:12:42Z

I added a follow up comment to #2088 (comment)

essentially, I think that might be a solution, and I might have found a reproducer (but I have to clean it up because for now I just hard-patched a code path in the cache file reader to make it blow up); the solution would be a combo of that PR + some extra fix to the file cache so that we validate the contents of the file when we mmap (i.e. checksum or another similar mechanism)

evacchi · 2024-02-25T16:29:00Z

#2091 follows up to #2088 and hardens the cache

mathetake · 2024-02-28T08:50:07Z

https://github.com/tetratelabs/wazero/actions/runs/8077516060/job/22067963967?pr=2099 this hasn't been resolved yet I think

evacchi · 2024-02-28T09:30:32Z

it's interesting that in both cases when the failure was in the old compiler it was amd64+macos-12 🧐

evacchi · 2024-02-28T20:51:09Z

I think I got it in #2102

tl;dr: there's a call_indirect in a spec test that points to a function that does not exist because its module has been collected earlier

mathetake added the bug Something isn't working label Feb 10, 2024

mathetake assigned evacchi Feb 15, 2024

mathetake changed the title ~~flaky test: TestFileCache_{compiler, wazeo}~~ flaky test: TestFileCache_{compiler, wazevo} Feb 18, 2024

achille-roussel mentioned this issue Feb 23, 2024

filecache: use os.CreateTemp to create files #2088

Merged

mathetake closed this as completed Feb 25, 2024

mathetake reopened this Feb 28, 2024

evacchi mentioned this issue Feb 28, 2024

compiler: lifecycle issue on finalizers with tables #2102

Closed

ncruces mentioned this issue Mar 1, 2024

spectest: ensure compiled functions can outlive module #2105

Closed

mathetake mentioned this issue Mar 8, 2024

table: ensures references alive regardless of closures #2139

Merged

mathetake closed this as completed in #2139 Mar 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flaky test: TestFileCache_{compiler, wazevo} #2039

flaky test: TestFileCache_{compiler, wazevo} #2039

mathetake commented Feb 10, 2024

mathetake commented Feb 10, 2024

evacchi commented Feb 19, 2024

achille-roussel commented Feb 19, 2024

evacchi commented Feb 20, 2024 •

edited

Loading

evacchi commented Feb 21, 2024 •

edited

Loading

evacchi commented Feb 22, 2024 •

edited

Loading

evacchi commented Feb 22, 2024 •

edited

Loading

evacchi commented Feb 23, 2024

evacchi commented Feb 25, 2024

mathetake commented Feb 28, 2024

evacchi commented Feb 28, 2024

evacchi commented Feb 28, 2024

flaky test: TestFileCache_{compiler, wazevo} #2039

flaky test: TestFileCache_{compiler, wazevo} #2039

Comments

mathetake commented Feb 10, 2024

mathetake commented Feb 10, 2024

evacchi commented Feb 19, 2024

achille-roussel commented Feb 19, 2024

evacchi commented Feb 20, 2024 • edited Loading

evacchi commented Feb 21, 2024 • edited Loading

evacchi commented Feb 22, 2024 • edited Loading

evacchi commented Feb 22, 2024 • edited Loading

evacchi commented Feb 23, 2024

evacchi commented Feb 25, 2024

mathetake commented Feb 28, 2024

evacchi commented Feb 28, 2024

evacchi commented Feb 28, 2024

evacchi commented Feb 20, 2024 •

edited

Loading

evacchi commented Feb 21, 2024 •

edited

Loading

evacchi commented Feb 22, 2024 •

edited

Loading

evacchi commented Feb 22, 2024 •

edited

Loading