Skip to content

DebugAllocator: Segfault in stack unwinder when reporting leak from exited thread #25405

@ezzieyguywuf

Description

@ezzieyguywuf

Description:

When running tests for a complex multithreaded application, we've identified a
reproducible segfault within the DebugAllocator's stack unwinding logic.

The crash occurs at process shutdown when the DebugAllocator attempts to
report a memory leak from a worker thread that has already exited. The unwinder
tries to read a frame pointer from the thread's now-invalid (and likely
unmapped) stack, leading to a segmentation fault. This appears to be a race
condition that is reliably triggered by our application's shutdown sequence,
which involves waking worker threads from blocking syscalls by closing file
descriptors.

Affected Versions:

  • Broken: 0.16.0-dev.457+f90510b08
  • Working: 0.15.1

Steps to Reproduce:

The issue was found in the websocket.zig project.

  1. Clone the repository.
  2. checkout the dev branch
  3. Using Zig version 0.16.0-dev.457+f90510b08, run the test suite:
    zig build test

Observed Behavior (Segfault):

The test command crashes with a segmentation fault. The stack trace points to
the next_internal function in the stack iterator (lib/std/debug.zig).

❯ zig build test
Segmentation fault at address 0xffffffffffffff20
/tmp/zig-x86_64-linux-0.16.0-dev.457+f90510b08/lib/std/debug.zig:945:52: 0x109c422 in next_internal (std.zig)
        const new_fp = math.add(usize, @as(*usize, @ptrFromInt(fp)).*, fp_bias) catch
                                                   ^
/tmp/zig-x86_64-linux-0.16.0-dev.457+f90510b08/lib/std/debug.zig:868:39: 0x106be51 in next (std.zig)
        var address = it.next_internal() orelse return null;
                                      ^
/tmp/zig-x86_64-linux-0.16.0-dev.457+f90510b08/lib/std/debug.zig:507:29: 0x1191068 in captureStackTrace (std.zig)
            addr.* = it.next() orelse {
                            ^
/tmp/zig-x86_64-linux-0.16.0-dev.457+f90510b08/lib/std/heap/debug_allocator.zig:515:40: 0x178cc64 in collectStackTrace (std.zig)
            std.debug.captureStackTrace(first_trace_addr, &stack_trace);
                                       ^
/tmp/zig-x86_64-linux-0.16.0-dev.457+f90510b08/lib/std/heap/debug_allocator.zig:333:34: 0x1782610 in captureStackTrace (std.zig)
                collectStackTrace(ret_addr, stack_addresses);
                                 ^
/tmp/zig-x86_64-linux-0.16.0-dev.457+f90510b08/lib/std/heap/debug_allocator.zig:801:41: 0x177e47b in alloc (std.zig)
                bucket.captureStackTrace(ret_addr, slot_count, 0, .alloc);
                                        ^
/tmp/zig-x86_64-linux-0.16.0-dev.457+f90510b08/lib/std/mem/Allocator.zig:142:26: 0x104e421 in allocBytesWithAlignment__anon_3127 (std.zig)
    return a.vtable.alloc(a.ptr, len, alignment, ret_addr);
                         ^
/tmp/zig-x86_64-linux-0.16.0-dev.457+f90510b08/lib/std/mem/Allocator.zig:282:40: 0x1780d59 in allocWithSizeAndAlignment__anon_377027 (std.zig)
    return self.allocBytesWithAlignment(alignment, byte_count, return_address);
                                       ^
/tmp/zig-x86_64-linux-0.16.0-dev.457+f90510b08/lib/std/mem/Allocator.zig:270:89: 0x178d9bc in alloc__anon_379231 (std.zig)
    const ptr: [*]align(a.toByteUnits()) T = @ptrCast(try self.allocWithSizeAndAlignment(@sizeOf(T), a, n, return_address));
                                                                                        ^
/home/wolfgangsanyer/Program/websocket.zig/src/server/server.zig:231:52: 0x178984d in listen__anon_379174 (websocket.zig)
                const threads = try allocator.alloc(Thread, worker_count);
                                                   ^
/tmp/zig-x86_64-linux-0.16.0-dev.457+f90510b08/lib/std/Thread.zig:530:21: 0x178c2ac in callFn__anon_379037 (std.zig)
                    @call(.auto, f, args) catch |err| {
                    ^
/tmp/zig-x86_64-linux-0.16.0-dev.457+f90510b08/lib/std/Thread.zig:783:30: 0x17818a9 in entryFn (std.zig)
                return callFn(f, args_ptr.*);
                             ^
???:?:?: 0x7f4a37d5ab7a in ??? (libc.so.6)

Analysis and Proposed Solution:

The root cause is that the stack unwinder does not safely handle dereferencing
frame pointers that may be invalid. A naive local patch to lib/std/debug.zig
that uses /proc/self/mem to safely read memory during unwinding prevents the
crash entirely.

This patch demonstrates an attempt to work around this, making the unwinder more
robust to memory access errors.

Diff of Naive Solution:

--- a/lib/std/debug.zig
+++ b/lib/std/debug.zig
@@ -14,6 +14,29 @@
 const Writer = std.Io.Writer;
 const tty = std.Io.tty;
 
+// This is a temporary workaround to test a theory about a crash in the stack unwinding logic.
+// It provides a way to safely read from an arbitrary memory address on Linux without crashing.
+fn safeLoad(comptime T: type, address: usize) ?T {
+    // The pread syscall takes a signed offset. If the address is too high,
+    // it will wrap around to a negative number, causing EINVAL.
+    if (address > std.math.maxInt(isize)) return null;
+
+    if (native_os != .linux) {
+        // On non-Linux platforms, fall back to the unsafe behavior for now.
+        // This could be implemented for other OSes if needed.
+        return @as(*const T, @ptrFromInt(address)).*;
+    }
+
+    var file = fs.openFileAbsolute("/proc/self/mem", .{ .mode = .read_only }) catch return null;
+    defer file.close();
+
+    var buf: T = undefined;
+    const bytes = std.mem.asBytes(&buf);
+    const res = posix.pread(file.handle, bytes, @intCast(address)) catch return null;
+    if (res != @sizeOf(T)) return null;
+    return buf;
+}
+
 pub const Dwarf = @import("debug/Dwarf.zig");
 pub const Pdb = @import("debug/Pdb.zig");
 pub const SelfInfo = @import("debug/SelfInfo.zig");
@@ -942,7 +965,7 @@
 
         // Sanity check.
         if (fp == 0 or !mem.isAligned(fp, @alignOf(usize))) return null;
-        const new_fp = math.add(usize, @as(*usize, @ptrFromInt(fp)).*, fp_bias) catch
+        const new_fp = math.add(usize, safeLoad(usize, fp) orelse return null, fp_bias) catch
             return null;
 
         // Sanity check: the stack grows down thus all the parent frames must be
@@ -950,7 +973,8 @@
         // A zero frame pointer often signals this is the last frame, that case
         // is gracefully handled by the next call to next_internal.
         if (new_fp != 0 and new_fp < it.fp) return null;
-        const new_pc = @as(*usize, @ptrFromInt(math.add(usize, fp, pc_offset) catch return null)).*;
+        const new_pc = safeLoad(usize, math.add(usize, fp, pc_offset) catch return null) orelse
+            return null;
 
         it.fp = new_fp;

Behavior with the Patch Applied:

With the above patch, the segfault is resolved. The test command now correctly
reports the memory leaks and exits gracefully.

❯ ../zig/zig-out/bin/zig build test
test
└─ run test
   └─ compile test Debug native failure
error: error(gpa): memory address 0x7fc4f7325640 leaked:
???:?:?: 0x18d5350 in ??? (exe)
???:?:?: 0x3da7dc8 in ??? (exe)
???:?:?: 0x3032aa6 in ??? (exe)
???:?:?: 0x2a4946c in ??? (exe)

error(gpa): memory address 0x7fc4f7326220 leaked:
???:?:?: 0x18d5350 in ??? (exe)
???:?:?: 0x3da7dc8 in ??? (exe)
???:?:?: 0x3032aa6 in ??? (exe)
???:?:?: 0x2a4946c in ??? (exe)

... (additional leaks) ...

31 of 31 tests passed

Additional Context:

We attempted a git bisect between the working 0.15.1 version and the broken
master branch, but the significant divergence between the branches made it
difficult to isolate the exact commit that introduced this regression.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions