Skip to content

MetricBuffer#retrieve_updates segfault: returned objects reference freed Rust memory #396

@dacuna-ic

Description

@dacuna-ic

Summary

MetricBuffer#retrieve_updates returns objects that appear to reference Rust-owned memory. When Ruby's GC runs concurrently (on another thread), it can collect native wrapper objects whose backing Rust memory is still being read by the scraper thread, causing a segfault.

Environment

  • Ruby 3.3.7 (also reproducible on 3.3.6)
  • temporalio 1.3.0 (also reproducible on 1.1.0)
  • arm64-darwin (also reproducible on aarch64-linux)

Reproduction

The script below creates a MetricBuffer, starts a background scraper thread that calls retrieve_updates and accesses all fields, then floods workflows to generate metric volume. It segfaults within seconds to minutes.

#!/usr/bin/env ruby
# frozen_string_literal: true

# Prerequisites:
#   - gem install temporalio
#   - A running Temporal server on localhost:7234
#     (e.g. `temporal server start-dev`)

require 'temporalio/client'
require 'temporalio/worker'
require 'temporalio/activity'
require 'temporalio/workflow'
require 'temporalio/runtime'
require 'securerandom'

TASK_QUEUE = "metric-buffer-segfault-repro"
NAMESPACE  = "default"

class NoOpActivity < Temporalio::Activity::Definition
  def execute(n)
    "done-#{n}"
  end
end

class BusyWorkflow < Temporalio::Workflow::Definition
  def execute(count)
    results = []
    count.times do |n|
      results << Temporalio::Workflow.execute_activity(
        NoOpActivity, n,
        start_to_close_timeout: 10
      )
    end
    results
  end
end

BUFFER_SIZE = 100_000
metrics_buffer = Temporalio::Runtime::MetricBuffer.new(BUFFER_SIZE)

runtime = Temporalio::Runtime.new(
  telemetry: Temporalio::Runtime::TelemetryOptions.new(
    logging: Temporalio::Runtime::LoggingOptions.new(
      log_filter: Temporalio::Runtime::LoggingFilterOptions.new(
        core_level: 'WARN', other_level: 'ERROR'
      )
    ),
    metrics: Temporalio::Runtime::MetricsOptions.new(
      buffer: metrics_buffer,
      attach_service_name: false
    )
  )
)
Temporalio::Runtime.default = runtime

# -- Scraper thread: retrieve_updates + access all fields --

scraper = Thread.new do
  loop do
    updates = metrics_buffer.retrieve_updates

    updates.each do |update|
      _name  = update.metric.name
      _kind  = update.metric.kind
      _unit  = update.metric.unit
      _value = update.value

      update.attributes.each do |k, v|
        _k = k.to_s
        _v = v.to_s
      end
    end

    puts "[scraper] Drained #{updates.size} metric updates" if updates.size > 0
    sleep(0.05)
  rescue => e
    puts "[scraper] ERROR: #{e.class}: #{e.message}"
    sleep(0.1)
  end
end
scraper.name = "metric_scraper"

# -- Connect client and start worker --

client = Temporalio::Client.connect("localhost:7234", NAMESPACE)

worker_thread = Thread.new do
  worker = Temporalio::Worker.new(
    client: client,
    task_queue: TASK_QUEUE,
    workflows: [BusyWorkflow],
    activities: [NoOpActivity]
  )
  worker.run(shutdown_signals: [])
end

sleep 2

puts "[main] Starting workflow flood. Watch for segfault..."
puts "[main] Ruby #{RUBY_VERSION} | temporalio #{Gem.loaded_specs['temporalio']&.version}"
puts "[main] PID: #{Process.pid}"

iteration = 0
loop do
  iteration += 1

  handles = (1..10).map do |i|
    client.start_workflow(
      BusyWorkflow, 5,
      id: "segfault-repro-#{iteration}-#{i}-#{SecureRandom.hex(4)}",
      task_queue: TASK_QUEUE
    )
  end

  handles.each { |h| h.result }

  puts "[main] Completed batch #{iteration} (#{iteration * 10} workflows)"
  sleep(0.1)
rescue Interrupt
  puts "\n[main] Interrupted."
  break
end

Expected Behavior

The script runs indefinitely, printing metric counts.

Actual Behavior

Segfaults after seconds to minutes:

repro.rb:70: [BUG] Segmentation fault at 0x0000000000000003

The crash occurs inside rb_vm_search_method_slowpath, typically when accessing update.attributes or update.metric.name — suggesting Ruby is calling a method on an object whose native backing memory has been freed.

Root Cause Analysis

The Update, Metric, and attribute objects returned by retrieve_updates appear to be thin Ruby wrappers around Rust-owned pointers rather than fully copied Ruby objects. When Ruby's GC runs on another thread and collects related native wrapper objects, the Rust side frees the backing memory, causing a use-after-free when the scraper thread subsequently reads from those pointers.

Workaround

Disabling GC during the retrieval and immediate serialization into pure Ruby objects prevents the segfault:

GC.disable
begin
  json = JSON.generate(metrics_buffer.retrieve_updates)
ensure
  GC.enable
end

JSON.parse(json).each do |m|
  # Safe: pure Ruby objects
end

This confirms the issue is GC-triggered: with GC disabled during the critical section, the script runs indefinitely without crashing (tested 2000+ workflows / 40+ batches).

Suggested Fix

retrieve_updates should return objects that own their data as Ruby strings/hashes/numbers, not wrappers around Rust pointers that can be invalidated by GC on other threads.

Happy to know if i'm otherwise just doing it wrong. Do let me know!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions