-
Notifications
You must be signed in to change notification settings - Fork 25
Description
Summary
MetricBuffer#retrieve_updates returns objects that appear to reference Rust-owned memory. When Ruby's GC runs concurrently (on another thread), it can collect native wrapper objects whose backing Rust memory is still being read by the scraper thread, causing a segfault.
Environment
- Ruby 3.3.7 (also reproducible on 3.3.6)
- temporalio 1.3.0 (also reproducible on 1.1.0)
- arm64-darwin (also reproducible on aarch64-linux)
Reproduction
The script below creates a MetricBuffer, starts a background scraper thread that calls retrieve_updates and accesses all fields, then floods workflows to generate metric volume. It segfaults within seconds to minutes.
#!/usr/bin/env ruby
# frozen_string_literal: true
# Prerequisites:
# - gem install temporalio
# - A running Temporal server on localhost:7234
# (e.g. `temporal server start-dev`)
require 'temporalio/client'
require 'temporalio/worker'
require 'temporalio/activity'
require 'temporalio/workflow'
require 'temporalio/runtime'
require 'securerandom'
TASK_QUEUE = "metric-buffer-segfault-repro"
NAMESPACE = "default"
class NoOpActivity < Temporalio::Activity::Definition
def execute(n)
"done-#{n}"
end
end
class BusyWorkflow < Temporalio::Workflow::Definition
def execute(count)
results = []
count.times do |n|
results << Temporalio::Workflow.execute_activity(
NoOpActivity, n,
start_to_close_timeout: 10
)
end
results
end
end
BUFFER_SIZE = 100_000
metrics_buffer = Temporalio::Runtime::MetricBuffer.new(BUFFER_SIZE)
runtime = Temporalio::Runtime.new(
telemetry: Temporalio::Runtime::TelemetryOptions.new(
logging: Temporalio::Runtime::LoggingOptions.new(
log_filter: Temporalio::Runtime::LoggingFilterOptions.new(
core_level: 'WARN', other_level: 'ERROR'
)
),
metrics: Temporalio::Runtime::MetricsOptions.new(
buffer: metrics_buffer,
attach_service_name: false
)
)
)
Temporalio::Runtime.default = runtime
# -- Scraper thread: retrieve_updates + access all fields --
scraper = Thread.new do
loop do
updates = metrics_buffer.retrieve_updates
updates.each do |update|
_name = update.metric.name
_kind = update.metric.kind
_unit = update.metric.unit
_value = update.value
update.attributes.each do |k, v|
_k = k.to_s
_v = v.to_s
end
end
puts "[scraper] Drained #{updates.size} metric updates" if updates.size > 0
sleep(0.05)
rescue => e
puts "[scraper] ERROR: #{e.class}: #{e.message}"
sleep(0.1)
end
end
scraper.name = "metric_scraper"
# -- Connect client and start worker --
client = Temporalio::Client.connect("localhost:7234", NAMESPACE)
worker_thread = Thread.new do
worker = Temporalio::Worker.new(
client: client,
task_queue: TASK_QUEUE,
workflows: [BusyWorkflow],
activities: [NoOpActivity]
)
worker.run(shutdown_signals: [])
end
sleep 2
puts "[main] Starting workflow flood. Watch for segfault..."
puts "[main] Ruby #{RUBY_VERSION} | temporalio #{Gem.loaded_specs['temporalio']&.version}"
puts "[main] PID: #{Process.pid}"
iteration = 0
loop do
iteration += 1
handles = (1..10).map do |i|
client.start_workflow(
BusyWorkflow, 5,
id: "segfault-repro-#{iteration}-#{i}-#{SecureRandom.hex(4)}",
task_queue: TASK_QUEUE
)
end
handles.each { |h| h.result }
puts "[main] Completed batch #{iteration} (#{iteration * 10} workflows)"
sleep(0.1)
rescue Interrupt
puts "\n[main] Interrupted."
break
endExpected Behavior
The script runs indefinitely, printing metric counts.
Actual Behavior
Segfaults after seconds to minutes:
repro.rb:70: [BUG] Segmentation fault at 0x0000000000000003
The crash occurs inside rb_vm_search_method_slowpath, typically when accessing update.attributes or update.metric.name — suggesting Ruby is calling a method on an object whose native backing memory has been freed.
Root Cause Analysis
The Update, Metric, and attribute objects returned by retrieve_updates appear to be thin Ruby wrappers around Rust-owned pointers rather than fully copied Ruby objects. When Ruby's GC runs on another thread and collects related native wrapper objects, the Rust side frees the backing memory, causing a use-after-free when the scraper thread subsequently reads from those pointers.
Workaround
Disabling GC during the retrieval and immediate serialization into pure Ruby objects prevents the segfault:
GC.disable
begin
json = JSON.generate(metrics_buffer.retrieve_updates)
ensure
GC.enable
end
JSON.parse(json).each do |m|
# Safe: pure Ruby objects
endThis confirms the issue is GC-triggered: with GC disabled during the critical section, the script runs indefinitely without crashing (tested 2000+ workflows / 40+ batches).
Suggested Fix
retrieve_updates should return objects that own their data as Ruby strings/hashes/numbers, not wrappers around Rust pointers that can be invalidated by GC on other threads.
Happy to know if i'm otherwise just doing it wrong. Do let me know!