Skip to content

Commit 248dc46

Browse files
authored
💥 BREAKING CHANGE - Disallow Ruby sync constructs from stdlib and provide safe alternatives (#314)
Fixes #310
1 parent 4a9740e commit 248dc46

14 files changed

+244
-91
lines changed

‎README.md‎

Lines changed: 26 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -571,9 +571,7 @@ Some things to note about the above code:
571571

572572
* A timer is represented by `Temporalio::Workflow.sleep`.
573573
* Timers are also started on `Temporalio::Workflow.timeout`.
574-
* _Technically_ `Kernel.sleep` and `Timeout.timeout` also delegate to the above calls, but the more explicit workflow
575-
forms are encouraged because they accept more options and are not subject to Ruby standard library implementation
576-
changes.
574+
* `Kernel.sleep` and `Timeout.timeout` are considered illegal by default.
577575
* Each timer accepts a `Cancellation`, but if none is given, it defaults to `Temporalio::Workflow.cancellation`.
578576
* `Temporalio::Workflow.wait_condition` accepts a block that waits until the evaluated block result is truthy, then
579577
returns the value.
@@ -586,7 +584,10 @@ Some things to note about the above code:
586584
#### Workflow Fiber Scheduling and Cancellation
587585

588586
Workflows are backed by a custom, deterministic `Fiber::Scheduler`. All fiber calls inside a workflow use this scheduler
589-
to ensure coroutines run deterministically.
587+
to ensure coroutines run deterministically. Although this means that `Kernel.sleep` and `Mutex` and such should work and
588+
since they are Fiber-aware, Temporal intentionally disables their use by default to prevent accidental use. See
589+
"Workflow Logic Constraints" and "Advanced Workflow Safety and Escaping" for more details, and see "Workflow Utilities"
590+
for alternatives.
590591

591592
Every workflow contains a `Temporalio::Cancellation` at `Temporalio::Workflow.cancellation`. This is canceled when the
592593
workflow is canceled. For all workflow calls that accept a cancellation token, this is the default. So if a workflow is
@@ -678,6 +679,9 @@ from workflows including:
678679
nil key for dynamic). `[]=` or `store` can be called on these to update the handlers, though defined handlers are
679680
encouraged over runtime-set ones.
680681

682+
There are also classes for `Temporalio::Workflow::Mutex`, `Temporalio::Workflow::Queue`, and
683+
`Temporalio::Workflow::SizedQueue` that are workflow-safe wrappers around the standard library forms.
684+
681685
`Temporalio::Workflow::ContinueAsNewError` can be raised to continue-as-new the workflow. It accepts positional args and
682686
defaults the workflow to the same as the current, though it can be changed with the `workflow` kwarg. See API
683687
documentation for other details.
@@ -714,15 +718,15 @@ Ruby workflows. This means there are several things workflows cannot do such as:
714718

715719
* Perform IO (network, disk, stdio, etc)
716720
* Access/alter external mutable state
717-
* Do any threading
721+
* Do any threading or blocking calls
718722
* Do anything using the system clock (e.g. `Time.Now`)
719723
* Make any random calls
720724
* Make any not-guaranteed-deterministic calls
721725

722-
This means you can't even call `puts` or logger calls outside of `Temporalio::Workflow.logger` because they use mutexes
723-
which may be hit during periods of high-contention, but they are not completely disabled since users may do quick
724-
debugging with them. See the [Advanced Workflow Safety and Escaping](#advanced-workflow-safety-and-escaping) section if
725-
needing to work around this.
726+
This means you can't even use logger calls outside of `Temporalio::Workflow.logger` because they use mutexes which may
727+
be hit during periods of high-contention, but they are not completely disabled since users may do quick debugging with
728+
them. See the [Advanced Workflow Safety and Escaping](#advanced-workflow-safety-and-escaping) section if needing to work
729+
around this.
726730

727731
#### Workflow Testing
728732

@@ -928,24 +932,22 @@ See the `WorkflowReplayer` API documentation for more details.
928932

929933
#### Advanced Workflow Safety and Escaping
930934

931-
Workflows use a custom fiber scheduler to make things like certain blocking calls and timeouts durable. There is also
932-
call tracing to prevent accidentally making illegal workflow calls. But sometimes in advanced situations, workarounds
933-
may be needed. This section describes advanced situations working with the workflow Fiber scheduler and illegal call
934-
tracer.
935+
Workflows use a custom fiber scheduler to make fibers durable. There is also call tracing to prevent accidentally making
936+
illegal workflow calls. But sometimes in advanced situations, workarounds may be needed. This section describes advanced
937+
situations working with the workflow Fiber scheduler and illegal call tracer.
935938

936939
##### Durable Fiber Scheduler
937940

938-
The custom fiber scheduler that powers workflows makes otherwise-local, blocking things durable. This is why `sleep` and
939-
`Timeout.timeout` and `Queue` and other things work durably. However, there are cases where it may be desired for these
940-
to work locally inside a workflow such as for logging or `puts` or other side-effecting, known-non-deterministic
941-
aspects.
941+
By default, Temporal considers `Logger`, `sleep`, `Timeout.timeout`, `Queue`, etc illegal. However, there are cases
942+
where it may be desired for these to work locally inside a workflow such as for logging or other side-effecting,
943+
known-non-deterministic aspects.
942944

943945
Users can pass a block to `Temporalio::Workflow::Unsafe.durable_scheduler_disabled` to not use the durable scheduler.
944946
This should be used any time the scheduler needs to be bypassed, e.g. for local stdout. Not doing this can cause
945-
workflows to get hung in high contention situations. For instance, if there is a `puts` or a logger (that isn't the
946-
safe-to-use `Temporalio::Workflow.logger`) in a workflow, _technically_ Ruby surrounds the IO writes with a mutex and
947-
in extreme high contention that mutex may durably block and then the workflow task may complete causing hung workflows
948-
because no event comes to wake the mutex.
947+
workflows to get hung in high contention situations. For instance, if there is a logger (that isn't the safe-to-use
948+
`Temporalio::Workflow.logger`) in a workflow, _technically_ Ruby surrounds the IO writes with a mutex and in extreme
949+
high contention that mutex may durably block and then the workflow task may complete causing hung workflows because no
950+
event comes to wake the mutex.
949951

950952
Also, by default anything that relies on IO wait that is not inside `durable_scheduler_disabled` will fail. It is
951953
recommended to put things that need this in `durable_scheduler_disabled`, but if the durable scheduler is still needed
@@ -957,9 +959,9 @@ Note `durable_scheduler_disabled` implies `illegal_call_tracing_disabled` (see n
957959

958960
##### Illegal Call Tracing
959961

960-
Ruby workflow threads employ a `TracePoint` to catch illegal calls such as `Time.now` or `Thread.new`. The set of
961-
illegal calls can be configured via the `illegal_workflow_calls` parameter when creating a worker. The default set is at
962-
`Temporalio::Worker.default_illegal_workflow_calls`.
962+
Ruby workflow threads employ a `TracePoint` to catch illegal calls such as `sleep` or `Time.now` or `Thread.new`. The
963+
set of illegal calls can be configured via the `illegal_workflow_calls` parameter when creating a worker. The default
964+
set is at `Temporalio::Worker.default_illegal_workflow_calls`.
963965

964966
When an illegal call is encountered, an exception is thrown. In advanced cases there may be a need to allow an illegal
965967
call that is known to be used deterministically. This code can be in a block passed to

‎temporalio/lib/temporalio/cancellation.rb‎

Lines changed: 16 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ class Cancellation
1818
def initialize(*parents)
1919
@canceled = false
2020
@canceled_reason = nil
21-
@canceled_mutex = Mutex.new
21+
@canceled_mutex = Workflow::Unsafe.illegal_call_tracing_disabled { Mutex.new }
2222
@canceled_cond_var = nil
2323
@cancel_callbacks = {} # Keyed by sentinel value, but value iteration still is deterministic
2424
@shield_depth = 0
@@ -28,22 +28,22 @@ def initialize(*parents)
2828

2929
# @return [Boolean] Whether this cancellation is canceled.
3030
def canceled?
31-
@canceled_mutex.synchronize { @canceled }
31+
canceled_mutex_synchronize { @canceled }
3232
end
3333

3434
# @return [String, nil] Reason for cancellation. Can be nil if not canceled or no reason provided.
3535
def canceled_reason
36-
@canceled_mutex.synchronize { @canceled_reason }
36+
canceled_mutex_synchronize { @canceled_reason }
3737
end
3838

3939
# @return [Boolean] Whether a cancel is pending but currently shielded.
4040
def pending_canceled?
41-
@canceled_mutex.synchronize { !@shield_pending_cancel.nil? }
41+
canceled_mutex_synchronize { !@shield_pending_cancel.nil? }
4242
end
4343

4444
# @return [String, nil] Reason for pending cancellation. Can be nil if not pending canceled or no reason provided.
4545
def pending_canceled_reason
46-
@canceled_mutex.synchronize { @shield_pending_cancel&.first }
46+
canceled_mutex_synchronize { @shield_pending_cancel&.first }
4747
end
4848

4949
# Raise an error if this cancellation is canceled.
@@ -71,13 +71,13 @@ def wait
7171
return
7272
end
7373

74-
@canceled_mutex.synchronize do
74+
canceled_mutex_synchronize do
7575
break if @canceled
7676

7777
# Add cond var if not present
7878
if @canceled_cond_var.nil?
7979
@canceled_cond_var = ConditionVariable.new
80-
@cancel_callbacks[Object.new] = proc { @canceled_mutex.synchronize { @canceled_cond_var.broadcast } }
80+
@cancel_callbacks[Object.new] = proc { canceled_mutex_synchronize { @canceled_cond_var.broadcast } }
8181
end
8282

8383
# Wait on it
@@ -94,10 +94,10 @@ def wait
9494
def shield
9595
raise ArgumentError, 'Block required' unless block_given?
9696

97-
@canceled_mutex.synchronize { @shield_depth += 1 }
97+
canceled_mutex_synchronize { @shield_depth += 1 }
9898
yield
9999
ensure
100-
callbacks_to_run = @canceled_mutex.synchronize do
100+
callbacks_to_run = canceled_mutex_synchronize do
101101
@shield_depth -= 1
102102
if @shield_depth.zero? && @shield_pending_cancel
103103
reason = @shield_pending_cancel.first
@@ -120,7 +120,7 @@ def shield
120120
def add_cancel_callback(&block)
121121
raise ArgumentError, 'Must provide block' unless block_given?
122122

123-
callback_to_run_immediately, key = @canceled_mutex.synchronize do
123+
callback_to_run_immediately, key = canceled_mutex_synchronize do
124124
break [block, nil] if @canceled
125125

126126
key = Object.new
@@ -135,7 +135,7 @@ def add_cancel_callback(&block)
135135
#
136136
# @param key [Object] Key returned from {add_cancel_callback}.
137137
def remove_cancel_callback(key)
138-
@canceled_mutex.synchronize do
138+
canceled_mutex_synchronize do
139139
@cancel_callbacks.delete(key)
140140
end
141141
nil
@@ -144,7 +144,7 @@ def remove_cancel_callback(key)
144144
private
145145

146146
def on_cancel(reason:)
147-
callbacks_to_run = @canceled_mutex.synchronize do
147+
callbacks_to_run = canceled_mutex_synchronize do
148148
# If we're shielding, set as pending and return nil
149149
if @shield_depth.positive?
150150
@shield_pending_cancel = [reason]
@@ -166,5 +166,9 @@ def prepare_cancel(reason:)
166166
@cancel_callbacks.clear
167167
to_return.values
168168
end
169+
170+
def canceled_mutex_synchronize(&)
171+
Workflow::Unsafe.illegal_call_tracing_disabled { @canceled_mutex.synchronize(&) }
172+
end
169173
end
170174
end

‎temporalio/lib/temporalio/internal/worker/workflow_instance/illegal_call_tracer.rb‎

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ def self.frozen_validated_illegal_calls(illegal_calls)
1616
# @type var fixed_val: :all | Worker::IllegalWorkflowCallValidator | Hash[Symbol, TrueClass | Worker::IllegalWorkflowCallValidator] # rubocop:disable Layout/LineLength
1717
fixed_val = case val
1818
when Temporalio::Worker::IllegalWorkflowCallValidator
19-
if sub_val.method_name
19+
if val.method_name
2020
raise ArgumentError,
2121
'Top level IllegalWorkflowCallValidator instances cannot have method name'
2222
end

‎temporalio/lib/temporalio/internal/worker/workflow_instance/scheduler.rb‎

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -111,7 +111,7 @@ def block(_blocker, timeout = nil)
111111
# We just yield because unblock will resume this. We will just wrap in timeout if needed.
112112
if timeout
113113
begin
114-
Timeout.timeout(timeout) { Fiber.yield }
114+
Workflow.timeout(timeout) { Fiber.yield }
115115
true
116116
rescue Timeout::Error
117117
false

‎temporalio/lib/temporalio/worker.rb‎

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -267,7 +267,11 @@ def self.default_illegal_workflow_calls
267267
#:write
268268
],
269269
'Kernel' => %i[abort at_exit autoload autoload? eval exec exit fork gets load open rand readline readlines
270-
spawn srand system test trap],
270+
sleep spawn srand system test trap],
271+
# Loggers use mutexes in ways that can hang workflows, so users need to disable the durable scheduler to use
272+
# them
273+
'Logger' => :all,
274+
'Monitor' => :all,
271275
'Net::HTTP' => :all,
272276
'Pathname' => :all,
273277
# TODO(cretz): Investigate why clock_gettime called from Timeout thread affects this code at all. Stack trace
@@ -282,9 +286,14 @@ def self.default_illegal_workflow_calls
282286
'Signal' => :all,
283287
'Socket' => :all,
284288
'Tempfile' => :all,
289+
'Timeout' => :all,
285290
'Thread' => %i[abort_on_exception= exit fork handle_interrupt ignore_deadlock= kill new pass
286291
pending_interrupt? report_on_exception= start stop initialize join name= priority= raise run
287292
terminate thread_variable_set wakeup],
293+
'Thread::ConditionVariable' => :all,
294+
'Thread::Mutex' => IllegalWorkflowCallValidator.known_safe_mutex_validator,
295+
'Thread::SizedQueue' => :all,
296+
'Thread::Queue' => :all,
288297
'Time' => IllegalWorkflowCallValidator.default_time_validators
289298
} #: Hash[String, :all | Array[Symbol]]
290299
hash.each_value(&:freeze)

‎temporalio/lib/temporalio/worker/illegal_workflow_call_validator.rb‎

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,15 @@ def self.default_time_validators
3939
]
4040
end
4141

42+
# @return [IllegalWorkflowCallValidator] Workflow call validator that is tailored to disallow most Mutex calls,
43+
# but let others through for certain situations.
44+
def self.known_safe_mutex_validator
45+
@known_safe_mutex_validator ||= IllegalWorkflowCallValidator.new do
46+
# Only Google Protobuf use of Mutex is known to be safe, fail unless any caller location path has protobuf
47+
raise 'disallowed' unless caller_locations&.any? { |loc| loc.path&.include?('google/protobuf/') }
48+
end
49+
end
50+
4251
# @return [String, nil] Method name if this validator is specific to a method.
4352
attr_reader :method_name
4453

‎temporalio/lib/temporalio/workflow.rb‎

Lines changed: 71 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -537,28 +537,72 @@ def self.replaying?
537537
# Run a block of code with illegal call tracing disabled. Users should be cautious about using this as it can
538538
# often signify unsafe code.
539539
#
540+
# If this is invoked outside of a workflow, it just runs the block.
541+
#
540542
# @yield Block to run with call tracing disabled
541543
#
542544
# @return [Object] Result of the block.
543545
def self.illegal_call_tracing_disabled(&)
544-
Workflow._current.illegal_call_tracing_disabled(&)
546+
if Workflow.in_workflow?
547+
Workflow._current.illegal_call_tracing_disabled(&)
548+
else
549+
yield
550+
end
545551
end
546552

547553
# Run a block of code with IO enabled. Specifically this allows the `io_wait` call of the fiber scheduler to work.
548554
# Users should be cautious about using this as it can often signify unsafe code. Note, this is often only
549555
# applicable to network code as file IO and most process-based IO does not go through scheduler `io_wait`.
556+
#
557+
# If this is invoked outside of a workflow, it just runs the block.
550558
def self.io_enabled(&)
551-
Workflow._current.io_enabled(&)
559+
if Workflow.in_workflow?
560+
Workflow._current.io_enabled(&)
561+
else
562+
yield
563+
end
552564
end
553565

554566
# Run a block of code with the durable/deterministic workflow Fiber scheduler off. This means fallback to default
555567
# fiber scheduler and no workflow helpers will be available in the block. This is usually only needed in advanced
556568
# situations where a third party library does something like use "Timeout" in a way that shouldn't be made
557569
# durable.
558570
#
571+
# If this is invoked outside of a workflow, it just runs the block.
572+
#
559573
# This implies {illegal_call_tracing_disabled}.
560574
def self.durable_scheduler_disabled(&)
561-
Workflow._current.durable_scheduler_disabled(&)
575+
if Workflow.in_workflow?
576+
Workflow._current.durable_scheduler_disabled(&)
577+
else
578+
yield
579+
end
580+
end
581+
582+
# @!visibility private
583+
def self._wrap_ruby_class_as_legal(target_class)
584+
Class.new do
585+
define_method(:initialize) do |*args, **kwargs, &block|
586+
@underlying = Unsafe.illegal_call_tracing_disabled do
587+
target_class.new(*args, **kwargs, &block) # steep:ignore
588+
end
589+
end
590+
591+
# @!visibility private
592+
def method_missing(name, ...)
593+
if @underlying.respond_to?(name)
594+
# Call with tracing disabled
595+
Unsafe.illegal_call_tracing_disabled { @underlying.public_send(name, ...) }
596+
else
597+
super
598+
end
599+
end
600+
601+
# @!visibility private
602+
def respond_to_missing?(name, include_all = false)
603+
@underlying.respond_to?(name, include_all) || super
604+
end
605+
end
562606
end
563607
end
564608

@@ -625,5 +669,29 @@ class InvalidWorkflowStateError < Error; end
625669
# error can still be used with configuring workflow failure exception types to change non-deterministic errors from
626670
# task failures to workflow failures.
627671
class NondeterminismError < Error; end
672+
673+
# Mutex is a workflow-safe wrapper around {::Mutex}.
674+
#
675+
# As of this writing, all methods on Mutex are safe for workflow use and are implicitly made deterministic by the
676+
# Fiber scheduler. The primary reason this is wrapped as safe is to be able to catch unintentional uses of Mutex by
677+
# non-workflow-safe code. However, users may prefer to use the more powerful {wait_condition} approach as a mutex
678+
# (e.g. wait until a certain attribute is set to false then set it to true before continuing).
679+
Mutex = Unsafe._wrap_ruby_class_as_legal(::Mutex)
680+
681+
# Queue is a workflow-safe wrapper around {::Queue}.
682+
#
683+
# As of this writing, all methods on Queue are safe for workflow use and are implicitly made deterministic by the
684+
# Fiber scheduler. The primary reason this is wrapped as safe is to be able to catch unintentional uses of Queue by
685+
# non-workflow-safe code. However, users may prefer to use the more powerful {wait_condition} approach as a queue
686+
# (e.g. wait until an array is non-empty before continuing).
687+
Queue = Unsafe._wrap_ruby_class_as_legal(::Queue)
688+
689+
# SizedQueue is a workflow-safe wrapper around {::SizedQueue}.
690+
#
691+
# As of this writing, all methods on SizedQueue are safe for workflow use and are implicitly made deterministic by
692+
# the Fiber scheduler. The primary reason this is wrapped as safe is to be able to catch unintentional uses of
693+
# SizedQueue by non-workflow-safe code. However, users may prefer to use the more powerful {wait_condition} approach
694+
# as a queue (e.g. wait until an array is non-empty before continuing).
695+
SizedQueue = Unsafe._wrap_ruby_class_as_legal(::SizedQueue)
628696
end
629697
end

‎temporalio/sig/temporalio/cancellation.rbs‎

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,11 +9,12 @@ module Temporalio
99
def check!: (?Exception err) -> void
1010
def to_ary: -> [Cancellation, Proc]
1111
def wait: -> void
12-
def shield: [T] { (?) -> untyped } -> T
12+
def shield: [T] { (?) -> T } -> T
1313
def add_cancel_callback: { -> untyped } -> Object
1414
def remove_cancel_callback: (Object key) -> void
1515

1616
private def on_cancel: (reason: Object?) -> void
1717
private def prepare_cancel: (reason: Object?) -> Array[Proc]?
18+
private def canceled_mutex_synchronize: [T] { (?) -> T } -> T
1819
end
1920
end

‎temporalio/sig/temporalio/worker/illegal_workflow_call_validator.rbs‎

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ module Temporalio
1414
end
1515

1616
def self.default_time_validators: -> Array[IllegalWorkflowCallValidator]
17+
def self.known_safe_mutex_validator: -> IllegalWorkflowCallValidator
1718

1819
attr_reader method_name: Symbol?
1920
attr_reader block: ^(CallInfo) -> void

0 commit comments

Comments
 (0)