-
-
Notifications
You must be signed in to change notification settings - Fork 199
Description
Issue: TBB worker threads scheduled to efficiency cores on Apple Silicon
Summary
On Apple Silicon Macs, TBB worker threads are created with the default QoS (Quality of Service) class. macOS interprets this as "not user-facing work" and may schedule these threads to efficiency (E) cores even when performance (P) cores are available. This significantly degrades Stan's parallel performance.
Description
Apple Silicon chips have heterogeneous cores:
- Performance (P) cores: Fast, for compute-intensive work
- Efficiency (E) cores: Slower (~3x), for background tasks
macOS uses QoS classes to decide core scheduling. The default QoS class signals "this work isn't urgent," allowing macOS to prefer E-cores to save power. For Stan's compute workloads, this is the wrong signal.
Observed behavior
- Environment: macOS 26.2 (Tahoe), Apple M3 Ultra (24 P-cores, 8 E-cores)
- Stan model using
reduce_sumwith 12 threads per chain, 2 chains - Initial CPU usage: ~800% per chain (threads on P-cores)
- After ~4 minutes: CPU usage drops to ~100-300% per chain (threads demoted to E-cores)
- P-cores sit idle while E-cores are saturated
This behavior appears more aggressive in macOS 26 (Tahoe) but affects all Apple Silicon Macs.
Root cause
TBB creates worker threads without setting a QoS class, so they inherit the default. macOS sees long-running default-QoS threads as background work and demotes them to E-cores.
Proposed fix
Stan Math already has a task_scheduler_observer in stan/math/rev/core/init_chainablestack.hpp that runs when TBB worker threads are created. Adding a call to pthread_set_qos_class_self_np(QOS_CLASS_USER_INITIATED, 0) in on_scheduler_entry() would signal to macOS that these are user-initiated compute threads.
This doesn't prevent E-core usage - it tells macOS to prefer P-cores when available. If all P-cores are busy (e.g., running more threads than P-cores), macOS can still use E-cores. The fix ensures P-cores aren't left idle while work runs slowly on E-cores.
void on_scheduler_entry(bool worker) {
#ifdef __APPLE__
#if defined(__arm64__) || defined(__aarch64__)
// Prefer performance cores for compute threads
pthread_set_qos_class_self_np(QOS_CLASS_USER_INITIATED, 0);
#endif
#endif
// ... existing AD tape initialization ...
}References
- oneTBB issue #896: Apple Silicon QoS support - TBB developers acknowledged the issue but suggested users handle it via
task_scheduler_observer - Apple QoS documentation
Environment
- macOS: 26.2 (Tahoe), also affects earlier versions
- Hardware: Apple M3 Ultra (also affects M1, M2, M4 series)
- Stan Math: 2.38.0 (TBB 2020.3)