Skip to content

fix(plugin-native): follow-up items from PR #353 FFI hardening review #365

@staging-devin-ai-integration

Description

Context

Follow-up items from PR #353 (plugin SDK FFI hardening). These were validated as real findings but deferred because they are architectural, low-risk, or would expand the PR scope significantly.

Tracked per review discussion.

Follow-up Items

FFI Safety / Soundness

  • Logger Send+Sync unsound with non-null enabled_user_data (logger.rs:34-35). Current hosts pass null, but the Send+Sync impl is a promise to plugin authors. Tighten via fn-pointer-only callbacks or Arc<dyn Send+Sync>. Breaking SDK API change — needs migration path.

  • PluginMetadataStorage Send+Sync by convention (metadata_storage.rs:52-54). Stored in OnceLock, never mutated after init, but public mutable fields technically break the "never-mutated-after-init" invariant. Consider pub(crate) visibility.

  • Post-create_instance handle leak (wrapper.rs:1023-1029). The block between create_instance returning Ok(handle) and Ok(Self { ... }) (e.g. set_log_enabled_callback) can leak the handle if a future addition panics or fails. Wrap in a scopeguard-style fallback that calls destroy_instance if the constructor doesn't reach Ok.

EnvFilter / Callsite

  • EnvFilter cache poisoning — shared callsite Identifier (wrapper.rs:781-821,896-938). All leaked per-(target,level) Metadata entries share a single PLUGIN_LOG_CALLSITE Identifier. tracing-subscriber's EnvFilter caches Interest by callsite ID, so directives like RUST_LOG=whisper=debug,kokoro=off can be masked by whichever target registers first. Fix: one leaked Callsite impl per (target,level) entry, or avoid register_callsite entirely. Add an EnvFilter integration test.

  • PluginLogCallsite not registered (noted in code). The perf win from v9's enabled-check short-circuit is at risk until callsite registration lands properly.

Timeouts / Blocking

  • begin_call rollback can run destroy_instance on arbitrary threads (wrapper.rs:286-308). Plugins may assume worker-thread context (e.g. ORT thread-locals). Document or enforce.

  • InstanceState::Drop → plugin destroy can land on tokio worker threads (wrapper.rs:359-366). Blocking destroy wedges tokio. Mitigated by worker thread detach but not guaranteed.

  • Shared timeout_warned across send and reply paths (wrapper.rs:1198-1236). One-shot dedup loses the reply-side warn when send times out first. Consider splitting into two atomics or rate-limiting (e.g. one warn/min) rather than one-shot.

  • on_upstream_hint FFI calls have no timeout (wrapper.rs:748-782). A slow/wedged hint permanently wedges the worker; all subsequent Tick/Process fail the send-side timeout with "worker likely wedged in prior FFI call". Already acknowledged in module docs (l.46-50), but the end state is node death with a confusing error. Either give hint calls their own timeout, rate-limit hints, or drop when worker occupancy is high.

Resource Cache

  • ResourceCache::clear() doesn't reset counters (resource_cache.rs:143-176). After clear, hits + misses + init_races > entries. Document as monotonic-by-design or expose reset_stats().

  • ResourceCache::clear() races with in-flight get_or_init (resource_cache.rs:143-157). Only tests use it today. Consider a generation counter before production callers appear.

  • get_or_init race path runs init twice (resource_cache.rs:194-225). Fine now, but for GB-sized ML models this wastes RAM + seconds. Add a TODO for a CondVar/Notify second-check design.

Worker / Performance

  • Worker Vec reuse lost on panic (wrapper.rs:524, 596, 672). A recurrently panicking plugin forces malloc thrashing. Minor DoS risk. Already documented.

  • input_pins()/output_pins() clone on every call (wrapper.rs:1029-1035). Hot path for sources.

CI / Tooling

  • Add Miri to CI so api_ptr_preserves_provenance test actually runs (wrapper.rs:2540-2552). Currently #[cfg(miri)] only.

  • cdylib panic-across-FFI test fixture — real loaded plugin exercising panic paths. Unit coverage documents why extern-stub panic testing aborts. Consider a dedicated test fixture crate.

SDK Ergonomics / Cleanup

  • logger.rs:78-84CString::new per log() call despite target_cstr: Option<Arc<CString>> cache field. Reuse the cached value.

  • cstring_sanitize in conversions.rs:541-543 silently falls back to empty. Align with ffi_guard::cstring_lossy which logs.

  • create_instance clones full NodeMetadata just for .kind (lib.rs:930,1148). Once storage is cached, read from it.

  • Document field drop-order invariant (library after handle) in NativePluginWrapper (wrapper.rs:212).

  • set_log_enabled_callback called during instance creation with no blocking bound (wrapper.rs:997-1003). A pathological plugin could wedge engine startup. Low risk but worth documenting.

  • error_to_c thread-local lifetime (conversions.rs:1010-1030). A nested host call can silently invalidate the plugin's captured pointer. Doc now says "copy immediately" — consider enforcing via API change in a future version.

  • Promote assert_send_sync const-assert pattern in SDK docs (parakeet_node.rs:75-79). The pattern const _: () = { const fn assert_send_sync<T: Send + Sync>() {}; assert_send_sync::<MyNode>(); }; is a good alternative to unsafe impl Send+Sync. Worth documenting as the preferred approach for plugin authors.

Metrics

  • plugin.calls can double-count post-timeout completions (metrics.rs:93-97). Consider renaming to call_observations. Already documented in code.

  • metrics.rs:100-102kind: &'static str to avoid per-instance to_string() allocation in build_labels. Low priority.

  • len() returns 0 when poisoned (resource_cache.rs:121). Conflates empty/poisoned. Consistent with std idioms, so OK — document explicitly.

Wire / ABI

  • Bytes::clone() pointer stability cross-crate invariant (conversions.rs:433-458). Regression test exists, but consider pinning bytes version or adding debug_assert.

  • v9 plugin missing optional set_log_enabled_callback (lib.rs:621-623, 771-773). Default logger_mut() -> None means v9 plugins silently get no enabled callback benefit.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions