Skip to content

Commit b3382ef

Browse files
committed
Bugfix: Fix orphaned llama-server processes
- `configure_ai` had three separate lock/unlock cycles, creating a race window where a spawned process existed but `child_pid` wasn't set yet. Rapid provider switching (Local → OpenAI → Local → OpenAI) could orphan processes that survived app quit. - Fix: spawn + `child_pid` assignment now happen synchronously inside a single MANAGER lock. Only the health check (up to 60s) runs async. - Split `start_server_inner` into `spawn_and_track_server` (sync) + `wait_for_server_health` (async) + `cleanup_failed_server`. - `wait_for_server_health` now kills the process on timeout or early death instead of returning `Err` and leaving it running. - Belt-and-suspenders: `kill_stale_llama_servers` uses `pgrep -f` to find and stop any llama-server processes from our AI directory, called on startup and before every spawn.
1 parent 52afe37 commit b3382ef

3 files changed

Lines changed: 167 additions & 107 deletions

File tree

apps/desktop/src-tauri/src/ai/CLAUDE.md

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ Three provider modes:
1515
| `manager.rs` | Central coordinator. Global `Mutex<Option<ManagerState>>` singleton. Most Tauri commands live here. Stores provider + OpenAI config in `ManagerState`. |
1616
| `download.rs` | HTTP streaming download with Range-based resume. Emits `ai-download-progress` events (200ms throttle). Cooperative cancellation via function parameter (`Fn() -> bool`). |
1717
| `extract.rs` | Copies bundled `llama-server` binary + dylibs from `resources/ai/` to the AI data dir. Sets Unix permissions, handles symlinks. |
18-
| `process.rs` | Spawns child process with `DYLD_LIBRARY_PATH` set. Instant SIGKILL to stop (llama-server is stateless; macOS reclaims all GPU/mmap resources). `kill_process` for fire-and-forget (quit, orphans), `kill_and_reap_in_background` for normal operation (reaps zombie in bg thread). Port discovery via `bind(:0)`. Takes `ctx_size` param. |
18+
| `process.rs` | Spawns child process with `DYLD_LIBRARY_PATH` set. Instant SIGKILL to stop (llama-server is stateless; macOS reclaims all GPU/mmap resources). `kill_process` for fire-and-forget (quit, orphans), `kill_and_reap_in_background` for normal operation (reaps zombie in bg thread). `kill_stale_llama_servers` for belt-and-suspenders orphan cleanup by process name. Port discovery via `bind(:0)`. |
1919
| `client.rs` | reqwest client with `AiBackend` enum: `Local { port }` or `OpenAi { api_key, base_url, model }`. Routes requests accordingly. |
2020
| `suggestions.rs` | Builds few-shot prompt from listing cache, routes to configured backend, sanitizes response. |
2121

@@ -37,7 +37,8 @@ Frontend loads
3737
openaiApiKey, openaiBaseUrl, openaiModel
3838
})
3939
-> backend: if provider === 'local' && model installed && local AI supported
40-
-> start_server_inner(ctx_size)
40+
-> spawn_and_track_server() (sync, inside lock — PID tracked immediately)
41+
-> wait_for_server_health() (async — polls up to 60s)
4142
-> emit 'ai-server-ready' when healthy
4243
```
4344

@@ -111,8 +112,14 @@ The frontend (`AiSection.svelte`) tracks `installStep` state and displays "Step
111112
**Gotcha**: `get_folder_suggestions` returns `Ok(Vec::new())` on AI errors, not `Err`.
112113
**Why**: AI suggestions are a nice-to-have enhancement. Returning empty gracefully hides the failure.
113114

114-
**Gotcha**: `configure_ai` must NOT block. Server start is spawned in background via `tauri::async_runtime::spawn`.
115-
**Why**: `start_server_inner` takes 5-60s for health check polling. Blocking would freeze the frontend on startup.
115+
**Gotcha**: `configure_ai` must NOT block. Only the health check runs async via `tauri::async_runtime::spawn`.
116+
**Why**: Health check polling takes 5-60s. Blocking would freeze the frontend on startup.
117+
118+
**Gotcha**: The process spawn and `child_pid` assignment must happen synchronously inside the MANAGER lock.
119+
**Why**: Previously, spawn happened inside an async task, creating a race window where a process existed but wasn't tracked in `child_pid`. Rapid provider switching (Local → OpenAI → Local → OpenAI) could orphan processes that survived app quit. Fixed by splitting into `spawn_and_track_server` (sync, inside lock) + `wait_for_server_health` (async).
120+
121+
**Gotcha**: `wait_for_server_health` kills the process on timeout or early death — don't remove that cleanup.
122+
**Why**: Without it, a process that fails health check would be orphaned (PID tracked but never cleaned up until explicit stop).
116123

117124
## Dependencies
118125

apps/desktop/src-tauri/src/ai/manager.rs

Lines changed: 128 additions & 103 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ use super::download::{cleanup_partial, download_file};
99
use super::extract::{LLAMA_SERVER_BINARY, REQUIRED_DYLIB, extract_bundled_llama_server};
1010
use super::process::{
1111
SERVER_LOG_FILENAME, find_available_port, is_process_alive, kill_and_reap_in_background, kill_process,
12-
log_diagnostics, read_log_tail, spawn_llama_server,
12+
kill_stale_llama_servers, log_diagnostics, read_log_tail, spawn_llama_server,
1313
};
1414
use super::{AiState, AiStatus, ModelInfo, get_default_model, get_model_by_id, is_local_ai_supported};
1515
use crate::ignore_poison::IgnorePoison;
@@ -73,16 +73,16 @@ pub fn init<R: Runtime>(app: &AppHandle<R>) {
7373
openai_model: String::from("gpt-4o-mini"),
7474
});
7575

76-
// Clean up stale PID from a previous session (crash, force-quit, or normal restart)
76+
// Belt-and-suspenders: stop ALL llama-server processes from our AI directory,
77+
// not just the tracked PID. Catches orphans from race conditions or crashes.
78+
if let Some(ref m) = *manager {
79+
kill_stale_llama_servers(&m.ai_dir);
80+
}
81+
82+
// Clean up tracked PID from a previous session
7783
if let Some(ref mut m) = *manager
78-
&& let Some(pid) = m.state.pid
84+
&& m.state.pid.is_some()
7985
{
80-
if is_process_alive(pid) {
81-
log::info!("AI manager: stopping orphaned llama-server (PID {pid}) from previous session");
82-
kill_process(pid); // Can't reap — it's a child of the old app process, not ours
83-
} else {
84-
log::debug!("AI manager: clearing dead PID {pid} from previous session");
85-
}
8686
m.state.pid = None;
8787
m.state.port = None;
8888
save_state(&m.ai_dir, &m.state);
@@ -490,57 +490,59 @@ pub fn configure_ai<R: Runtime>(
490490
"AI configure: provider={provider}, context_size={context_size}, base_url={openai_base_url}, model={openai_model}"
491491
);
492492

493-
let should_start_local;
494-
let should_stop;
493+
// Single lock: decide, stop, spawn — no race window for orphan processes
494+
let spawn_result;
495495
{
496496
let mut manager = MANAGER.lock_ignore_poison();
497-
if let Some(ref mut m) = *manager {
498-
// Stop server if switching away from local while one is running
499-
should_stop = provider != "local" && m.child_pid.is_some();
500-
501-
m.provider = provider.clone();
502-
m.context_size = context_size;
503-
m.openai_api_key = openai_api_key;
504-
m.openai_base_url = openai_base_url;
505-
m.openai_model = openai_model;
506-
507-
should_start_local =
508-
provider == "local" && is_local_ai_supported() && is_fully_installed(m) && m.child_pid.is_none();
509-
} else {
510-
return;
511-
}
512-
}
497+
let Some(ref mut m) = *manager else { return };
513498

514-
// Stop local server if provider changed away from local
515-
if should_stop {
516-
log::info!("AI configure: provider changed away from local, stopping server");
517-
let mut manager = MANAGER.lock_ignore_poison();
518-
if let Some(ref mut m) = *manager
499+
// Stop server if switching away from local while one is running
500+
if provider != "local"
519501
&& let Some(pid) = m.child_pid.take()
520502
{
503+
log::info!("AI configure: provider changed away from local, stopping server");
521504
kill_and_reap_in_background(pid);
522505
m.state.port = None;
523506
m.state.pid = None;
524507
save_state(&m.ai_dir, &m.state);
525508
}
509+
510+
m.provider = provider.clone();
511+
m.context_size = context_size;
512+
m.openai_api_key = openai_api_key;
513+
m.openai_base_url = openai_base_url;
514+
m.openai_model = openai_model;
515+
516+
// Spawn server synchronously so child_pid is set before the lock is released.
517+
// Only the health check (up to 60s) runs async.
518+
spawn_result =
519+
if provider == "local" && is_local_ai_supported() && is_fully_installed(m) && m.child_pid.is_none() {
520+
match spawn_and_track_server(m) {
521+
Ok((pid, port)) => {
522+
m.server_starting = true;
523+
Some((pid, port))
524+
}
525+
Err(e) => {
526+
log::error!("AI configure: couldn't spawn server: {e}");
527+
None
528+
}
529+
}
530+
} else {
531+
None
532+
};
526533
}
527534

528-
if should_start_local {
529-
let app_clone = app.clone();
535+
// Health check asynchronously (the slow part, up to 60s)
536+
if let Some((pid, port)) = spawn_result {
530537
let _ = app.emit("ai-starting", ());
531-
{
532-
let mut manager = MANAGER.lock_ignore_poison();
533-
if let Some(ref mut m) = *manager {
534-
m.server_starting = true;
535-
}
536-
}
538+
let ai_dir = get_ai_dir(&app);
537539
tauri::async_runtime::spawn(async move {
538-
match start_server_inner(&app_clone).await {
540+
match wait_for_server_health(&ai_dir, pid, port).await {
539541
Ok(()) => {
540542
log::info!("AI: server ready");
541-
let _ = app_clone.emit("ai-server-ready", ());
543+
let _ = app.emit("ai-server-ready", ());
542544
}
543-
Err(e) => log::error!("AI manager: failed to start server: {e}"),
545+
Err(e) => log::error!("AI manager: server didn't start: {e}"),
544546
}
545547
let mut manager = MANAGER.lock_ignore_poison();
546548
if let Some(ref mut m) = *manager {
@@ -573,33 +575,44 @@ pub fn start_ai_server<R: Runtime>(app: AppHandle<R>, ctx_size: u32) -> Result<(
573575
return Err(String::from("Local AI not supported on this hardware"));
574576
}
575577

576-
let should_start;
578+
// Recovery: re-extract binary if missing (before acquiring lock)
579+
let ai_dir = get_ai_dir(&app);
580+
let binary_path = ai_dir.join(LLAMA_SERVER_BINARY);
581+
if !binary_path.exists() {
582+
log::debug!("AI manager: binary missing, attempting re-extraction...");
583+
extract_bundled_llama_server(&app, &ai_dir)?;
584+
}
585+
586+
let spawn_result;
577587
{
578588
let mut manager = MANAGER.lock_ignore_poison();
579-
if let Some(ref mut m) = *manager {
580-
m.context_size = ctx_size;
581-
should_start = is_fully_installed(m) && m.child_pid.is_none();
582-
} else {
589+
let Some(ref mut m) = *manager else {
583590
return Err(String::from("AI manager not initialized"));
584-
}
591+
};
592+
m.context_size = ctx_size;
593+
594+
spawn_result = if is_fully_installed(m) && m.child_pid.is_none() {
595+
match spawn_and_track_server(m) {
596+
Ok((pid, port)) => {
597+
m.server_starting = true;
598+
Some((pid, port))
599+
}
600+
Err(e) => return Err(e),
601+
}
602+
} else {
603+
None
604+
};
585605
}
586606

587-
if should_start {
588-
let app_clone = app.clone();
607+
if let Some((pid, port)) = spawn_result {
589608
let _ = app.emit("ai-starting", ());
590-
{
591-
let mut manager = MANAGER.lock_ignore_poison();
592-
if let Some(ref mut m) = *manager {
593-
m.server_starting = true;
594-
}
595-
}
596609
tauri::async_runtime::spawn(async move {
597-
match start_server_inner(&app_clone).await {
610+
match wait_for_server_health(&ai_dir, pid, port).await {
598611
Ok(()) => {
599612
log::info!("AI: server ready");
600-
let _ = app_clone.emit("ai-server-ready", ());
613+
let _ = app.emit("ai-server-ready", ());
601614
}
602-
Err(e) => log::error!("AI manager: failed to start server: {e}"),
615+
Err(e) => log::error!("AI manager: server didn't start: {e}"),
603616
}
604617
let mut manager = MANAGER.lock_ignore_poison();
605618
if let Some(ref mut m) = *manager {
@@ -919,9 +932,16 @@ async fn do_download<R: Runtime>(app: &AppHandle<R>) -> Result<(), String> {
919932
// Emit installing event so UI shows "Setting up AI..." while server starts
920933
let _ = app.emit("ai-installing", ());
921934

922-
// Start the server FIRST, then emit install complete
923-
// This ensures the server is healthy before showing "AI ready"
924-
start_server_inner(app).await?;
935+
// Start the server FIRST, then emit install complete.
936+
// Spawn synchronously so PID is tracked immediately, then health-check async.
937+
let (pid, port) = {
938+
let mut manager = MANAGER.lock_ignore_poison();
939+
let Some(ref mut m) = *manager else {
940+
return Err(String::from("AI manager not initialized"));
941+
};
942+
spawn_and_track_server(m)?
943+
};
944+
wait_for_server_health(&ai_dir, pid, port).await?;
925945

926946
// Emit install complete only after server is healthy
927947
let _ = app.emit("ai-install-complete", ());
@@ -934,58 +954,49 @@ fn is_cancel_requested() -> bool {
934954
manager.as_ref().is_some_and(|m| m.cancel_requested)
935955
}
936956

937-
async fn start_server_inner<R: Runtime>(app: &AppHandle<R>) -> Result<(), String> {
938-
let ai_dir = get_ai_dir(app);
939-
let model = get_current_model();
957+
/// Spawns llama-server and immediately tracks its PID in manager state.
958+
/// Must be called while holding the MANAGER lock.
959+
/// Returns (pid, port) for the caller to health-check asynchronously.
960+
fn spawn_and_track_server(m: &mut ManagerState) -> Result<(u32, u16), String> {
961+
let model = get_model_by_id(&m.state.installed_model_id).unwrap_or_else(get_default_model);
962+
let port = find_available_port().ok_or("No available port")?;
940963

941-
// Recovery: if model exists but binary is missing, try re-extraction
942-
let binary_path = ai_dir.join(LLAMA_SERVER_BINARY);
943-
if !binary_path.exists() {
944-
log::debug!("AI manager: binary missing, attempting re-extraction...");
945-
extract_bundled_llama_server(app, &ai_dir)?;
946-
}
964+
log::debug!("AI server: starting llama-server on port {port} with context size {}", m.context_size);
947965

948-
// Read context size from manager state
949-
let ctx_size = {
950-
let manager = MANAGER.lock_ignore_poison();
951-
manager.as_ref().map(|m| m.context_size).unwrap_or(4096)
952-
};
966+
// Belt-and-suspenders: stop any stale llama-servers before spawning a new one
967+
kill_stale_llama_servers(&m.ai_dir);
953968

954-
// Find an available port
955-
let port = find_available_port().ok_or("No available port")?;
956-
log::debug!("AI server: starting llama-server on port {port} with context size {ctx_size}");
969+
let pid = spawn_llama_server(&m.ai_dir, model.filename, port, m.context_size)?;
970+
971+
// Track PID immediately — no race window where a process exists untracked
972+
m.child_pid = Some(pid);
973+
m.state.port = Some(port);
974+
m.state.pid = Some(pid);
975+
save_state(&m.ai_dir, &m.state);
957976

958-
// Spawn the server process
959-
let pid = spawn_llama_server(&ai_dir, model.filename, port, ctx_size)?;
977+
Ok((pid, port))
978+
}
960979

961-
// Brief pause to let the process initialize, then check if it's still alive
980+
/// Waits for the server to become healthy (polls every 500ms, up to 60s).
981+
/// On failure, kills the process and clears state.
982+
async fn wait_for_server_health(ai_dir: &Path, pid: u32, port: u16) -> Result<(), String> {
983+
// Brief pause to let the process initialize
962984
tokio::time::sleep(std::time::Duration::from_millis(100)).await;
963985
if !is_process_alive(pid) {
964-
let last_lines = read_log_tail(&ai_dir, 20);
986+
cleanup_failed_server(pid);
987+
let last_lines = read_log_tail(ai_dir, 20);
965988
log::error!("AI server: process died immediately. Last log lines:\n{last_lines}");
966989
let log_path = ai_dir.join(SERVER_LOG_FILENAME);
967990
return Err(format!("llama-server crashed on startup. Check log at: {log_path:?}"));
968991
}
969992

970-
// Update state
971-
{
972-
let mut manager = MANAGER.lock_ignore_poison();
973-
if let Some(ref mut m) = *manager {
974-
m.state.port = Some(port);
975-
m.state.pid = Some(pid);
976-
m.child_pid = Some(pid);
977-
save_state(&m.ai_dir, &m.state);
978-
}
979-
}
980-
981-
// Wait for health check (poll every 500ms, up to 60s)
982993
log::debug!("AI server: waiting for health check on port {port}...");
983994
for i in 0..120 {
984995
tokio::time::sleep(std::time::Duration::from_millis(500)).await;
985996

986-
// Check if process is still alive
987997
if !is_process_alive(pid) {
988-
let last_lines = read_log_tail(&ai_dir, 20);
998+
cleanup_failed_server(pid);
999+
let last_lines = read_log_tail(ai_dir, 20);
9891000
log::error!("AI server: process died during startup. Last log lines:\n{last_lines}");
9901001
return Err(format!("llama-server process (PID {pid}) died during startup"));
9911002
}
@@ -995,22 +1006,36 @@ async fn start_server_inner<R: Runtime>(app: &AppHandle<R>) -> Result<(), String
9951006
return Ok(());
9961007
}
9971008

998-
// Log progress every 5 seconds
9991009
if i % 10 == 9 {
10001010
log::debug!("AI server: still waiting for health check ({}s)...", (i + 1) / 2);
1001-
if let Some((line_count, last_line)) = log_diagnostics(&ai_dir) {
1011+
if let Some((line_count, last_line)) = log_diagnostics(ai_dir) {
10021012
log::debug!("AI server: log has {line_count} lines, last: {last_line}");
10031013
}
10041014
}
10051015
}
10061016

1007-
// Timed out - read the log for diagnostics
1008-
let last_lines = read_log_tail(&ai_dir, 20);
1017+
// Timed out — kill the process instead of leaving it orphaned
1018+
cleanup_failed_server(pid);
1019+
let last_lines = read_log_tail(ai_dir, 20);
10091020
log::error!("AI server: health check timed out. Last log lines:\n{last_lines}");
1010-
10111021
Err(String::from("llama-server failed to become healthy within 60s"))
10121022
}
10131023

1024+
/// Kills a server process and clears its tracking state.
1025+
/// Only clears state if the tracked PID still matches (avoids clobbering a newer spawn).
1026+
fn cleanup_failed_server(pid: u32) {
1027+
kill_and_reap_in_background(pid);
1028+
let mut manager = MANAGER.lock_ignore_poison();
1029+
if let Some(ref mut m) = *manager
1030+
&& m.child_pid == Some(pid)
1031+
{
1032+
m.child_pid = None;
1033+
m.state.port = None;
1034+
m.state.pid = None;
1035+
save_state(&m.ai_dir, &m.state);
1036+
}
1037+
}
1038+
10141039
#[cfg(test)]
10151040
mod tests {
10161041
use super::*;

0 commit comments

Comments
 (0)