Skip to content

fix(http): log upload byte size on complete#24

Merged
webbertakken merged 53 commits into
mainfrom
maintenance
Jun 1, 2026
Merged

fix(http): log upload byte size on complete#24
webbertakken merged 53 commits into
mainfrom
maintenance

Conversation

@webbertakken
Copy link
Copy Markdown
Owner

  • fix(http): log upload byte size on complete
  • Rolls up the maintenance branch (53 commits): logging, reliability, and doc upkeep across ws/http/engine/ui surfaces

All quality gates verified locally: fmt, clippy (-D warnings), cargo test, cargo audit.

poll_existing dropped config::save errors with let _ = on all
three studio responses (404, approved, rejected), so a read-only
or full disk left no operator-visible breadcrumb.

- approved credentials save failure now logs at error (a Sentry
  event via sentry-tracing): the session runs in memory but the
  worker silently re-registers on the next restart otherwise.
- 404 + rejected cleanup save failures log at warn with the
  config path, matching the existing pattern in this module.
- regression tests force a save failure (config path under a
  regular file) and assert the breadcrumb fires without leaking
  the auth token.
- WS client emitted zero tracing despite being the only comms channel
- add structured breadcrumbs on target `studio_worker::ws::client`
  (op, worker_id, elapsed_ms; debug on success, warn on fault),
  mirroring the `studio_worker::http` contract
- surfaces genuine silent failures: session maps recv errors to a
  generic Disconnected(_) and fires `let _ = sender.send(...)` for
  accept/reject/fail/completeJson
- extract `classify_incoming` so split + non-split recv loops share
  one logging site and stay identical; behaviour preserved
- cover with capture()-based unit tests; doc target list updated
- AGENTS.md/README/overview no longer present GradioEngine or the `gradio` engine; document sd-cpp + MultiEngine instead
- README config example now matches the Config struct: drop engine, gradio_endpoint_url, label, auto_enabled, supported_models_override; add models_root
- remove dead tests/gradio_engine.rs reference; Cargo keyword gradio -> inference
- keep intentional 'legacy gone' notes (overview 383/517) and historical plan records
- Windows minted the install id + 256-bit registration_secret from a
  SHA-256-mixed timestamp, not the OS RNG: the secret was predictable.
- The doc comment claimed BCryptGenRandom on Windows; it was never
  implemented (only unix read /dev/urandom).
- Replace the per-OS hand-rolled RNG + weak fallback with one
  getrandom::fill call (getrandom 0.3, already transitively in tree).
- Panic loudly if the OS entropy source is unavailable instead of
  minting a guessable secret; Sentry captures it.
- Add entropy-primitive contract tests (distinctness + bit coverage)
  that now cover the formerly-untested Windows path.
- Promote src/ui/autostart.rs to top-level src/autostart.rs (always
  compiled, like service.rs) so the deferred disk-write round-trip is
  testable in the default CI matrix without the ui feature or HOME
  mutation.
- enable/disable now emit structured tracing on target
  studio_worker::autostart (info ok / warn fail), matching http.rs +
  service.rs; a failed toggle is no longer silently swallowed.
- Config tab surfaces the failure in the UI instead of let _ = ...
- Drop the module-doc claim of a RegistryWriter trait that was never
  built; document the honest Windows marker-file status.
- Add disk-write round-trip + tracing-emission tests.
- The video cargo-feature engine emits ext=gif, but complete()'s
  inline MIME match only knew png/webp/wav/mp3/mp4, so real GIFs were
  uploaded as application/octet-stream (wrong stored content-type).
- Extract a single-source-of-truth mime_for_ext() and add the missing
  gif => image/gif arm; complete() now routes through it.
- Add fast unit tests (none existed for the mapping); one locks the
  contract that every engine-emitted ext maps to a real MIME type.
- config::save logged a debug breadcrumb on success but emitted
  nothing on failure, relying on each caller to log the Err. The UI
  Save button discards it (let _ = draft.save(...)), so a failed
  persist was invisible in journalctl / Sentry.
- Emit a warn! on the failure branch (path + error only, no secret
  fields) so every save failure is surfaced regardless of caller.
- Split the IO into write_config() so both branches log without
  duplicating the happy path.
- Add tracing-contract tests for the failure path incl. secret
  redaction (portable file-as-parent trigger, no /proc reliance).
- run_cli emits an INFO startup breadcrumb (target studio_worker::cli) naming AGENT_VERSION + the subcommand about to run
- with no SENTRY_DSN the worker logged nothing at launch; this anchors which version/subcommand a process is running, key right after an auto-update re-exec or service restart
- add Command::name() stable kebab-case labels, exhaustively unit-tested; breadcrumb proven via a capture()-based test
- Under `cargo test --features ui` the capture-based config_tracing
  tests intermittently saw empty buffers ("expected ... got: ").
- Root cause: a tracing callsite first evaluated by a parallel test in
  the narrow window around the one-time global-subscriber install can
  cache its Interest as `never` and then silently drop the captured
  event.  `try_init` rebuilds the cache once on install, which only
  helps the caller that wins the race.
- Fix: call `tracing::callsite::rebuild_interest_cache()` at the start
  of every `capture()` so each caller re-evaluates all callsites
  against the now-stable subscriber before emitting.  Idempotent + cheap.
- Verified: 2 consecutive `--features ui` runs (act) that previously
  flaked now pass clean; default suite green.
- Tray icon is the operator's at-a-glance health indicator, but the
  variant transitions (idle/busy/disconnected) were unlogged and the
  `set_icon`/`set_tooltip` updates were swallowed with `let _ =`, so a
  stale indicator left the operator with zero diagnostics.
- app.rs: log idle/busy/disconnected transitions on
  studio_worker::ui::app and warn when a tray icon/tooltip update
  fails; log the hide-to-tray interception.
- mod.rs: log the open-window and quit tray-menu clicks (pause was
  already logged), completing the operator audit trail.
- Add a unit test for the transition-logging helper; verified under
  --features ui via act (clippy + check + test green).
- download_file renamed the .part into place with no length check,
  so a short read (reset/flaky CDN) cached a corrupt model that every
  later job failed to load with no retry - despite the doc claiming
  HEAD-checked length.
- Add pure verify_download_len() comparing streamed bytes to the
  response Content-Length; mismatch removes the .part and bails clearly
  so the next tick re-downloads. None (chunked) => accept, as before.
- Clean up the .part on a mid-stream io error (was leaked); drop the
  handle before remove/rename so cleanup works on Windows.
- Correct the stale module doc to match actual behaviour.
- worker only handled SIGINT; systemd/launchd stop sends SIGTERM, so
  the graceful stop path never fired under the documented deployment
- now waits on both SIGINT + SIGTERM (Ctrl-C off-Unix) and logs a
  named shutdown breadcrumb, mirroring the startup banner
- WS session now closes cleanly + last log batch flushes on stop
- TDD: request_shutdown unit-tested (flag + breadcrumb); signal wait
  excluded from coverage like other OS-signal code
- worker never recorded what it offered the studio; only config (startup banner) and per-job picks were logged, so 'why won't it claim X jobs' was unanswerable from the logs
- add pure runtime::summarize_capabilities(kinds/models/vram/auto_enabled), unit-tested
- log it via push_log_with_observers right before Hello, so it ships to the studio and shows in the UI Logs tab
- prove it end-to-end: new ws_session_full_loop handshake test asserts the summary lands in recent_logs
- Add path-injectable enable_at/disable_at/is_enabled_at seams,
  mirroring the existing write_entry/remove_entry design; public
  enable/disable/is_enabled now delegate (signatures unchanged).
- Cover the enable->is_enabled->disable round-trip, the unresolved-path
  branch, platform artefact selection, and the Linux XDG path layout
  against a tempdir, with no env mutation or real-FS side effects.
- Lifts autostart.rs coverage: lines 76% -> 91%, functions 64% -> 84%.
- route the four best-effort `remove_file` cleanups (per-job sd-cli
  output, init image, half-written .part downloads) through a new
  `remove_temp_file` helper that warns on failure instead of silently
  swallowing the error
- a removal that keeps failing leaks temp files and can quietly fill a
  long-running worker's disk; now it shows up in the logs
- NotFound stays silent (already the desired end state)
- covers the helper with TDD: quiet success, ignored-missing, and a
  surfaced failure that names the path + cleanup op
- About tab "Check for updates" left no trace on success/error
- Emit structured INFO (up-to-date / newer) + WARN (failure)
- Stable target studio_worker::ui::about for journal/Sentry filtering
- Cover all three outcomes with capture-based tests
- JSON (LLM/STT) completion path now logs a breadcrumb on success,
  symmetric with the binary path's "binary upload ok"
- branch on the completeJson send result so a dropped frame records
  Failed instead of a false-positive Completed
- cover both with a full-loop test asserting the per-job breadcrumb
- dispatch_with_source bailed without a breadcrumb on the
  studio_worker::engine::multi target when the studio asked for an
  engine this binary lacks
- pick_for already warns on its no-match path; mirror that so every
  routing decision (selected or rejected) is filterable on one target
- add a tracing-capture regression test for the rejection breadcrumb
- RealRunner::download streamed the installer with no length check, so a
  truncated fetch (reset / flaky CDN) wrote a partial script that the next
  step hands straight to sh / powershell.
- Add pure verify_download_len() comparing streamed bytes to the response
  Content-Length; mismatch bails before run_installer (tempdir drop cleans
  the partial file). None (chunked) => accept, mirroring the sdcpp guard.
- Cover the previously-untested real download path end-to-end via wiremock
  (happy + truncation-rejection); update.rs line cov 85 -> 90%.
- emit structured breadcrumbs for DesktopNotifier: debug on success
  (previously silent), warn with an `error` field on failure
  (previously string-interpolated, off-convention)
- extract a testable `log_show_outcome` seam and add unit tests for
  both branches via the shared tracing capture helper
- correct the stale NotificationPrefs doc: the toggles are session
  only, not persisted to Config
- mp4 crate was added speculatively for video output but the video engine encodes GIF instead (no H.264 encoder shipped)
- never referenced in source; removes 4 crates from the resolve (mp4, num-bigint, num-integer, num-rational)
- shrinks build graph + supply-chain surface
- base64 was used by the old Gradio engine to decode response images; that engine was removed in the stable-diffusion.cpp migration
- no longer referenced in source; manifest now reflects actual direct deps
- base64 stays available transitively via reqwest, so the lockfile resolve is unchanged
- build_client_options was untested: lock that a valid DSN yields options
  carrying release / environment / server_name with perf tracing disabled.
- Lock the invalid-DSN path: a malformed SENTRY_DSN must warn and return
  None (init propagates it via ?), never panic the worker at startup.
- telemetry.rs line cov 90.8 -> 93.2%, region 86.5 -> 91.0%.
- surface dropped accept/reject offer-response sends at session level
- reaches UI Logs tab + studio-shipped logs with the job_id attached
- previously only the transport layer logged these, locally and
  without job context
- pure offer_response_breadcrumb helper, unit-tested for both outcomes
- surface dropped Fail-frame sends at session level (upload error,
  dispatch error, dispatch panic arms of run_offered_job)
- previously swallowed by let _ = sender.send(...): a dropped Fail
  left the studio holding the job until timeout with no local record
- reaches UI Logs tab + studio-shipped logs with the job_id attached
- pure fail_send_breadcrumb helper + record_fail_send glue, unit-tested
  for both outcomes
- a prior edit inserted remove_temp_file inside verify_download_len's
  doc comment, splitting it and leaving remove_temp_file documented as
  download verification
- move each doc block back above its own function so both read
  coherently again (no behaviour change)
- auto-updater slept a full auto_update_tick (60s) before
  re-checking stop, so run_loops' join blocked shutdown for up
  to a tick after SIGTERM/SIGINT, undoing graceful shutdown
- idle wait now re-polls stop every AUTO_UPDATE_SHUTDOWN_TICK
  (250ms) via the shared wait_with_stop helper
- consolidate wait_with_stop into runtime; ws::session reuses it
- tests: prompt-stop during idle wait + helper short-circuit
- Add dispatch_with_source to the Engine trait block
- Drop the removed enable/disable subcommands; register has no label flag
- MultiEngine routes strictly by ModelSource.engine (no kind fallback)
- video.rs is an animated-GIF stand-in (not ffmpeg); tts.rs is formant-synth (not Piper)
- Video TaskResult is webp/gif, never mp4
- load() read/parse failures now emit a source-level WARN breadcrumb (op=load, config_path), mirroring save(); previously only main()'s generic error surfaced them, lacking the structured studio_worker::config fields
- parse breadcrumb omits the toml span on purpose: it can echo a secret on the failing line (e.g. an unterminated auth_token)
- tests cover read-fail, parse-fail, and secret redaction
- build() now emits an info breadcrumb naming the backends this worker routes across (op=build, engine_count, engines=...) on the studio_worker::engine target
- previously the roster was implicit: operators could not tell from the logs whether e.g. the sdcpp backend self-registered or was skipped for a missing sd-cli
- roster logging split into a pure helper, unit-tested for shape; the build-level test stays environment-tolerant (no count pin)
- replace a dead empty if-block (computed values, did nothing) that
  silently ignored prompt + max_tokens overflowing the context window
- emit a structured warn so a truncated long chat is diagnosable from
  the logs; we don't trim the operator's prompt (no behaviour change)
- compare against n_ctx (the KV-cache size), not n_batch, so the guard
  fires on real overflow rather than crying wolf on normal prompts
- pure exceeds_context_window helper, unit-tested at the boundary
- config.toml holds worker identity + secrets (auth_token,
  registration_secret) but used a plain fs::write
- non-atomic: a crash/power-loss/full-disk mid-write truncated
  the file, wiping registration and forcing re-approval
- world-readable: fs::write honours umask (typically 0644),
  exposing secrets to other local users
- now stream to a same-dir temp file, fsync, then rename over
  the target; tempfile gives 0600 and persist keeps it
- tests: owner-only mode (unix) + atomic replace without litter
- Status-tab Pause/Resume button flipped the flag silently; the tray
  menu toggle already logs. Now both paths emit a structured breadcrumb.
- Extract toggle_pause() so the logging is unit-testable without egui.
- Add a both-directions test (target/message/paused field, flag flip).
- add missing src modules: cli, auto_register, telemetry, update, autostart, test_support
- expand tests/ list from 6 to all 22 files, grouped by area
- note real_* E2E tests are feature-gated and off on free-tier CI
- push_log breadcrumbs now carry job_id as a structured field so
  operators can pivot shipped studio logs / Sentry by job
- Option<&str> records the field only when present; jobless logs
  (startup, heartbeat, auto-update) stay free of an empty job_id
- covered by two capture-based tests (present + absent)
- Reserve the worker atomically before accepting an offer so a
  second concurrent offer is rejected instead of clobbering the
  in-flight job.
- Send Accept inside the job task and bail if it fails, releasing
  the busy flag.
- Report the actual current job id in heartbeats instead of a
  placeholder; thread SessionSchedule through LoopSchedule.
- Parse api_base_url, drop query/fragment, and strip a trailing
  /graphics/api so a configured base that already includes the API
  prefix no longer double-prefixes request paths.
- Validate the installer download URL before fetching; only https
  is allowed, with loopback http permitted for tests.
- Resolve model cache files through a guard that rejects nested,
  absolute, or ../ filenames, keeping downloads inside models_root.
- config tab render now returns whether a save succeeded; the live
  Arc<Mutex<Config>> is only updated on a successful save so unsaved
  draft edits never leak into the running loops.
- Debug builds position the window at the left monitor origin;
  release builds defer to the window manager (behaviour unchanged).
  STUDIO_WORKER_WINDOW_POS=x,y overrides in any build.
- macOS plist is gg.minis.studio-worker.plist and the Windows task
  template is minis-studio-worker.task.xml; sync overview with code.
- run cargo audit --deny warnings on Cargo manifest changes + weekly cron
- pin 11 accepted informational advisories in .cargo/audit.toml (ui GTK3 + build-macro transitive deps)
- document the workflow in AGENTS.md CI section
- drain_notifications compared ring.len() to a seen-count, so once
  the capped recent_jobs ring saturated (50 jobs) its length stopped
  changing and desktop notifications silently stopped for the rest of
  the session
- track the last-notified job by (job_id, finished_at) identity and
  walk newest-first until that marker, so new arrivals are detected
  regardless of ring saturation
- add regression tests: per-job, idempotent re-drain, post-saturation
- guard out + init-image scratch via RAII so every exit path removes them

- prevents temp-dir leak on sd-cli failure / unreadable output
- ImageParams gains maskUrl; sdcpp downloads the mask alongside the init image
  and passes sd-cli --mask so only the white region is repainted.
- Powers the Find the Differences studio pipeline (draw red shapes -> inpaint
  the differences variant). Unit-tested arg building (mask + no-mask).
- cover run_cli's Ui arm + the non-ui run_ui bail branch
- lifts lib.rs coverage from 88.6% to 100% lines
- sdcpp: a 'model' role file is a full checkpoint, loaded via sd-cli
  -m/--model; a 'diffusion-model' role stays on --diffusion-model (split
  weights). Lets the studio hand SDXL-inpaint as a single checkpoint.
- Unit-tested both flag paths.
- number_prefix + paste come from the image-candle feature (hf-hub/candle), not the ui GTK3 stack
- proc-macro-error is the ui stack's glib-macros build macro
- header + review guidance now cover both ui and image-candle trees so stale ignores can not hide
- The reconnect attempt counter never reset after a successful connect, so a
  long-lived worker was killed once it accumulated DEFAULT_RECONNECT_ATTEMPTS
  (5) lifetime disconnects — even healthy ones. Now only consecutive
  failures-to-connect count toward the cap; a welcomed session resets it.
- Fixes workers silently going stale after transient drops (e.g. local
  wrangler-dev WS recycles).
- Patch/minor bumps within existing constraints (cargo update)
- Includes openssl 0.10.80, hyper 1.10.1, http 1.4.1, reqwest 0.13.4
- egui/eframe 0.34.3, serde_json 1.0.150
- fmt, clippy, tests, cargo-audit all green
The session only learned of a drop from the reader, which blocks forever on a
half-open socket (post-job WS drops, dead peers). Two root-cause fixes so the
worker reconnects on its own instead of hanging until an external restart:

- Read-idle-timeout: the reader treats no-frames-within-N-seconds as a
  disconnect. The studio acks every heartbeat (~5s) so a live link always
  yields a frame; only a dead socket elapses it.
- Abort the heartbeat/log-shipper/observer pumps on teardown: they only broke
  on the global stop flag, so on a silent-but-open socket (sends still buffer)
  they looped forever and blocked run_one_session from returning -> reconnect.
- Connect timeout so a stalled upgrade can't hang the reconnect loop either.

Covered by a new silent-server integration test (was hanging, now recovers)
plus a connect-timeout unit test.
@webbertakken webbertakken changed the title Maintenance: studio-worker upkeep fix(http): log upload byte size on complete Jun 1, 2026
@webbertakken webbertakken merged commit 08846b1 into main Jun 1, 2026
19 of 20 checks passed
@webbertakken webbertakken deleted the maintenance branch June 1, 2026 23:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant