Skip to content

Bug: Race Condition in _record_reliability_event Corrupts Metrics #198

@codec404

Description

@codec404

Summary: _record_reliability_event performs a read-modify-write on the reliability summary without holding any lock during the entire operation. It calls _load_reliability_summary() (reads from disk), modifies the in-memory dict, then calls _maybe_save_reliability_summary() (writes to disk). Two concurrent script executions can both read the same old summary, each apply their own result, and the second write silently overwrites the first - causing one execution's metrics to be lost entirely.

Location: app.py, lines 1493–1568 (_record_reliability_event function).

Steps to reproduce:

  1. Start the DevShell server with a multi-threaded WSGI server (or trigger two rapid concurrent script executions via /api/scripts/run).
  2. Execute two scripts simultaneously so both finish at nearly the same time.
  3. Check the reliability summary (GET /api/reliability/summary); observe that total_runs incremented by 1 instead of 2, or that one execution's failure/success is not reflected.

Expected: Every completed execution is atomically recorded - concurrent executions each contribute their outcome to the summary without overwriting each other.

Actual: Under concurrent load the read-modify-write cycle is unsynchronized, so one update can silently overwrite another, leaving total_runs, failures, and reliability_score incorrect.

Suggested fix: Wrap the entire read-modify-write body of _record_reliability_event inside the existing _reliability_cache_lock mutex (or introduce a dedicated write lock), so only one thread at a time performs the load → modify → save cycle.

@siddu-k Assign me this issue

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions