Skip to content

fix(session): prevent concurrent commit re-committing old messages#783

Open
deepakdevp wants to merge 3 commits intovolcengine:mainfrom
deepakdevp:fix/session-commit-race
Open

fix(session): prevent concurrent commit re-committing old messages#783
deepakdevp wants to merge 3 commits intovolcengine:mainfrom
deepakdevp:fix/session-commit-race

Conversation

@deepakdevp
Copy link
Contributor

Summary

  • Adds an asyncio.Lock to Session to serialize concurrent commit_async() calls
  • Reorders Phase 1 of commit_async() to clear live messages before the slow LLM summary generation, closing the race window where a second commit could see stale data
  • Includes rollback logic if the file-clear fails, so messages aren't lost

Fixes #580

Root Cause

commit_async() had no synchronization. When called concurrently, both calls would copy the same self._messages, generate separate archives, and trigger duplicate memory extraction. The race window spanned the entire LLM summary generation (seconds), during which the live messages.jsonl still contained the old messages.

Changes Made

  • openviking/session/session.py:
    • Added self._commit_lock = asyncio.Lock() to Session.__init__
    • Wrapped Phase 1 (copy + clear + file write) in async with self._commit_lock
    • Moved empty-check inside the lock to prevent both callers from passing it
    • Lock released before slow operations (LLM summary, memory extraction)
    • Added rollback: if file-clear fails, messages are restored from the copy
  • tests/session/test_session_commit_race.py (new): 2 tests
    • Concurrent commits produce exactly 1 archive (dedup)
    • Messages added during commit are not lost

Type of Change

  • Bug fix (non-breaking change which fixes an issue)

Testing

  • 2 new race condition tests
  • Ruff lint + format clean
  • Existing commit tests unaffected (pre-existing VLM config requirement)

🤖 Generated with Claude Code

deepakdevp and others added 2 commits March 19, 2026 22:27
commit_async() now acquires an asyncio.Lock during Phase 1 (copy +
clear + file write). This prevents concurrent commits from re-committing
the same messages. The lock is released before the slow LLM summary and
memory extraction, so it doesn't block other operations.

The phase order is changed: live messages are cleared BEFORE the archive
summary is generated, closing the race window where a second commit
could see stale data. If the file-clear fails, messages are rolled back.

Fixes volcengine#580.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests verify that:
- Two concurrent commit_async() calls on the same session produce
  exactly one archive (the other returns early)
- Messages added while a commit is running are preserved in the
  session and not lost or re-committed

Part of fix for volcengine#580.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@qin-ctx
Copy link
Collaborator

qin-ctx commented Mar 20, 2026

Thanks for the contribution! The race condition analysis is spot-on — there is indeed a window during commit_async() where concurrent calls can re-commit the same messages. However, we'd like to suggest a different approach for the locking mechanism.

Existing protections

The session commit path already has several layers of protection:

  1. HTTP API layer — TaskTracker (server/routers/sessions.py): create_if_no_running("session_commit", session_id) atomically rejects duplicate background commits for the same session at the API entry point.
  2. Phase 2 crash recovery — RedoLog: On restart, LockManager._recover_pending_redo() replays incomplete memory extraction tasks.

What's missing is exactly what you identified: Phase 1 atomicity — the gap between copying messages and clearing them, with a slow LLM summary call in between.

Why asyncio.Lock() is not the right fit here

The asyncio.Lock you added is an in-process, in-memory lock. It only protects concurrent calls within the same Python process on the same Session object. It won't help when:

  • The HTTP server runs multiple workers (each has its own process/memory)
  • Multiple service instances handle the same session
  • Direct Python client usage across threads

Suggested approach: use existing PathLock

We already have a filesystem-based distributed lock infrastructure in openviking/storage/transaction/PathLock + LockManager + LockContext. It:

  • Locks via .path.ovlock files on AGFS — works across processes and instances
  • Has fencing tokens, stale lock cleanup, and timeout support
  • Is already used elsewhere in the codebase for write coordination

The session has self._session_uri which maps to an AGFS path, so it naturally fits into this locking scheme. The Phase 1 critical section would look roughly like:

from openviking.storage.transaction import LockContext, get_lock_manager

session_path = self._viking_fs._uri_to_path(self._session_uri, ctx=self.ctx)
async with LockContext(get_lock_manager(), [session_path], "point"):
    if not self._messages:
        return result
    self._compression.compression_index += 1
    messages_to_archive = self._messages.copy()
    self._messages.clear()
    await self._write_to_agfs_async(messages=[])
# Lock released — slow LLM summary + archive write proceeds without holding the lock

What we do agree on

The reordering of Phase 1 operations (copy + clear before the LLM summary call) is the right call — it closes the race window regardless of lock implementation. The rollback logic on _write_to_agfs_async failure is also a good idea.

Would you be open to reworking this PR to use PathLock instead of asyncio.Lock? Happy to help if you have questions about the transaction module.

Replaces the in-process asyncio.Lock with the existing PathLock
(distributed filesystem lock via LockContext) for Phase 1 of
commit_async(). This ensures commit serialization works across
multiple HTTP workers and service instances, not just within a
single Python process.

Addresses review feedback from qin-ctx on PR volcengine#783.
@deepakdevp
Copy link
Contributor Author

Thanks @qin-ctx for the thorough review and the suggestion! You're absolutely right that asyncio.Lock is insufficient for multi-worker deployments.

I've replaced it with LockContext(get_lock_manager(), [session_path], lock_mode="point") — same pattern used in the content router and resource processor. The Phase 1 reordering and rollback logic remain unchanged.

Changes in the latest push:

  • Removed import asyncio and self._commit_lock = asyncio.Lock()
  • Phase 1 now uses async with LockContext(...) with the session's AGFS path
  • Docstring updated to reflect PathLock

Please take another look when you get a chance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

Bug: async session commit can re-commit old messages before live session switch completes

2 participants