Skip to content

Conversation

@ovaistariq
Copy link
Contributor

@ovaistariq ovaistariq commented Sep 27, 2025

This fixes issue #37 where tigrisfs would deadlock during concurrent directory operations, particularly when listing directories.

The deadlock occurred because:

  1. listObjectsFlat() held dh.mu while calling sealDir()
  2. sealDir() -> removeExpired() -> removeChildUnlocked() tried to lock ALL directory handles mutexes
  3. Other threads could hold different directory handle mutexes while waiting for dh.mu, creating a circular lock dependency

The solution uses a generation counter approach:

  • Each directory maintains an atomic generation counter
  • The counter increments on structural changes (add/remove children)
  • Directory handles check the generation before operations
  • If generation changed, handles reset their position without locking

Note

Introduce generation counters for directories and handles to invalidate positions on structural changes, removing cross-handle locking and preventing deadlocks during listing.

  • Directory generation tracking:
    • Add DirInodeData.generation and DirHandle.generation; initialize from atomic.LoadUint64(&inode.dir.generation) in NewDirHandle, FUSE OpenDir, and on seek/reset paths.
    • Increment generation on structural updates: sealDir(), insertChildUnlocked(), removeChildUnlocked(), removeAllChildrenUnlocked().
    • DirHandle.checkDirPosition() resets position when generation changes; readDir/seek paths reload generation at offset 0.
  • Deadlock avoidance:
    • Remove per-handle mutex resets during child insert/remove; handles self-invalidate via generation.
    • In listing: release dh.mu before sealDir() and refresh handle generation after sealing to avoid lock cycles.
  • FUSE integration:
    • When creating remote DirHandles, store initial generation alongside inode in dirHandles.

Written by Cursor Bugbot for commit 2d6df29. This will update automatically on new commits. Configure here.

…invalidation

This fixes issue #37 where tigrisfs would deadlock during concurrent directory
operations, particularly when listing directories.

The deadlock occurred because:
1. listObjectsFlat() held dh.mu while calling sealDir()
2. sealDir() -> removeExpired() -> removeChildUnlocked() tried to lock ALL
   directory handles' mutexes
3. Other threads could hold different directory handle mutexes while waiting
   for dh.mu, creating a circular lock dependency

The solution uses a generation counter approach:
- Each directory maintains an atomic generation counter
- The counter increments on structural changes (add/remove children)
- Directory handles check the generation before operations
- If generation changed, handles reset their position without locking
Copilot AI review requested due to automatic review settings September 27, 2025 01:23
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a deadlock issue in concurrent directory operations by implementing a generation-based invalidation system. The deadlock occurred when listObjectsFlat() held a directory handle mutex while calling sealDir(), which attempted to lock all directory handle mutexes, creating circular dependencies with other threads.

  • Replace direct mutex locking of directory handles with atomic generation counters
  • Add generation tracking to both DirInodeData and DirHandle structures
  • Implement lock-free invalidation mechanism for directory handle position tracking

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Summary

This PR implements a generation-based invalidation mechanism to resolve deadlocks in concurrent directory operations, specifically fixing issue #37 where ls operations would hang indefinitely.

Key Changes:

  • Added generation counters to DirInodeData and DirHandle structures to track directory structure changes
  • Replaced direct handle locking with atomic generation counter increments during structural changes
  • Modified listObjectsFlat() to release locks before calling sealDir(), preventing circular lock dependencies
  • Directory handles now check generation counters to detect stale state and reset positions without cross-handle locking

Problem Solved:
The original deadlock occurred when listObjectsFlat() held dh.mu while calling sealDir(), which tried to lock all directory handle mutexes. If other threads held different directory handle mutexes while waiting for dh.mu, this created a circular dependency. The generation-based approach eliminates this by making handle invalidation lockless.

Confidence Score: 4/5

  • This PR is safe to merge with minimal risk
  • The fix correctly addresses the root cause of the deadlock using a well-designed generation-based approach. One minor race condition was identified but doesn't affect correctness, only efficiency. The solution is elegant and maintains POSIX compliance for concurrent readdir operations.
  • No files require special attention beyond the minor race condition fix suggested

Important Files Changed

File Analysis

Filename        Score        Overview
core/dir.go 4/5 Implements generation-based invalidation to fix deadlock in directory operations. Clean solution with one minor race condition.

Now we read the actual generation value after sealDir() returns to ensure
we have the correct current generation.
cursor[bot]

This comment was marked as outdated.

@ovaistariq
Copy link
Contributor Author

@greptileai

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Summary

This PR successfully addresses the critical deadlock issue (#37) in directory operations by implementing a generation counter approach instead of explicit mutex coordination between directory handles.

Key Changes:

  • Added atomic generation counters to DirInodeData and DirHandle structures to track structural changes
  • Replaced the problematic lock coordination pattern in listObjectsFlat() with generation-based invalidation
  • Modified all structural change operations (sealDir, insertChildUnlocked, removeChildUnlocked, removeAllChildrenUnlocked) to increment the generation counter
  • Implemented self-validating directory handles that reset their position when generation changes are detected

Issues Found:

  • Race conditions exist in generation synchronization logic, particularly in listObjectsFlat() and checkDirPosition()
  • Missing initialization of generation field in NewDirHandle() constructor
  • Potential for missed generation updates due to timing issues in unlock/lock patterns

Recommendation:
The approach is sound and addresses the deadlock correctly, but the implementation needs refinement to handle race conditions properly before merging.

Confidence Score: 3/5

  • This PR resolves a critical deadlock but introduces race conditions that need fixing
  • The generation counter approach correctly addresses the deadlock by avoiding circular locking, but the implementation has race conditions in generation synchronization and missing initialization that could lead to inconsistent state
  • Pay close attention to core/dir.go for race conditions in generation handling

Important Files Changed

File Analysis

Filename        Score        Overview
core/dir.go 3/5 Implements generation-based deadlock resolution but has race conditions in generation synchronization and missing initialization

…validations

The tests were failing because new DirHandles started with generation=0,
but the directory might already have a non-zero generation from prior
modifications. This caused the handle to immediately think it was out
of date and reset its position, leading to missing entries in readdir.

Changes:
- Initialize generation when creating new DirHandles
- Reset generation to current when seeking to offset 0
- Ensure consistent generation tracking across all handle creation paths

This fixes the test failures where directory listings were returning
incomplete results.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@ovaistariq
Copy link
Contributor Author

@greptileai

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Summary

This PR implements a generation-based invalidation system to resolve deadlocks in concurrent directory operations. The approach replaces cross-handle locking with atomic generation counters that track structural changes, allowing directory handles to self-invalidate without creating lock dependencies.

Key changes:

  • Add atomic generation counter to DirInodeData and DirHandle structs
  • Increment generation on structural changes (sealDir, insertChildUnlocked, removeChildUnlocked)
  • Directory handles check generation in checkDirPosition() and reset position if changed
  • Critical deadlock fix in listObjectsFlat(): release dh.mu before calling sealDir()

Issues found:

  • Generation not incremented in all insertChildUnlocked code paths (empty directory and append cases)
  • Race condition in listObjectsFlat() where generation could be outdated after reload
  • Missing generation increments could cause directory handles to miss structural changes

Confidence Score: 2/5

  • This PR has critical logic bugs that need resolution before merge
  • While the generation-based approach is sound for resolving the deadlock, there are critical bugs in the generation increment logic that could cause directory handles to miss structural changes. The missing generation increments in insertChildUnlocked and the race condition in listObjectsFlat need to be fixed to ensure correct behavior.
  • core/dir.go requires careful attention due to generation increment bugs and race conditions

Important Files Changed

File Analysis

Filename        Score        Overview
core/dir.go 2/5 Implements generation-based invalidation for directory operations, but has critical bugs in generation increment logic for insertChildUnlocked and race conditions in listObjectsFlat
core/cluster_fs.go 5/5 Simple addition to initialize directory handle generation when offset is reset to 0
core/cluster_fs_fuse.go 5/5 Correctly initializes DirHandle with current generation when opening directory remotely via FUSE

Move the generation load to immediately after sealDir() returns but
before reacquiring dh.mu. This ensures we capture the generation at
the point when sealDir completes, avoiding a race where another thread
could call sealDir() between our unlock and the generation load.

The sequence is now:
1. Unlock dh.mu
2. Call sealDir()
3. Load generation (while dh.mu is still unlocked)
4. Reacquire dh.mu
5. Update dh.generation

This ensures we get the most accurate generation value while still
avoiding the deadlock.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@ovaistariq
Copy link
Contributor Author

@greptileai

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Summary

This PR successfully addresses the deadlock issue in directory operations by introducing a generation-based invalidation system. The solution replaces the problematic cross-handle mutex locking with atomic generation counters that allow directory handles to self-invalidate when structural changes occur.

Key Changes:

  • Added generation field to DirInodeData and DirHandle structures
  • Implemented atomic generation increments on structural changes (sealDir, insertChildUnlocked, removeChildUnlocked, removeAllChildrenUnlocked)
  • Replaced direct handle invalidation loops with generation-based self-checking in checkDirPosition()
  • Modified listObjectsFlat() to unlock dh.mu before calling sealDir() to prevent deadlock cycles

Deadlock Resolution:
The original deadlock occurred when listObjectsFlat() held dh.mu while calling sealDir(), which in turn tried to lock all directory handle mutexes through removeExpired(). This created circular lock dependencies when other threads held different handle mutexes while waiting for dh.mu. The new approach eliminates this cross-handle locking entirely.

Race Condition:
There's a minor race condition in listObjectsFlat() where the generation could change between unlocking dh.mu and loading the current generation. While this is acknowledged as "benign" in comments, it could cause a handle to miss an invalidation signal in rare timing scenarios.

Confidence Score: 4/5

  • This PR is safe to merge with minimal risk - it fixes a critical deadlock while introducing only minor race conditions
  • Score reflects successful deadlock resolution with well-designed generation-based approach, offset by a manageable race condition that needs attention
  • Pay attention to the race condition in core/dir.go listObjectsFlat() unlock/lock pattern

Important Files Changed

File Analysis

Filename        Score        Overview
core/dir.go 4/5 Implements generation-based directory invalidation to fix deadlock; contains race condition in listObjectsFlat unlock/lock pattern

@ovaistariq ovaistariq merged commit b6eba91 into main Sep 27, 2025
7 checks passed
@ovaistariq ovaistariq deleted the fix-deadlock-with-generation branch September 27, 2025 17:02
ovaistariq added a commit that referenced this pull request Sep 29, 2025
This fixes potential deadlocks in the loadListing() function where
unlock/relock patterns could cause circular lock dependencies.

The deadlock could occur in two places:
1. When calling slurpOnce() at line 718-725
2. When calling listObjectsFlat() at line 737-742

Both cases released parent.mu (and dh.mu in case 1) while other
goroutines could be holding different locks and waiting, creating
circular dependencies.

The solution uses the generation counter approach from PR #39:
- Check generation before unlocking
- After relocking, detect if generation changed
- If changed, reset directory handle position

This ensures proper synchronization without risking deadlocks.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
ovaistariq added a commit that referenced this pull request Sep 29, 2025
* fix: Prevent deadlock in loadListing unlock/relock pattern

This fixes potential deadlocks in the loadListing() function where
unlock/relock patterns could cause circular lock dependencies.

The deadlock could occur in two places:
1. When calling slurpOnce() at line 718-725
2. When calling listObjectsFlat() at line 737-742

Both cases released parent.mu (and dh.mu in case 1) while other
goroutines could be holding different locks and waiting, creating
circular dependencies.

The solution uses the generation counter approach from PR #39:
- Check generation before unlocking
- After relocking, detect if generation changed
- If changed, reset directory handle position

This ensures proper synchronization without risking deadlocks.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants