Skip to content

Conversation

@ndyakov
Copy link
Member

@ndyakov ndyakov commented Oct 25, 2025

Working on #3559 I saw a degradation in the pool performance. In this PR I am trying to reduce the degradation a bit:

  • faster pool semaphores that uses atomics for counting (counting semaphore)
  • lockless hook manager - using an atomic pointer
  • predefined slices of states to reduce allocations in hot paths
  • inlining of state manager call to improve the performance in hot paths
Metric BEFORE AFTER Improvement
Pool Get/Put Speed 403-931 ns/op 205-305 ns/op 47-67% faster 
Allocations 48 B/op, 4 allocs/op 32 B/op, 2 allocs/op 33% less memory, 50% fewer allocs 
State Machine 26.36 ns/op 15.32 ns/op 42% faster 

@ndyakov ndyakov self-assigned this Oct 25, 2025
@ndyakov ndyakov force-pushed the ndyakov/pool-performance branch 3 times, most recently from fce8679 to 3478f7a Compare October 25, 2025 11:37
…redefined state slices

- Replace hookManager RWMutex with atomic.Pointer for lock-free reads in hot paths
- Add predefined state slices to avoid allocations (validFromInUse, validFromCreatedOrIdle, etc.)
- Add Clone() method to PoolHookManager for atomic updates
- Update AddPoolHook/RemovePoolHook to use copy-on-write pattern
- Update all hookManager access points to use atomic Load()

Performance improvements:
- Eliminates RWMutex contention in Get/Put/Remove hot paths
- Reduces allocations by reusing predefined state slices
- Lock-free reads allow better CPU cache utilization
@ndyakov ndyakov force-pushed the ndyakov/pool-performance branch from c914088 to 1b0168d Compare October 25, 2025 12:01
The state machine was calling notifyWaiters() on EVERY Get/Put operation,
which acquired a mutex even when no waiters were present (the common case).

Fix: Use atomic waiterCount to check for waiters BEFORE acquiring mutex.
This eliminates mutex contention in the hot path (Get/Put operations).

Implementation:
- Added atomic.Int32 waiterCount field to ConnStateMachine
- Increment when adding waiter, decrement when removing
- Check waiterCount atomically before acquiring mutex in notifyWaiters()

Performance impact:
- Before: mutex lock/unlock on every Get/Put (even with no waiters)
- After: lock-free atomic check, only acquire mutex if waiters exist
- Expected improvement: ~30-50% for Get/Put operations
@ndyakov ndyakov force-pushed the ndyakov/pool-performance branch from 079c217 to 78bcfbb Compare October 25, 2025 16:29
…ot path

The pool was creating new slice literals on EVERY Get/Put operation:
- popIdle(): []ConnState{StateCreated, StateIdle}
- putConn(): []ConnState{StateInUse}
- CompareAndSwapUsed(): []ConnState{StateIdle} and []ConnState{StateInUse}
- MarkUnusableForHandoff(): []ConnState{StateInUse, StateIdle, StateCreated}

These allocations were happening millions of times per second in the hot path.

Fix: Use predefined global slices defined in conn_state.go:
- validFromInUse
- validFromCreatedOrIdle
- validFromCreatedInUseOrIdle

Performance impact:
- Before: 4 slice allocations per Get/Put cycle
- After: 0 allocations (use predefined slices)
- Expected improvement: ~30-40% reduction in allocations and GC pressure
Further optimize the hot path by:
1. Remove redundant GetState() call in the loop
2. Only check waiterCount after successful CAS (not before loop)
3. Inline the waiterCount check to avoid notifyWaiters() call overhead

This reduces atomic operations from 4-5 per Get/Put to 2-3:
- Before: GetState() + CAS + waiterCount.Load() + notifyWaiters mutex check
- After: CAS + waiterCount.Load() (only if CAS succeeds)

Performance impact:
- Eliminates 1-2 atomic operations per Get/Put
- Expected improvement: ~10-15% for Get/Put operations
Introduced TryTransitionFast() for the hot path (Get/Put operations):
- Single CAS operation (same as master's atomic bool)
- No waiter notification overhead
- No loop through valid states
- No error allocation

Hot path flow:
1. popIdle(): Try IDLE → IN_USE (fast), fallback to CREATED → IN_USE
2. putConn(): Try IN_USE → IDLE (fast)

This matches master's performance while preserving state machine for:
- Background operations (handoff/reauth use UNUSABLE state)
- State validation (TryTransition still available)
- Waiter notification (AwaitAndTransition for blocking)

Performance comparison per Get/Put cycle:
- Master: 2 atomic CAS operations
- State machine (before): 5 atomic operations (2.5x slower)
- State machine (after): 2 atomic CAS operations (same as master!)

Expected improvement: Restore to baseline ~11,373 ops/sec
@ndyakov ndyakov changed the base branch from master to ndyakov/state-machine-conn October 25, 2025 18:21
@ndyakov ndyakov marked this pull request as ready for review October 25, 2025 21:39
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR aims to improve pool performance by replacing synchronization primitives with more efficient atomic-based implementations and optimizing hot path operations through inlining and allocation reduction.

Key Changes:

  • Introduced FastSemaphore using atomic operations instead of channel-based semaphores for connection limiting
  • Replaced mutex-protected hook manager with lock-free atomic pointer operations
  • Added predefined state slices and inline methods to reduce allocations in critical paths

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
internal/semaphore.go New FastSemaphore implementation using atomics for high-performance connection limiting
internal/pool/pool.go Replaced channel-based queue with FastSemaphore and converted hook manager to atomic pointer
internal/pool/hooks_test.go Updated tests to use atomic loads for hook manager access
internal/pool/hooks.go Added Clone() method to support lock-free hook manager updates
internal/pool/export_test.go Updated test helper to use semaphore instead of channel queue
internal/pool/conn_state.go Added predefined state slices, optimized waiter tracking with atomic counter, and fast transition method
internal/pool/conn.go Added inline TryAcquire() and Release() methods with direct state machine access for hot paths
internal/auth/streaming/pool_hook.go Migrated from channel-based semaphore to FastSemaphore for re-auth worker limiting

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ndyakov and others added 4 commits October 26, 2025 00:50
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@ndyakov ndyakov force-pushed the ndyakov/pool-performance branch from e825db8 to da5fe33 Compare October 27, 2025 06:05
@ndyakov ndyakov force-pushed the ndyakov/pool-performance branch from 6928e52 to 471a828 Compare October 27, 2025 06:21
@ndyakov ndyakov changed the title (wip) pool performance fix(pool): pool performance Oct 27, 2025
@ndyakov ndyakov merged commit 080a33c into ndyakov/state-machine-conn Oct 27, 2025
34 checks passed
@ndyakov ndyakov deleted the ndyakov/pool-performance branch October 27, 2025 13:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants