Skip to content

refactor(auth): remove navigator.locks-based mutex; introduce commit guard + dispose()#2392

Open
Bewinxed wants to merge 6 commits into
supabase:masterfrom
Bewinxed:bewinxed/authjs-lockless
Open

refactor(auth): remove navigator.locks-based mutex; introduce commit guard + dispose()#2392
Bewinxed wants to merge 6 commits into
supabase:masterfrom
Bewinxed:bewinxed/authjs-lockless

Conversation

@Bewinxed
Copy link
Copy Markdown
Contributor

@Bewinxed Bewinxed commented May 21, 2026

Description

What changed?

GoTrueClient now defaults to lockless coordination. The navigator.locks-based mutex no longer runs by default — most users get the lockless path. Callers who explicitly pass a custom lock (typically React Native processLock or Node multi-process setups) keep the old behavior on an opt-in legacy path so the change is backwards-compatible.

The lockless default uses:

  • refreshingDeferred (already existed) to single-flight in-instance concurrent refreshes.
  • A commit guard inside _callRefreshToken that snapshots storage before the rotated-token fetch and re-reads after, then discards the rotated tokens if a non-null snapshot was cleared (typical case: a concurrent signOut ran _removeSession). The discarded path returns { data: null, error: new AuthRefreshDiscardedError() }. _recoverAndRefresh recognises this error and skips its own _removeSession() call so no duplicate SIGNED_OUT event fires.
  • A new exported error class, AuthRefreshDiscardedError (with an isAuthRefreshDiscardedError type guard), is what the commit guard returns when it throws away a successfully-rotated token. Distinct from AuthRetryableFetchError (transient network) and AuthApiError (server rejection). Surfaces through refreshSession() and getSession() results.
  • A new client.auth.dispose() tears down the auto-refresh interval, the visibilitychange listener, the BroadcastChannel, and registered onAuthStateChange subscribers. Idempotent. Designed for React Strict Mode and HMR cleanup hooks. In-flight fetch calls are not aborted — they run to completion.
  • GoTrue handles cross-tab refresh races server-side via the v1 parent-of-active path at internal/tokens/service.go:376-385, so the client doesn't need to coordinate. The current default is v1 (RefreshTokenAlgorithmVersion = 0); v2 (counter-based) is opt-in via RefreshTokenUpgradePercentage and gated rollout. This PR's reasoning relies on v1 behaviour, which is what customers actually run today.

The legacy opt-in path (settings.lock != null):

  • _acquireLock, pendingInLock, lockAcquired, lockAcquireTimeout, and all 14 call-site wrappers are preserved verbatim from master and gated on this.lock != null. The default path is untouched by them.
  • Every gated site is marked // TODO(v3): remove legacy lock path so the eventual v3 cleanup is a mechanical search-and-delete. A separate Linear ticket on the supabase-js v3 project tracks that work.
  • The lock and lockAcquireTimeout constructor options are accepted and honored when supplied; both are @deprecated in JSDoc with "Custom locks still work in v2.x for backwards compatibility. Will be removed in v3."
  • navigatorLock, processLock, LockAcquireTimeoutError, NavigatorLockAcquireTimeoutError, ProcessLockAcquireTimeoutError, and internals in lib/locks.ts remain exported. Still @deprecated.

Stale JSDoc referencing the lock on getSession, onAuthStateChange, _useSession, _challengeAndVerify, and _listFactors now matches the new default behaviour. The @deprecated tag on the async onAuthStateChange overload is kept, with its reason updated to point at the one residual reentry hazard (refreshSession called from inside a TOKEN_REFRESHED handler — refreshingDeferred.resolve happens after _notifyAllSubscribers returns).

Why was this change needed?

The shared mutex has caused seven failure classes on the issue tracker for 18+ months. Each earlier patch fixed one symptom and created another:

  • #1962 configurable timeout
  • #2106 steal fallback
  • #2178 steal-cascade guard

#2235 (primitive swap to processLock) was an earlier attempt at moving off navigator.locks and is no longer being actively pursued. Community PRs #2016 and #2019 fixed individual symptoms and were closed waiting on a structural fix.

Failure classes resolved by this PR's default (lockless) path:

  1. Strict Mode / HMR orphan deadlocks. Nothing to orphan on the default path.
  2. iOS Safari INITIAL_SESSION / getSession() hangs with a persisted session (#936 @bugprone). The default path no longer touches navigator.locks.
  3. Sentry flooding with AbortError: Lock broken by another request with the 'steal' option (#2013 @cpannwitz). The { steal: true } fallback that produced these errors is bypassed on the default path.
  4. Subscriber re-entry deadlock (the one at the old GoTrueClient.ts:3824). The cycle needed the lock; a callback that awaits getUser() now just runs on the default path.
  5. Visibility/tick race (#936 @JTBrinkmann). The visibility handler and the auto-refresh tick no longer share a lock to fight over on the default path.
  6. Stale tickers/listeners on un-disposed clients. dispose() cleans them up.
  7. signOut blocked behind in-flight refresh. signOut now runs concurrently with the refresh on the default path, and the commit guard prevents the refresh from writing rotated tokens after _removeSession() cleared storage.

Callers on the legacy opt-in path (explicit settings.lock) keep the old serialization semantics and the failure modes that come with them. They accepted those when they opted in; they can drop the option to migrate to the lockless default at any time.

Why default lockless: cross-tab races are handled on the server (GoTrue's parent-of-active), in-instance refresh dedup is already handled by refreshingDeferred, and the only job the lock did beyond that was serializing subscriber callbacks — which is the deadlock we're fixing.

Closes (test target): #2013, #936, #2111

Examples

dispose() in a React app

import { useEffect } from 'react'
import { createClient } from '@supabase/supabase-js'

export function useSupabase() {
  useEffect(() => {
    const supabase = createClient(URL, KEY)
    return () => {
      supabase.auth.dispose()
    }
  }, [])
}

Subscriber callbacks can call other auth methods now (default path)

// Previously: deadlocked. Callback held the lock; getUser tried to acquire it.
// Now (default path): works.
supabase.auth.onAuthStateChange(async (event, session) => {
  if (event === 'SIGNED_IN') {
    const { data } = await supabase.auth.getUser()
    // ...
  }
})

One residual hazard remains: calling refreshSession from inside a TOKEN_REFRESHED handler still deadlocks via refreshingDeferred. The async overload of onAuthStateChange keeps its @deprecated marker for that reason.

AuthRefreshDiscardedError for the signOut-during-refresh race

import { isAuthRefreshDiscardedError } from '@supabase/auth-js'

const { data, error } = await supabase.auth.refreshSession()
if (isAuthRefreshDiscardedError(error)) {
  // A concurrent signOut cleared storage between when this refresh
  // started and when it came back. The rotated tokens were discarded.
  // The SIGNED_OUT event already fired from the concurrent signOut.
}

Legacy lock opt-in (existing custom-lock users)

// React Native processLock setup or Node multi-process setup
import { processLock } from '@supabase/auth-js'

const supabase = createClient(URL, KEY, {
  auth: { lock: processLock }, // legacy path, still works, deprecated for v3
})

Breaking changes

  • This PR contains no breaking changes

The earlier draft of this PR silently dropped custom lock functions, which was behaviorally breaking for opt-in users. The current design preserves that behavior on a gated legacy path so the change ships as a v2 minor. Custom-lock users have time to migrate before v3 removes the option entirely.

Behaviour changes worth flagging for review

  • Default coordination is lockless. A client constructed without a lock option no longer acquires navigator.locks or any in-process lock. For most users this is invisible (the lock was internal coordination). For users who depended on observing the lock indirectly (rare), the change is observable.
  • _autoRefreshTokenTick may now run concurrently with signOut / setSession / getUser on the default path. Previously _acquireLock(0, ...) made the tick skip whenever any auth op held the lock. The lockless equivalent only skips when refreshingDeferred is set. The commit guard keeps storage consistent under the new concurrency. The legacy lock opt-in path retains the old skip-on-any-lock behavior.
  • Subscriber timing on the default path: subscribers stay awaited; same as before. What changes is that signOut no longer waits for an in-flight refresh's HTTP and continuation to finish before its own fetch goes out. Both fetches now run concurrently, and the commit guard keeps storage consistent.

Checklist

  • I have read the Contributing Guidelines
  • My PR title follows the conventional commit format: <type>(<scope>): <description>
  • I have run npx nx format to ensure consistent code formatting
  • I have added tests for new functionality (commit guard, dispose(), subscriber re-entry no-deadlock, default vs. legacy path)
  • I have updated documentation (JSDoc on affected methods, type-level deprecation notices)

Additional notes

Alternatives considered

Continuing to patch navigator.locks (better steal recovery, lower timeouts, smarter error filtering). Each prior patch in this direction (#1962, #2106, #2178) fixed one symptom and produced another. The Web Locks API has no recovery for orphaned holders other than { steal: true }, which leaves the previous fn() running concurrently with the new holder and creates the steal-cascade error storm. Each fix here swaps one failure for another instead of removing the contention that causes them.

Swapping navigatorLock for processLock as the default browser lock (the direction #2235 explored). Removes the cross-process orphan failure but keeps the same shape of the bug: one primitive serializing operations that don't all need the same synchronization.

Non-blocking subscriber notifications alone (#2016). Fixes subscriber re-entry. The other six failure classes still bite.

Removing the lock entirely (no legacy opt-in). The earlier draft of this PR did this. It silently dropped custom locks supplied via settings.lock, which broke React Native processLock and Node multi-process setups. The current design preserves those callers on a gated legacy path so the change is non-breaking; the legacy path is slated for removal in v3.

Adding an AbortController layer for in-flight operation cancellation. Deferred to a follow-up, not included here. The commit guard catches the one race that affects correctness (signOut overwriting cleared storage with rotated tokens). AbortController would be a UX improvement on top: cancel the in-flight refresh as soon as signOut runs, instead of letting it finish and discarding the result. Out of scope for this PR.

Deferring the commit guard and accepting eventual consistency. Considered and rejected. Without a guard, a refresh that completes after _removeSession() cleared storage will write the rotated tokens back. Subscribers then see SIGNED_OUT followed by TOKEN_REFRESHED, with stale tokens in storage until the next refresh tick (~30s) fails against the server and clears them. The guard is about 15 lines. Better to land it with this PR than to ship the bug and patch it later.

Server-side context

Why cross-tab is safe on the default path: GoTrue's parent-of-active path at internal/tokens/service.go:376-385 (the v1 branch, *models.RefreshToken). When a request arrives with a revoked refresh token whose child is the currently-active token, the server returns the active token instead of rejecting. Two tabs that POST the same refresh token concurrently both receive the same rotated token under DB row locking. This is the production-default behaviour (RefreshTokenAlgorithmVersion = 0, v1). v2 (counter-based, gated on RefreshTokenUpgradePercentage) is safe under the same N-tab concurrency, covered by TestConcurrentReuse.

Test coverage

Updated tests in packages/core/auth-js/test/GoTrueClient.test.ts:

  • 'Lockless coordination (default) and legacy lock opt-in' describe:
    • default path: client with no lock option, assert _acquireLock is never invoked
    • legacy path: client with custom lock, assert lock IS invoked (.toHaveBeenCalled())
    • lockAcquireTimeout accepted on both paths (lockless ignores it at runtime; legacy uses it)
    • subscriber callback may call other auth methods from inside onAuthStateChange without deadlock
  • 'dispose() lifecycle' describe: idempotency, subscriber clearing, ticker stopping.
  • 'Refresh commit guard (signOut-during-refresh race)' describe (4 tests): cleared-mid-flight discard, empty-storage acceptance, different-session acceptance, different-session-written-mid-flight discard.

Updated tests in packages/core/auth-js/test/GoTrueClient.browser.test.ts:

  • 'Lockless coordination: navigator.locks should NOT be invoked': default path no longer touches the browser API.
  • 'Legacy lock opt-in: custom lock function is invoked when supplied': custom lock IS called on the legacy path.

Updated tests in packages/core/supabase-js/test/unit/SupabaseAuthClient.test.ts:

  • 'should pass through lockAcquireTimeout option' and 'should accept auth.lockAcquireTimeout and wire it to auth client': assert the option flows from createClient through to the auth client instance (via a targeted as unknown as { lockAcquireTimeout: number } cast). Comment in-source explains the cast is greppable for the v3 cleanup.

Open questions for review

  1. AbortSignal.any plumbing in lib/fetch.ts isn't done here (no AbortController layer to plumb yet). If we add the AbortController follow-up later, this becomes a prerequisite. Just so it's on the radar.
  2. @supabase/supabase-js types deprecation: lock and lockAcquireTimeout are @deprecated in the supabase-js types file too, since they delegate to the auth client. Confirm that's the right surface to deprecate.

v3 cleanup

A separate Linear ticket on the SDK team / supabase-js v3 project tracks the mechanical removal of the legacy lock path in v3. Every gated site in GoTrueClient.ts has a // TODO(v3): remove legacy lock path marker; the ticket lists the full file-by-file scope.

Status

Ready for review. Replaces #2387 (which was opened against develop before the v3 branching update).

@Bewinxed Bewinxed force-pushed the bewinxed/authjs-lockless branch 3 times, most recently from 0b30f2d to 36261d5 Compare May 22, 2026 21:21
@mandarini mandarini force-pushed the bewinxed/authjs-lockless branch from 36261d5 to 026e1ee Compare May 25, 2026 10:58
@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new Bot commented May 25, 2026

Open in StackBlitz

@supabase/auth-js

npm i https://pkg.pr.new/@supabase/auth-js@2392

@supabase/functions-js

npm i https://pkg.pr.new/@supabase/functions-js@2392

@supabase/postgrest-js

npm i https://pkg.pr.new/@supabase/postgrest-js@2392

@supabase/realtime-js

npm i https://pkg.pr.new/@supabase/realtime-js@2392

@supabase/storage-js

npm i https://pkg.pr.new/@supabase/storage-js@2392

@supabase/supabase-js

npm i https://pkg.pr.new/@supabase/supabase-js@2392

commit: 80eec56

@mandarini mandarini self-assigned this May 25, 2026
Bewinxed and others added 5 commits May 25, 2026 17:42
…guard + dispose()

Removes the `_acquireLock` mutex that wrapped every auth operation and the
underlying `navigator.locks` / `processLock` machinery. Replaces it with two
lighter primitives that target the specific synchronization needs each
operation actually has:

- `refreshingDeferred` (already existed) continues to single-flight the
  refresh path within an instance.
- A storage-level commit guard in `_callRefreshToken` re-reads the storage
  refresh_token between the rotated-tokens response and `_saveSession`.
  If storage changed under us (e.g. a concurrent `signOut` ran
  `_removeSession`), the rotated tokens are discarded rather than written
  back. Returns `AuthRefreshDiscardedError` on the result.

Cross-tab refresh races are handled server-side by GoTrue's v1
parent-of-active mechanism at `internal/tokens/service.go:376-385`, so no
client-side coordination is needed.

New `client.dispose()` tears down the auto-refresh interval, the
`visibilitychange` listener, and the BroadcastChannel; clears registered
`onAuthStateChange` subscribers. Idempotent. Call from cleanup hooks in
React Strict Mode / HMR contexts to prevent stale tickers from outliving
the client.

The `lock` and `lockAcquireTimeout` constructor options are silently
ignored for backwards compatibility; both are marked `@deprecated`.
`navigatorLock`, `processLock`, and the `LockAcquireTimeoutError` family
remain exported from `./lib/locks` for one major version.

Stale lock references in JSDoc on `getSession`, `onAuthStateChange`,
`_challengeAndVerify`, and `_listFactors` updated to match the new model.

Test branch only, not for merge yet. See RFC `lockless_auth_coordination`.

Resolves (test target): supabase#2013, supabase#936, supabase#2111
…ispose()

_useSession: explain that concurrent callers can both reach `__loadSession`
because storage reads are idempotent and the only write path (refresh) is
single-flighted downstream by `refreshingDeferred` in `_callRefreshToken`.
No serialization is needed at this layer.

dispose(): add a lifecycle caveat clarifying that in-flight refreshes are
not aborted, so a disposed instance can still persist a rotated session to
storage after `dispose()` returns. A subsequent `createClient` against the
same `storageKey` will pick that session up. Notes the mitigation (await
pending ops before dispose, or use a fresh `storageKey`).

Doc-only changes; no runtime behaviour change.
The two tests at `SupabaseAuthClient.test.ts:61-77` were asserting that
`lockAcquireTimeout` is stored as a runtime field on the auth client
(`expect((authClient as any).lockAcquireTimeout).toBe(30_000)`). That field
no longer exists after the lockless refactor — the option is accepted by
the type for backwards compatibility but is silently ignored at runtime
because the client doesn't acquire a lock around auth operations.

Rewrite both tests to verify the new contract:
- `_initSupabaseAuthClient` accepts the option without throwing.
- `createClient` accepts it through `auth.lockAcquireTimeout`, the
  auth client is still constructed, but the value is not stored as a
  runtime field (`toBeUndefined()`).

Fixes the CI failure in `Supabase-JS Integration CI / Unit + Type Check`
(2 failed, 102 passed → now 104 passed) on all three platforms.
@mandarini mandarini force-pushed the bewinxed/authjs-lockless branch from 265c2a9 to 3ca0099 Compare May 25, 2026 14:42
@mandarini mandarini marked this pull request as ready for review May 26, 2026 10:18
@mandarini mandarini requested review from a team as code owners May 26, 2026 10:18
@mandarini mandarini added the do-not-merge Do not merge this PR. label May 26, 2026
@mandarini
Copy link
Copy Markdown
Contributor

Do not merge until we ask users to test, and dogfood.

if (!data.session) throw new AuthSessionMissingError()

const storedAfter = (await getItemAsync(this.storage, this.storageKey)) as Session | null
const storageChangedUnderUs =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Severity: MEDIUM

TOCTOU gap between guard evaluation and _saveSession: storageChangedUnderUs is computed synchronously after the storedAfter read, but the subsequent _saveSession call contains its own await setItemAsync(...) yield points. A concurrent signOut → _removeSession that clears storage and emits SIGNED_OUT can interleave inside those awaits, causing the rotated tokens to be written back after signOut completed — leaving a live authenticated session the user intended to revoke.
Helpful? Add 👍 / 👎

💡 Fix Suggestion

Suggestion: Close the TOCTOU gap by introducing a monotonic 'session removal epoch' counter that _removeSession increments before it clears storage. _callRefreshToken then captures the epoch immediately before calling _saveSession and re-checks it immediately after. If the epoch advanced during the save, a concurrent signOut → _removeSession ran inside one of _saveSession's await points and wrote rotated tokens back over a cleared slot. In that case:

  1. Add a class field (near line 290 alongside refreshingDeferred):

    protected _sessionRemovalEpoch = 0
  2. Increment it at the very top of _removeSession (before any await):

    private async _removeSession() {
      this._sessionRemovalEpoch += 1
      // ... rest of existing body
    }
  3. Add a post-save epoch check in _callRefreshToken, replacing the current lines 4793–4798:

    const epochBeforeSave = this._sessionRemovalEpoch
    await this._saveSession(data.session)
    
    // Post-save TOCTOU guard: if _removeSession ran inside _saveSession's awaits,
    // undo the write without calling _removeSession (which would emit a duplicate SIGNED_OUT).
    if (this._sessionRemovalEpoch !== epochBeforeSave) {
      await removeItemAsync(this.storage, this.storageKey)
      if (this.userStorage) {
        await removeItemAsync(this.userStorage, this.storageKey + '-user')
      }
      const discarded: CallRefreshTokenResult = { data: null, error: new AuthRefreshDiscardedError() }
      this.refreshingDeferred.resolve(discarded)
      return discarded
    }
    
    await this._notifyAllSubscribers('TOKEN_REFRESHED', data.session)

Use removeItemAsync directly (not _removeSession) to undo the write, because _removeSession already fired SIGNED_OUT — calling it again would emit a duplicate event. The increment-before-await ordering in step 2 ensures the check in step 3 is not subject to its own race.

Note that this fix requires coordinated changes in three locations (class field, _removeSession, and _callRefreshToken), which is why no single-location code suggestion is provided.

debugName,
'commit guard: storage changed since refresh started, discarding rotated tokens',
{
startedWith: `${storedAtStart.refresh_token?.substring(0, 5)}...`,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity: LOW

The commit-guard debug block logs the first 5 characters of both the pre- and post-fetch refresh tokens. When debug: true is configured, these fragments are emitted to console.log and may be forwarded to log-aggregation or error-monitoring services (e.g. Sentry, Datadog). Refresh tokens are long-lived bearer credentials; partial values should not appear in log output.
Helpful? Add 👍 / 👎

💡 Fix Suggestion

Suggestion: Replace the partial refresh token substrings (substring(0, 5)) in the commit-guard debug block with a redacted placeholder. Since the block is already inside if (storageChangedUnderUs), the fact that the tokens differ is already established — the actual token prefixes add no diagnostic value and expose credential fragments to any log-aggregation or error-monitoring backend that collects console.log output when debug: true is set. Use '[redacted]' (or a boolean presence indicator) for both startedWith and nowHolds, and apply the same treatment to the debugName construction that also logs refreshToken.substring(0, 5) at the top of _callRefreshToken.

⚠️ Experimental Feature: This code suggestion is automatically generated. Please review carefully.

Suggested change
startedWith: `${storedAtStart.refresh_token?.substring(0, 5)}...`,
startedWith: storedAtStart.refresh_token ? '[redacted]' : null,
nowHolds: storedAfter?.refresh_token ? '[redacted]' : null,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge Do not merge this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants