Skip to content

refactor(auth): remove navigator.locks-based mutex; introduce commit guard + dispose()#2387

Draft
Bewinxed wants to merge 1 commit into
supabase:developfrom
Bewinxed:bewinxed/auth-lockless
Draft

refactor(auth): remove navigator.locks-based mutex; introduce commit guard + dispose()#2387
Bewinxed wants to merge 1 commit into
supabase:developfrom
Bewinxed:bewinxed/auth-lockless

Conversation

@Bewinxed
Copy link
Copy Markdown
Contributor

@Bewinxed Bewinxed commented May 21, 2026

Description

What changed?

GoTrueClient no longer uses a shared mutex to serialize auth operations. _acquireLock, its pendingInLock queue, and all 14 call-site wrappers are gone. In their place:

  • refreshingDeferred (already existed) keeps single-flighting in-instance concurrent refreshes.
  • A commit guard inside _callRefreshToken re-reads storage after the rotated-token response and before _saveSession. If the input refresh_token isn't in storage anymore (because a concurrent signOut ran _removeSession between fetch start and continuation), the rotated tokens are discarded and the result resolves to { data: null, error: new AuthRefreshDiscardedError() }. _recoverAndRefresh recognises this error and skips its own _removeSession() call, so no duplicate SIGNED_OUT event fires.
  • A new exported error class, AuthRefreshDiscardedError (with an isAuthRefreshDiscardedError type guard), is what the commit guard returns when it throws away a successfully-rotated token. It signals "the server gave us new tokens but the client decided not to keep them," distinct from AuthRetryableFetchError (transient network) and AuthApiError (server rejection). Surfaces through refreshSession() and getSession() results.
  • A new client.dispose() tears down the auto-refresh interval, the visibilitychange listener, the BroadcastChannel, and registered onAuthStateChange subscribers. Idempotent. Designed for React Strict Mode and HMR cleanup hooks.
  • GoTrue handles cross-tab refresh races server-side via the v1 parent-of-active path at internal/tokens/service.go:376-385, so the client doesn't need to coordinate. The current default is v1 (RefreshTokenAlgorithmVersion = 0); v2 (counter-based) is opt-in via RefreshTokenUpgradePercentage and gated rollout. This PR's reasoning relies on v1 behaviour, which is what customers actually run today.
  • The lock and lockAcquireTimeout constructor options are accepted but silently ignored for backwards compatibility. Both are now @deprecated in JSDoc, including the @supabase/supabase-js re-exported fields.
  • navigatorLock, processLock, LockAcquireTimeoutError, NavigatorLockAcquireTimeoutError, ProcessLockAcquireTimeoutError, and internals in lib/locks.ts remain exported for one major version. All @deprecated.
  • Stale JSDoc referencing the lock on getSession, onAuthStateChange, _useSession, _challengeAndVerify, and _listFactors now matches the new behaviour. The @deprecated tag on the async onAuthStateChange overload (its only justification was the deadlock that no longer exists) is removed.

Why was this change needed?

The shared mutex has caused seven failure classes on the issue tracker for 18+ months. Each earlier patch fixed one symptom and created another:

  • #1962 configurable timeout
  • #2106 steal fallback
  • #2178 steal-cascade guard

#2235 (primitive swap to processLock) was an earlier attempt at moving off navigator.locks and is no longer being actively pursued. Community PRs #2016 and #2019 fixed individual symptoms and were closed waiting on a structural fix.

Failure classes resolved by this PR:

  1. Strict Mode / HMR orphan deadlocks. Nothing to orphan now that the lock is gone.
  2. iOS Safari INITIAL_SESSION / getSession() hangs with a persisted session (#936 @bugprone). The client no longer touches navigator.locks.
  3. Sentry flooding with AbortError: Lock broken by another request with the 'steal' option (#2013 @cpannwitz). The { steal: true } fallback that produced these errors is gone with the lock.
  4. Subscriber re-entry deadlock (the one at the old GoTrueClient.ts:3824). The cycle needed the lock; a callback that awaits getUser() now just runs.
  5. Visibility/tick race (#936 @JTBrinkmann). The visibility handler and the auto-refresh tick no longer share a lock to fight over.
  6. Stale tickers/listeners on un-disposed clients. dispose() cleans them up.
  7. signOut blocked behind in-flight refresh. signOut now runs concurrently with the refresh, and the commit guard prevents the refresh from writing rotated tokens after _removeSession() cleared storage.

Why no lock at all: cross-tab races are handled on the server (GoTrue's parent-of-active), in-instance refresh dedup is already handled by refreshingDeferred, and the only job left for the lock was serializing subscriber callbacks. That's the deadlock this PR fixes. A narrower lock would still do that one job.

Closes (test target, not for merge yet): #2013, #936, #2111

Related: lockless_auth_coordination RFC (to be opened separately).

Examples

dispose() in a React app

import { useEffect } from 'react'
import { createClient } from '@supabase/supabase-js'

export function useSupabase() {
  useEffect(() => {
    const supabase = createClient(URL, KEY)
    return () => {
      supabase.auth.dispose()
    }
  }, [])
}

Subscriber callbacks can call other auth methods now

// Previously: deadlocked. Callback held the lock; getUser tried to acquire it.
// Now: works.
supabase.auth.onAuthStateChange(async (event, session) => {
  if (event === 'SIGNED_IN') {
    const { data } = await supabase.auth.getUser()
    // ...
  }
})

AuthRefreshDiscardedError for the signOut-during-refresh race

import { isAuthRefreshDiscardedError } from '@supabase/auth-js'

const { data, error } = await supabase.auth.refreshSession()
if (isAuthRefreshDiscardedError(error)) {
  // A concurrent signOut cleared storage between when this refresh
  // started and when it came back. The rotated tokens were discarded.
  // The SIGNED_OUT event already fired from the concurrent signOut.
}

Breaking changes

  • This PR contains no breaking changes

Behaviour changes worth flagging for review

  • The JSDoc on onAuthStateChange no longer warns against async callbacks. Calling other auth methods inside the callback used to deadlock; now it works. The @deprecated marker on the async overload signature is removed (its only justification was the deadlock).
  • lock and lockAcquireTimeout are accepted but ignored. A custom lock function passed via constructor is never invoked. Code that depended on a custom lock being called will silently lose that behaviour. The @deprecated JSDoc on both options flags this for the developer.
  • Subscriber timing: subscribers stay awaited; same as before. What changes is that signOut no longer waits for an in-flight refresh's HTTP and continuation to finish before its own fetch goes out. Both fetches now run concurrently, and the commit guard keeps storage consistent.

Checklist

  • I have read the Contributing Guidelines
  • My PR title follows the conventional commit format: <type>(<scope>): <description>
  • I have run npx nx format to ensure consistent code formatting
  • I have added tests for new functionality (commit guard, dispose(), subscriber re-entry no-deadlock, custom lock no-op)
  • I have updated documentation (JSDoc on affected methods, type-level deprecation notices)

Additional notes

Alternatives considered

Continuing to patch navigator.locks (better steal recovery, lower timeouts, smarter error filtering). Each prior patch in this direction (#1962, #2106, #2178) fixed one symptom and produced another. The Web Locks API has no recovery for orphaned holders other than { steal: true }, which leaves the previous fn() running concurrently with the new holder and creates the steal-cascade error storm. Each fix here swaps one failure for another instead of removing the contention that causes them.

Swapping navigatorLock for processLock as the default browser lock (the direction #2235 explored). Removes the cross-process orphan failure but keeps the same shape of the bug: one primitive serializing operations that don't all need the same synchronization. Subscriber re-entry deadlock, visibility-tick contention, init-tick contention, and in-process orphans of the new primitive all remain.

Non-blocking subscriber notifications alone (#2016). Fixes subscriber re-entry. The other six failure classes still bite.

A "smaller, refactored" lock with narrower scope (e.g. lock only around storage writes, not around HTTP). Tempting on paper. Cross-tab refresh is already lockless on the server (parent-of-active returns the same rotated token to N tabs), and refreshingDeferred already single-flights refreshes inside a tab. The visibility-tick race is the lock fighting itself. The only thing a narrower lock could still do is serialize subscriber callbacks within a tab, which is the deadlock this PR fixes. No smaller version of this PR still fixes the seven failures above.

Adding an AbortController layer for in-flight operation cancellation. Deferred to a follow-up, not included here. The commit guard catches the one race that affects correctness (signOut overwriting cleared storage with rotated tokens). AbortController would be a UX improvement on top: cancel the in-flight refresh as soon as signOut runs, instead of letting it finish and discarding the result. It also requires plumbing signal through lib/fetch.ts and a decision on AbortSignal.any runtime support (Node ≥ 20.3, Hermes ≥ 0.74, Safari ≥ 17.4). Out of scope for v1.

Deferring the commit guard and accepting eventual consistency. Considered and rejected. Without a guard, a refresh that completes after _removeSession() cleared storage will write the rotated tokens back. Subscribers then see SIGNED_OUT followed by TOKEN_REFRESHED, with stale tokens in storage until the next refresh tick (~30s) fails against the server and clears them. The guard is about 15 lines. Better to land it with this PR than to ship the bug and patch it later.

Server-side context

Why cross-tab is safe: GoTrue's parent-of-active path at internal/tokens/service.go:376-385 (the v1 branch, *models.RefreshToken). When a request arrives with a revoked refresh token whose child is the currently-active token, the server returns the active token instead of rejecting. Two tabs that POST the same refresh token concurrently both receive the same rotated token under DB row locking. This is the production-default behaviour (RefreshTokenAlgorithmVersion = 0, v1). v2 (counter-based, gated on RefreshTokenUpgradePercentage) is safe under the same N-tab concurrency, covered by TestConcurrentReuse.

Open questions for review

  1. Should _setSession's expired branch also filter AuthRefreshDiscardedError? Currently it propagates the error to the caller, which I think is correct (setSession was given new tokens; if a concurrent signOut just ran, the caller should be told). Confirming we're OK with that.
  2. AbortSignal.any plumbing in lib/fetch.ts isn't done here (no AbortController layer to plumb yet). If we add the AbortController follow-up later, this becomes a prerequisite. Just so it's on the radar.
  3. @supabase/supabase-js types deprecation: I marked lock and lockAcquireTimeout @deprecated in the supabase-js types file too, since they delegate to the auth client. Confirm that's the right surface to deprecate.

Test coverage

New tests in packages/core/auth-js/test/GoTrueClient.test.ts:

  • 'Lockless coordination' describe: deprecated lock and lockAcquireTimeout options accepted-but-ignored; subscriber callback may call other auth methods without deadlock.
  • 'dispose() lifecycle' describe: idempotency, subscriber clearing, ticker stopping.
  • 'Refresh commit guard (signOut-during-refresh race)' describe: simulates the race, asserts AuthRefreshDiscardedError is returned and storage stays cleared.

New tests in packages/core/auth-js/test/GoTrueClient.browser.test.ts:

  • 'Lockless coordination: navigator.locks should NOT be invoked': confirms the browser API is no longer touched.
  • 'Lockless backwards-compatibility: deprecated lock option': a custom lock function passed via constructor is never invoked.

Removed: 'Lock functionality', 'Browser locks functionality', 'Lock Mechanism Branches' (tested a mechanism that no longer exists). Updated the existing 'should handle custom lock implementation' test to assert not.toHaveBeenCalled() instead of toHaveBeenCalled().

Status

This is on a test branch for now. Ready for review and discussion. Once the design is signed off, I'll open it against develop.

@Bewinxed Bewinxed force-pushed the bewinxed/auth-lockless branch 9 times, most recently from f698e42 to d248bb2 Compare May 21, 2026 07:38
…guard + dispose()

Removes the `_acquireLock` mutex that wrapped every auth operation and the
underlying `navigator.locks` / `processLock` machinery. Replaces it with two
lighter primitives that target the specific synchronization needs each
operation actually has:

- `refreshingDeferred` (already existed) continues to single-flight the
  refresh path within an instance.
- A storage-level commit guard in `_callRefreshToken` re-reads the storage
  refresh_token between the rotated-tokens response and `_saveSession`.
  If storage changed under us (e.g. a concurrent `signOut` ran
  `_removeSession`), the rotated tokens are discarded rather than written
  back. Returns `AuthRefreshDiscardedError` on the result.

Cross-tab refresh races are handled server-side by GoTrue's v1
parent-of-active mechanism at `internal/tokens/service.go:376-385`, so no
client-side coordination is needed.

New `client.dispose()` tears down the auto-refresh interval, the
`visibilitychange` listener, and the BroadcastChannel; clears registered
`onAuthStateChange` subscribers. Idempotent. Call from cleanup hooks in
React Strict Mode / HMR contexts to prevent stale tickers from outliving
the client.

The `lock` and `lockAcquireTimeout` constructor options are silently
ignored for backwards compatibility; both are marked `@deprecated`.
`navigatorLock`, `processLock`, and the `LockAcquireTimeoutError` family
remain exported from `./lib/locks` for one major version.

Stale lock references in JSDoc on `getSession`, `onAuthStateChange`,
`_challengeAndVerify`, and `_listFactors` updated to match the new model.

Test branch only, not for merge yet. See RFC `lockless_auth_coordination`.

Resolves (test target): supabase#2013, supabase#936, supabase#2111
@Bewinxed Bewinxed force-pushed the bewinxed/auth-lockless branch from d248bb2 to f3d4e71 Compare May 21, 2026 07:42
@mandarini mandarini self-assigned this May 21, 2026
@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new Bot commented May 21, 2026

Open in StackBlitz

@supabase/auth-js

npm i https://pkg.pr.new/@supabase/auth-js@2387

@supabase/functions-js

npm i https://pkg.pr.new/@supabase/functions-js@2387

@supabase/postgrest-js

npm i https://pkg.pr.new/@supabase/postgrest-js@2387

@supabase/realtime-js

npm i https://pkg.pr.new/@supabase/realtime-js@2387

@supabase/storage-js

npm i https://pkg.pr.new/@supabase/storage-js@2387

@supabase/supabase-js

npm i https://pkg.pr.new/@supabase/supabase-js@2387

commit: f3d4e71

@mandarini mandarini changed the base branch from develop to master May 21, 2026 11:51
@mandarini mandarini changed the base branch from master to develop May 21, 2026 11:51
Copy link
Copy Markdown
Contributor

@mandarini mandarini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments in the code, and also:

  1. "No breaking changes" checkbox is incorrect. There are two real breaking changes.
  • Custom lock functions are silently ignored. React Native users with processLock and Node multi-process setups with custom locks now race silently with no warning. Add a console.warn (gated on an opt-out) at construction when a non-null lock is supplied.
  • setSession/refreshSession({ refresh_token }) with externally-supplied tokens against empty storage now error. Goes away once #1 is fixed.

Let's update the PR description.

  1. _autoRefreshTokenTick behavior change not flagged.

Old code: _acquireLock(0, ...) made the tick skip whenever any auth op held the lock. New code: only skips when refreshingDeferred is set. The tick can now fire concurrently with signOut/setSession/getUser/etc. for the first time. Add to "Behaviour changes worth flagging for review".

Side note: the early-return if (this.refreshingDeferred !== null) at the top of the tick duplicates dedup that _callRefreshToken already does. Keep as perf opt or drop, your call.

  1. RFC references in code.

Code comments and test names reference lockless_auth_coordination. If the RFC isn't landing as a separate doc, scrub the references; otherwise link it.

Thank you so much @Bewinxed !!! :D

// discard the rotated tokens. Writing them back would undo what
// signOut just cleared.
const currentStored = (await getItemAsync(this.storage, this.storageKey)) as Session | null
if (!currentStored || currentStored.refresh_token !== refreshToken) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit guard fires false-positives on empty storage.

To repro: fresh client, empty memoryLocalStorageAdapter, no signOut anywhere, mocked _refreshAccessToken returning rotated tokens without touching storage:

await client.setSession({ access_token: expiredAT, refresh_token: 'R-external' })
// returns: { data: { session: null }, error: AuthRefreshDiscardedError }
// error.message: "Refresh result discarded: session state changed mid-flight (e.g., concurrent signOut)"

Breaks setSession and refreshSession({ refresh_token }) for any caller supplying an expired AT plus a refresh token that isn't already in storage. That's the entire SSR cookie-handoff / external-token-hydration flow.

Fix: snapshot storage before the fetch, only discard when a non-null snapshot got cleared.

const storedAtStart = await getItemAsync(this.storage, this.storageKey)
// ... _refreshAccessToken ...
const storedAfter = await getItemAsync(this.storage, this.storageKey)
const wasClearedMidFlight =
  storedAtStart !== null &&
  (storedAfter === null || storedAfter.refresh_token !== refreshToken)

Add a test for the empty-storage path. Your existing 'discards rotated tokens when storage was cleared mid-flight' test plants R1 first, so it doesn't catch this.

*
* Call this from cleanup hooks when the client is being replaced before
* its JS realm is destroyed. React Strict Mode and HMR are the common
* cases. After dispose, network operations will reject. Storage reads
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dispose() docstring claims behavior the code doesn't have. dispose() stops the ticker, closes the broadcast channel, removes the visibility listener, clears subscribers. In-flight fetches still complete and _saveSession still writes. New getSession/getUser calls after dispose() work fine. Per our earlier alignment to defer AbortController plumbing, delete the sentence.

Comment on lines -3844 to 3846
* @deprecated Due to the possibility of deadlocks with async functions as callbacks, use the version without an async function.
*/
onAuthStateChange(callback: (event: AuthChangeEvent, session: Session | null) => Promise<void>): {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think removing @deprecated on the async onAuthStateChange overload is premature. One reentry deadlock remains: a subscriber that calls refreshSession (or anything routing through _callRefreshToken) from inside a TOKEN_REFRESHED handler. refreshingDeferred.resolve happens after _notifyAllSubscribers returns:

await this._notifyAllSubscribers('TOKEN_REFRESHED', data.session)  // subscriber runs here
// ...
this.refreshingDeferred.resolve(result)  // resolved AFTER

Subscriber's inner refreshSession dedups onto refreshingDeferred, which won't resolve until the outer finishes awaiting subscribers. Deadlock.

Let's keep the @deprecated on the async overload, rewrite the reason to point at residual refreshingDeferred reentry instead of the old lock. The common cases (getUser, setSession, reading session from a callback) are genuinely fixed and worth saying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants