refactor(auth): remove navigator.locks-based mutex; introduce commit guard + dispose()#2392
refactor(auth): remove navigator.locks-based mutex; introduce commit guard + dispose()#2392Bewinxed wants to merge 6 commits into
Conversation
0b30f2d to
36261d5
Compare
36261d5 to
026e1ee
Compare
@supabase/auth-js
@supabase/functions-js
@supabase/postgrest-js
@supabase/realtime-js
@supabase/storage-js
@supabase/supabase-js
commit: |
…guard + dispose() Removes the `_acquireLock` mutex that wrapped every auth operation and the underlying `navigator.locks` / `processLock` machinery. Replaces it with two lighter primitives that target the specific synchronization needs each operation actually has: - `refreshingDeferred` (already existed) continues to single-flight the refresh path within an instance. - A storage-level commit guard in `_callRefreshToken` re-reads the storage refresh_token between the rotated-tokens response and `_saveSession`. If storage changed under us (e.g. a concurrent `signOut` ran `_removeSession`), the rotated tokens are discarded rather than written back. Returns `AuthRefreshDiscardedError` on the result. Cross-tab refresh races are handled server-side by GoTrue's v1 parent-of-active mechanism at `internal/tokens/service.go:376-385`, so no client-side coordination is needed. New `client.dispose()` tears down the auto-refresh interval, the `visibilitychange` listener, and the BroadcastChannel; clears registered `onAuthStateChange` subscribers. Idempotent. Call from cleanup hooks in React Strict Mode / HMR contexts to prevent stale tickers from outliving the client. The `lock` and `lockAcquireTimeout` constructor options are silently ignored for backwards compatibility; both are marked `@deprecated`. `navigatorLock`, `processLock`, and the `LockAcquireTimeoutError` family remain exported from `./lib/locks` for one major version. Stale lock references in JSDoc on `getSession`, `onAuthStateChange`, `_challengeAndVerify`, and `_listFactors` updated to match the new model. Test branch only, not for merge yet. See RFC `lockless_auth_coordination`. Resolves (test target): supabase#2013, supabase#936, supabase#2111
…ispose() _useSession: explain that concurrent callers can both reach `__loadSession` because storage reads are idempotent and the only write path (refresh) is single-flighted downstream by `refreshingDeferred` in `_callRefreshToken`. No serialization is needed at this layer. dispose(): add a lifecycle caveat clarifying that in-flight refreshes are not aborted, so a disposed instance can still persist a rotated session to storage after `dispose()` returns. A subsequent `createClient` against the same `storageKey` will pick that session up. Notes the mitigation (await pending ops before dispose, or use a fresh `storageKey`). Doc-only changes; no runtime behaviour change.
The two tests at `SupabaseAuthClient.test.ts:61-77` were asserting that `lockAcquireTimeout` is stored as a runtime field on the auth client (`expect((authClient as any).lockAcquireTimeout).toBe(30_000)`). That field no longer exists after the lockless refactor — the option is accepted by the type for backwards compatibility but is silently ignored at runtime because the client doesn't acquire a lock around auth operations. Rewrite both tests to verify the new contract: - `_initSupabaseAuthClient` accepts the option without throwing. - `createClient` accepts it through `auth.lockAcquireTimeout`, the auth client is still constructed, but the value is not stored as a runtime field (`toBeUndefined()`). Fixes the CI failure in `Supabase-JS Integration CI / Unit + Type Check` (2 failed, 102 passed → now 104 passed) on all three platforms.
265c2a9 to
3ca0099
Compare
|
Do not merge until we ask users to test, and dogfood. |
| if (!data.session) throw new AuthSessionMissingError() | ||
|
|
||
| const storedAfter = (await getItemAsync(this.storage, this.storageKey)) as Session | null | ||
| const storageChangedUnderUs = |
There was a problem hiding this comment.
🟡 Severity: MEDIUM
TOCTOU gap between guard evaluation and _saveSession: storageChangedUnderUs is computed synchronously after the storedAfter read, but the subsequent _saveSession call contains its own await setItemAsync(...) yield points. A concurrent signOut → _removeSession that clears storage and emits SIGNED_OUT can interleave inside those awaits, causing the rotated tokens to be written back after signOut completed — leaving a live authenticated session the user intended to revoke.
Helpful? Add 👍 / 👎
💡 Fix Suggestion
Suggestion: Close the TOCTOU gap by introducing a monotonic 'session removal epoch' counter that _removeSession increments before it clears storage. _callRefreshToken then captures the epoch immediately before calling _saveSession and re-checks it immediately after. If the epoch advanced during the save, a concurrent signOut → _removeSession ran inside one of _saveSession's await points and wrote rotated tokens back over a cleared slot. In that case:
-
Add a class field (near line 290 alongside
refreshingDeferred):protected _sessionRemovalEpoch = 0
-
Increment it at the very top of
_removeSession(before anyawait):private async _removeSession() { this._sessionRemovalEpoch += 1 // ... rest of existing body }
-
Add a post-save epoch check in
_callRefreshToken, replacing the current lines 4793–4798:const epochBeforeSave = this._sessionRemovalEpoch await this._saveSession(data.session) // Post-save TOCTOU guard: if _removeSession ran inside _saveSession's awaits, // undo the write without calling _removeSession (which would emit a duplicate SIGNED_OUT). if (this._sessionRemovalEpoch !== epochBeforeSave) { await removeItemAsync(this.storage, this.storageKey) if (this.userStorage) { await removeItemAsync(this.userStorage, this.storageKey + '-user') } const discarded: CallRefreshTokenResult = { data: null, error: new AuthRefreshDiscardedError() } this.refreshingDeferred.resolve(discarded) return discarded } await this._notifyAllSubscribers('TOKEN_REFRESHED', data.session)
Use removeItemAsync directly (not _removeSession) to undo the write, because _removeSession already fired SIGNED_OUT — calling it again would emit a duplicate event. The increment-before-await ordering in step 2 ensures the check in step 3 is not subject to its own race.
Note that this fix requires coordinated changes in three locations (class field, _removeSession, and _callRefreshToken), which is why no single-location code suggestion is provided.
| debugName, | ||
| 'commit guard: storage changed since refresh started, discarding rotated tokens', | ||
| { | ||
| startedWith: `${storedAtStart.refresh_token?.substring(0, 5)}...`, |
There was a problem hiding this comment.
⚪ Severity: LOW
The commit-guard debug block logs the first 5 characters of both the pre- and post-fetch refresh tokens. When debug: true is configured, these fragments are emitted to console.log and may be forwarded to log-aggregation or error-monitoring services (e.g. Sentry, Datadog). Refresh tokens are long-lived bearer credentials; partial values should not appear in log output.
Helpful? Add 👍 / 👎
💡 Fix Suggestion
Suggestion: Replace the partial refresh token substrings (substring(0, 5)) in the commit-guard debug block with a redacted placeholder. Since the block is already inside if (storageChangedUnderUs), the fact that the tokens differ is already established — the actual token prefixes add no diagnostic value and expose credential fragments to any log-aggregation or error-monitoring backend that collects console.log output when debug: true is set. Use '[redacted]' (or a boolean presence indicator) for both startedWith and nowHolds, and apply the same treatment to the debugName construction that also logs refreshToken.substring(0, 5) at the top of _callRefreshToken.
⚠️ Experimental Feature: This code suggestion is automatically generated. Please review carefully.
| startedWith: `${storedAtStart.refresh_token?.substring(0, 5)}...`, | |
| startedWith: storedAtStart.refresh_token ? '[redacted]' : null, | |
| nowHolds: storedAfter?.refresh_token ? '[redacted]' : null, |
Description
What changed?
GoTrueClientnow defaults to lockless coordination. Thenavigator.locks-based mutex no longer runs by default — most users get the lockless path. Callers who explicitly pass a customlock(typically React NativeprocessLockor Node multi-process setups) keep the old behavior on an opt-in legacy path so the change is backwards-compatible.The lockless default uses:
refreshingDeferred(already existed) to single-flight in-instance concurrent refreshes._callRefreshTokenthat snapshots storage before the rotated-token fetch and re-reads after, then discards the rotated tokens if a non-null snapshot was cleared (typical case: a concurrentsignOutran_removeSession). The discarded path returns{ data: null, error: new AuthRefreshDiscardedError() }._recoverAndRefreshrecognises this error and skips its own_removeSession()call so no duplicateSIGNED_OUTevent fires.AuthRefreshDiscardedError(with anisAuthRefreshDiscardedErrortype guard), is what the commit guard returns when it throws away a successfully-rotated token. Distinct fromAuthRetryableFetchError(transient network) andAuthApiError(server rejection). Surfaces throughrefreshSession()andgetSession()results.client.auth.dispose()tears down the auto-refresh interval, thevisibilitychangelistener, theBroadcastChannel, and registeredonAuthStateChangesubscribers. Idempotent. Designed for React Strict Mode and HMR cleanup hooks. In-flightfetchcalls are not aborted — they run to completion.internal/tokens/service.go:376-385, so the client doesn't need to coordinate. The current default is v1 (RefreshTokenAlgorithmVersion = 0); v2 (counter-based) is opt-in viaRefreshTokenUpgradePercentageand gated rollout. This PR's reasoning relies on v1 behaviour, which is what customers actually run today.The legacy opt-in path (
settings.lock != null):_acquireLock,pendingInLock,lockAcquired,lockAcquireTimeout, and all 14 call-site wrappers are preserved verbatim frommasterand gated onthis.lock != null. The default path is untouched by them.// TODO(v3): remove legacy lock pathso the eventual v3 cleanup is a mechanical search-and-delete. A separate Linear ticket on the supabase-js v3 project tracks that work.lockandlockAcquireTimeoutconstructor options are accepted and honored when supplied; both are@deprecatedin JSDoc with "Custom locks still work in v2.x for backwards compatibility. Will be removed in v3."navigatorLock,processLock,LockAcquireTimeoutError,NavigatorLockAcquireTimeoutError,ProcessLockAcquireTimeoutError, andinternalsinlib/locks.tsremain exported. Still@deprecated.Stale JSDoc referencing the lock on
getSession,onAuthStateChange,_useSession,_challengeAndVerify, and_listFactorsnow matches the new default behaviour. The@deprecatedtag on the asynconAuthStateChangeoverload is kept, with its reason updated to point at the one residual reentry hazard (refreshSessioncalled from inside aTOKEN_REFRESHEDhandler —refreshingDeferred.resolvehappens after_notifyAllSubscribersreturns).Why was this change needed?
The shared mutex has caused seven failure classes on the issue tracker for 18+ months. Each earlier patch fixed one symptom and created another:
#2235 (primitive swap to
processLock) was an earlier attempt at moving offnavigator.locksand is no longer being actively pursued. Community PRs #2016 and #2019 fixed individual symptoms and were closed waiting on a structural fix.Failure classes resolved by this PR's default (lockless) path:
INITIAL_SESSION/getSession()hangs with a persisted session (#936 @bugprone). The default path no longer touchesnavigator.locks.AbortError: Lock broken by another request with the 'steal' option(#2013 @cpannwitz). The{ steal: true }fallback that produced these errors is bypassed on the default path.GoTrueClient.ts:3824). The cycle needed the lock; a callback that awaitsgetUser()now just runs on the default path.dispose()cleans them up.signOutblocked behind in-flight refresh.signOutnow runs concurrently with the refresh on the default path, and the commit guard prevents the refresh from writing rotated tokens after_removeSession()cleared storage.Callers on the legacy opt-in path (explicit
settings.lock) keep the old serialization semantics and the failure modes that come with them. They accepted those when they opted in; they can drop the option to migrate to the lockless default at any time.Why default lockless: cross-tab races are handled on the server (GoTrue's parent-of-active), in-instance refresh dedup is already handled by
refreshingDeferred, and the only job the lock did beyond that was serializing subscriber callbacks — which is the deadlock we're fixing.Closes (test target): #2013, #936, #2111
Examples
dispose()in a React appSubscriber callbacks can call other auth methods now (default path)
One residual hazard remains: calling
refreshSessionfrom inside aTOKEN_REFRESHEDhandler still deadlocks viarefreshingDeferred. The async overload ofonAuthStateChangekeeps its@deprecatedmarker for that reason.AuthRefreshDiscardedErrorfor the signOut-during-refresh raceLegacy lock opt-in (existing custom-lock users)
Breaking changes
The earlier draft of this PR silently dropped custom
lockfunctions, which was behaviorally breaking for opt-in users. The current design preserves that behavior on a gated legacy path so the change ships as a v2 minor. Custom-lock users have time to migrate before v3 removes the option entirely.Behaviour changes worth flagging for review
lockoption no longer acquiresnavigator.locksor any in-process lock. For most users this is invisible (the lock was internal coordination). For users who depended on observing the lock indirectly (rare), the change is observable._autoRefreshTokenTickmay now run concurrently withsignOut/setSession/getUseron the default path. Previously_acquireLock(0, ...)made the tick skip whenever any auth op held the lock. The lockless equivalent only skips whenrefreshingDeferredis set. The commit guard keeps storage consistent under the new concurrency. The legacy lock opt-in path retains the old skip-on-any-lock behavior.signOutno longer waits for an in-flight refresh's HTTP and continuation to finish before its own fetch goes out. Both fetches now run concurrently, and the commit guard keeps storage consistent.Checklist
<type>(<scope>): <description>npx nx formatto ensure consistent code formattingdispose(), subscriber re-entry no-deadlock, default vs. legacy path)Additional notes
Alternatives considered
Continuing to patch
navigator.locks(better steal recovery, lower timeouts, smarter error filtering). Each prior patch in this direction (#1962, #2106, #2178) fixed one symptom and produced another. The Web Locks API has no recovery for orphaned holders other than{ steal: true }, which leaves the previousfn()running concurrently with the new holder and creates the steal-cascade error storm. Each fix here swaps one failure for another instead of removing the contention that causes them.Swapping
navigatorLockforprocessLockas the default browser lock (the direction #2235 explored). Removes the cross-process orphan failure but keeps the same shape of the bug: one primitive serializing operations that don't all need the same synchronization.Non-blocking subscriber notifications alone (#2016). Fixes subscriber re-entry. The other six failure classes still bite.
Removing the lock entirely (no legacy opt-in). The earlier draft of this PR did this. It silently dropped custom locks supplied via
settings.lock, which broke React NativeprocessLockand Node multi-process setups. The current design preserves those callers on a gated legacy path so the change is non-breaking; the legacy path is slated for removal in v3.Adding an
AbortControllerlayer for in-flight operation cancellation. Deferred to a follow-up, not included here. The commit guard catches the one race that affects correctness (signOut overwriting cleared storage with rotated tokens). AbortController would be a UX improvement on top: cancel the in-flight refresh as soon as signOut runs, instead of letting it finish and discarding the result. Out of scope for this PR.Deferring the commit guard and accepting eventual consistency. Considered and rejected. Without a guard, a refresh that completes after
_removeSession()cleared storage will write the rotated tokens back. Subscribers then seeSIGNED_OUTfollowed byTOKEN_REFRESHED, with stale tokens in storage until the next refresh tick (~30s) fails against the server and clears them. The guard is about 15 lines. Better to land it with this PR than to ship the bug and patch it later.Server-side context
Why cross-tab is safe on the default path: GoTrue's parent-of-active path at
internal/tokens/service.go:376-385(the v1 branch,*models.RefreshToken). When a request arrives with a revoked refresh token whose child is the currently-active token, the server returns the active token instead of rejecting. Two tabs that POST the same refresh token concurrently both receive the same rotated token under DB row locking. This is the production-default behaviour (RefreshTokenAlgorithmVersion = 0, v1). v2 (counter-based, gated onRefreshTokenUpgradePercentage) is safe under the same N-tab concurrency, covered byTestConcurrentReuse.Test coverage
Updated tests in
packages/core/auth-js/test/GoTrueClient.test.ts:'Lockless coordination (default) and legacy lock opt-in'describe:lockoption, assert_acquireLockis never invokedlock, assertlockIS invoked (.toHaveBeenCalled())lockAcquireTimeoutaccepted on both paths (lockless ignores it at runtime; legacy uses it)onAuthStateChangewithout deadlock'dispose() lifecycle'describe: idempotency, subscriber clearing, ticker stopping.'Refresh commit guard (signOut-during-refresh race)'describe (4 tests): cleared-mid-flight discard, empty-storage acceptance, different-session acceptance, different-session-written-mid-flight discard.Updated tests in
packages/core/auth-js/test/GoTrueClient.browser.test.ts:'Lockless coordination: navigator.locks should NOT be invoked': default path no longer touches the browser API.'Legacy lock opt-in: customlockfunction is invoked when supplied': custom lock IS called on the legacy path.Updated tests in
packages/core/supabase-js/test/unit/SupabaseAuthClient.test.ts:'should pass through lockAcquireTimeout option'and'should accept auth.lockAcquireTimeout and wire it to auth client': assert the option flows fromcreateClientthrough to the auth client instance (via a targetedas unknown as { lockAcquireTimeout: number }cast). Comment in-source explains the cast is greppable for the v3 cleanup.Open questions for review
AbortSignal.anyplumbing inlib/fetch.tsisn't done here (no AbortController layer to plumb yet). If we add the AbortController follow-up later, this becomes a prerequisite. Just so it's on the radar.@supabase/supabase-jstypes deprecation:lockandlockAcquireTimeoutare@deprecatedin the supabase-js types file too, since they delegate to the auth client. Confirm that's the right surface to deprecate.v3 cleanup
A separate Linear ticket on the SDK team /
supabase-js v3project tracks the mechanical removal of the legacy lock path in v3. Every gated site inGoTrueClient.tshas a// TODO(v3): remove legacy lock pathmarker; the ticket lists the full file-by-file scope.Status
Ready for review. Replaces #2387 (which was opened against
developbefore the v3 branching update).