Skip to content

fix(social): distinguish platform-down from token-expired#49

Merged
paulocastellano merged 14 commits into
mainfrom
fix/platform-unavailable-vs-token-expired
May 19, 2026
Merged

fix(social): distinguish platform-down from token-expired#49
paulocastellano merged 14 commits into
mainfrom
fix/platform-unavailable-vs-token-expired

Conversation

@paulocastellano
Copy link
Copy Markdown
Contributor

Summary

When a social provider's API was down (5xx, timeout, DNS), the hourly RefreshSocialToken job and the daily VerifyWorkspaceConnections job were treating the failure as "token revoked" — marking accounts as expired and emailing the user to reconnect. Bluesky going offline triggered false-positive disconnect notifications because Bluesky access tokens are short-lived (2h), so every hourly refresh failed during the outage.

The fix introduces a separate exception for transient platform-down failures and routes all OAuth refresh HTTP calls through a small, typed client.

Changes

  • PlatformUnavailableException — new. Raised for ConnectionException, 5xx. Means "API is unreachable, retry later".
  • TokenExpiredException — unchanged. Reserved for 4xx (token is provably bad).
  • TokenRefreshClient — new. Normalizes failure semantics for OAuth refresh calls across all 8 providers. Takes a Platform enum so typos are compile-time errors and the user-facing label comes from a single source (Platform::label()).
  • ConnectionVerifier — all refresh*Token methods route through the new client. Hardcoded OAuth URLs (LinkedIn `accessToken`, YouTube `oauth2.googleapis.com`, Bluesky default PDS) moved into `config/trypost.php` alongside the existing per-platform entries.
  • RefreshSocialToken job — `PlatformUnavailableException` → log warning and stop. Does not `markAsTokenExpired`, does not notify. Next scheduled tick retries.
  • VerifyWorkspaceConnections job — `PlatformUnavailableException` from the inner refresh is treated as a transient skip (no disconnect, no email).

Why a class and not just a helper / macro

Considered three approaches:

  1. Private helper method on ConnectionVerifier (the first cut).
  2. Http::macro('forTokenRefresh', ...) with ->throw($callback).
  3. Dedicated TokenRefreshClient class.

The macro approach (2) is clean for mapping 4xx/5xx via the throw callback, but doesn't intercept ConnectionException — that's raised by Guzzle before the response exists, so every caller would still need its own try/catch. The class encapsulates both code paths in send(), callers stay at one level of nesting, and the class is independently testable.

Test plan

  • `php artisan test --compact --parallel` — full suite 1565 passing
  • Bluesky 5xx during refresh raises `PlatformUnavailable`, not `TokenExpired` (incl. when password fallback is stored)
  • LinkedIn / X / YouTube / Pinterest / TikTok / Threads / Instagram path: 5xx → unavailable, 4xx → expired, ConnectionException → unavailable
  • `RefreshSocialToken` job: PlatformUnavailable leaves status `Connected`, no `SendNotification` dispatched
  • `VerifyWorkspaceConnections` job: PlatformUnavailable leaves status `Connected`, no email queued
  • Manual: stop Bluesky in fake/test mode for >2h and confirm no reconnect email is sent

🤖 Generated with Claude Code

When a provider's API was down (5xx, timeout, DNS), the hourly
RefreshSocialToken job and daily VerifyWorkspaceConnections job were
treating it as "token revoked" and emailing the user to reconnect.
Bluesky going offline triggered false-positive disconnect notifications
because Bluesky access tokens are short-lived (2h) so every hourly
refresh failed during the outage.

- New PlatformUnavailableException: API unreachable / 5xx, transient.
  TokenExpiredException stays for 4xx (token is provably bad).
- New TokenRefreshClient: normalizes failure semantics for OAuth
  refresh HTTP calls across all providers. Takes a Platform enum so
  typos fail at compile time and the user-facing label comes from
  one source.
- ConnectionVerifier: all 8 refresh*Token methods route through the
  new client. Hardcoded OAuth URLs (LinkedIn, YouTube) and Bluesky's
  default PDS host moved into config/trypost.php alongside the
  existing per-platform entries.
- RefreshSocialToken job: PlatformUnavailableException → log warning
  and stop. Do NOT markAsTokenExpired, do NOT notify the user. Next
  scheduled tick retries.
- VerifyWorkspaceConnections job: PlatformUnavailableException from
  the inner refresh propagates and is treated as a transient skip.
Review follow-ups:

- verifyMastodon was the last hardcoded host left after the PR moved
  LinkedIn/YouTube/Bluesky to config. Adds trypost.platforms.mastodon
  .default_instance (env MASTODON_DEFAULT_INSTANCE) and reads from it.
- refreshToken() docblock now declares @throws PlatformUnavailableException
  (the whole point of the PR was missing from its contract).
- Strip the new explanatory comments inside catch blocks and tests —
  rationale lives in the commit / PR, not inline. The two comments
  inside empty `catch (TokenExpiredException) {}` blocks stay because
  there the comment is the only thing telling the reader why the
  exception is swallowed.
Earlier in the PR the new configs (linkedin.oauth_api, youtube.oauth_api,
bluesky.default_service, mastodon.default_instance) were only read by
ConnectionVerifier. The same URLs were still hardcoded in the publishers,
analytics and the Bluesky auth controller — meaning a self-hosted user
setting BLUESKY_DEFAULT_SERVICE or MASTODON_DEFAULT_INSTANCE in env would
get split behavior: refresh/verify honor the override, publish/analytics
don't.

Routes all 10 remaining call sites through the same config values so the
overrides actually work end-to-end.
LinkedInPageAnalytics and MastodonAnalytics were the only two of the 11
files refactored to read OAuth host / default instance from config that
had zero test coverage. Adds smoke tests that assert the HTTP request
hits the configured URL, so a typo in the config key (e.g. linkedin.api
vs linkedin.oauth_api) would now fail loudly.
The previous commits in this PR closed the loophole on the hourly /
daily token-refresh jobs. The same loophole remained on the publish
path: every per-platform publisher (LinkedIn, X, YouTube, TikTok,
Threads, Instagram, Pinterest, Bluesky and their Analytics siblings)
has its own refreshToken() called before publishing a scheduled post,
and all of those treated any non-2xx as TokenExpired — including 5xx.

Result before this commit: a Bluesky outage that coincided with a
scheduled publish would mark the account as expired and fail the post.

Changes:
- Route every refreshToken() in the 16 publisher / analytics classes
  through TokenRefreshClient::for(Platform::X)->send(...).
- TokenRefreshClient now also fills platformErrorCode from the HTTP
  status and pulls error_description / error.message from the JSON
  body, preserving the richer info LinkedIn / X / TikTok / Pinterest /
  Threads used to put on their TokenExpiredException.
- PublishToSocialPlatform catches PlatformUnavailableException
  explicitly: the post is marked failed (category: platform_unavailable,
  with http_status in error_context) but the account stays Connected.
  No retry inside this job — the scheduler reattempts the next run.

Test added: publish flow does NOT mark account expired when the
publisher throws PlatformUnavailable. Full suite: 1569 passing.
Every per-platform publisher and analytics class had its own private
refreshToken() implementation, all doing essentially the same thing as
ConnectionVerifier's per-platform refresh*Token methods. ~17 copies of
near-identical OAuth refresh logic across the codebase, which is how
the 5xx→TokenExpired bug got planted in the publish path even after we
fixed it on the hourly/daily jobs.

Consolidation:
- All publish/analytics paths call app(ConnectionVerifier::class)
  ->refreshToken($account) instead of their own implementation.
- 17 private refreshToken() methods deleted.
- refreshTokenWithLock helper removed from HasSocialHttpClient trait
  (its lock semantics are duplicated by ConnectionVerifier's per-
  account lock).
- ConnectionVerifier::refreshLinkedInToken passes $account->platform
  to TokenRefreshClient so the "LinkedIn" vs "LinkedIn Page" label
  in error messages is preserved.
- TokenRefreshClient now treats HTTP 429 the same as 5xx (raises
  PlatformUnavailableException) — replaces the retry-on-429 behavior
  that socialHttp() provided to the deleted refresh methods.

Net: -460 LOC. Single per-platform refresh implementation. Every fix
or new platform now lands in exactly one place.
Same shape of bug as the refreshToken duplication: the regex that strips
access_token / Bearer headers from logged HTTP bodies existed in three
near-identical copies (HasSocialHttpClient trait, TokenRefreshClient,
SocialPublishException), drifting subtly — SocialPublishException was
missing the JSON "token" pattern.

Extracts a TokenRedactor::redact(string) helper and routes all callers
through it. Adding a new token format now means one regex in one file.
Lets the call sites drop the explicit null check ternary. The redactor
itself owns the null handling — one less conditional at every caller.
- TokenRedactorTest (7 unit tests): all four regex patterns, multiple
  secrets in one body, null input, and the no-op pass-through case.
  Guards against silent regression of the redaction regexes (which
  previously drifted across three duplicated copies).
- ConnectionVerifierTest: HTTP 429 during refresh raises
  PlatformUnavailableException, not TokenExpiredException. Locks in
  the rate-limit-as-transient behavior added when consolidating the
  refresh logic.
Same rule applied to production code in earlier commits now applies to
tests: Http::fake patterns and assertions read from
config('trypost.platforms.*.oauth_api' / '.api' / '.default_service')
instead of hardcoded strings. Hardcoded URLs in tests drift silently
when the config changes.

Also documents the rule in CLAUDE.md under "External Service URLs" so
new code (and tests) start in the right place — only the host comes
from config, path/RPC segments stay inline next to the call.
Edge case from the prior commits: if the publisher throws TokenExpired
(401 path), PublishToSocialPlatform attempts refreshAccountToken() to
recover. That internally goes through ConnectionVerifier::verify, which
can now raise PlatformUnavailable (5xx). The old catch (\Throwable)
swallowed it but the loop still fell through to markAsTokenExpired —
meaning a transient platform outage during a retry could still flip the
account to expired.

Adds an explicit PlatformUnavailable catch in the retry block: marks
the post failed with category platform_unavailable and breaks before
touching the account status.
Every refresh*Token method inside ConnectionVerifier already finishes
with \$account->update(...) (mutates the model) + \$account->refresh()
(reloads from DB to pick up sibling updates like LinkedInTokenSynchronizer).
The lock-contention branch in refreshToken() also calls \$account->refresh()
before returning.

So a second \$account->refresh() in the caller was always a redundant
SELECT — no scenario where it actually pulled a different value than
what CV already left in memory. Removed from all 15 call sites, plus
fixed the duplicated "Mastodon tokens don't expire" comment.
…iling

Before: a scheduled post hitting a platform outage was marked Failed —
user had to manually retry. Now the job reschedules itself for 10
minutes later and the PostPlatform shows status "Retrying". Loops
indefinitely until the platform accepts the post.

- Adds PostPlatformStatus::Retrying (existing string column, no migration)
- PublishToSocialPlatform: PlatformUnavailable catch now calls
  rescheduleForRetry() which (a) updates the row to Retrying with
  retry_count + next_attempt_at in error_context, and (b) dispatches
  itself with a 10-minute delay. updatePostStatus() naturally leaves
  the parent Post in Publishing because Retrying is neither Published
  nor Failed.
- Same treatment for the retry-refresh edge case (publisher throws
  TokenExpired, refresh subsequently fails with PlatformUnavailable).
- i18n + frontend status config updated (en, pt-BR, es) for both
  posts.status.retrying and posts.edit.status.retrying.
- Tests: 3 new tests covering the dispatched job, the edge case path,
  and retry_count increment across attempts.
Audit found three behaviors with no explicit assertion:

- Job is dispatched with a 10-minute delay (would silently regress if
  the duration changed). Uses Carbon::setTestNow + Bus::assertDispatched
  inspecting \$job->delay.
- error_context.last_attempt_at is recorded at the moment of failure.
- updatePostStatus does NOT finalize the parent Post while any of its
  platforms is in Retrying — covers the central invariant of the
  feature (the post must stay Publishing until every platform lands
  in Published or Failed).
- A platform currently in Retrying transitions to Published when the
  next attempt succeeds — proves the loop terminates.
@paulocastellano paulocastellano merged commit e2caefe into main May 19, 2026
2 checks passed
@paulocastellano paulocastellano deleted the fix/platform-unavailable-vs-token-expired branch May 19, 2026 13:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant