Skip to content

Add Quarantine recipes companion page covering 4 clustered gaps#59

Draft
samgutentag wants to merge 1 commit into
mainfrom
sam-gutentag/quarantine-recipes
Draft

Add Quarantine recipes companion page covering 4 clustered gaps#59
samgutentag wants to merge 1 commit into
mainfrom
sam-gutentag/quarantine-recipes

Conversation

@samgutentag
Copy link
Copy Markdown
Member

Summary

  • New page: flaky-tests/quarantining/recipes.mdx
  • docs.json nav entry added under the existing Quarantining group, slotted ahead of quarantine-service-availability
  • Cross-link added from quarantining/index.mdx ("Recipes and operational patterns" pointing at the new page)
  • Covers: semantics cheat sheet, bulk operations (incl. post-incident bulk-unquarantine via pass-on-retry recovery threshold), conditional/per-PR limitations, fix verification without un-quarantining, Events tab + history clarity, cross-repo + private-fork pattern

Why

Sourced from customer feedback mining — bundle of 4 adjacent quarantine clusters (26 pairs / 8+ distinct customers total) all targeting the quarantining surface. The existing quarantining/index.mdx is already long and doesn't address the recipe-style questions customers consistently ask. Splitting recipe content into a companion page keeps the index page focused on what quarantining is and how to enable it, and gives operational answers a single home.

Items flagged for review

  • Page placement — chose flaky-tests/quarantining/recipes.mdx (the actual on-disk structure is flaky-tests/quarantining/, not flaky-tests/management/quarantining/ as the cluster files suggested). Verify the nav placement (currently above quarantine-service-availability) is the right order.
  • State propagation timing — cheat sheet says "up to a minute" for manual override propagation. The cluster examples reference 30-60s and "a minute" interchangeably. Confirm against current eng config or soften further if needed.
  • Bulk-unquarantine via pass-on-retry recovery threshold — recipe lifted from the trunk-gusto Slack answer. Verify this is still the recommended workaround and that the threshold setting path (Settings > Repositories > repo > Flaky Tests > Pass-on-Retry Monitor) is current.
  • Per-PR quarantine workaround — the recommended pattern is "leave the test quarantined and read the underlying test result from CI logs / Test History." Confirm this is what eng recommends, and that the Quarantined: Only filter on Test History is the right surface to point at.
  • Cross-repo dry-run pattern--repo-url + --dry-run=true (CLI) and repo-url: + dry-run: true (action). Verify both flag names against the current CLI and uploader action.
  • Events tab scoping — the "Quarantine Event" filter is described as surfacing manual ALWAYS/NEVER only, not auto-quarantine history (which lives under Flake Detection events). Confirm against the current Events tab UI — the cluster file notes the UX is still being refined.
  • Bulk "Never Quarantine" — described as a tracked roadmap item. Confirm phrasing — current cluster answer from Q2-2026 was "API tune-ups planned, by-hand is the only option for now."

Customer signal

Cluster bundle: 4 quarantine-adjacent clusters consolidated for shared IA.

  • quarantine-automation-behavior (16 pairs / 8 customers, verdict partial)
  • conditional-quarantine-by-pr-or-branch (3 pairs / 1 customer, verdict missing)
  • quarantine-history-and-events-clarity (4 pairs / 3 customers, verdict partial)
  • cross-repo-quarantine-and-private-fork (3 pairs / 1 customer, verdict partial)

Total: 26 pairs across 8+ distinct customers.

Representative permalinks (full lists in each cluster file at findings/clusters/<id>.json):

…onal, cross-repo, fix verification

Sourced from customer feedback mining — bundle of 4 adjacent quarantine
clusters (quarantine-automation-behavior, conditional-quarantine-by-pr-or-branch,
quarantine-history-and-events-clarity, cross-repo-quarantine-and-private-fork)
all targeting the quarantining surface. New companion page keeps the main
quarantining index focused on semantics while covering recipe-style content
in one place.
@samgutentag samgutentag added the needs review PR sourced from customer-feedback-mining; needs human scrutiny for accuracy before merge label May 20, 2026
@mintlify
Copy link
Copy Markdown
Contributor

mintlify Bot commented May 20, 2026

Preview deployment for your docs. Learn more about Mintlify Previews.

Project Status Preview Updated (UTC)
trunk 🟢 Ready View Preview May 20, 2026, 11:56 PM

💡 Tip: Enable Workflows to automatically generate PRs for you.

@samgutentag
Copy link
Copy Markdown
Member Author

Code verification (2026-05-21): 3 confirmed / 3 contradicted / 0 ambiguous / 1 unverifiable

# Claim Verdict Source
1 State propagation takes "up to a minute" after a manual override unverifiable no numeric constant found
2 Bulk-unquarantine recipe uses "recovery threshold" in Settings > Repositories > repo > Flaky Tests > Pass-on-Retry Monitor contradicted pass-on-retry-monitor-dialogs.tsx#L318
3 Test History has a Quarantined: Only filter confirmed test-run-history-filters.tsx#L21-L25
4 Action uploader input is repo-url: (CLI is --repo-url) contradicted action.yaml#L68
5 Events tab "Quarantine Event" filter shows manual overrides only; auto-quarantine history is under "Flake Detection" confirmed (with label nits) events-tab.tsx#L200-L205, audit_trail.ts#L569-L605
6 No bulk "Never Quarantine" API endpoint confirmed routes_protected.ts (no override write route)
7 Broken tests do not auto-quarantine, only flaky ones do contradicted (in docstring) but confirmed (in execution path) shared_schema.ts#L200-L210 vs test_case_status_clickhouse.ts#L93-L117

Summary of items the author should resolve before publishing:

  1. Claim 2 (recovery threshold) — The actual UI label is "Resolution period", not "recovery threshold". The settings path is wrong too: the Pass-on-Retry monitor is edited from the Monitors page under Flaky Tests > repo > Monitors (not under a "Settings > Repositories > repo > Flaky Tests > Pass-on-Retry Monitor" sub-page). Recommend rewriting the steps to: open the Monitors page for the repo, edit the Pass-on-Retry monitor, lower the Resolution period.

  2. Claim 4 (action YAML) — The action input is gh-repo-url:, not repo-url:. The current YAML example will silently no-op because there is no repo-url input on analytics-uploader@v1. The CLI flag --repo-url is correct. Fix the action snippet to use gh-repo-url:.

  3. Claim 7 (broken vs flaky) — There is a meaningful discrepancy in source: the public-api docstring for the quarantined field says auto-quarantine applies to tests whose status is flaky or broken, but the actual execution path (getFlakyTestCaseIdsFromClickHouse) filters by status = "FLAKY" only. The docs page's claim matches the execution path, which is the behavior customers actually see. The schema description is the outlier. Worth flagging to engineering, since either the description should be tightened or the broken path should be added (this is a real product-truth question). For the docs page itself, the claim is correct.

  4. Claim 5 (events tab labels) — Filter labels in the UI are "Quarantine" and "Monitors", not "Quarantine Event" and "Flake Detection". The internal AuditEventCategory enum uses QUARANTINE and FLAKE_DETECTION, but the visible labels are different. Recommend matching the user-visible labels in prose.

  5. Claim 1 (propagation timing) — Could not find a 60_000-style constant or documented TTL/polling interval in the override write path (it is a synchronous Prisma upsert) or in getQuarantineConfig (no caching layer visible in the public-api or controller). The "up to a minute" softening may be conservative or based on read-replica lag (PrismaFlakyReadOnlyClient is used in getQuarantinedTests), but there is no numeric source. Recommend either dropping the specific "up to a minute" figure or confirming the source with engineering.


Source #1 — Propagation timing "up to a minute" (unverifiable)

File: trunk-io/trunk/trunk/services/metrics-flaky-tests/controllers/index.ts#L1151-L1208

async setTestQuarantiningOverride({
  runInfoId,
  repoId,
  userId,
  quarantiningOverride,
  annotation,
}: {
  runInfoId: string;
  repoId: string;
  userId: string;
  quarantiningOverride: TestQuarantiningOverride;
  annotation?: string;
}) {
  // ... (read previous override)
  const [{ testCaseV2 }, quarantineAuditLog] = await this.prismaFlaky.$transaction([
    this.prismaFlaky.runInfoMeta.upsert({
      create: { runInfoId, quarantiningOverride, repoId },
      update: { quarantiningOverride, repoId },
      where: { runInfoId_repoId: { runInfoId, repoId } },
      select: { testCaseV2: true },
    }),
    // ... (audit log)
  ]);

Reasoning: The override write is a synchronous Prisma upsert against the primary database. The CLI's getQuarantineConfig route in routes_protected.ts calls getQuarantinedTestIds which reads from the same data with no documented cache. The read path does use PrismaFlakyReadOnlyClient (see quarantined_tests_controller.ts line 45, 78, 203), so read-replica replication lag could account for some delay, but there is no 60_000-style constant or documented TTL anywhere in the override or fetch paths that would justify the "up to a minute" figure. The claim might be conservatively-phrased real-world UX, but it is not backed by an identifiable source constant. Recommend either softening to "may not be immediate" or confirming the figure with the flaky-tests team before publishing.

Source #2 — Recovery threshold setting and Settings path (contradicted)

File: trunk-io/trunk2/ts/apps/frontend/src/app/(everything-else)/[orgSlug]/flaky-tests/repo/[repoId]/monitor/pass-on-retry-monitor-dialogs.tsx#L315-L322

<div className="space-y-1.5">
  <div className="flex items-center justify-between gap-4">
    <div>
      <Label>Resolution period</Label>
      <p className="text-xs text-muted-foreground mt-0.5">
        How many days without pass-on-retry behavior before the monitor
        resolves. Shorter = faster resolution, longer = more confidence.
      </p>

File: trunk-io/trunk2/ts/apps/frontend/src/app/(everything-else)/[orgSlug]/flaky-tests/repo/[repoId]/monitor/monitors-header.tsx#L24-L27

<div>
  <h1 className="text-2xl font-bold">Monitors</h1>
  <p className="text-muted-foreground mt-1">
    Configure how tests are monitored and classified for {repoName}

Reasoning: The user-visible label on the stepper is "Resolution period", not "recovery threshold". The underlying GraphQL field is recoveryDays (Int, days), which the dialog steps by 1 day at a time. The settings path described in the docs ("Settings > Repositories > repo > Flaky Tests > Pass-on-Retry Monitor") does not match the actual route. The pass-on-retry monitor is configured from the repo-level Monitors page (URL pattern: /<orgSlug>/flaky-tests/repo/<repoId>/monitor), where the user clicks into the existing pass-on-retry monitor row to open this dialog. There is no "Pass-on-Retry Monitor" sub-tab under Settings. Recommend rewriting the recipe steps as: open Flaky Tests > [repo] > Monitors, find the Pass-on-Retry monitor row, click to edit, lower the Resolution period, save. Note the underlying value is in days (minimum 1 day), so "3 days" in the example is fine.

Source #3 — Quarantined: Only filter on Test History (confirmed)

File: trunk-io/trunk2/ts/apps/frontend/src/app/(everything-else)/[orgSlug]/flaky-tests/repo/[repoId]/test/[testId]/components/test-run-history-filters.tsx#L21-L25

const QUARANTINED_OPTIONS: { value: QuarantinedFilter; label: string }[] = [
  { value: "include", label: "Include" },
  { value: "exclude", label: "Exclude" },
  { value: "only", label: "Only" },
];

Reasoning: The Test History view's Quarantined filter is a toggle group with three options: Include, Exclude, Only. Setting it to Only displays exactly the runs where Trunk overrode the exit code due to quarantine state. This matches the recipe in the docs page. The component is mounted via the Test History tab (test-detail-tabs.tsx declares the <TabsTrigger value="history">Test History</TabsTrigger>), so the user-facing path "Test History tab > Quarantined: Only" is accurate.

Source #4 — Action uploader `repo-url:` input (contradicted)

File: trunk-io/analytics-uploader/action.yaml#L68-L70

gh-repo-url:
  description: The URL of the GitHub repo
  default: ${{ github.event.pull_request.head.repo.html_url }}

File: trunk-io/analytics-uploader/src/inputs.ts#L37-L46

prTitle: core.getInput("pr-title"),
ghRepoUrl: core.getInput("gh-repo-url"),
ghRepoHeadSha: core.getInput("gh-repo-head-sha"),
// ...
dryRun: parseBoolean(core.getInput("dry-run")),

File: trunk-io/analytics-cli/cli/src/upload_command.rs#L95-L100

#[arg(
    long,
    env = constants::TRUNK_REPO_URL_ENV,
    help = "Override the repository URL (normally from git config remote.origin.url)."
)]
pub repo_url: Option<String>,

Reasoning: The CLI flag --repo-url is correct (analytics-cli line 100, derived from TRUNK_REPO_URL_ENV). The action input is gh-repo-url, not repo-url. The action wrapper translates gh-repo-url into the GH_REPO_URL env var (analytics-uploader src/args.ts line 115), which the CLI picks up. The action YAML example in the recipe (repo-url: "https://github.com/your-org/main-repo.git") is a no-op input that will be silently ignored by core.getInput(), and the override will not take effect. The action snippet needs to be gh-repo-url: to match the declared input. Also worth noting: that input defaults to ${{ github.event.pull_request.head.repo.html_url }}, so overriding it in this scenario is a legitimate use case. The dry-run: input name is correct.

Source #5 — Events tab Quarantine vs Flake Detection scoping (confirmed with label nits)

File: trunk-io/trunk2/ts/apps/frontend/src/app/(everything-else)/[orgSlug]/flaky-tests/repo/[repoId]/test/[testId]/components/events-tab.tsx#L200-L205

const FILTER_OPTIONS = [
  { value: AuditEventCategory.All, label: "All" },
  { value: AuditEventCategory.FlakeDetection, label: "Monitors" },
  { value: AuditEventCategory.Quarantine, label: "Quarantine" },
  { value: AuditEventCategory.Ticketing, label: "Ticketing" },
] as const;

File: trunk-io/trunk2/ts/apps/frontend/src/lib/services/flaky-tests/test-case-event-audit-trail.ts#L569-L605

async function fetchQuarantineEvents({
  repoId,
  testCaseId,
  pageSize,
}: { /* ... */ }): Promise<{ events: QuarantineAuditTrailEvent[]; totalCount: number }> {
  try {
    const prismaFlaky = await prisma();
    const [rows, totalCount] = await Promise.all([
      prismaFlaky.flaky_tests_audit_log.findMany({
        where: { repoId, testCaseId },
        orderBy: { createdAt: "desc" as const },
        take: pageSize,
      }),
      // ...
    ]);
    const events = rows.map(
      ({ id, createdAt, description, userId, annotation }) => ({
        id,
        kind: "QUARANTINE" as const,
        // ...
      }),
    );

Reasoning: The Quarantine filter sources from flaky_tests_audit_log, which is populated by setTestQuarantiningOverride (manual ALWAYS / NEVER override actions, see controllers/index.ts#L1193-L1200). It does not include auto-quarantine consequence events; those live in the flake-detection events channel under the FlakeDetection category, which sources from ClickHouse test_detection_events via getTestCaseAllDetectionEvents. So the semantic distinction in the docs page (manual-only on the Quarantine filter, auto-quarantine history under Flake Detection) is correct. The only nit: the user-visible filter labels are "Quarantine" and "Monitors" (not "Quarantine Event" and "Flake Detection" as the docs prose says). Recommend updating to match the visible labels.

Source #6 — No bulk "Never Quarantine" API endpoint (confirmed)

File: trunk-io/trunk/trunk/services/public-api/src/routes_protected.ts

The complete list of public quarantine-related routes in routes_protected.ts:

// Line 2244
path: "/flaky-tests/list-quarantined-tests",
// Line 2991
router.post("/metrics/getQuarantineConfig", async (req, res) => { /* ... */ });

Reasoning: Of the wrapped public-API routes in routes_protected.ts, only list-quarantined-tests and the internal getQuarantineConfig (used by the analytics CLI to fetch the current quarantine state) involve quarantine. There is no set-quarantining-override or update-quarantining-override route in the public API. The internal gRPC SetTestQuarantiningOverride exists (see trunk/services/edge/src/edge_service.ts and flaky_tests_service.ts#L723) but it operates on a single runInfoId per call and is not exposed via the customer-facing REST API. So both the narrower claim ("no bulk Never Quarantine API") and the broader claim ("API support for bulk overrides is on the roadmap") are accurate for the public surface.

Source #7 — Broken tests do not auto-quarantine (contradicted in docstring, confirmed in execution)

File: trunk-io/trunk/trunk/services/public-api-shared/schema.ts#L200-L210

quarantined: z.boolean().openapi({
  description: `Whether the test is quarantined.

This is \`true\` when quarantining is enabled for the repo and either of the following applies:

- The quarantine override is set to \`ALWAYS_QUARANTINE\` for this test
- Automatic quarantining is enabled for the repo, and this test's status is either \`flaky\` or \`broken\`

If this is \`true\`, the next test run will be marked as passed even if the test run conclusion is
failed.`,
}),

File: trunk-io/trunk/trunk/services/metrics-flaky-tests/controllers/test_case_status_clickhouse.ts#L93-L117

export const getFlakyTestCaseIdsFromClickHouse = async (
  client: ClickHouseClient,
  repoId: string,
  excludeTestIds: string[] = [],
  lookbackDays: number = QUARANTINED_TESTS_LOOKBACK_DAYS,
): Promise<FlakyTestCaseWithUpdatedAt[]> => {
  // ...
  const result = await client.query({
    query: `SELECT tcs.test_case_id, tcs.updated_at FROM ... WHERE tcs.repo_id = {repoId:String} AND tcs.status = {status:String}${excludeClause} ORDER BY tcs.updated_at DESC, tcs.test_case_id ASC`,
    query_params:
      excludeTestIds.length > 0
        ? { repoId, status: "FLAKY", excludeIds: excludeTestIds, lookbackDays }
        : { repoId, status: "FLAKY", lookbackDays },
    format: "JSONEachRow",
  });

File: trunk-io/trunk/trunk/services/metrics-flaky-tests/controllers/quarantined_tests_controller.ts#L107-L155

/**
 * Get all quarantined test ids ordered by the last status updated at.
 *
 * 1. get quarantine overrides
 * 2. get auto-quarantined tests (flaky, non-never-quarantine) from ClickHouse
 * 3. combine always-quarantine and auto-quarantined tests and sort
 */
async getAllQuarantinedTestIdsOrderedByLastStatusUpdatedAtDesc({
  repoId,
  autoQuarantineEnabled,
}: { /* ... */ }) {
  // ...
  if (autoQuarantineEnabled) {
    const autoQuarantinedTestsFromClickHouse = await getFlakyTestCaseIdsFromClickHouse(
      this.nativeClickHouseClientFlaky,
      repoId,
      quarantineOverridesIds,

Reasoning: This is the most interesting finding. There are two sources telling different stories.

The public-API description string (shared_schema.ts#L206) says auto-quarantine applies when a test's status is flaky or broken. But the actual execution path that selects auto-quarantined tests (getFlakyTestCaseIdsFromClickHouse) filters strictly by status = "FLAKY" (lines 116-117). The function name itself ("FlakyTestCaseIds"), the inline comment ("auto-quarantined tests (flaky, non-never-quarantine)"), and the ClickHouse query parameter all confirm that only FLAKY tests are picked up. Broken tests are excluded from auto-quarantine in production behavior.

The docs page's claim ("broken tests never auto-quarantine") matches the execution path, which is what customers actually experience. So the docs page is correct about product behavior. The public-api schema description is the outlier and should likely be tightened to drop the "or broken" clause, since it does not reflect the actual implementation. Worth raising with the flaky-tests team separately — this is a real product-truth discrepancy that the docs change happens to surface. For PR #59 itself, no change needed; the claim aligns with reality.

@samgutentag samgutentag marked this pull request as draft May 21, 2026 17:51
@samgutentag samgutentag added the needs eng review verify-docs-against-code: at least one claim contradicts source. label May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs eng review verify-docs-against-code: at least one claim contradicts source. needs review PR sourced from customer-feedback-mining; needs human scrutiny for accuracy before merge

Development

Successfully merging this pull request may close these issues.

1 participant