workos · nicknisi · Feb 3, 2026 · Feb 2, 2026 · Feb 2, 2026 · Feb 2, 2026
diff --git a/.gitignore b/.gitignore
@@ -31,3 +31,5 @@ dist/
 
 # Eval results
 tests/eval-results/
+.next/
+.react-router/
diff --git a/skills/workos-authkit-tanstack-start/SKILL.md b/skills/workos-authkit-tanstack-start/SKILL.md
@@ -43,6 +43,7 @@ From README, extract:
 ## Directory Structure Detection
 
 **Modern TanStack Start (v1.132+)** uses `src/`:
+
 ```
 src/
 ├── start.ts              # Middleware config (CRITICAL)
@@ -54,6 +55,7 @@ src/
 ```
 
 **Legacy (vinxi-based)** uses `app/`:
+
 ```
 app/
 ├── start.ts or router.tsx
@@ -62,6 +64,7 @@ app/
 ```
 
 **Detection:**
+
 ```bash
 ls src/routes 2>/dev/null && echo "Modern (src/)" || echo "Legacy (app/)"
 ```
@@ -94,6 +97,7 @@ export default {
 ```
 
 Alternative pattern with createStart:
+
 ```typescript
 import { createStart } from '@tanstack/react-start';
 import { authkitMiddleware } from '@workos/authkit-tanstack-react-start';
@@ -132,6 +136,7 @@ export const Route = createFileRoute('/api/auth/callback')({
 ```
 
 **Key points:**
+
 - Use `handleCallbackRoute()` - do not write custom OAuth logic
 - Route path string must match the URI path exactly
 - This is a server-only route (no component needed)
@@ -221,6 +226,7 @@ function Profile() {
 
 **Cause:** Route file path doesn't match WORKOS_REDIRECT_URI
 **Fix:**
+
 - URI `/api/auth/callback` → file `src/routes/api.auth.callback.tsx` (flat) or `app/routes/api/auth/callback.tsx` (nested)
 - Route path string in `createFileRoute()` must match exactly
 
@@ -242,6 +248,7 @@ function Profile() {
 ## SDK Exports Reference
 
 **Server (main export):**
+
 - `authkitMiddleware()` - Request middleware
 - `handleCallbackRoute()` - OAuth callback handler
 - `getAuth()` - Get current session
@@ -250,6 +257,7 @@ function Profile() {
 - `switchToOrganization()` - Change org context
 
 **Client (`/client` subpath):**
+
 - `AuthKitProvider` - Context provider
 - `useAuth()` - Auth state hook
 - `useAccessToken()` - Token management
diff --git a/tests/evals/README.md b/tests/evals/README.md
@@ -1,6 +1,6 @@
 # Installer Evaluations
 
-Automated evaluation framework for testing WorkOS AuthKit installer skills against realistic project scenarios.
+Automated evaluation framework for testing WorkOS AuthKit installer skills.
 
 ## Quick Start
 
@@ -11,72 +11,137 @@ pnpm eval
 # Run specific framework
 pnpm eval --framework=nextjs
 
-# Run specific scenario
-pnpm eval --framework=react --state=example-auth0
+# Run with quality grading
+pnpm eval --quality
 ```
 
+## Success Criteria
+
+The eval framework validates against these thresholds:
+
+| Metric                  | Threshold |
+| ----------------------- | --------- |
+| First-attempt pass rate | ≥90%      |
+| With-retry pass rate    | ≥95%      |
+
+Use `--no-fail` to run without exit code validation.
+
 ## Test Matrix
 
-The framework tests 10 scenarios (5 frameworks × 2 project states):
+**Scenarios: 24 total (5 frameworks × 4-5 states)**
 
-| State           | Description                                          |
-| --------------- | ---------------------------------------------------- |
-| `example`       | Project with routes, components, custom config       |
-| `example-auth0` | Project with Auth0 authentication already integrated |
+| State                    | Description                       |
+| ------------------------ | --------------------------------- |
+| `example`                | Clean project, no existing auth   |
+| `example-auth0`          | Project with Auth0 to migrate     |
+| `partial-install`        | Half-completed AuthKit attempt    |
+| `typescript-strict`      | Strict TypeScript configuration   |
+| `conflicting-middleware` | Existing middleware to merge      |
 
-| Framework        | Skill                         | Key Checks                                     |
-| ---------------- | ----------------------------- | ---------------------------------------------- |
-| `nextjs`         | workos-authkit-nextjs         | middleware.ts, callback route, AuthKitProvider |
-| `react`          | workos-authkit-react          | AuthKitProvider, callback component, useAuth   |
-| `react-router`   | workos-authkit-react-router   | Auth loader, protected routes                  |
-| `tanstack-start` | workos-authkit-tanstack-start | Server functions, callback route               |
-| `vanilla-js`     | workos-authkit-vanilla-js     | Auth script, callback page                     |
+| Framework        | Skill                         | Key Checks                                      |
+| ---------------- | ----------------------------- | ----------------------------------------------- |
+| `nextjs`         | workos-authkit-nextjs         | middleware.ts, callback route, AuthKitProvider  |
+| `react`          | workos-authkit-react          | AuthKitProvider, callback component, useAuth    |
+| `react-router`   | workos-authkit-react-router   | Auth loader, protected routes                   |
+| `tanstack-start` | workos-authkit-tanstack-start | Server functions, callback route                |
+| `vanilla-js`     | workos-authkit-vanilla-js     | Auth script, callback page                      |
 
 ## CLI Options
 
 ```
---framework=<name>  Filter by framework
+--framework=<name>  Filter by framework (nextjs, react, react-router, tanstack-start, vanilla-js)
 --state=<state>     Filter by project state
---verbose, -v       Show agent tool calls and detailed output
+--quality, -q       Enable LLM-based quality grading
+--verbose, -v       Show agent output and tool calls
 --debug             Extra verbose, preserve temp dirs on failure
 --keep-on-fail      Don't cleanup temp directory when scenario fails
---retry=<n>         Number of retry attempts (default: 2)
+--retry=<n>         Retry attempts (default: 2)
 --no-retry          Disable retries
---json              Output results as JSON
+--no-fail           Don't exit 1 on threshold failure
+--sequential        Run scenarios sequentially (disable parallelism)
+--no-dashboard      Disable live dashboard, use sequential logging
+--json              Output as JSON
 --help, -h          Show help
 ```
 
-## Debugging Failures
+## Quality Grading
 
-### 1. Inspect the failure details
+When enabled with `--quality`, passing scenarios are graded on:
 
-```bash
-pnpm eval --framework=react --state=example-auth0 --verbose
-```
+| Dimension      | Description                         |
+| -------------- | ----------------------------------- |
+| Code Style     | Adherence to project conventions    |
+| Minimalism     | Changes are focused, no extras      |
+| Error Handling | Proper error handling and messages  |
+| Idiomatic      | Follows framework best practices    |
 
-### 2. Preserve the temp directory
+Each dimension scored 1-5. See `quality-rubrics.ts` for detailed rubrics.
 
-```bash
-pnpm eval --framework=react --state=example-auth0 --keep-on-fail
-# Output will show: "Temp directory preserved: /tmp/eval-react-xxxxx"
-```
+## Latency Metrics
 
-### 3. Manually inspect the project state
+Every run tracks:
 
-```bash
-cd /tmp/eval-react-xxxxx
-ls -la
-cat middleware.ts
-```
+- **TTFT**: Time to first token
+- **Agent Thinking**: Time spent deliberating
+- **Tool Execution**: Time in tool calls
+- **Tokens/sec**: Output throughput
 
-### 4. Compare with previous runs
+## Comparing Runs
 
 ```bash
 # List recent runs
 pnpm eval:history
 
+# Show more runs
+pnpm eval:history --limit=20
+
 # Compare two runs
-pnpm eval:compare 2024-01-15T10-30-00 2024-01-16T14-45-00
+pnpm eval:diff 2024-01-15T10-30-00 2024-01-16T14-45-00
+
+# Use 'latest' as alias for most recent run
+pnpm eval:diff latest 2024-01-15T10-30-00
+```
+
+The diff command shows:
+
+- Pass rate changes (first-attempt and with-retry)
+- Skill version changes (with correlation analysis)
+- Scenario regressions/improvements
+- Latency changes (p50, p95)
+- Quality score changes
+
+### Correlation Analysis
+
+When skill files change AND scenarios regress, the diff command highlights likely causes:
+
+```
+Likely Causes:
+  ⚠ nextjs skill changed (03133745 → a1b2c3d4) and 2 scenario(s) regressed
+```
+
+## Results Storage
+
+Results saved to `tests/eval-results/`:
+
+- `{timestamp}.json` - Full results with metadata
+- `latest.json` - Symlink to most recent
+
+Each result file includes:
+
+- Summary (pass rates, scenario counts)
+- Per-scenario results with checks
+- Latency metrics (TTFT, tool breakdown)
+- Quality grades (if enabled)
+- Metadata (skill versions, CLI version, model version)
+
+Prune old results:
+
+```bash
+# Keep only 10 most recent (default)
+pnpm eval:prune
+
+# Keep specific number
+pnpm eval:prune --keep=5
 ```
 
 ## Adding a New Fixture
@@ -135,16 +200,29 @@ checks.push(await this.buildGrader.checkBuild());
 return { passed: checks.every((c) => c.passed), checks };
 ```
 
-## Results Storage
+## Troubleshooting
 
-Results are saved to `tests/eval-results/`:
+### "Build failed" but files look correct
 
-- Each run creates `{timestamp}.json`
-- `latest.json` symlinks to most recent
-- Use `pnpm eval:history` to list runs
-- Use `pnpm eval:compare` to diff runs
+Use `--keep-on-fail` to preserve temp directory and inspect:
 
-## Troubleshooting
+```bash
+pnpm eval --framework=nextjs --keep-on-fail
+cd /tmp/eval-nextjs-xxxxx && pnpm build
+```
+
+### Flaky passes/failures
+
+Increase retries: `pnpm eval --retry=3`
+
+If consistently flaky, check if skill instructions are ambiguous.
+
+### Pass rate regression
+
+1. Run `pnpm eval:diff latest <previous-run>`
+2. Check "Likely Causes" section
+3. Review skill file changes listed
+4. If no skill changes, check for external factors (API changes, dependency updates)
 
 ### "pnpm install failed"
 
@@ -155,21 +233,13 @@ cd tests/fixtures/{framework}/{state}
 pnpm install
 ```
 
-### "Build failed" but files look correct
+### High latency
 
-The agent may have created correct files but with syntax errors. Use `--keep-on-fail` to inspect:
+Check the tool breakdown in the summary output to identify bottlenecks:
 
-```bash
-pnpm eval --framework=nextjs --keep-on-fail
-# Then run build manually in temp dir to see full error
 ```
-
-### Flaky passes/failures
-
-LLM responses vary. Use `--retry=3` for more attempts:
-
-```bash
-pnpm eval --retry=3
+Tool Time Breakdown (total across all scenarios):
+  Bash: 206.5s (27 calls)
+  Read: 54.3s (14 calls)
+  ...
 ```
-
-If a scenario is consistently flaky, check if the skill instructions are ambiguous.