Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
0419fb7
fix: conditionally show login/logout in auth0 example nav
nicknisi Feb 2, 2026
ee87dd5
chore: add .next/ to gitignore
nicknisi Feb 2, 2026
d17d785
chore: run the formatter
nicknisi Feb 2, 2026
8baf3c6
chore: add .react-router/ to gitignore
nicknisi Feb 2, 2026
c0e0f78
fix(evals): allow AuthKitProvider in extracted provider file
nicknisi Feb 2, 2026
a25160b
feat(evals): add success criteria validation and skill versioning
nicknisi Feb 2, 2026
7f55627
feat(evals): add edge case test fixtures for all frameworks
nicknisi Feb 2, 2026
b1356aa
feat(evals): add latency tracking for performance metrics
nicknisi Feb 2, 2026
a9d6aee
feat(evals): add LLM-based quality grading with --quality flag
nicknisi Feb 2, 2026
6535d57
fix(evals): initialize git in fixtures for quality diff capture
nicknisi Feb 2, 2026
0ab044d
fix(evals): exclude node_modules and lock files from quality diff
nicknisi Feb 3, 2026
5c8b84e
fix(evals): truncate large diffs to avoid rate limits
nicknisi Feb 3, 2026
ff1bd57
fix(evals): exclude build directories from quality diff
nicknisi Feb 3, 2026
a71063a
refactor(evals): capture only source files in quality diff
nicknisi Feb 3, 2026
2576f0e
feat(evals): show quality reasoning in verbose mode
nicknisi Feb 3, 2026
2541d67
fix(evals): exclude node_modules from source file diff
nicknisi Feb 3, 2026
e12c712
refactor(evals): use key files instead of raw diff for quality grading
nicknisi Feb 3, 2026
43f8ed6
feat(evals): add diff, history, and prune commands
nicknisi Feb 3, 2026
4c96be2
refactor(evals): use chain-of-thought prompting for quality grading
nicknisi Feb 3, 2026
d09f6f5
fix(evals): show all fixture states in results matrix
nicknisi Feb 3, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -31,3 +31,5 @@ dist/

# Eval results
tests/eval-results/
.next/
.react-router/
8 changes: 8 additions & 0 deletions skills/workos-authkit-tanstack-start/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ From README, extract:
## Directory Structure Detection

**Modern TanStack Start (v1.132+)** uses `src/`:

```
src/
├── start.ts # Middleware config (CRITICAL)
Expand All @@ -54,6 +55,7 @@ src/
```

**Legacy (vinxi-based)** uses `app/`:

```
app/
├── start.ts or router.tsx
Expand All @@ -62,6 +64,7 @@ app/
```

**Detection:**

```bash
ls src/routes 2>/dev/null && echo "Modern (src/)" || echo "Legacy (app/)"
```
Expand Down Expand Up @@ -94,6 +97,7 @@ export default {
```

Alternative pattern with createStart:

```typescript
import { createStart } from '@tanstack/react-start';
import { authkitMiddleware } from '@workos/authkit-tanstack-react-start';
Expand Down Expand Up @@ -132,6 +136,7 @@ export const Route = createFileRoute('/api/auth/callback')({
```

**Key points:**

- Use `handleCallbackRoute()` - do not write custom OAuth logic
- Route path string must match the URI path exactly
- This is a server-only route (no component needed)
Expand Down Expand Up @@ -221,6 +226,7 @@ function Profile() {

**Cause:** Route file path doesn't match WORKOS_REDIRECT_URI
**Fix:**

- URI `/api/auth/callback` → file `src/routes/api.auth.callback.tsx` (flat) or `app/routes/api/auth/callback.tsx` (nested)
- Route path string in `createFileRoute()` must match exactly

Expand All @@ -242,6 +248,7 @@ function Profile() {
## SDK Exports Reference

**Server (main export):**

- `authkitMiddleware()` - Request middleware
- `handleCallbackRoute()` - OAuth callback handler
- `getAuth()` - Get current session
Expand All @@ -250,6 +257,7 @@ function Profile() {
- `switchToOrganization()` - Change org context

**Client (`/client` subpath):**

- `AuthKitProvider` - Context provider
- `useAuth()` - Auth state hook
- `useAccessToken()` - Token management
186 changes: 128 additions & 58 deletions tests/evals/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Installer Evaluations

Automated evaluation framework for testing WorkOS AuthKit installer skills against realistic project scenarios.
Automated evaluation framework for testing WorkOS AuthKit installer skills.

## Quick Start

Expand All @@ -11,72 +11,137 @@ pnpm eval
# Run specific framework
pnpm eval --framework=nextjs

# Run specific scenario
pnpm eval --framework=react --state=example-auth0
# Run with quality grading
pnpm eval --quality
```

## Success Criteria

The eval framework validates against these thresholds:

| Metric | Threshold |
| ----------------------- | --------- |
| First-attempt pass rate | ≥90% |
| With-retry pass rate | ≥95% |

Use `--no-fail` to run without exit code validation.

## Test Matrix

The framework tests 10 scenarios (5 frameworks × 2 project states):
**Scenarios: 24 total (5 frameworks × 4-5 states)**

| State | Description |
| --------------- | ---------------------------------------------------- |
| `example` | Project with routes, components, custom config |
| `example-auth0` | Project with Auth0 authentication already integrated |
| State | Description |
| ------------------------ | --------------------------------- |
| `example` | Clean project, no existing auth |
| `example-auth0` | Project with Auth0 to migrate |
| `partial-install` | Half-completed AuthKit attempt |
| `typescript-strict` | Strict TypeScript configuration |
| `conflicting-middleware` | Existing middleware to merge |

| Framework | Skill | Key Checks |
| ---------------- | ----------------------------- | ---------------------------------------------- |
| `nextjs` | workos-authkit-nextjs | middleware.ts, callback route, AuthKitProvider |
| `react` | workos-authkit-react | AuthKitProvider, callback component, useAuth |
| `react-router` | workos-authkit-react-router | Auth loader, protected routes |
| `tanstack-start` | workos-authkit-tanstack-start | Server functions, callback route |
| `vanilla-js` | workos-authkit-vanilla-js | Auth script, callback page |
| Framework | Skill | Key Checks |
| ---------------- | ----------------------------- | ----------------------------------------------- |
| `nextjs` | workos-authkit-nextjs | middleware.ts, callback route, AuthKitProvider |
| `react` | workos-authkit-react | AuthKitProvider, callback component, useAuth |
| `react-router` | workos-authkit-react-router | Auth loader, protected routes |
| `tanstack-start` | workos-authkit-tanstack-start | Server functions, callback route |
| `vanilla-js` | workos-authkit-vanilla-js | Auth script, callback page |

## CLI Options

```
--framework=<name> Filter by framework
--framework=<name> Filter by framework (nextjs, react, react-router, tanstack-start, vanilla-js)
--state=<state> Filter by project state
--verbose, -v Show agent tool calls and detailed output
--quality, -q Enable LLM-based quality grading
--verbose, -v Show agent output and tool calls
--debug Extra verbose, preserve temp dirs on failure
--keep-on-fail Don't cleanup temp directory when scenario fails
--retry=<n> Number of retry attempts (default: 2)
--retry=<n> Retry attempts (default: 2)
--no-retry Disable retries
--json Output results as JSON
--no-fail Don't exit 1 on threshold failure
--sequential Run scenarios sequentially (disable parallelism)
--no-dashboard Disable live dashboard, use sequential logging
--json Output as JSON
--help, -h Show help
```

## Debugging Failures
## Quality Grading

### 1. Inspect the failure details
When enabled with `--quality`, passing scenarios are graded on:

```bash
pnpm eval --framework=react --state=example-auth0 --verbose
```
| Dimension | Description |
| -------------- | ----------------------------------- |
| Code Style | Adherence to project conventions |
| Minimalism | Changes are focused, no extras |
| Error Handling | Proper error handling and messages |
| Idiomatic | Follows framework best practices |

### 2. Preserve the temp directory
Each dimension scored 1-5. See `quality-rubrics.ts` for detailed rubrics.

```bash
pnpm eval --framework=react --state=example-auth0 --keep-on-fail
# Output will show: "Temp directory preserved: /tmp/eval-react-xxxxx"
```
## Latency Metrics

### 3. Manually inspect the project state
Every run tracks:

```bash
cd /tmp/eval-react-xxxxx
ls -la
cat middleware.ts
```
- **TTFT**: Time to first token
- **Agent Thinking**: Time spent deliberating
- **Tool Execution**: Time in tool calls
- **Tokens/sec**: Output throughput

### 4. Compare with previous runs
## Comparing Runs

```bash
# List recent runs
pnpm eval:history

# Show more runs
pnpm eval:history --limit=20

# Compare two runs
pnpm eval:compare 2024-01-15T10-30-00 2024-01-16T14-45-00
pnpm eval:diff 2024-01-15T10-30-00 2024-01-16T14-45-00

# Use 'latest' as alias for most recent run
pnpm eval:diff latest 2024-01-15T10-30-00
```

The diff command shows:

- Pass rate changes (first-attempt and with-retry)
- Skill version changes (with correlation analysis)
- Scenario regressions/improvements
- Latency changes (p50, p95)
- Quality score changes

### Correlation Analysis

When skill files change AND scenarios regress, the diff command highlights likely causes:

```
Likely Causes:
⚠ nextjs skill changed (03133745 → a1b2c3d4) and 2 scenario(s) regressed
```

## Results Storage

Results saved to `tests/eval-results/`:

- `{timestamp}.json` - Full results with metadata
- `latest.json` - Symlink to most recent

Each result file includes:

- Summary (pass rates, scenario counts)
- Per-scenario results with checks
- Latency metrics (TTFT, tool breakdown)
- Quality grades (if enabled)
- Metadata (skill versions, CLI version, model version)

Prune old results:

```bash
# Keep only 10 most recent (default)
pnpm eval:prune

# Keep specific number
pnpm eval:prune --keep=5
```

## Adding a New Fixture
Expand Down Expand Up @@ -135,16 +200,29 @@ checks.push(await this.buildGrader.checkBuild());
return { passed: checks.every((c) => c.passed), checks };
```

## Results Storage
## Troubleshooting

Results are saved to `tests/eval-results/`:
### "Build failed" but files look correct

- Each run creates `{timestamp}.json`
- `latest.json` symlinks to most recent
- Use `pnpm eval:history` to list runs
- Use `pnpm eval:compare` to diff runs
Use `--keep-on-fail` to preserve temp directory and inspect:

## Troubleshooting
```bash
pnpm eval --framework=nextjs --keep-on-fail
cd /tmp/eval-nextjs-xxxxx && pnpm build
```

### Flaky passes/failures

Increase retries: `pnpm eval --retry=3`

If consistently flaky, check if skill instructions are ambiguous.

### Pass rate regression

1. Run `pnpm eval:diff latest <previous-run>`
2. Check "Likely Causes" section
3. Review skill file changes listed
4. If no skill changes, check for external factors (API changes, dependency updates)

### "pnpm install failed"

Expand All @@ -155,21 +233,13 @@ cd tests/fixtures/{framework}/{state}
pnpm install
```

### "Build failed" but files look correct
### High latency

The agent may have created correct files but with syntax errors. Use `--keep-on-fail` to inspect:
Check the tool breakdown in the summary output to identify bottlenecks:

```bash
pnpm eval --framework=nextjs --keep-on-fail
# Then run build manually in temp dir to see full error
```

### Flaky passes/failures

LLM responses vary. Use `--retry=3` for more attempts:

```bash
pnpm eval --retry=3
Tool Time Breakdown (total across all scenarios):
Bash: 206.5s (27 calls)
Read: 54.3s (14 calls)
...
```

If a scenario is consistently flaky, check if the skill instructions are ambiguous.
Loading