Wire schema inference into dataset wizard and redesign review step#25
Conversation
Frontend calls the real backend (Fastify + Claude Sonnet 4.6 via OpenRouter) to generate schemas from natural language prompts. The review step is redesigned: dataset name at top with a hover-expand prompt chip, cadence picker moved here from the describe step, auto-sizing description textareas, custom type selector with chevron, and an inline "+ New column" row at the table bottom. Backend changes: add POST /infer-schema (Clerk-protected), add Mastra as the orchestration layer with its own Docker service, remove the one-shot CLI entry point. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
Disabled knowledge base sources:
📝 WalkthroughWalkthroughThis PR migrates schema inference from a CLI-based tool to an authenticated backend API powered by the Mastra workflow framework. It adds Clerk secret and publishable key support, defines a Mastra workflow that wraps the existing inference pipeline, exposes a protected POST /infer-schema endpoint, updates the frontend dataset wizard to call this API, and introduces a Mastra dev service plus related environment and doc updates. Sequence DiagramsequenceDiagram
participant Frontend as Frontend (NewDatasetPage)
participant Clerk as Clerk Auth
participant Backend as Fastify /infer-schema
participant Mastra as Mastra Workflow
participant OpenRouter as OpenRouter API
Frontend->>Clerk: useAuth().getToken()
Clerk-->>Frontend: Bearer token
Frontend->>Backend: POST /infer-schema { prompt } (Bearer)
Backend->>Backend: requireAuth preHandler
Backend->>Mastra: execute inferSchemaWorkflow(prompt)
Mastra->>OpenRouter: generateText(SYSTEM_PROMPT, prompt)
OpenRouter-->>Mastra: model output (schema)
Mastra-->>Backend: validated DatasetSchema
Backend-->>Frontend: InferredSchema JSON
Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/src/env.ts (1)
21-25:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUpdate stale comment about protected routes.
The comment claims "Optional today because no protected routes exist yet," but this PR introduces the
/infer-schemaprotected route. The Clerk keys are now required for the application to function correctly.📝 Suggested update
- // Used by ./clerk-auth.ts to verify JWTs on user-facing routes. Optional - // today because no protected routes exist yet; required as soon as one is - // added. + // Used by ./clerk-auth.ts to verify JWTs on user-facing routes (e.g. + // /infer-schema). Required for protected endpoints. CLERK_SECRET_KEY: process.env.CLERK_SECRET_KEY, CLERK_PUBLISHABLE_KEY: process.env.CLERK_PUBLISHABLE_KEY,🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/src/env.ts` around lines 21 - 25, The comment above CLERK_SECRET_KEY and CLERK_PUBLISHABLE_KEY is stale — the app now exposes a protected route (/infer-schema) so these Clerk keys are required; update the comment in env.ts to state that CLERK_SECRET_KEY and CLERK_PUBLISHABLE_KEY are required for JWT verification on protected routes (e.g., /infer-schema) and must be provided in the environment rather than optional.
🧹 Nitpick comments (3)
backend/src/mastra/workflows/infer-schema.ts (1)
4-8: ⚡ Quick winRemove unused imports.
columnDefinitionSchemaandretrievalStrategySchemaare imported but never referenced in this file. OnlydatasetSchemaSchemais actually used.♻️ Proposed cleanup
import { datasetSchemaSchema, - columnDefinitionSchema, - retrievalStrategySchema, } from "../../pipeline/types.js";🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/src/mastra/workflows/infer-schema.ts` around lines 4 - 8, Remove the unused imports columnDefinitionSchema and retrievalStrategySchema from the import list and keep only datasetSchemaSchema; update the import statement that currently references datasetSchemaSchema, columnDefinitionSchema, retrievalStrategySchema so it only imports datasetSchemaSchema to eliminate unused symbols and related linter errors.backend/.env.example (1)
11-12: ⚡ Quick winConsider ordering publishable key before secret key.
By convention, publishable (public) keys are typically listed before secret (private) keys in environment files for consistency.
📝 Suggested reordering
# Required for user-facing protected routes (JWT verification). # Same values as the frontend's Clerk keys. +CLERK_PUBLISHABLE_KEY= CLERK_SECRET_KEY= -CLERK_PUBLISHABLE_KEY=🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/.env.example` around lines 11 - 12, Swap the order of the environment variables so the public key appears before the private key; specifically, place CLERK_PUBLISHABLE_KEY above CLERK_SECRET_KEY in the env example. Update the block that currently lists CLERK_SECRET_KEY then CLERK_PUBLISHABLE_KEY to instead list CLERK_PUBLISHABLE_KEY first, then CLERK_SECRET_KEY, keeping the variable names and empty values intact.backend/src/pipeline/schema-inference.ts (1)
26-33: ⚡ Quick winPrefer centralized env module over direct process.env access.
This function reads
process.env.OPENROUTER_API_KEYdirectly, diverging from the established pattern wherebackend/src/env.tscentralizes all environment variable access. Theenvmodule already exportsOPENROUTER_API_KEYwith the same validation logic. Using it consistently across the codebase improves maintainability, type safety, and testability.♻️ Proposed change
+import { env } from "../env.js"; + function getModel() { - const apiKey = process.env.OPENROUTER_API_KEY; - if (!apiKey) { - throw new Error("Missing required environment variable: OPENROUTER_API_KEY"); - } + const apiKey = env.OPENROUTER_API_KEY; + if (!apiKey) { + throw new Error("Missing required environment variable: OPENROUTER_API_KEY"); + } const openrouter = createOpenRouter({ apiKey }); return openrouter("anthropic/claude-sonnet-4-6"); }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/src/pipeline/schema-inference.ts` around lines 26 - 33, Replace direct process.env access in getModel with the centralized env export: import and use the exported OPENROUTER_API_KEY from the project's env module instead of reading process.env.OPENROUTER_API_KEY; keep the existing createOpenRouter call and returned model ("anthropic/claude-sonnet-4-6") but rely on env.OPENROUTER_API_KEY (or the exported constant name) for validation and type-safety consistent with backend/src/env.ts.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docker-compose.dev.yml`:
- Line 33: The docker-compose mapping for CLERK_PUBLISHABLE_KEY currently only
reads NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY so the backend can be empty if only
CLERK_PUBLISHABLE_KEY is supplied; update the service environment entry so
CLERK_PUBLISHABLE_KEY resolves first from the host CLERK_PUBLISHABLE_KEY and
falls back to NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY (i.e., prefer
CLERK_PUBLISHABLE_KEY, then NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY) so both
deployment styles work; locate the CLERK_PUBLISHABLE_KEY entry in
docker-compose.dev.yml to apply this change.
In `@frontend/app/dataset/new/page.tsx`:
- Around line 227-229: The error banner rendered when schema generation fails
(the JSX block that checks {error && ( ... )}) needs live-region semantics so
screen readers announce it; update that div to include role="alert" (or
aria-live="assertive" with aria-atomic="true") so the message is announced
immediately. Locate the conditional render around the {error && (...)} in
page.tsx and add the ARIA attributes to that element (keep existing className
and styling unchanged) to ensure generation failures are surfaced to assistive
tech.
- Around line 295-306: The cadence selector rendered from CADENCE_OPTIONS using
the button mapping currently conveys state only via color; update the rendered
controls so assistive tech can read the selected option by adding accessible
state attributes or using native radios — e.g., when mapping CADENCE_OPTIONS in
the component (where setCadence and cadence are used) either replace the buttons
with proper radio inputs (grouped with a fieldset/legend or role="radiogroup")
or keep the buttons and add aria-pressed={cadence === opt.value} (or
role="radio" + aria-checked={cadence === opt.value}) plus an accessible
container role="radiogroup"; ensure onClick still calls setCadence(opt.value)
and include unique key={opt.value} as before.
- Around line 49-56: The BACKEND_TYPE_MAP forcibly converts inferred enum
columns to "text", which loses semantics; update handling so enums are
preserved: change BACKEND_TYPE_MAP to map InferredColumn type "enum" to
ColumnType "enum" (don't coerce to "text"), add "enum" to the ColumnType
union/type definition, extend the TypeSelector component to display and accept
"enum" as a choice, and update handleConfirm and the create payload logic to
accept and persist ColumnType "enum" (or alternatively throw an explicit error
in inferSchema/handleConfirm if enum is unsupported) so inferred enum columns
keep their type instead of being silently rewritten.
---
Outside diff comments:
In `@backend/src/env.ts`:
- Around line 21-25: The comment above CLERK_SECRET_KEY and
CLERK_PUBLISHABLE_KEY is stale — the app now exposes a protected route
(/infer-schema) so these Clerk keys are required; update the comment in env.ts
to state that CLERK_SECRET_KEY and CLERK_PUBLISHABLE_KEY are required for JWT
verification on protected routes (e.g., /infer-schema) and must be provided in
the environment rather than optional.
---
Nitpick comments:
In `@backend/.env.example`:
- Around line 11-12: Swap the order of the environment variables so the public
key appears before the private key; specifically, place CLERK_PUBLISHABLE_KEY
above CLERK_SECRET_KEY in the env example. Update the block that currently lists
CLERK_SECRET_KEY then CLERK_PUBLISHABLE_KEY to instead list
CLERK_PUBLISHABLE_KEY first, then CLERK_SECRET_KEY, keeping the variable names
and empty values intact.
In `@backend/src/mastra/workflows/infer-schema.ts`:
- Around line 4-8: Remove the unused imports columnDefinitionSchema and
retrievalStrategySchema from the import list and keep only datasetSchemaSchema;
update the import statement that currently references datasetSchemaSchema,
columnDefinitionSchema, retrievalStrategySchema so it only imports
datasetSchemaSchema to eliminate unused symbols and related linter errors.
In `@backend/src/pipeline/schema-inference.ts`:
- Around line 26-33: Replace direct process.env access in getModel with the
centralized env export: import and use the exported OPENROUTER_API_KEY from the
project's env module instead of reading process.env.OPENROUTER_API_KEY; keep the
existing createOpenRouter call and returned model
("anthropic/claude-sonnet-4-6") but rely on env.OPENROUTER_API_KEY (or the
exported constant name) for validation and type-safety consistent with
backend/src/env.ts.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f1dad1fd-f417-4e1c-b3a7-2c8eaa6b9b12
⛔ Files ignored due to path filters (1)
backend/package-lock.jsonis excluded by!**/package-lock.json
📒 Files selected for processing (16)
.gitignorebackend/.env.examplebackend/Dockerfile.mastrabackend/package.jsonbackend/src/clerk-auth.tsbackend/src/cli.tsbackend/src/env.tsbackend/src/index.tsbackend/src/mastra/index.tsbackend/src/mastra/workflows/infer-schema.tsbackend/src/pipeline/schema-inference.tsdocker-compose.dev.ymlfrontend/.env.examplefrontend/app/dataset/new/page.tsxfrontend/lib/backend.tsmakefiles/Makefile
💤 Files with no reviewable changes (1)
- backend/src/cli.ts
| CONVEX_SELF_HOSTED_ADMIN_KEY: ${CONVEX_SELF_HOSTED_ADMIN_KEY:-} | ||
| # Optional today; required once protected routes are added. | ||
| CLERK_SECRET_KEY: ${CLERK_SECRET_KEY:-} | ||
| CLERK_PUBLISHABLE_KEY: ${NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY:-} |
There was a problem hiding this comment.
Fix backend Clerk publishable key env mapping.
CLERK_PUBLISHABLE_KEY is sourced from NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY, which can leave backend config empty when only CLERK_PUBLISHABLE_KEY is provided.
Suggested fix
- CLERK_PUBLISHABLE_KEY: ${NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY:-}
+ CLERK_PUBLISHABLE_KEY: ${CLERK_PUBLISHABLE_KEY:-}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| CLERK_PUBLISHABLE_KEY: ${NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY:-} | |
| CLERK_PUBLISHABLE_KEY: ${CLERK_PUBLISHABLE_KEY:-} |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docker-compose.dev.yml` at line 33, The docker-compose mapping for
CLERK_PUBLISHABLE_KEY currently only reads NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY so
the backend can be empty if only CLERK_PUBLISHABLE_KEY is supplied; update the
service environment entry so CLERK_PUBLISHABLE_KEY resolves first from the host
CLERK_PUBLISHABLE_KEY and falls back to NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY (i.e.,
prefer CLERK_PUBLISHABLE_KEY, then NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY) so both
deployment styles work; locate the CLERK_PUBLISHABLE_KEY entry in
docker-compose.dev.yml to apply this change.
There was a problem hiding this comment.
Intentional — our root .env (and .env.example) only defines NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY since the frontend needs the NEXT_PUBLIC_ prefix. The backend doesn't have its own separate CLERK_PUBLISHABLE_KEY var. The mapping here correctly bridges the two naming conventions. No change needed.
There was a problem hiding this comment.
@simantak-dabhade, understood — thanks for the context! The single NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY in the root .env serving as the canonical name makes sense, with Docker Compose bridging it to the backend's CLERK_PUBLISHABLE_KEY. I'll note this convention to avoid re-flagging it in future reviews.
(´・ω・`) 🐇
✏️ Learnings added
Learnt from: simantak-dabhade
Repo: tinyfish-io/bigset PR: 25
File: docker-compose.dev.yml:33-33
Timestamp: 2026-05-21T18:52:10.158Z
Learning: In the `tinyfish-io/bigset` repo, the root `.env` (and `.env.example`) defines only `NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY` as the canonical Clerk publishable key variable (required for the Next.js frontend's `NEXT_PUBLIC_` prefix convention). The backend does not have its own separate `CLERK_PUBLISHABLE_KEY` env var. In `docker-compose.dev.yml`, `CLERK_PUBLISHABLE_KEY: ${NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY:-}` is intentional — it bridges the single frontend-named variable to the backend service. Do not flag this mapping as an error in future reviews.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
| const BACKEND_TYPE_MAP: Record<InferredColumn["type"], ColumnType> = { | ||
| string: "text", | ||
| enum: "text", | ||
| url: "url", | ||
| date: "date", | ||
| number: "number", | ||
| boolean: "boolean", | ||
| }; |
There was a problem hiding this comment.
Don't silently coerce inferred enum columns to text.
inferSchema() can return type: "enum", but this map downgrades it immediately, and handleConfirm() later persists that downgraded value. Prompts that infer categorical fields like status/stage/category will lose their schema semantics before the user even reviews them. Either add enum support to ColumnType/TypeSelector and the create payload, or fail fast on unsupported backend types instead of rewriting them.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@frontend/app/dataset/new/page.tsx` around lines 49 - 56, The BACKEND_TYPE_MAP
forcibly converts inferred enum columns to "text", which loses semantics; update
handling so enums are preserved: change BACKEND_TYPE_MAP to map InferredColumn
type "enum" to ColumnType "enum" (don't coerce to "text"), add "enum" to the
ColumnType union/type definition, extend the TypeSelector component to display
and accept "enum" as a choice, and update handleConfirm and the create payload
logic to accept and persist ColumnType "enum" (or alternatively throw an
explicit error in inferSchema/handleConfirm if enum is unsupported) so inferred
enum columns keep their type instead of being silently rewritten.
There was a problem hiding this comment.
Intentional — Convex's dataset schema (frontend/convex/schema.ts) stores column types as text | number | boolean | url | date. There's no enum type in the data model. The coercion to text is correct for now. When we add enum support to the Convex schema (with allowed values, validation, etc.), we'll propagate it through the full stack.
There was a problem hiding this comment.
@simantak-dabhade, makes sense — with Convex's column type union fixed to text | number | boolean | url | date, coercing enum to text is the right call for now. Would you like me to open a follow-up GitHub issue to track adding full enum support (allowed values, validation, Convex schema update, TypeSelector extension, etc.) when you're ready to tackle it?
✏️ Learnings added
Learnt from: simantak-dabhade
Repo: tinyfish-io/bigset PR: 25
File: frontend/app/dataset/new/page.tsx:49-56
Timestamp: 2026-05-21T18:52:14.885Z
Learning: In `frontend/app/dataset/new/page.tsx`, the `BACKEND_TYPE_MAP` intentionally coerces `InferredColumn` type `"enum"` to `ColumnType` `"text"` because Convex's dataset schema (`frontend/convex/schema.ts`) only supports column types `text | number | boolean | url | date`. There is no `enum` type in the data model. Full enum support (with allowed values, validation, etc.) is planned for a future PR that will propagate it through the full stack.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
| {CADENCE_OPTIONS.map((opt) => ( | ||
| <button | ||
| key={opt.value} | ||
| onClick={() => setCadence(opt.value)} | ||
| className={`rounded-lg border px-3 py-1.5 text-xs font-medium transition-colors ${ | ||
| cadence === opt.value | ||
| ? "border-foreground bg-foreground text-accent-text" | ||
| : "border-border bg-surface text-foreground hover:border-foreground/30" | ||
| }`} | ||
| > | ||
| {opt.label} | ||
| </button> |
There was a problem hiding this comment.
Expose the selected cadence to assistive tech.
These act like toggle/radio controls, but the active option is only expressed by color. Add aria-pressed/aria-checked semantics or switch to actual radio inputs so the current schedule is announced correctly.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@frontend/app/dataset/new/page.tsx` around lines 295 - 306, The cadence
selector rendered from CADENCE_OPTIONS using the button mapping currently
conveys state only via color; update the rendered controls so assistive tech can
read the selected option by adding accessible state attributes or using native
radios — e.g., when mapping CADENCE_OPTIONS in the component (where setCadence
and cadence are used) either replace the buttons with proper radio inputs
(grouped with a fieldset/legend or role="radiogroup") or keep the buttons and
add aria-pressed={cadence === opt.value} (or role="radio" +
aria-checked={cadence === opt.value}) plus an accessible container
role="radiogroup"; ensure onClick still calls setCadence(opt.value) and include
unique key={opt.value} as before.
There was a problem hiding this comment.
Good catch — will add aria-pressed in a follow-up accessibility pass. Low priority for now since this is an internal tool, but noted.
There was a problem hiding this comment.
@simantak-dabhade, totally reasonable for an internal tool — noted! Would you like me to open a GitHub issue to track the accessibility pass (adding aria-pressed/role="radiogroup" to the cadence selector) so it doesn't get lost?
|
Thanks @manav-tf |
Root: add Mastra Studio port, OpenRouter setup step, env var plumbing section, schema inference pipeline description, frontend→backend auth flow. Backend: document Fastify routes, Clerk JWT auth, schema inference pipeline, Mastra workflows, and required env vars. Remove stale claim that backend does not handle auth. Frontend: document lib/backend.ts API client and backend type mapping. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix stale comment in env.ts (Clerk keys are now required, not optional) - Remove unused imports in infer-schema workflow - Add role="alert" to error banner for screen reader announcements Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
docker-compose passes this to backend and mastra but it wasn't in the example file — new contributors would miss it. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Re: stale comment in Fixed in f76d5f3 — updated the comment to reflect that Clerk keys are now required for protected routes like Re: unused imports in Also fixed in f76d5f3 — removed |
Summary
POST /infer-schema, Clerk JWT auth) → Claude Sonnet 4.6 via OpenRouter → editable schema in the wizardmastra devon :4111) withinferSchemaWorkflow; removed the one-shot CLI entry point (cli.ts)frontend/lib/backend.ts: Typed API client for backend calls with Clerk token forwardingFiles
src/index.ts,src/env.ts,src/clerk-auth.tsPOST /infer-schemaroute,CLERK_PUBLISHABLE_KEYenv varsrc/mastra/,Dockerfile.mastrasrc/pipeline/schema-inference.tsapp/dataset/new/page.tsxlib/backend.ts/infer-schemadocker-compose.dev.yml,makefiles/Makefilesrc/cli.ts(deleted),.gitignore,.env.examplefilesTest plan
make dev— all services start (frontend :3500, backend :3501, Mastra :4111)/dataset/new, enter a prompt, click "Generate Schema"/dataset/{id}with "Building..." status/infer-schemashould return 401🤖 Generated with Claude Code