Skip to content

feat: add file attachment support for chat agent with document parsing#12

Merged
larryro merged 1 commit into
mainfrom
chat-agent-files-support
Dec 12, 2025
Merged

feat: add file attachment support for chat agent with document parsing#12
larryro merged 1 commit into
mainfrom
chat-agent-files-support

Conversation

@larryro
Copy link
Copy Markdown
Collaborator

@larryro larryro commented Dec 12, 2025

  • Add file attachment handling to chat interface with paste support for images
  • Implement multi-modal message support: images sent inline, documents parsed via tools
  • Create file parser service in crawler for PDF, DOCX, PPTX text extraction
  • Add parse operation to pdf_tool, docx_tool, and pptx_tool
  • Add shared parseFile helper to download and extract text from document URLs
  • Register files with agent component for proper tracking
  • Update chat mutation and action to pass attachments through the flow

Summary by CodeRabbit

  • New Features
    • Added support for parsing PDF, DOCX, and PPTX documents with automatic text and metadata extraction.
    • Users can now attach files and images to chat conversations for AI processing.
    • Added image paste functionality—users can paste images directly from clipboard into the chat.
    • AI chat now supports multi-modal interactions, processing both text and attached files/images together.

✏️ Tip: You can customize this high-level summary in your review settings.

- Add file attachment handling to chat interface with paste support for images
- Implement multi-modal message support: images sent inline, documents parsed via tools
- Create file parser service in crawler for PDF, DOCX, PPTX text extraction
- Add parse operation to pdf_tool, docx_tool, and pptx_tool
- Add shared parseFile helper to download and extract text from document URLs
- Register files with agent component for proper tracking
- Update chat mutation and action to pass attachments through the flow
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Dec 12, 2025

📝 Walkthrough

Walkthrough

This PR introduces file parsing and multi-modal attachment support across the application stack. The crawler service adds a new FileParserService class that parses PDF, DOCX, and PPTX files and exposes them via a new /api/v1/parse/file REST endpoint. The platform side extends the chat UI with clipboard paste support for images and an attachment button, integrates file parsing into existing PDF/DOCX/PPTX Convex tools, and enhances the chat agent to handle attachments through the message pipeline, including optimistic UI updates and multi-modal prompt construction for AI processing.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

  • services/platform/convex/model/chat_agent/generate_agent_response.ts: Multi-modal attachment handling with file registration, SHA256 computation, and complex promptContent construction for AI integration requires careful logic review.
  • services/platform/app/(app)/dashboard/[id]/chat/components/chat-interface.tsx: Optimistic message clearing logic with conditional content matching (exact or prefix-based) needs verification for edge cases.
  • services/platform/convex/agent_tools/convex_tools/files/helpers/parse_file.ts: File download, FormData construction, and error handling for crawler integration requires validation of abort timeout and response parsing.
  • services/platform/app/(app)/dashboard/[id]/chat/components/chat-input.tsx: Image paste handler and clipboard processing logic should be reviewed for browser compatibility and file handling correctness.
  • services/platform/convex/agent_tools/convex_tools/files/pdf_tool.ts and pptx_tool.ts: Union type refactoring and conditional operation handling (parse vs. generate/analyze) need verification for completeness.

Possibly related PRs

  • talecorp/poc2#29: Directly implements the backend file parsing service (FileParserService) and /api/v1/parse/file endpoint that the parse_file.ts helper and Convex tool changes invoke.
  • tale-project/poc2#497: Introduces the same crawler FileParserService and ParseFileResponse API contract used by the parse_file.ts helper and new PDF/DOCX/PPTX tool parse operations.

Comment @coderabbitai help to get the list of available commands and usage tips.

@larryro larryro merged commit 92b6cc6 into main Dec 12, 2025
1 of 2 checks passed
@larryro larryro deleted the chat-agent-files-support branch December 12, 2025 07:04
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (7)
services/platform/convex/agent_tools/convex_tools/files/pdf_tool.ts (2)

12-30: Avoid the as GeneratePdfResult escape hatch; keep result typing honest
return { operation: 'generate', ...result } as GeneratePdfResult; bypasses compile-time verification that internal.documents.generateDocumentInternal still returns the full GeneratePdfResult shape. Prefer explicit mapping (or satisfies) so missing fields fail fast at build time.


63-118: Use z.discriminatedUnion('operation', [...]) to enforce operation-specific required fields

The schema currently marks all fields as optional, but validates them at runtime based on the operation value. This defers validation errors until the handler executes. Use Zod's discriminatedUnion() to split the schema by operation and enforce required fields upfront:

-    args: z.object({
-      operation: z
-        .enum(['generate', 'parse'])
-        .optional()
-        .describe("Operation: 'generate' (default) or 'parse'"),
-      // For generate operation
-      fileName: z.string().optional().describe(...),
-      sourceType: z.enum(['markdown', 'html', 'url']).optional().describe(...),
-      content: z.string().optional().describe(...),
-      ...
-      // For parse operation
-      url: z.string().optional().describe(...),
-      filename: z.string().optional().describe(...),
-    }),
+    args: z.discriminatedUnion('operation', [
+      z.object({
+        operation: z.literal('generate').optional(),
+        fileName: z.string().min(1),
+        sourceType: z.enum(['markdown', 'html', 'url']),
+        content: z.string().min(1),
+        pdfOptions: z
+          .object({
+            format: z.string().optional(),
+            landscape: z.boolean().optional(),
+            marginTop: z.string().optional(),
+            marginBottom: z.string().optional(),
+            marginLeft: z.string().optional(),
+            marginRight: z.string().optional(),
+            printBackground: z.boolean().optional(),
+          })
+          .optional(),
+        urlOptions: z
+          .object({
+            waitUntil: z.enum(['load', 'domcontentloaded', 'networkidle', 'commit']).optional(),
+          })
+          .optional(),
+        extraCss: z.string().optional(),
+        wrapInTemplate: z.boolean().optional(),
+      }),
+      z.object({
+        operation: z.literal('parse'),
+        url: z.string().url(),
+        filename: z.string().min(1),
+      }),
+    ]),

This catches missing fields at validation time with precise error messages instead of throwing at runtime (lines 124–143).

services/platform/app/(app)/dashboard/[id]/chat/components/chat-input.tsx (1)

252-272: Replace <img> + hardcoded blues with next/image + semantic colors.
Repo guidelines: use Next.js Image (not native img) and avoid hardcoded colors like from-blue-100, text-blue-600.

+import Image from 'next/image';

-                        <img
-                          src={attachment.previewUrl}
-                          alt={attachment.fileName}
-                          className="size-full object-cover"
-                        />
+                        <Image
+                          src={attachment.previewUrl}
+                          alt={attachment.fileName}
+                          fill
+                          className="object-cover"
+                          unoptimized
+                        />

-                        <div className="size-full bg-gradient-to-br from-blue-100 to-blue-200 flex items-center justify-center">
-                          <span className="text-xs text-blue-600">IMG</span>
+                        <div className="size-full bg-muted flex items-center justify-center">
+                          <span className="text-xs text-muted-foreground">IMG</span>
                         </div>
services/platform/convex/chat_agent.ts (1)

24-48: Deduplicate the attachment validator to prevent schema drift.

+const FILE_ATTACHMENT_VALIDATOR = v.object({
+  fileId: v.id('_storage'),
+  fileName: v.string(),
+  fileType: v.string(),
+  fileSize: v.number(),
+});

 export const chatWithAgent = mutation({
   args: {
@@
-    attachments: v.optional(
-      v.array(
-        v.object({
-          fileId: v.id('_storage'),
-          fileName: v.string(),
-          fileType: v.string(),
-          fileSize: v.number(),
-        }),
-      ),
-    ),
+    attachments: v.optional(v.array(FILE_ATTACHMENT_VALIDATOR)),
   },
@@
 export const generateAgentResponse = internalAction({
   args: {
@@
-    attachments: v.optional(
-      v.array(
-        v.object({
-          fileId: v.id('_storage'),
-          fileName: v.string(),
-          fileType: v.string(),
-          fileSize: v.number(),
-        }),
-      ),
-    ),
+    attachments: v.optional(v.array(FILE_ATTACHMENT_VALIDATOR)),
     messageText: v.optional(v.string()),
   },

Also applies to: 108-125

services/platform/convex/model/chat_agent/generate_agent_response.ts (1)

384-396: Don’t log responseBody by default (possible sensitive data leak).
Even if “sanitized”, responseBody/err.data can include provider payloads that may contain user content. Gate it behind debugLog or drop it / truncate heavily.

services/platform/app/(app)/dashboard/[id]/chat/components/chat-interface.tsx (1)

200-218: startsWith optimistic clearing can false-positive and hide the user message.
Guard the prefix-match so it only triggers for the “attachments appended” pattern (e.g., next chars are \n\n).

-        return (
-          m.content === optimisticMessage.content ||
-          m.content.startsWith(optimisticMessage.content)
-        );
+        if (m.content === optimisticMessage.content) return true;
+        if (!m.content.startsWith(optimisticMessage.content)) return false;
+        // Only treat as “same message” if server appended content (attachments section).
+        const suffix = m.content.slice(optimisticMessage.content.length);
+        return suffix.startsWith('\n\n');
services/platform/convex/model/chat_agent/chat_with_agent.ts (1)

50-75: Dedup logic drops attachments when message text repeats (attachments must bypass dedup).
Right now attachments are ignored if messageAlreadyExists is true, even if the attachments differ.

-  const {
+  const {
     latestMessage,
     lastUserMessage,
     messageAlreadyExists,
     trimmedMessage,
   } = computeDeduplicationState(existingMessages, message);
 
   const hasAttachments = attachments && attachments.length > 0;
+  const treatAsDuplicate = messageAlreadyExists && !hasAttachments;

@@
-  if (!messageAlreadyExists) {
+  if (!treatAsDuplicate) {
     const { messageId } = await saveMessage(ctx, components.agent, {
       threadId,
       message: { role: 'user', content: messageContent },
     });
     promptMessageId = messageId;
   } else {
     promptMessageId = lastUserMessage!._id;
   }

@@
-  const actionAttachments = !messageAlreadyExists && hasAttachments
+  const actionAttachments = !treatAsDuplicate && hasAttachments
     ? attachments.map((a) => ({
         fileId: a.fileId,
         fileName: a.fileName,
         fileType: a.fileType,
         fileSize: a.fileSize,
       }))
     : undefined;

Also applies to: 114-136

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between 17e07ff and 80b318c.

⛔ Files ignored due to path filters (1)
  • services/platform/convex/_generated/api.d.ts is excluded by !**/_generated/**
📒 Files selected for processing (13)
  • services/crawler/app/file_parser_service.py (1 hunks)
  • services/crawler/app/main.py (2 hunks)
  • services/crawler/app/models.py (1 hunks)
  • services/crawler/requirements.txt (1 hunks)
  • services/platform/app/(app)/dashboard/[id]/chat/components/chat-input.tsx (4 hunks)
  • services/platform/app/(app)/dashboard/[id]/chat/components/chat-interface.tsx (2 hunks)
  • services/platform/convex/agent_tools/convex_tools/files/docx_tool.ts (6 hunks)
  • services/platform/convex/agent_tools/convex_tools/files/helpers/parse_file.ts (1 hunks)
  • services/platform/convex/agent_tools/convex_tools/files/pdf_tool.ts (4 hunks)
  • services/platform/convex/agent_tools/convex_tools/files/pptx_tool.ts (6 hunks)
  • services/platform/convex/chat_agent.ts (2 hunks)
  • services/platform/convex/model/chat_agent/chat_with_agent.ts (4 hunks)
  • services/platform/convex/model/chat_agent/generate_agent_response.ts (7 hunks)
🧰 Additional context used
📓 Path-based instructions (6)
**/*

📄 CodeRabbit inference engine (.cursor/rules/workspace_rules.mdc)

Use English only for ALL user-facing content including UI components, labels, buttons, dialogs, forms, toast messages, error messages, success messages, comments, documentation, README files, variable names, function names, and type names

Files:

  • services/crawler/requirements.txt
  • services/platform/convex/chat_agent.ts
  • services/platform/convex/model/chat_agent/generate_agent_response.ts
  • services/platform/convex/model/chat_agent/chat_with_agent.ts
  • services/platform/convex/agent_tools/convex_tools/files/pdf_tool.ts
  • services/crawler/app/models.py
  • services/platform/convex/agent_tools/convex_tools/files/pptx_tool.ts
  • services/crawler/app/file_parser_service.py
  • services/platform/convex/agent_tools/convex_tools/files/helpers/parse_file.ts
  • services/platform/convex/agent_tools/convex_tools/files/docx_tool.ts
  • services/platform/app/(app)/dashboard/[id]/chat/components/chat-interface.tsx
  • services/platform/app/(app)/dashboard/[id]/chat/components/chat-input.tsx
  • services/crawler/app/main.py
**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.cursor/rules/workspace_rules.mdc)

**/*.{ts,tsx,js,jsx}: Use Vercel AI SDK with OpenAI - import from 'ai' and '@ai-sdk/openai', never use raw OpenAI SDK or OpenRouter
Never hallucinate API keys - always use environment variables and existing .env configuration
Use camelCase for function names (e.g., getUserData)
Use SCREAMING_SNAKE_CASE for constants (e.g., API_BASE_URL, MAX_RETRIES)
Use feature flags with enums (TypeScript) or const objects (JavaScript) with UPPERCASE_WITH_UNDERSCORE naming
Implement error handling with try-catch pattern: check for result.error and display descriptive toast messages using result.error as title

Files:

  • services/platform/convex/chat_agent.ts
  • services/platform/convex/model/chat_agent/generate_agent_response.ts
  • services/platform/convex/model/chat_agent/chat_with_agent.ts
  • services/platform/convex/agent_tools/convex_tools/files/pdf_tool.ts
  • services/platform/convex/agent_tools/convex_tools/files/pptx_tool.ts
  • services/platform/convex/agent_tools/convex_tools/files/helpers/parse_file.ts
  • services/platform/convex/agent_tools/convex_tools/files/docx_tool.ts
  • services/platform/app/(app)/dashboard/[id]/chat/components/chat-interface.tsx
  • services/platform/app/(app)/dashboard/[id]/chat/components/chat-input.tsx
**/*.{ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/workspace_rules.mdc)

**/*.{ts,tsx}: Use kebab-case for file names (e.g., user-profile.tsx)
Use PascalCase for component names (e.g., UserProfile)
Use descriptive messages as toast title (never generic 'Error'), with optional description for additional context only
Follow component structure: 'use client' directive, imports, interface Props, hooks, effects, event handlers, then render
Prioritize data fetching methods in order: Server Actions (preferred), Route Handlers (when needed), Client-side (minimal use)
Use React.memo for expensive components to optimize performance
Use Next.js Image component for all images instead of native img tags
Use dynamic imports for code splitting

Files:

  • services/platform/convex/chat_agent.ts
  • services/platform/convex/model/chat_agent/generate_agent_response.ts
  • services/platform/convex/model/chat_agent/chat_with_agent.ts
  • services/platform/convex/agent_tools/convex_tools/files/pdf_tool.ts
  • services/platform/convex/agent_tools/convex_tools/files/pptx_tool.ts
  • services/platform/convex/agent_tools/convex_tools/files/helpers/parse_file.ts
  • services/platform/convex/agent_tools/convex_tools/files/docx_tool.ts
  • services/platform/app/(app)/dashboard/[id]/chat/components/chat-interface.tsx
  • services/platform/app/(app)/dashboard/[id]/chat/components/chat-input.tsx
services/*/convex/*.ts

📄 CodeRabbit inference engine (.cursor/rules/workspace_rules.mdc)

Thin wrapper API modules (like services/platform/convex/documents.ts) may export multiple Convex functions as wrappers that delegate to model helpers, but must not contain business logic and must only perform argument/return validation and delegation

Files:

  • services/platform/convex/chat_agent.ts
**/*.{tsx,jsx}

📄 CodeRabbit inference engine (.cursor/rules/figma_rules.mdc)

**/*.{tsx,jsx}: Avoid specifying font-['Inter:Regular',_sans-serif] as it should be the default
Only specify font-family when using non-default fonts like font-['Inter:Medium',_sans-serif]
Ensure font-family matches font-weight (Inter:Regular with font-normal, Inter:Medium with font-medium)
Use leading-normal instead of leading-[normal] in Tailwind classes
Use standard font size classes instead of arbitrary values: text-[12px]text-xs, text-[14px]text-sm, text-[16px]text-base, text-[18px]text-lg, text-[20px]text-xl, text-[24px]text-2xl
Use semantic spacing classes: p-[4px]p-1, p-[8px]p-2, m-[4px]m-1, m-[8px]m-2
Convert pixel values to rem using the 16px base for width and height measurements: w-[278px]w-[17.375rem], h-[48px]h-[3rem], min-w-[120px]min-w-[7.5rem], max-w-[400px]max-w-[25rem]
NEVER use hardcoded colors like text-gray-500, bg-gray-100, border-gray-200; ALWAYS use design system semantic colors: text-foreground for primary text, text-muted-foreground for secondary text and icons, bg-background for main backgrounds, bg-muted for subtle backgrounds and hover states, border-border for borders
ALWAYS use the Table component instead of custom flex layouts; use Table, TableHeader, TableBody, TableRow, TableHead, TableCell components with proper column widths using rem units and semantic colors

Files:

  • services/platform/app/(app)/dashboard/[id]/chat/components/chat-interface.tsx
  • services/platform/app/(app)/dashboard/[id]/chat/components/chat-input.tsx
**/app/**/*.tsx

📄 CodeRabbit inference engine (.cursor/rules/workspace_rules.mdc)

In Next.js App Router, use page.tsx as server components by default; use 'use client' only for interactions and state

Files:

  • services/platform/app/(app)/dashboard/[id]/chat/components/chat-interface.tsx
  • services/platform/app/(app)/dashboard/[id]/chat/components/chat-input.tsx
🧠 Learnings (18)
📚 Learning: 2025-07-20T08:40:24.693Z
Learnt from: CR
Repo: talecorp/poc PR: 0
File: .cursor/rules/ai.mdc:0-0
Timestamp: 2025-07-20T08:40:24.693Z
Learning: Applies to **/actions/*.ts : For text generation, use the `generateObject` or similar functions from the AI SDK

Applied to files:

  • services/platform/convex/model/chat_agent/generate_agent_response.ts
📚 Learning: 2025-07-20T08:40:24.693Z
Learnt from: CR
Repo: talecorp/poc PR: 0
File: .cursor/rules/ai.mdc:0-0
Timestamp: 2025-07-20T08:40:24.693Z
Learning: Applies to **/app/api/**/*.ts : For text generation, use the `generateObject` or similar functions from the AI SDK

Applied to files:

  • services/platform/convex/model/chat_agent/generate_agent_response.ts
📚 Learning: 2025-07-20T08:40:24.693Z
Learnt from: CR
Repo: talecorp/poc PR: 0
File: .cursor/rules/ai.mdc:0-0
Timestamp: 2025-07-20T08:40:24.693Z
Learning: Applies to **/actions/*.ts : Import the Vercel AI SDK directly (e.g., `import { generateObject } from 'ai'; import { openrouter } from 'openrouter/ai-sdk-provider';`)

Applied to files:

  • services/platform/convex/model/chat_agent/generate_agent_response.ts
📚 Learning: 2025-08-21T14:59:56.034Z
Learnt from: CR
Repo: talecorp/lanserhof PR: 0
File: .cursor/rules/ai.mdc:0-0
Timestamp: 2025-08-21T14:59:56.034Z
Learning: Applies to {**/actions/*.ts,**/app/api/**/*.ts,**/lib/**/*.ts} : Always use the Vercel AI SDK to access large language models

Applied to files:

  • services/platform/convex/model/chat_agent/generate_agent_response.ts
📚 Learning: 2025-08-21T14:59:56.034Z
Learnt from: CR
Repo: talecorp/lanserhof PR: 0
File: .cursor/rules/ai.mdc:0-0
Timestamp: 2025-08-21T14:59:56.034Z
Learning: Applies to {**/actions/*.ts,**/app/api/**/*.ts,**/lib/**/*.ts} : Import the Vercel AI SDK directly (e.g., `import { generateObject } from 'ai'; import { openrouter } from 'openrouter/ai-sdk-provider';`)

Applied to files:

  • services/platform/convex/model/chat_agent/generate_agent_response.ts
📚 Learning: 2025-08-21T14:59:56.034Z
Learnt from: CR
Repo: talecorp/lanserhof PR: 0
File: .cursor/rules/ai.mdc:0-0
Timestamp: 2025-08-21T14:59:56.034Z
Learning: Applies to {**/actions/*.ts,**/app/api/**/*.ts,**/lib/**/*.ts} : Do not use the direct OpenAI client without the Vercel AI SDK

Applied to files:

  • services/platform/convex/model/chat_agent/generate_agent_response.ts
📚 Learning: 2025-08-21T14:59:56.034Z
Learnt from: CR
Repo: talecorp/lanserhof PR: 0
File: .cursor/rules/ai.mdc:0-0
Timestamp: 2025-08-21T14:59:56.034Z
Learning: Applies to {**/actions/*.ts,**/app/api/**/*.ts,**/lib/**/*.ts} : Avoid custom AI wrappers that do not add meaningful functionality beyond the Vercel AI SDK

Applied to files:

  • services/platform/convex/model/chat_agent/generate_agent_response.ts
📚 Learning: 2025-07-20T08:40:24.693Z
Learnt from: CR
Repo: talecorp/poc PR: 0
File: .cursor/rules/ai.mdc:0-0
Timestamp: 2025-07-20T08:40:24.693Z
Learning: Applies to **/actions/*.ts : Always use the Vercel AI SDK to access large language models

Applied to files:

  • services/platform/convex/model/chat_agent/generate_agent_response.ts
📚 Learning: 2025-07-20T08:40:24.693Z
Learnt from: CR
Repo: talecorp/poc PR: 0
File: .cursor/rules/ai.mdc:0-0
Timestamp: 2025-07-20T08:40:24.693Z
Learning: Applies to **/lib/**/*.ts : For text generation, use the `generateObject` or similar functions from the AI SDK

Applied to files:

  • services/platform/convex/model/chat_agent/generate_agent_response.ts
📚 Learning: 2025-07-20T08:40:24.693Z
Learnt from: CR
Repo: talecorp/poc PR: 0
File: .cursor/rules/ai.mdc:0-0
Timestamp: 2025-07-20T08:40:24.693Z
Learning: Applies to **/actions/*.ts : Never use direct OpenAI client without the Vercel AI SDK

Applied to files:

  • services/platform/convex/model/chat_agent/generate_agent_response.ts
📚 Learning: 2025-11-30T03:53:00.316Z
Learnt from: CR
Repo: tale-project/poc2 PR: 0
File: .cursor/rules/convex_rules.mdc:0-0
Timestamp: 2025-11-30T03:53:00.316Z
Learning: Applies to convex/**/*.ts : Use one file per Convex function under the `convex/` directory, with file name and exported function name identical (snake_case for file names, camelCase for function names)

Applied to files:

  • services/platform/convex/agent_tools/convex_tools/files/helpers/parse_file.ts
📚 Learning: 2025-12-02T08:13:24.290Z
Learnt from: CR
Repo: tale-project/tale PR: 0
File: .cursor/rules/convex_rules.mdc:0-0
Timestamp: 2025-12-02T08:13:24.290Z
Learning: Applies to convex/**/*.{ts,tsx} : When touching existing modules that export multiple functions, split them into separate files so each file exports exactly one function matching the file name, unless the module is an explicitly designated thin wrapper file.

Applied to files:

  • services/platform/convex/agent_tools/convex_tools/files/helpers/parse_file.ts
📚 Learning: 2025-07-03T08:43:49.346Z
Learnt from: CR
Repo: talecorp/poc PR: 0
File: .cursor/rules/next-best-practice.mdc:0-0
Timestamp: 2025-07-03T08:43:49.346Z
Learning: Applies to **/*.{tsx,jsx} : Use useCallback for stable function references in components

Applied to files:

  • services/platform/app/(app)/dashboard/[id]/chat/components/chat-input.tsx
📚 Learning: 2025-07-19T15:29:09.401Z
Learnt from: CR
Repo: talecorp/poc PR: 0
File: .cursorrules:0-0
Timestamp: 2025-07-19T15:29:09.401Z
Learning: Applies to app/components/**/*.{js,jsx,ts,tsx} : Use useCallback for stable function references in React components.

Applied to files:

  • services/platform/app/(app)/dashboard/[id]/chat/components/chat-input.tsx
📚 Learning: 2025-07-19T15:30:00.886Z
Learnt from: CR
Repo: talecorp/poc PR: 0
File: .cursor/rules/core-rules.mdc:0-0
Timestamp: 2025-07-19T15:30:00.886Z
Learning: Applies to **/*.{tsx,jsx} : Follow the required React component structure: 'use client' (when needed), imports, Props interface, function definition, hooks, effects, event handlers, and render in order

Applied to files:

  • services/platform/app/(app)/dashboard/[id]/chat/components/chat-input.tsx
📚 Learning: 2025-07-19T15:30:00.886Z
Learnt from: CR
Repo: talecorp/poc PR: 0
File: .cursor/rules/core-rules.mdc:0-0
Timestamp: 2025-07-19T15:30:00.886Z
Learning: Applies to **/*.{tsx,jsx} : Use memoization (e.g., useMemo) for expensive operations and cache repeated calculations (e.g., useRef with Map) in React components

Applied to files:

  • services/platform/app/(app)/dashboard/[id]/chat/components/chat-input.tsx
📚 Learning: 2025-07-03T08:43:49.346Z
Learnt from: CR
Repo: talecorp/poc PR: 0
File: .cursor/rules/next-best-practice.mdc:0-0
Timestamp: 2025-07-03T08:43:49.346Z
Learning: Applies to **/*.{tsx,jsx} : Consistent component structure: hooks, effects, event handlers, render

Applied to files:

  • services/platform/app/(app)/dashboard/[id]/chat/components/chat-input.tsx
📚 Learning: 2025-08-21T14:59:36.874Z
Learnt from: CR
Repo: talecorp/lanserhof PR: 0
File: .cursorrules:0-0
Timestamp: 2025-08-21T14:59:36.874Z
Learning: Applies to **/*.{tsx,jsx} : Use React.memo for expensive components

Applied to files:

  • services/platform/app/(app)/dashboard/[id]/chat/components/chat-input.tsx
🧬 Code graph analysis (7)
services/platform/convex/model/chat_agent/generate_agent_response.ts (1)
services/platform/convex/model/chat_agent/chat_with_agent.ts (1)
  • FileAttachment (26-31)
services/platform/convex/model/chat_agent/chat_with_agent.ts (1)
services/platform/convex/lib/debug_log.ts (1)
  • createDebugLog (27-39)
services/platform/convex/agent_tools/convex_tools/files/pptx_tool.ts (1)
services/platform/convex/agent_tools/convex_tools/files/helpers/parse_file.ts (2)
  • ParseFileResult (11-25)
  • parseFile (34-102)
services/platform/convex/agent_tools/convex_tools/files/helpers/parse_file.ts (2)
services/platform/convex/lib/debug_log.ts (1)
  • createDebugLog (27-39)
services/platform/convex/agent_tools/convex_tools/crawler/helpers/get_crawler_service_url.ts (1)
  • getCrawlerServiceUrl (7-15)
services/platform/convex/agent_tools/convex_tools/files/docx_tool.ts (2)
services/platform/convex/agent_tools/convex_tools/files/helpers/parse_file.ts (2)
  • ParseFileResult (11-25)
  • parseFile (34-102)
services/platform/convex/model/documents/generate_docx.ts (1)
  • GenerateDocxResult (43-50)
services/platform/app/(app)/dashboard/[id]/chat/components/chat-interface.tsx (2)
services/platform/convex/chat_agent.ts (1)
  • chatWithAgent (24-48)
services/platform/convex/model/chat_agent/chat_with_agent.ts (1)
  • chatWithAgent (46-181)
services/crawler/app/main.py (4)
services/crawler/app/models.py (1)
  • ParseFileResponse (283-294)
services/crawler/app/converter_service.py (1)
  • get_converter_service (356-361)
services/crawler/app/template_service.py (1)
  • get_template_service (117-122)
services/crawler/app/file_parser_service.py (2)
  • FileParserService (17-147)
  • parse_file (131-147)
🪛 Ruff (0.14.8)
services/crawler/app/file_parser_service.py

12-12: typing.Dict is deprecated, use dict instead

(UP035)


50-50: Do not catch blind exception: Exception

(BLE001)


51-51: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


51-51: Logging statement uses f-string

(G004)


86-86: Do not catch blind exception: Exception

(BLE001)


87-87: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


87-87: Logging statement uses f-string

(G004)


127-127: Do not catch blind exception: Exception

(BLE001)


128-128: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


128-128: Logging statement uses f-string

(G004)

services/crawler/app/main.py

908-908: Do not perform function call File in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)


926-929: Abstract raise to an inner function

(TRY301)


941-941: Do not catch blind exception: Exception

(BLE001)


946-946: Use explicit conversion flag

Replace with conversion flag

(RUF010)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build Docker Image
🔇 Additional comments (8)
services/crawler/app/models.py (1)

280-294: LGTM!

The ParseFileResponse model is well-structured, follows the existing patterns in the file, and aligns with the ParseFileResult interface in the TypeScript layer. The field descriptions are clear and appropriate.

services/crawler/app/main.py (1)

48-57: LGTM! Consistent with existing service patterns.

The global service singleton pattern matches the existing get_converter_service() and get_template_service() implementations. The FileParserService doesn't require async initialization like the crawler or converter services, so no lifespan changes are needed.

services/platform/convex/agent_tools/convex_tools/files/pptx_tool.ts (1)

281-292: LGTM! Parse operation implementation is consistent.

The parse operation follows the established patterns of other tools and correctly validates required inputs. The handler delegates to the shared parseFile helper, maintaining DRY principles.

services/platform/convex/agent_tools/convex_tools/files/helpers/parse_file.ts (1)

11-25: LGTM! Interface aligns with backend model.

The ParseFileResult interface correctly mirrors the ParseFileResponse Pydantic model from the crawler service, ensuring type consistency across the stack.

services/platform/convex/agent_tools/convex_tools/files/docx_tool.ts (2)

223-234: LGTM! Consistent with other file tools.

The parse operation implementation follows the same pattern as pptx_tool.ts and pdf_tool.ts, maintaining consistency across all file parsing tools. The input validation and error handling are appropriate.


41-43: Type definitions are well-structured.

The ParseDocxResult type correctly extends ParseFileResult with the operation discriminator, and the DocxResult union properly includes all three operation result types.

services/platform/convex/agent_tools/convex_tools/files/pdf_tool.ts (1)

146-182: Generate path looks solid; just remove the unsafe cast and consider logging a stable error code
Flow, debug logging, and action invocation look consistent. Main concern is the as GeneratePdfResult cast (see earlier).

services/platform/app/(app)/dashboard/[id]/chat/components/chat-interface.tsx (1)

294-307: Attachment mapping looks correct (strips UI-only fields).

Comment on lines +12 to +15
from typing import Dict, Any

logger = logging.getLogger(__name__)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Switch Dict[...] to dict[...] (and standardize response typing)
Ruff is right: prefer built-in generics for py3.9+.

-from typing import Dict, Any
+from typing import Any
...
-    def parse_pdf(self, file_bytes: bytes, filename: str = "document.pdf") -> Dict[str, Any]:
+    def parse_pdf(self, file_bytes: bytes, filename: str = "document.pdf") -> dict[str, Any]:
...
-    def parse_docx(self, file_bytes: bytes, filename: str = "document.docx") -> Dict[str, Any]:
+    def parse_docx(self, file_bytes: bytes, filename: str = "document.docx") -> dict[str, Any]:
...
-    def parse_pptx(self, file_bytes: bytes, filename: str = "presentation.pptx") -> Dict[str, Any]:
+    def parse_pptx(self, file_bytes: bytes, filename: str = "presentation.pptx") -> dict[str, Any]:
...
-    def parse_file(self, file_bytes: bytes, filename: str, content_type: str = "") -> Dict[str, Any]:
+    def parse_file(self, file_bytes: bytes, filename: str, content_type: str = "") -> dict[str, Any]:

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Ruff (0.14.8)

12-12: typing.Dict is deprecated, use dict instead

(UP035)

🤖 Prompt for AI Agents
In services/crawler/app/file_parser_service.py around lines 12 to 15, replace
the use of typing.Dict[...] with the built-in generic dict[...] (Python 3.9+)
and standardize the function/response type annotations to use dict[str, Any] (or
the appropriate key type) throughout the file; remove the unused Dict import,
keep Any imported from typing (or add it if missing), and update all signatures
and return annotations so they consistently use dict[...] instead of Dict[...].

Comment on lines +20 to +53
def parse_pdf(self, file_bytes: bytes, filename: str = "document.pdf") -> Dict[str, Any]:
"""Extract text content from a PDF file."""
import fitz # PyMuPDF

try:
doc = fitz.open(stream=file_bytes, filetype="pdf")
pages = []
full_text = []

for page_num, page in enumerate(doc, start=1):
text = page.get_text("text")
pages.append({"page_number": page_num, "text": text.strip()})
full_text.append(text)

metadata = doc.metadata or {}
doc.close()

return {
"success": True,
"filename": filename,
"file_type": "application/pdf",
"page_count": len(pages),
"pages": pages,
"full_text": "\n\n".join(full_text).strip(),
"metadata": {
"title": metadata.get("title", ""),
"author": metadata.get("author", ""),
"subject": metadata.get("subject", ""),
},
}
except Exception as e:
logger.error(f"Error parsing PDF: {e}")
return {"success": False, "filename": filename, "file_type": "application/pdf", "error": str(e)}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -type f -path "*crawler*" -name "*.py" | grep -i parser | head -10

Repository: tale-project/tale

Length of output: 178


🏁 Script executed:

find . -type f -name "file_parser_service.py"

Repository: tale-project/tale

Length of output: 106


🏁 Script executed:

git ls-files | grep -i "file_parser"

Repository: tale-project/tale

Length of output: 104


🏁 Script executed:

cat -n services/crawler/app/file_parser_service.py | head -100

Repository: tale-project/tale

Length of output: 4484


🏁 Script executed:

cat -n services/crawler/app/file_parser_service.py | sed -n '54-110p'

Repository: tale-project/tale

Length of output: 112


🏁 Script executed:

tail -60 services/crawler/app/file_parser_service.py

Repository: tale-project/tale

Length of output: 2992


🏁 Script executed:

find . -name "pyproject.toml" -o -name "setup.py" -o -name "setup.cfg" | head -3 | xargs grep -l "python" 2>/dev/null | head -1

Repository: tale-project/tale

Length of output: 147


Ensure PDF resources close on all paths and use logger.exception() instead of f-string logging

The doc.close() call at line 35 is not guaranteed to run if an exception occurs during page iteration or metadata access (lines 29–34), causing a resource leak. Additionally, using logger.error(f"...") loses the exception stack trace; logger.exception() should be used instead to preserve debugging information.

         try:
-            doc = fitz.open(stream=file_bytes, filetype="pdf")
-            pages = []
-            full_text = []
+            doc = fitz.open(stream=file_bytes, filetype="pdf")
+            try:
+                pages: list[dict[str, Any]] = []
+                full_text: list[str] = []
             
-            for page_num, page in enumerate(doc, start=1):
-                text = page.get_text("text")
-                pages.append({"page_number": page_num, "text": text.strip()})
-                full_text.append(text)
+                for page_num, page in enumerate(doc, start=1):
+                    text = page.get_text("text")
+                    pages.append({"page_number": page_num, "text": text.strip()})
+                    full_text.append(text)
             
-            metadata = doc.metadata or {}
-            doc.close()
+                metadata = doc.metadata or {}
+            finally:
+                doc.close()
             
             return {
                 "success": True,
                 "filename": filename,
                 "file_type": "application/pdf",
                 "page_count": len(pages),
                 "pages": pages,
                 "full_text": "\n\n".join(full_text).strip(),
                 "metadata": {
                     "title": metadata.get("title", ""),
                     "author": metadata.get("author", ""),
                     "subject": metadata.get("subject", ""),
                 },
             }
-        except Exception as e:
-            logger.error(f"Error parsing PDF: {e}")
-            return {"success": False, "filename": filename, "file_type": "application/pdf", "error": str(e)}
+        except Exception as e:
+            logger.exception("Error parsing PDF")
+            return {
+                "success": False,
+                "filename": filename,
+                "file_type": "application/pdf",
+                "error": str(e),
+            }
🧰 Tools
🪛 Ruff (0.14.8)

50-50: Do not catch blind exception: Exception

(BLE001)


51-51: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


51-51: Logging statement uses f-string

(G004)

🤖 Prompt for AI Agents
In services/crawler/app/file_parser_service.py around lines 20 to 53, the PDF
document is opened without guaranteeing it gets closed on all code paths and the
exception is logged with logger.error which drops the traceback; update the code
to open the PyMuPDF document inside a try/finally or use a context manager so
doc.close() always runs (even if page iteration or metadata access raises), and
replace logger.error(...) with logger.exception(...) in the except block so the
full stack trace is preserved.

Comment on lines +54 to +89
def parse_docx(self, file_bytes: bytes, filename: str = "document.docx") -> Dict[str, Any]:
"""Extract text content from a DOCX file."""
from docx import Document

try:
doc = Document(BytesIO(file_bytes))
paragraphs = []
for para in doc.paragraphs:
text = para.text.strip()
if text:
paragraphs.append({"text": text, "style": para.style.name if para.style else None})

tables = []
for table in doc.tables:
table_data = [[cell.text.strip() for cell in row.cells] for row in table.rows]
if table_data:
tables.append(table_data)

full_text = "\n".join(p["text"] for p in paragraphs)
core_props = doc.core_properties

return {
"success": True,
"filename": filename,
"file_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"paragraph_count": len(paragraphs),
"table_count": len(tables),
"paragraphs": paragraphs,
"tables": tables,
"full_text": full_text,
"metadata": {"title": core_props.title or "", "author": core_props.author or ""},
}
except Exception as e:
logger.error(f"Error parsing DOCX: {e}")
return {"success": False, "filename": filename, "error": str(e)}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "file_parser_service.py" -type f

Repository: tale-project/tale

Length of output: 106


🏁 Script executed:

git ls-files | grep -i "file_parser" | head -20

Repository: tale-project/tale

Length of output: 104


🏁 Script executed:

find . -path "*/crawler/app/*" -name "*.py" | head -20

Repository: tale-project/tale

Length of output: 445


🏁 Script executed:

git ls-files "**/*file_parser*" || rg -l "file_parser_service" -t py || find . -name "*file_parser*" -type f

Repository: tale-project/tale

Length of output: 104


🏁 Script executed:

find . -type f -name "*.py" | xargs grep -l "def parse_docx" 2>/dev/null | head -5

Repository: tale-project/tale

Length of output: 106


🏁 Script executed:

rg "def parse_docx" -A 20 -t py

Repository: tale-project/tale

Length of output: 1849


🏁 Script executed:

rg "def parse_pdf" -A 20 -t py

Repository: tale-project/tale

Length of output: 1754


🏁 Script executed:

rg "parse_docx\|parse_pdf" -t py | head -30

Repository: tale-project/tale

Length of output: 43


🏁 Script executed:

find . -type d -name "crawler" 2>/dev/null

Repository: tale-project/tale

Length of output: 191


🏁 Script executed:

ls -la

Repository: tale-project/tale

Length of output: 1145


🏁 Script executed:

find . -type f -name "*.py" | wc -l && find . -type f -name "*.py" | head -20

Repository: tale-project/tale

Length of output: 781


🏁 Script executed:

rg "application/vnd.openxmlformats-officedocument.wordprocessingml" -t py

Repository: tale-project/tale

Length of output: 208


🏁 Script executed:

git log --oneline -5 2>/dev/null || echo "Not a git repo"

Repository: tale-project/tale

Length of output: 139


🏁 Script executed:

rg "Extract text content from a DOCX file" -t py

Repository: tale-project/tale

Length of output: 157


🏁 Script executed:

rg "logger.exception\|logger.error" -t py -A 2 | head -40

Repository: tale-project/tale

Length of output: 43


🏁 Script executed:

rg "success.*False" -t py | head -20

Repository: tale-project/tale

Length of output: 1029


🌐 Web query:

Python logger.exception vs logger.error exception trace stacktrace

💡 Result:

Short answer

  • logger.exception(msg) == logger.error(msg, exc_info=True). It logs at ERROR level and automatically adds the current exception traceback from sys.exc_info(). [1][2]

Key differences / gotchas

  • logger.error(...) by itself does NOT include a traceback — pass exc_info=True (or exc_info=sys.exc_info() or an exc tuple) to include it. [1][2]
  • Use logger.exception only inside an except block (or pass an explicit exc_info) — otherwise sys.exc_info() is empty and you get no traceback (or None info). [3]

Minimal examples

  • inside except:
    try: ...
    except Exception:
    logger.exception("failed") # same as logger.error("failed", exc_info=True) [1][2]
  • logging a saved exception later:
    logger.error("failed", exc_info=my_exc_info) # where my_exc_info is sys.exc_info() or (type, value, tb). [3]

Sources
[1] docs / blog: explanation that logger.exception is error + exc_info=True.
[2] Loggly / tutorials: exc_info=True includes full stack trace; logger.exception defaults to ERROR.
[3] Stack Overflow: logger.exception relies on sys.exc_info() (must be in except or supply exc_info).


🌐 Web query:

Python API response consistency error handling payload fields

💡 Result:

Here are concise, practical guidelines for consistent API responses, error handling, and payload fields in Python APIs — plus a minimal example.

Key rules

  • Use proper HTTP status codes (4xx for client, 5xx for server). Avoid always returning 200 for errors. [1][2]
  • Adopt a standard machine-readable error format (IETF Problem Details — RFC 9457 / RFC 7807). Include type, title, status, detail, instance; extend with application-specific fields. [1][3]
  • Provide a consistent success envelope (data + meta) so clients can parse uniformly.
  • Include a stable error code (machine-readable), human message, and optional field-level errors for validation failures.
  • Always return a correlation/trace id in responses and logs to link client errors to server logs.
  • Don’t leak internal stack traces or sensitive data; log full details server-side and return safe summary messages to clients.
  • Version your API and include response version in metadata to enable non‑breaking evolution.
  • For transient upstream failures use 5xx and include Retry-After when appropriate; for rate limits use 429 plus Retry-After. [2][4]

Minimal JSON patterns

  • Success:
    {
    "data": { ... },
    "meta": { "request_id":"...", "version":"v1", "timestamp":"2025-12-12T12:34:56Z" }
    }
  • Error (RFC-style + extras):
    {
    "type": "https://example.com/problems/validation-error",
    "title": "Validation error",
    "status": 400,
    "detail": "One or more fields failed validation",
    "instance": "/orders/1234",
    "code": "INVALID_INPUT", // app-specific code
    "errors": { // field-level details
    "email": ["missing", "invalid format"],
    "quantity": ["must be >= 1"]
    },
    "request_id": "abc123" // correlation id
    }

Python implementation notes

  • Centralize error handling (middleware/exception handler) that:
    • Maps exceptions to HTTP statuses and application error codes.
    • Formats all responses using the agreed envelope (success or RFC-style error).
    • Attaches/generates request_id and logs it with full stack traces.
  • Framework examples:
    • FastAPI: use exception handlers and Pydantic validation to populate field errors, then return Problem Details JSON.
    • Flask/Django: use error handlers/middleware and serializers to enforce the response shape.
  • Validate inputs early (schema libs: Pydantic, Marshmallow) and translate validation errors into the errors array field. [2][5]

Mapping examples (brief)

  • Validation error -> 400 + code INVALID_INPUT + errors (field list).
  • Auth failure -> 401 + code UNAUTHORIZED.
  • Permission -> 403 + code FORBIDDEN.
  • Not found -> 404 + code NOT_FOUND.
  • Conflict (duplicate) -> 409 + code CONFLICT.
  • Upstream timeout -> 504 + code UPSTREAM_TIMEOUT + Retry-After when possible.

References

  1. RFC 9457 — Problem Details for HTTP APIs (IETF). [1]
  2. Best practices / error handling guides (Treblle, Zuplo) — practical tips on status codes, not exposing internals, Retry-After, etc. [2][4]
  3. RFC 7807 — original Problem Details specification (still referenced widely). [3]

If you want, I can:

  • produce a ready-to-drop-in FastAPI or Flask middleware/exception-handler implementing this pattern, or
  • generate a JSON Schema / OpenAPI example for your API. Which do you prefer?

🏁 Script executed:

cat -n services/crawler/app/file_parser_service.py

Repository: tale-project/tale

Length of output: 7240


🏁 Script executed:

rg "def parse_docx" -B 2 -A 40 services/crawler/app/file_parser_service.py | head -50

Repository: tale-project/tale

Length of output: 1922


🏁 Script executed:

sed -n '54,90p' services/crawler/app/file_parser_service.py

Repository: tale-project/tale

Length of output: 1707


🏁 Script executed:

rg "def parse_pdf" -B 2 -A 45 services/crawler/app/file_parser_service.py

Repository: tale-project/tale

Length of output: 1973


🏁 Script executed:

grep -n "except Exception" services/crawler/app/file_parser_service.py

Repository: tale-project/tale

Length of output: 163


🏁 Script executed:

sed -n '96,145p' services/crawler/app/file_parser_service.py

Repository: tale-project/tale

Length of output: 2534


🏁 Script executed:

rg "except Exception as e:" services/crawler/app/file_parser_service.py -A 3

Repository: tale-project/tale

Length of output: 583


🏁 Script executed:

rg "parse_docx\|parse_pdf" services/crawler/app/main.py -B 3 -A 3

Repository: tale-project/tale

Length of output: 43


🏁 Script executed:

head -20 services/crawler/app/file_parser_service.py

Repository: tale-project/tale

Length of output: 569


Standardize DOCX failure response to include file_type + capture exception trace in logs

parse_docx success includes file_type, but failure omits it—unlike parse_pdf, which includes it consistently. This breaks callers expecting a stable response structure. Additionally, logger.error() without traceback loses debugging context; logger.exception() automatically captures the full stack.

-        except Exception as e:
-            logger.error(f"Error parsing DOCX: {e}")
-            return {"success": False, "filename": filename, "error": str(e)}
+        except Exception as e:
+            logger.exception("Error parsing DOCX")
+            return {
+                "success": False,
+                "filename": filename,
+                "file_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
+                "error": str(e),
+            }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def parse_docx(self, file_bytes: bytes, filename: str = "document.docx") -> Dict[str, Any]:
"""Extract text content from a DOCX file."""
from docx import Document
try:
doc = Document(BytesIO(file_bytes))
paragraphs = []
for para in doc.paragraphs:
text = para.text.strip()
if text:
paragraphs.append({"text": text, "style": para.style.name if para.style else None})
tables = []
for table in doc.tables:
table_data = [[cell.text.strip() for cell in row.cells] for row in table.rows]
if table_data:
tables.append(table_data)
full_text = "\n".join(p["text"] for p in paragraphs)
core_props = doc.core_properties
return {
"success": True,
"filename": filename,
"file_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"paragraph_count": len(paragraphs),
"table_count": len(tables),
"paragraphs": paragraphs,
"tables": tables,
"full_text": full_text,
"metadata": {"title": core_props.title or "", "author": core_props.author or ""},
}
except Exception as e:
logger.error(f"Error parsing DOCX: {e}")
return {"success": False, "filename": filename, "error": str(e)}
def parse_docx(self, file_bytes: bytes, filename: str = "document.docx") -> Dict[str, Any]:
"""Extract text content from a DOCX file."""
from docx import Document
try:
doc = Document(BytesIO(file_bytes))
paragraphs = []
for para in doc.paragraphs:
text = para.text.strip()
if text:
paragraphs.append({"text": text, "style": para.style.name if para.style else None})
tables = []
for table in doc.tables:
table_data = [[cell.text.strip() for cell in row.cells] for row in table.rows]
if table_data:
tables.append(table_data)
full_text = "\n".join(p["text"] for p in paragraphs)
core_props = doc.core_properties
return {
"success": True,
"filename": filename,
"file_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"paragraph_count": len(paragraphs),
"table_count": len(tables),
"paragraphs": paragraphs,
"tables": tables,
"full_text": full_text,
"metadata": {"title": core_props.title or "", "author": core_props.author or ""},
}
except Exception as e:
logger.exception("Error parsing DOCX")
return {
"success": False,
"filename": filename,
"file_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"error": str(e),
}
🧰 Tools
🪛 Ruff (0.14.8)

86-86: Do not catch blind exception: Exception

(BLE001)


87-87: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


87-87: Logging statement uses f-string

(G004)

🤖 Prompt for AI Agents
In services/crawler/app/file_parser_service.py around lines 54 to 89, the DOCX
failure path omits the file_type field and logs only the error message; update
the except block to (1) use logger.exception(...) to capture the full traceback
and (2) return a failure dict that includes the same "file_type":
"application/vnd.openxmlformats-officedocument.wordprocessingml.document" key as
the success response (alongside success: False, filename and error), so callers
receive a stable response shape.

Comment on lines +90 to +130
def parse_pptx(self, file_bytes: bytes, filename: str = "presentation.pptx") -> Dict[str, Any]:
"""Extract text content from a PPTX file."""
from pptx import Presentation

try:
prs = Presentation(BytesIO(file_bytes))
slides = []
full_text_parts = []

for slide_num, slide in enumerate(prs.slides, start=1):
slide_text = []
for shape in slide.shapes:
if shape.has_text_frame:
for paragraph in shape.text_frame.paragraphs:
text = paragraph.text.strip()
if text:
slide_text.append(text)
if shape.has_table:
for row in shape.table.rows:
for cell in row.cells:
text = cell.text.strip()
if text:
slide_text.append(text)

slides.append({"slide_number": slide_num, "text_content": slide_text, "full_text": "\n".join(slide_text)})
full_text_parts.extend(slide_text)

core_props = prs.core_properties
return {
"success": True,
"filename": filename,
"file_type": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
"slide_count": len(slides),
"slides": slides,
"full_text": "\n\n".join(full_text_parts),
"metadata": {"title": core_props.title or "", "author": core_props.author or ""},
}
except Exception as e:
logger.error(f"Error parsing PPTX: {e}")
return {"success": False, "filename": filename, "error": str(e)}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd /tmp && find . -name "file_parser_service.py" -type f 2>/dev/null

Repository: tale-project/tale

Length of output: 43


🏁 Script executed:

cd /tmp && git ls-files | grep -i "file_parser"

Repository: tale-project/tale

Length of output: 129


🏁 Script executed:

cd /tmp && git ls-files | head -20

Repository: tale-project/tale

Length of output: 129


🏁 Script executed:

find . -name "file_parser_service.py" -type f

Repository: tale-project/tale

Length of output: 106


🏁 Script executed:

find . -type f -name "*.py" | grep -i parser | head -20

Repository: tale-project/tale

Length of output: 106


🏁 Script executed:

wc -l ./services/crawler/app/file_parser_service.py

Repository: tale-project/tale

Length of output: 110


🏁 Script executed:

cat -n ./services/crawler/app/file_parser_service.py

Repository: tale-project/tale

Length of output: 7240


Standardize PPTX failure payload shape (file_type missing) + use logger.exception() for traceback

PPTX failure response omits file_type while the success response includes it (line 121). DOCX has the same inconsistency (line 88 failure vs line 78 success), while PDF correctly includes it in both success and failure responses (line 52). Additionally, use logger.exception() to preserve the full exception traceback in logs.

        except Exception as e:
-            logger.error(f"Error parsing PPTX: {e}")
-            return {"success": False, "filename": filename, "error": str(e)}
+            logger.exception("Error parsing PPTX")
+            return {
+                "success": False,
+                "filename": filename,
+                "file_type": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
+                "error": str(e),
+            }
🧰 Tools
🪛 Ruff (0.14.8)

127-127: Do not catch blind exception: Exception

(BLE001)


128-128: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


128-128: Logging statement uses f-string

(G004)

🤖 Prompt for AI Agents
In services/crawler/app/file_parser_service.py around lines 90 to 130, the PPTX
error path logs with logger.error and returns a payload missing the file_type
key while the success path includes it; update the except block to call
logger.exception("Error parsing PPTX") to record the full traceback and return a
failure dict that mirrors the success shape by including "file_type":
"application/vnd.openxmlformats-officedocument.presentationml.presentation"
along with success False, filename and error; also make the analogous change for
the DOCX handler so both formats have consistent failure payloads and
tracebacks.

Comment on lines +131 to +147
def parse_file(self, file_bytes: bytes, filename: str, content_type: str = "") -> Dict[str, Any]:
"""Parse a file based on its content type or filename extension."""
filename_lower = filename.lower()
content_type_lower = content_type.lower() if content_type else ""

if filename_lower.endswith(".pdf") or "pdf" in content_type_lower:
return self.parse_pdf(file_bytes, filename)
elif filename_lower.endswith(".docx") or "wordprocessingml" in content_type_lower:
return self.parse_docx(file_bytes, filename)
elif filename_lower.endswith(".pptx") or "presentationml" in content_type_lower:
return self.parse_pptx(file_bytes, filename)
else:
return {
"success": False,
"filename": filename,
"error": f"Unsupported file type: {filename} ({content_type}). Supported: PDF, DOCX, PPTX.",
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Add guardrails for untrusted uploads (size limits / zip-bomb risk)
DOCX/PPTX are ZIP-based; parsing arbitrary bytes without size/complexity limits can be an easy DoS vector. Consider enforcing a max byte size (and ideally request timeouts) at the API boundary or here before dispatch.

🤖 Prompt for AI Agents
In services/crawler/app/file_parser_service.py around lines 131 to 147, the
parser dispatches DOCX/PPTX (zip-based) and other uploads without any size or
zip-complexity guards; add an early defensive check before dispatch: reject
files above a configured MAX_FILE_BYTES with a clear error return, and for
zip-based formats (docx/pptx) additionally inspect the bytes as a zip (without
extracting) to enforce limits on number of entries (MAX_ZIP_ENTRIES) and total
uncompressed size (MAX_UNCOMPRESSED_BYTES); if any limit is exceeded return the
same error structure with a descriptive message and do not call
parse_docx/parse_pptx. Ensure the limits are configurable constants and applied
before any parsing to mitigate DoS/zip-bomb risk.

Comment on lines +48 to +54
// Download the file from the URL
const fileResponse = await fetch(url);
if (!fileResponse.ok) {
throw new Error(`Failed to download file: ${fileResponse.status} ${fileResponse.statusText}`);
}

const fileBlob = await fileResponse.blob();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add timeout for file download to prevent hanging requests.

The initial fetch(url) call has no timeout, which could cause the function to hang indefinitely if the source server is slow or unresponsive. The crawler service call has a 120-second timeout, but the download doesn't.

+    // Download the file from the URL with timeout
+    const downloadController = new AbortController();
+    const downloadTimeoutId = setTimeout(() => downloadController.abort(), 60000);
+
-    const fileResponse = await fetch(url);
+    const fileResponse = await fetch(url, { signal: downloadController.signal });
+    clearTimeout(downloadTimeoutId);
+
     if (!fileResponse.ok) {
       throw new Error(`Failed to download file: ${fileResponse.status} ${fileResponse.statusText}`);
     }

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In services/platform/convex/agent_tools/convex_tools/files/helpers/parse_file.ts
around lines 48 to 54, the fetch(url) call has no timeout and can hang; add an
AbortController, pass its signal to fetch, start a timeout (match the crawler's
120000 ms or a chosen value), and on timeout call controller.abort() to cancel
the request; ensure you clear the timeout timer after fetch completes and catch
the abort error to throw a clear timeout-specific Error so downloads don’t hang
indefinitely.

Comment on lines +54 to +58
const fileBlob = await fileResponse.blob();

// Create FormData and upload to crawler service
const formData = new FormData();
formData.append('file', fileBlob, filename);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider adding file size validation to prevent resource exhaustion.

Large files could exhaust memory or cause timeouts. Consider adding a size check before uploading to the crawler service.

     const fileBlob = await fileResponse.blob();

+    // Validate file size (e.g., 50MB limit)
+    const MAX_FILE_SIZE = 50 * 1024 * 1024;
+    if (fileBlob.size > MAX_FILE_SIZE) {
+      throw new Error(`File too large: ${fileBlob.size} bytes exceeds ${MAX_FILE_SIZE} byte limit`);
+    }
+
     // Create FormData and upload to crawler service
     const formData = new FormData();
     formData.append('file', fileBlob, filename);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const fileBlob = await fileResponse.blob();
// Create FormData and upload to crawler service
const formData = new FormData();
formData.append('file', fileBlob, filename);
const fileBlob = await fileResponse.blob();
// Validate file size (e.g., 50MB limit)
const MAX_FILE_SIZE = 50 * 1024 * 1024;
if (fileBlob.size > MAX_FILE_SIZE) {
throw new Error(`File too large: ${fileBlob.size} bytes exceeds ${MAX_FILE_SIZE} byte limit`);
}
// Create FormData and upload to crawler service
const formData = new FormData();
formData.append('file', fileBlob, filename);

Comment on lines 119 to +134
handler: async (ctx: ToolCtx, args): Promise<PdfResult> => {
debugLog('tool:pdf start', {
const operation = args.operation ?? 'generate';

// Handle parse operation
if (operation === 'parse') {
if (!args.url) {
throw new Error("Missing required 'url' for parse operation");
}
if (!args.filename) {
throw new Error("Missing required 'filename' for parse operation");
}

const result = await parseFile(args.url, args.filename, 'pdf');
return { operation: 'parse', ...result };
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, locate and examine the file mentioned in the review
find . -type f -name "pdf_tool.ts" | head -20

Repository: tale-project/tale

Length of output: 130


🏁 Script executed:

# Get the file size to determine how to read it
wc -l services/platform/convex/agent_tools/convex_tools/files/pdf_tool.ts 2>/dev/null || echo "File not found in expected location"

Repository: tale-project/tale

Length of output: 132


🏁 Script executed:

# Search for the pdf_tool.ts file more broadly
fd "pdf_tool.ts" --type f

Repository: tale-project/tale

Length of output: 128


🏁 Script executed:

# Search for parseFile function definition
rg "parseFile" --type ts --type tsx -A 5 -B 2

Repository: tale-project/tale

Length of output: 88


🏁 Script executed:

# Check for URL validation patterns in the codebase
rg "allowlist|allowList|SSRF|validateUrl|URL.*valid" --type ts --type tsx

Repository: tale-project/tale

Length of output: 88


🏁 Script executed:

# Read the pdf_tool.ts file
cat -n services/platform/convex/agent_tools/convex_tools/files/pdf_tool.ts

Repository: tale-project/tale

Length of output: 7551


🏁 Script executed:

# Search for parseFile with simpler pattern
rg "parseFile" -A 5 -B 2

Repository: tale-project/tale

Length of output: 12733


🏁 Script executed:

# Search for URL validation patterns more broadly
rg "validate|allowlist|allowList" -A 2 -B 2 | head -100

Repository: tale-project/tale

Length of output: 8635


🏁 Script executed:

# Read the parseFile helper
cat -n services/platform/convex/agent_tools/convex_tools/files/helpers/parse_file.ts

Repository: tale-project/tale

Length of output: 3594


🏁 Script executed:

# Check if there's any URL validation or allowlist patterns used elsewhere
rg "storage.convex|allowlist|CONVEX.*DOMAIN|URL.*validation" --type ts -B 2 -A 2

Repository: tale-project/tale

Length of output: 2566


🏁 Script executed:

# Check the documentation comment for the parse operation to understand URL expectations
grep -A 10 "parse - Extract" services/platform/convex/agent_tools/convex_tools/files/pdf_tool.ts

Repository: tale-project/tale

Length of output: 623


Add try/catch and operation-scoped logging to parse operation; enforce URL domain allowlist before fetch

The parse operation invokes parseFile() without wrapping it in try/catch or operation-scoped logging, leaving failures untagged at the tool level. More critically, parseFile accepts any caller-provided URL and fetches it directly (line 49 in parse_file.ts) without validation—a classic SSRF vector. While the tool description mentions Convex storage, the parameter accepts "accessible HTTP URL", which is too permissive. Add an allowlist (e.g., Convex storage domain) before the fetch, and wrap the parseFile call with try/catch and debugLog calls matching the pattern used in the generate operation.

🤖 Prompt for AI Agents
In services/platform/convex/agent_tools/convex_tools/files/pdf_tool.ts around
lines 119 to 134, the parse branch needs an allowlist check and operation-scoped
logging with error handling: before calling parseFile, validate args.url against
a small allowlist of permitted hosts (e.g., your Convex storage domain(s)) and
throw a clear Error if the URL host is not allowed; wrap the parseFile call in
try/catch and surround it with debugLog calls similar to the generate operation
(e.g., debugLog('pdf:parse:start', { url, filename }),
debugLog('pdf:parse:success', result) on success, and
debugLog('pdf:parse:error', { error })) and rethrow or throw a new Error after
logging so failures are surfaced consistently.

Comment on lines +82 to +90
// For parse operation
url: z
.string()
.optional()
.describe("For 'parse': URL of the PPTX file to parse"),
filename: z
.string()
.optional()
.describe("For 'parse': Original filename (e.g., 'presentation.pptx')"),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider adding file extension validation hint.

The url and filename parameters for the parse operation are correctly defined. However, consider adding a .describe() hint that clarifies the expected file type to help the AI agent make better decisions.

   url: z
     .string()
     .optional()
-    .describe("For 'parse': URL of the PPTX file to parse"),
+    .describe("For 'parse': URL of the PPTX file to parse. Must be a .pptx file."),
   filename: z
     .string()
     .optional()
-    .describe("For 'parse': Original filename (e.g., 'presentation.pptx')"),
+    .describe("For 'parse': Original filename. Must end with .pptx extension."),
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// For parse operation
url: z
.string()
.optional()
.describe("For 'parse': URL of the PPTX file to parse"),
filename: z
.string()
.optional()
.describe("For 'parse': Original filename (e.g., 'presentation.pptx')"),
// For parse operation
url: z
.string()
.optional()
.describe("For 'parse': URL of the PPTX file to parse. Must be a .pptx file."),
filename: z
.string()
.optional()
.describe("For 'parse': Original filename. Must end with .pptx extension."),
🤖 Prompt for AI Agents
In services/platform/convex/agent_tools/convex_tools/files/pptx_tool.ts around
lines 82 to 90, the .describe() texts for the url and filename should explicitly
state the expected .pptx file type and hint about extension validation; update
the descriptions to mention "must be a .pptx file (e.g., presentation.pptx)"
and, if feasible, add a simple validation (e.g., require the filename or URL to
end with .pptx) so the agent and runtime can detect incorrect file types early.

Comment on lines +77 to +113
// Build message content with markdown references for all attachments
let messageContent = trimmedMessage;
if (hasAttachments) {
const imageMarkdowns: string[] = [];
const fileMarkdowns: string[] = [];

for (const attachment of attachments) {
const url = await ctx.storage.getUrl(attachment.fileId);
if (!url) continue;

if (attachment.fileType.startsWith('image/')) {
// Images: Use markdown image syntax for inline display
imageMarkdowns.push(`![${attachment.fileName}](${url})`);
} else {
// Other files: Use markdown link with file info
const sizeKB = Math.round(attachment.fileSize / 1024);
const sizeDisplay = sizeKB >= 1024
? `${(sizeKB / 1024).toFixed(1)} MB`
: `${sizeKB} KB`;
fileMarkdowns.push(`📎 [${attachment.fileName}](${url}) (${attachment.fileType}, ${sizeDisplay})`);
}
}

// Build the attachment section
const attachmentParts: string[] = [];
if (imageMarkdowns.length > 0) {
attachmentParts.push(imageMarkdowns.join('\n'));
}
if (fileMarkdowns.length > 0) {
attachmentParts.push(fileMarkdowns.join('\n'));
}

if (attachmentParts.length > 0) {
messageContent = `${trimmedMessage}\n\n${attachmentParts.join('\n\n')}`;
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Search for markdown rendering patterns
rg -n --hidden -S "react-markdown|remark-gfm|rehype-raw|dangerouslySetInnerHTML|sanitize" .

Repository: tale-project/tale

Length of output: 9429


🏁 Script executed:

# Find where messageContent from chat_with_agent is used/rendered
rg -n "messageContent|chat_with_agent" --max-count 20

Repository: tale-project/tale

Length of output: 2148


🏁 Script executed:

# Look for the function definition to see full context of message handling
grep -n "export.*chat_with_agent\|async.*chat_with_agent" services/platform/convex/model/chat_agent/chat_with_agent.ts

Repository: tale-project/tale

Length of output: 43


🏁 Script executed:

# Search for markdown libraries and renderer setup in the frontend
fd -e "package.json" -e "tsconfig.json" | head -5 && rg -l "markdown|unified|remark|rehype" . | head -20

Repository: tale-project/tale

Length of output: 1057


🏁 Script executed:

# Read the sanitize-chat.ts file to understand sanitization
cat -n services/platform/lib/utils/sanitize-chat.ts

Repository: tale-project/tale

Length of output: 1885


🏁 Script executed:

# Read message-bubble.tsx to see how content is rendered with react-markdown
cat -n services/platform/app/\(app\)/dashboard/\[id\]/chat/components/message-bubble.tsx | head -100

Repository: tale-project/tale

Length of output: 2829


🏁 Script executed:

# Check the rendering part where sanitizedContent is used
sed -n '380,465p' "services/platform/app/(app)/dashboard/[id]/chat/components/message-bubble.tsx"

Repository: tale-project/tale

Length of output: 2660


🏁 Script executed:

# Check react-markdown version and look for any security plugins or configurations
cat services/platform/package.json | grep -A 2 -B 2 "react-markdown"

Repository: tale-project/tale

Length of output: 204


🏁 Script executed:

# Search for any sanitization happening before sanitizeChatMessage is called
rg -n "sanitizeChatMessage|chat_with_agent" services/platform/app/\(app\)/dashboard/\[id\]/chat/components/chat-interface.tsx -A 5

Repository: tale-project/tale

Length of output: 419


🏁 Script executed:

# Check how attachment data flows - is fileName ever escaped before reaching message-bubble?
sed -n '77,120p' services/platform/convex/model/chat_agent/chat_with_agent.ts

Repository: tale-project/tale

Length of output: 1619


🏁 Script executed:

# Check if there's URL validation for markdown links or any additional security measures
rg -n "javascript:|protocol|href.*validation" services/platform/lib/ services/platform/app/ --max-count 20

Repository: tale-project/tale

Length of output: 1231


🏁 Script executed:

# Check the full context of how messages are stored and retrieved to understand the data flow
rg -n "saveMessage|message.content" services/platform/convex/model/chat_agent/chat_with_agent.ts -B 3 -A 3

Repository: tale-project/tale

Length of output: 989


🌐 Web query:

react-markdown 10.1.0 security markdown link injection javascript protocol vulnerability

💡 Result:

Short answer: No known JavaScript-protocol link-injection (javascript:) vulnerability in react-markdown v10.1.0 as of Dec 12, 2025. Sources (react-markdown docs and Snyk) show react-markdown is “secure by default” and list no direct vulnerabilities for v10.1.0; most reported markdown link/URL XSS issues affect other libraries (e.g., markdown-to-jsx, markdown-it). [1][2]

Recommendation: Keep react-markdown updated, do not enable raw HTML unless you sanitize it, and use rehype-sanitize (or your own URL-sanitizer/urlTransform) to block javascript: and other dangerous schemes when rendering untrusted markdown. See the react-markdown security guidance. [1][2]

Sources:
[1] react-markdown (npm) — Security and version info.
[2] Snyk — react-markdown security/scan results.


🏁 Script executed:

# Check what rehypeRaw is used for in the codebase and if it's necessary
rg -n "rehypeRaw\|rehype-raw" services/platform/app/\(app\)/dashboard/\[id\]/chat/components/message-bubble.tsx -B 2 -A 5

Repository: tale-project/tale

Length of output: 43


🌐 Web query:

rehype-raw security risks markdown injection 2024 2025

💡 Result:

Short summary — risk, recent context (2024–2025), and mitigations.

  • Risk: rehype-raw allows embedded/raw HTML in Markdown to be parsed into the HTML AST; used on untrusted input it can enable XSS/“markdown injection” because arbitrary HTML (including <script>, event handlers, data: URLs, etc.) may be passed through. Use of allowDangerousHtml + rehype-raw is explicitly called out as unsafe in the project docs. [1][4][5]

  • Recent/context (2024–2025): there are no high-profile CVEs on rehype-raw itself, but related tooling and common Markdown parsers have had XSS advisories in 2024–2025 (examples: reports and CVEs involving markdown rendering in apps using rehype/remark pipelines and a CVE published for markdown-it in 2025), showing the practical risk when raw HTML is allowed and not sanitized. This reinforces that the attack surface is real even if rehype-raw is not itself a packaged “vulnerability.” [5][6][2]

  • Recommended mitigations

    1. Never run rehype-raw over untrusted user content without sanitization. If you must render untrusted Markdown with embedded HTML, sanitize AFTER rehype-raw (e.g., rehype-sanitize / hast-util-sanitize) — rehype docs and rehype-sanitize explicitly recommend this ordering. [2][1]
    2. Prefer disallowing raw HTML: avoid allowDangerousHtml in remark-rehype or strip/escape HTML before processing. [1][4]
    3. Use a strict sanitize schema (e.g., GitHub-like schema) and disallow risky attributes (on*, javascript: URLs, data: where appropriate). rehype-sanitize has options/schemas for that. [2]
    4. If you need limited HTML, consider whitelist/tag-filtering or transform untrusted HTML to a safe subset server-side before rendering. Some wrappers/options (tagfilter) can help but must be tested. [3][4]
    5. Keep unified/remark/rehype/rehype-raw/rehype-sanitize and Markdown libraries up to date; monitor advisories (npm, Snyk, NVD) for related parser/XSS fixes. [2][4][6]

If you want, I can:

  • show a minimal rehype pipeline example that safely processes untrusted Markdown (rehype-raw followed by rehype-sanitize), or
  • scan specific package versions you use and list any advisories. Which would you prefer?

Sources
[1] rehype-raw README / security note.
[2] rehype-sanitize (recommended plugin + security guidance).
[3] rehype-raw usage notes (tagfilter / ordering advice).
[4] Snyk package security page for rehype-raw (maintenance & findings).
[5] NVD advisory example showing XSS in an app rendering Markdown via rehype-raw (CVE example).
[6] NVD / advisories re: markdown-it XSS (2025) — example of related parser vulnerabilities.


🌐 Web query:

react-markdown filename markdown link syntax injection attack example

💡 Result:

Short example — attack and mitigations.

Attack (malicious filename inserted into Markdown):

  • Vulnerable template:
    Download file
  • Attacker sets filename exactly to:
    ](javascript:alert(1))<!--.pdf
  • Resulting Markdown becomes:
    Download file
    — which creates a link with a javascript: URL → XSS when clicked. [1][2]

Mitigations (pick one or more):

  • URL-encode filenames when inserting into links: use encodeURIComponent(filename) so special characters can’t break the link syntax. [2]
  • Validate/whitelist link protocols (allow only http(s), mailto, tel) or use react-markdown’s uriTransformer / transformLinkUri to filter/normalize URIs. [1][3]
  • Disable raw HTML in markdown and sanitize output (DOMPurify / rehype-sanitize) before rendering. [2][3]

References

  • Example attack discussion / link-transform protections (react-markdown / markdown libs). [1]
  • Markdown security best practices and sanitization (DOMPurify, URL validation, allowed protocols). [2]
  • Secure-react-markdown guidance (use uriTransformer / disable HTML / sanitize). [3]

Sources
[1] Medium: Avoiding XSS via Markdown in React (react-markdown link transformer). (medium.com)
[2] Markdown Security Best Practices (link/data/javascript protocols; sanitize/encode). (markdownlang.com)
[3] WSO2 / secure guidance for react-markdown (skipHtml, sanitize plugins). (security.docs.wso2.com)

Citations:


🏁 Script executed:

# Verify the exact data flow: check how react-markdown with rehypeRaw is configured
cat -n services/platform/app/\(app\)/dashboard/\[id\]/chat/components/message-bubble.tsx | grep -A 20 "Markdown"

Repository: tale-project/tale

Length of output: 4690


🏁 Script executed:

# Check if there are any link/URL validators or transformers in the react-markdown setup
rg -n "uriTransformer|transformLinkUri|href.*validate|allowDangerousHtml" services/platform/app/

Repository: tale-project/tale

Length of output: 43


🌐 Web query:

DOMPurify sanitize javascript: URL protocol links security 2025

💡 Result:

Short answer: DOMPurify blocks "javascript:" (and other unsafe schemes) in href/src by default, but you must keep DOMPurify up-to-date and follow defense‑in‑depth because past config-specific bugs (e.g., SAFE_FOR_TEMPLATES) or old versions have introduced XSS risks. See sources below. [1][2][3][4]

Recommendations (concise)

  • Upgrade to the latest DOMPurify release (patch any CVE‑fixed versions; e.g., CVE-2025-26791 affects < 3.2.4 when SAFE_FOR_TEMPLATES=true). [3]
  • Rely on DOMPurify’s default URL filtering (it blocks javascript:, data:, vbscript: by default) but also validate/normalize URLs server‑side where possible. [1][2]
  • Don’t assume custom hooks/configs are safe — review any ALLOWED_* or SAFE_FOR_TEMPLATES settings and test with current browser parsing behavior. [2][3]
  • For external links add rel="noopener noreferrer" (and prefer target behavior controlled by your code) and canonicalize hrefs before inserting. [1][4]
  • If you only need URL sanitization (not full HTML), consider dedicated URL validation/normalization in addition to DOMPurify (DOMPurify’s URL logic is internal; see related GitHub discussion). [4]

Sources

  • DOMPurify homepage / features (URI/protocol filtering). [1]
  • DOMPurify technical FAQ / “how does DOMPurify ensure…”, notes on javascript: URIs and allowed protocols. [2]
  • CVE-2025-26791 advisory (DOMPurify <3.2.4 SAFE_FOR_TEMPLATES mXSS issue). [3]
  • DOMPurify GitHub issue re: URL sanitization API (discussing internal URL logic / developer needs). [4]

🏁 Script executed:

# Check if there's any sanitization AFTER markdown rendering, before display
sed -n '380,465p' "services/platform/app/(app)/dashboard/[id]/chat/components/message-bubble.tsx" | grep -E "sanitize|DOMPurify|dangerously"

Repository: tale-project/tale

Length of output: 257


🏁 Script executed:

# Trace the exact flow - what does sanitizeChatMessage output?
# Check if it's actually sanitizing markdown or if it's just escaping HTML entities
python3 << 'EOF'
# Simulate what DOMPurify does to markdown with javascript: link
markdown_input = "![badfile](javascript:alert(1))"
print(f"Input markdown: {markdown_input}")

# DOMPurify when given markdown as text (not HTML) would:
# 1. Look for HTML tags - finds none
# 2. Look for HTML attributes - finds none
# 3. Returns the input unchanged (because it's just text, not HTML)
print(f"DOMPurify output (treating as text): {markdown_input}")

# Then react-markdown parses it:
print(f"react-markdown parses: {markdown_input}")
print(f"Renders as HTML link: <a href=\"javascript:alert(1)\">badfile</a>")

# No further sanitization, so the javascript: link is live
print("\nConclusion: The javascript: link reaches the browser unsanitized!")
EOF

Repository: tale-project/tale

Length of output: 363


🏁 Script executed:

# Verify by checking if message-bubble ever calls sanitizeChatMessage AFTER markdown rendering
grep -n "sanitize\|Markdown" "services/platform/app/(app)/dashboard/[id]/chat/components/message-bubble.tsx" | head -30

Repository: tale-project/tale

Length of output: 651


Escape fileName and fileType in markdown before embedding (markdown syntax injection risk).

If attachment.fileName contains markdown control characters like ](javascript:alert(1))<!--, it can break the markdown link syntax and inject a javascript: URL. For example, a fileName of ](javascript:alert(1))<!--.pdf transforms the markdown ![](javascript:alert(1))<!--.pdf](url) into a link with a javascript: protocol. Since sanitizeChatMessage is applied to the markdown source (not the rendered HTML), the javascript: link reaches the browser unsanitized.

Add the escapeMarkdownText function and apply it to both fileName and fileType:

+function escapeMarkdownText(value: string): string {
+  // Escape markdown syntax characters to prevent link/text injection.
+  return value.replaceAll('\\', '\\\\').replaceAll('[', '\\[').replaceAll(']', '\\]').replaceAll('(', '\\(').replaceAll(')', '\\)');
+}

@@
-        imageMarkdowns.push(`![${attachment.fileName}](${url})`);
+        imageMarkdowns.push(`![${escapeMarkdownText(attachment.fileName)}](${url})`);
@@
-        fileMarkdowns.push(`📎 [${attachment.fileName}](${url}) (${attachment.fileType}, ${sizeDisplay})`);
+        fileMarkdowns.push(`📎 [${escapeMarkdownText(attachment.fileName)}](${url}) (${escapeMarkdownText(attachment.fileType)}, ${sizeDisplay})`);

Committable suggestion skipped: line range outside the PR's diff.

larryro added a commit that referenced this pull request May 7, 2026
Three round-2-confirmed gaps in the dual-control flow.

placeLegalHold (W1 #11)
- targetType: 'org' is the nuclear "halt all retention" hold; a single
  admin placing it freezes cleanup org-wide, defeating dual-control.
  Refuse by default; deployments running a single admin opt back in via
  the existing TALE_LEGAL_HOLD_SINGLE_ADMIN_OK env. A future maker-
  checker placement flow (legalHoldPlaceRequests + approveLegalHoldPlacement)
  supersedes this; today's commit restores the contract.

approveLegalHoldRelease (W1 #12)
- 5-min minimum delay between request and approval (constant
  RELEASE_APPROVAL_MIN_DELAY_MS). Defeats the chained-call attack
  where one admin requests + a second admin instantly approves via the
  same automation flow. Skipped when self-approval is in effect (the
  escape hatch already opted out of dual-control entirely).
- Re-check that the original requester is still an org admin at
  approval time. A demoted/removed requester whose request lingered
  loses the ability to retroactively gate a destructive change.

closeLegalMatter (W1 #10 + W4 #6)
- Implements the docstring promise (round-2 v08 bonus): closing a
  matter now fans out PENDING release requests for every active hold
  linked via matterRef, leveraging the new
  by_organizationId_matterRef index. Approval still requires a
  second admin via approveLegalHoldRelease — matter-close does NOT
  auto-release, dual-control survives.
- Returns releaseRequestsFiled count so the UI can confirm the
  cascade. Audit row records linkedHolds + releaseRequestsFiled.
- Skips holds with an existing pending or approved request (idempotent
  against re-running close on a half-closed matter).
larryro added a commit that referenced this pull request May 7, 2026
…cies

Bundle of round-2-confirmed cross-tenant fixes plus the dead-code
delete of the semantic LLM response cache.

POLICY_TYPES drift (W6 #5)
- lib/shared/schemas/governance.ts now includes
  'data_classification_notice' to match the Convex enum, killing the
  `as const` cast at use-data-classification-notice.ts:50.

documents/compare_documents.ts (W6 #8)
- Convex `_storage` is a global namespace; org membership alone was
  not enough to gate `ctx.storage.getUrl`. Adds a JOIN through
  fileMetadata via the new internal query verifyStorageIdsBelongToOrg
  to confirm both `baseStorageId` and `comparisonStorageId` are owned
  by the caller's org. Refuses with a clear error otherwise. Pattern
  copied from agent_tools/documents/helpers/retrieve_document.ts.

file_metadata/actions.ts::checkFileRagStatuses (W6 #9)
- Was an unauthenticated public action that could flip any org's
  fileMetadata.ragStatus to `failed` via expireStaleRagQueue (DoS,
  pre-existing on `main`). Now requires `getAuthUser` and filters
  storageIds to ones owned by an org the caller is a member of via
  the new file_metadata.internal_queries.filterStorageIdsByCallerOrg.

governance/queries.ts (W6 #11)
- getPolicy + listPolicies now apply a member-readable allow-list
  (data_classification_notice, feature_flags, pii_config,
  chat_filter, personalization, upload_policy, default_models). All
  other types — login_policy.trustedProxies, password_policy,
  two_factor_policy, model_access.rules, budgets, retention_policy,
  moderation_provider.endpoint, system_prompt — are admin-only.
  listPolicies silently filters those out for non-admins.

semantic LLM response cache — DELETE (W6 #12 + #13)
- Round-2 v05 confirmed the lookup is structurally cross-tenant
  (filters only on agent_name, model, expires_at, similarity; ignores
  user_id / organization_id even though they're stored). The platform
  helpers `lookupSemanticCache` / `storeSemanticCacheAsync` had ZERO
  callers in the monorepo, the FastAPI router was mounted but
  unreachable from platform — a latent foot-gun primed for the next
  dev to wire up unaware. Deletes:

  - services/platform/convex/lib/response_cache/semantic_cache.ts
  - services/platform/convex/lib/response_cache/internal_actions.ts
  - services/rag/app/routers/llm_cache.py
  - services/rag/app/services/llm_response_cache.py

  Plus the corresponding imports in routers/__init__.py, main.py,
  rag_service.py. Also removes the two empty-catch violations in
  semantic_cache.ts (no longer applicable).

The exact-key Convex `lib/response_cache/{internal_mutations,
internal_queries}.ts` cache stays — it is the actually-wired one and
is correctly org-scoped.
larryro added a commit that referenced this pull request May 8, 2026
Three round-2-confirmed gaps in the dual-control flow.

placeLegalHold (W1 #11)
- targetType: 'org' is the nuclear "halt all retention" hold; a single
  admin placing it freezes cleanup org-wide, defeating dual-control.
  Refuse by default; deployments running a single admin opt back in via
  the existing TALE_LEGAL_HOLD_SINGLE_ADMIN_OK env. A future maker-
  checker placement flow (legalHoldPlaceRequests + approveLegalHoldPlacement)
  supersedes this; today's commit restores the contract.

approveLegalHoldRelease (W1 #12)
- 5-min minimum delay between request and approval (constant
  RELEASE_APPROVAL_MIN_DELAY_MS). Defeats the chained-call attack
  where one admin requests + a second admin instantly approves via the
  same automation flow. Skipped when self-approval is in effect (the
  escape hatch already opted out of dual-control entirely).
- Re-check that the original requester is still an org admin at
  approval time. A demoted/removed requester whose request lingered
  loses the ability to retroactively gate a destructive change.

closeLegalMatter (W1 #10 + W4 #6)
- Implements the docstring promise (round-2 v08 bonus): closing a
  matter now fans out PENDING release requests for every active hold
  linked via matterRef, leveraging the new
  by_organizationId_matterRef index. Approval still requires a
  second admin via approveLegalHoldRelease — matter-close does NOT
  auto-release, dual-control survives.
- Returns releaseRequestsFiled count so the UI can confirm the
  cascade. Audit row records linkedHolds + releaseRequestsFiled.
- Skips holds with an existing pending or approved request (idempotent
  against re-running close on a half-closed matter).
larryro added a commit that referenced this pull request May 8, 2026
…cies

Bundle of round-2-confirmed cross-tenant fixes plus the dead-code
delete of the semantic LLM response cache.

POLICY_TYPES drift (W6 #5)
- lib/shared/schemas/governance.ts now includes
  'data_classification_notice' to match the Convex enum, killing the
  `as const` cast at use-data-classification-notice.ts:50.

documents/compare_documents.ts (W6 #8)
- Convex `_storage` is a global namespace; org membership alone was
  not enough to gate `ctx.storage.getUrl`. Adds a JOIN through
  fileMetadata via the new internal query verifyStorageIdsBelongToOrg
  to confirm both `baseStorageId` and `comparisonStorageId` are owned
  by the caller's org. Refuses with a clear error otherwise. Pattern
  copied from agent_tools/documents/helpers/retrieve_document.ts.

file_metadata/actions.ts::checkFileRagStatuses (W6 #9)
- Was an unauthenticated public action that could flip any org's
  fileMetadata.ragStatus to `failed` via expireStaleRagQueue (DoS,
  pre-existing on `main`). Now requires `getAuthUser` and filters
  storageIds to ones owned by an org the caller is a member of via
  the new file_metadata.internal_queries.filterStorageIdsByCallerOrg.

governance/queries.ts (W6 #11)
- getPolicy + listPolicies now apply a member-readable allow-list
  (data_classification_notice, feature_flags, pii_config,
  chat_filter, personalization, upload_policy, default_models). All
  other types — login_policy.trustedProxies, password_policy,
  two_factor_policy, model_access.rules, budgets, retention_policy,
  moderation_provider.endpoint, system_prompt — are admin-only.
  listPolicies silently filters those out for non-admins.

semantic LLM response cache — DELETE (W6 #12 + #13)
- Round-2 v05 confirmed the lookup is structurally cross-tenant
  (filters only on agent_name, model, expires_at, similarity; ignores
  user_id / organization_id even though they're stored). The platform
  helpers `lookupSemanticCache` / `storeSemanticCacheAsync` had ZERO
  callers in the monorepo, the FastAPI router was mounted but
  unreachable from platform — a latent foot-gun primed for the next
  dev to wire up unaware. Deletes:

  - services/platform/convex/lib/response_cache/semantic_cache.ts
  - services/platform/convex/lib/response_cache/internal_actions.ts
  - services/rag/app/routers/llm_cache.py
  - services/rag/app/services/llm_response_cache.py

  Plus the corresponding imports in routers/__init__.py, main.py,
  rag_service.py. Also removes the two empty-catch violations in
  semantic_cache.ts (no longer applicable).

The exact-key Convex `lib/response_cache/{internal_mutations,
internal_queries}.ts` cache stays — it is the actually-wired one and
is correctly org-scoped.
larryro added a commit that referenced this pull request May 8, 2026
Three round-2-confirmed gaps in the dual-control flow.

placeLegalHold (W1 #11)
- targetType: 'org' is the nuclear "halt all retention" hold; a single
  admin placing it freezes cleanup org-wide, defeating dual-control.
  Refuse by default; deployments running a single admin opt back in via
  the existing TALE_LEGAL_HOLD_SINGLE_ADMIN_OK env. A future maker-
  checker placement flow (legalHoldPlaceRequests + approveLegalHoldPlacement)
  supersedes this; today's commit restores the contract.

approveLegalHoldRelease (W1 #12)
- 5-min minimum delay between request and approval (constant
  RELEASE_APPROVAL_MIN_DELAY_MS). Defeats the chained-call attack
  where one admin requests + a second admin instantly approves via the
  same automation flow. Skipped when self-approval is in effect (the
  escape hatch already opted out of dual-control entirely).
- Re-check that the original requester is still an org admin at
  approval time. A demoted/removed requester whose request lingered
  loses the ability to retroactively gate a destructive change.

closeLegalMatter (W1 #10 + W4 #6)
- Implements the docstring promise (round-2 v08 bonus): closing a
  matter now fans out PENDING release requests for every active hold
  linked via matterRef, leveraging the new
  by_organizationId_matterRef index. Approval still requires a
  second admin via approveLegalHoldRelease — matter-close does NOT
  auto-release, dual-control survives.
- Returns releaseRequestsFiled count so the UI can confirm the
  cascade. Audit row records linkedHolds + releaseRequestsFiled.
- Skips holds with an existing pending or approved request (idempotent
  against re-running close on a half-closed matter).
larryro added a commit that referenced this pull request May 8, 2026
…cies

Bundle of round-2-confirmed cross-tenant fixes plus the dead-code
delete of the semantic LLM response cache.

POLICY_TYPES drift (W6 #5)
- lib/shared/schemas/governance.ts now includes
  'data_classification_notice' to match the Convex enum, killing the
  `as const` cast at use-data-classification-notice.ts:50.

documents/compare_documents.ts (W6 #8)
- Convex `_storage` is a global namespace; org membership alone was
  not enough to gate `ctx.storage.getUrl`. Adds a JOIN through
  fileMetadata via the new internal query verifyStorageIdsBelongToOrg
  to confirm both `baseStorageId` and `comparisonStorageId` are owned
  by the caller's org. Refuses with a clear error otherwise. Pattern
  copied from agent_tools/documents/helpers/retrieve_document.ts.

file_metadata/actions.ts::checkFileRagStatuses (W6 #9)
- Was an unauthenticated public action that could flip any org's
  fileMetadata.ragStatus to `failed` via expireStaleRagQueue (DoS,
  pre-existing on `main`). Now requires `getAuthUser` and filters
  storageIds to ones owned by an org the caller is a member of via
  the new file_metadata.internal_queries.filterStorageIdsByCallerOrg.

governance/queries.ts (W6 #11)
- getPolicy + listPolicies now apply a member-readable allow-list
  (data_classification_notice, feature_flags, pii_config,
  chat_filter, personalization, upload_policy, default_models). All
  other types — login_policy.trustedProxies, password_policy,
  two_factor_policy, model_access.rules, budgets, retention_policy,
  moderation_provider.endpoint, system_prompt — are admin-only.
  listPolicies silently filters those out for non-admins.

semantic LLM response cache — DELETE (W6 #12 + #13)
- Round-2 v05 confirmed the lookup is structurally cross-tenant
  (filters only on agent_name, model, expires_at, similarity; ignores
  user_id / organization_id even though they're stored). The platform
  helpers `lookupSemanticCache` / `storeSemanticCacheAsync` had ZERO
  callers in the monorepo, the FastAPI router was mounted but
  unreachable from platform — a latent foot-gun primed for the next
  dev to wire up unaware. Deletes:

  - services/platform/convex/lib/response_cache/semantic_cache.ts
  - services/platform/convex/lib/response_cache/internal_actions.ts
  - services/rag/app/routers/llm_cache.py
  - services/rag/app/services/llm_response_cache.py

  Plus the corresponding imports in routers/__init__.py, main.py,
  rag_service.py. Also removes the two empty-catch violations in
  semantic_cache.ts (no longer applicable).

The exact-key Convex `lib/response_cache/{internal_mutations,
internal_queries}.ts` cache stays — it is the actually-wired one and
is correctly org-scoped.
larryro added a commit that referenced this pull request May 9, 2026
…stodian cascade

Round-1 #10/#12/#13 + round-2 V3 confirmed three custodian-cascade gaps
(P0-6 retention guard, P0-7 deleteDocumentById, P0-8 cleanupMessageMetadata
+ cleanupChatFilterEvents). The user's observation made the deeper truth
clear: in production, only `org` and `userMembership` hold target types
are operator-facing — `thread` / `document` / `execution` were modeled in
the original fine-grained hold design but never wired into the place-hold
UI (`PICKER_TARGET_TYPES = ['userMembership', 'org']`), and `usePlaceLegalHold`
is called from a single dialog. The corresponding `holds.threadIds` /
`documentIds` / `executionIds` Sets in `ActiveHolds` therefore stayed
empty in every deployment, while ~80 LOC of dead per-row checks
proliferated through retention, erasure, restore, folders, and the
cascade helper. This commit collapses the model to its actual shape.

Schema-level narrowing (legal_hold.ts):
  - HOLD_TARGET_TYPES = ['userMembership', 'org'] (write-side)
  - ActiveHolds = { orgHeld, userMembershipIds }
  - loadActiveHolds switch reduces to 2 cases; legacy thread/document/
    execution rows in the table (if any) are intentionally ignored —
    they're surfaced in admin UI as `(legacy)` via the read-side
    validator in legal_hold_queries.ts.
  - resolveAndAssertTarget keeps only `org` + `userMembership` cases.
  - isHeld() simplified to org + custodian.

Guard simplification (legal_hold_guard.ts):
  - assertNotHeld drops the per-row branches; `targetType`/`targetId`
    are now error-message context only. The user-custodian cascade
    via `authorUserId` is the sole remaining gate beyond org-wide.

assertSafeRetentionDelete (internal_mutations_retention.ts):
  - Drops `targetType?` / `targetId?` args; adds `authorUserId?`.
  - Internal handler is now: cross-org check → TOCTOU check →
    org-wide hold → user-custodian cascade. Single, clear contract.

P0-6 cascade wiring — every retention mutation passes the row's
author user id (~15 call sites in internal_mutations_retention.ts):
  - documents → doc.createdBy
  - threads → thread.userId
  - workflowExecution → execution.userId
  - workflowTriggerLog → exec.userId via wfExecutionId join
  - messageFeedback → row.userId
  - memoryAudit → row.subjectUserId (with thread-fallback)
  - chatFilterEvent → parent thread's userId via index lookup
  - usageLedger → row.userId
  - promptTemplate → row.createdBy
  - messageMetadata → parent thread's userId (closes P0-8 directly)
  - tempFile, customer, vendor, externalConversation: org-only
    (no author concept on the row)

P0-8 cascade fix — cleanupMessageMetadata + cleanupChatFilterEvents
no longer rely on the now-empty `holds.threadIds` Set; both look up
the parent thread's `userId` for the user-membership cascade
(messageMetadata cleanup mutation was already P0-8 fixed in commit 1).

Other dead-check removals:
  - retention_cleanup.ts: ~10 dead `holds.threadIds.has` /
    `documentIds.has` / `executionIds.has` checks removed; per-row
    custodian cascade either replaces them or the mutation re-checks.
  - governance/restore.ts: drops dead per-row branches; user-custodian
    cascade in restoreSoftDeletedRow (commit 1) covers the rest.
  - threads/cascade_helpers.ts: per-thread hold check replaced with
    a thread-metadata lookup → `userMembershipIds.has(thread.userId)`.
  - threads/delete_chat_thread.ts + restore_chat_thread.ts: drop
    `holds.threadIds.has(threadId)`; ownerHeld + orgHeld cover all
    refusal cases.
  - folders/mutations.ts assertNoHeldDescendantDocs: per-document
    hold check replaced with `userMembershipIds.has(doc.createdBy)`.
  - governance/erasure.ts requestErasure: drops heldThreadIds /
    heldDocumentIds collection — refusal is now `orgHeld ||
    userCustodianHeld`. eraseSubjectDocuments drops the per-row
    held-document check (subject is by definition the only author
    in the iteration; user-custodian gate applied at request time).
  - legal_hold_internal.ts: loadActiveHoldsForOrg returns the
    narrowed ActiveHolds shape across the action/query boundary.

Tests rewritten — `governance/__tests__/retention_legal_hold_gaps.test.ts`:
  - Drops thread/document/execution per-row hold cases (behaviour
    deleted on purpose).
  - Adds 6 cases covering the simplified contract: authorUserId
    cascade, TOCTOU + cross-org pre-checks, missing authorUserId
    grandfather, org-wide hold blanket refusal.
  - Source-grep regressions: every retention mutation passes the
    right authorUserId; cleanup actions pre-filter via the right
    `holds.userMembershipIds.has(...)`; per-row Sets stay deleted
    from the cleanup / erasure / guard sources.

Note: this is a one-way migration. Any deployment with active
legacy thread/document/execution holds loses enforcement of those
holds going forward; the read-side validator in legal_hold_queries
keeps showing them as `(legacy)`. Acceptable per project memory
(demo stage, no prod use). Pre-prod check: convert to user/org
holds first if any deployment has them.

Verified: typecheck clean; 851 tests pass across affected dirs;
full vitest run shows 5641/5642 pass (1 pre-existing flake in
canvas-pane.test.tsx, unrelated).
larryro added a commit that referenced this pull request May 9, 2026
…te gates

P0-9 — Org-wide hold UI invisibility (round-1 #12, round-2 V4):
  Before: `heldByTargetValidator` declared `via: 'org'` but no producer
  ever emitted it. With only an org-wide hold active, `getLegalHoldByTarget`
  returned null and `listActiveHoldTargetIds` returned an empty list →
  no chat-sidebar lock icons, no admin-grid badges, no banner. Backend
  refused destructive actions but the UX showed nothing until the user
  attempted a delete and hit a confusing error.
  Fix:
  - `getLegalHoldByTarget`: third pass after direct + custodian-cascade
    looks up the org-wide hold row and surfaces it with `via: 'org'`.
  - `listActiveHoldTargetIds`: returns a new `orgHeld: boolean` flag
    alongside the per-target id set; the chat-sidebar ORs the flag with
    the per-thread set so every row gets the lock indicator when the
    whole org is on a hold.
  - `chat-history-sidebar.tsx`: replaces direct `heldThreadIds.has(id)`
    checks with an `isThreadHeld(id)` helper that ORs `orgWideHeld`.

P0-10 — `prepareOrganizationDeletion` bypasses holds (round-1 #13, V4):
  `cascadeOnOrgDeleted` hard-deletes userMemories + userPreferences (and
  per Commit 1, every thread-bound chat-upload file) for every member of
  the org. The mutation refused only on owner-role check; legal-hold
  status was never consulted. Adds `assertNotHeld(ctx, orgId, 'org',
  orgId, undefined, authUser._id)` before the cascade — refuses on org-wide
  hold OR on a userMembership cascade against the owner deleting.
  `GuardedTargetType` extended with `'org'` for the error-payload context.

P0-11 — `removeMember` bypasses custodian holds (round-1 #13, V4):
  `cascadeOnMemberRemoved` cascades into userMemories + userPreferences
  (+ chat-upload files) scoped to the member being removed. Mutation
  previously checked only role. Adds `assertNotHeld(ctx, orgId,
  'userMembership', member.userId, undefined, member.userId)` before
  the cascade — refuses when the org is on a hold OR when the member
  themselves is on a custodian hold. Self-author cascade uses the same
  user id for `targetId` and `authorUserId` so the gate fires regardless
  of which leg matches.

Verified: typecheck clean; 908 tests pass across affected dirs.
larryro added a commit that referenced this pull request May 9, 2026
Three round-2-confirmed gaps in the dual-control flow.

placeLegalHold (W1 #11)
- targetType: 'org' is the nuclear "halt all retention" hold; a single
  admin placing it freezes cleanup org-wide, defeating dual-control.
  Refuse by default; deployments running a single admin opt back in via
  the existing TALE_LEGAL_HOLD_SINGLE_ADMIN_OK env. A future maker-
  checker placement flow (legalHoldPlaceRequests + approveLegalHoldPlacement)
  supersedes this; today's commit restores the contract.

approveLegalHoldRelease (W1 #12)
- 5-min minimum delay between request and approval (constant
  RELEASE_APPROVAL_MIN_DELAY_MS). Defeats the chained-call attack
  where one admin requests + a second admin instantly approves via the
  same automation flow. Skipped when self-approval is in effect (the
  escape hatch already opted out of dual-control entirely).
- Re-check that the original requester is still an org admin at
  approval time. A demoted/removed requester whose request lingered
  loses the ability to retroactively gate a destructive change.

closeLegalMatter (W1 #10 + W4 #6)
- Implements the docstring promise (round-2 v08 bonus): closing a
  matter now fans out PENDING release requests for every active hold
  linked via matterRef, leveraging the new
  by_organizationId_matterRef index. Approval still requires a
  second admin via approveLegalHoldRelease — matter-close does NOT
  auto-release, dual-control survives.
- Returns releaseRequestsFiled count so the UI can confirm the
  cascade. Audit row records linkedHolds + releaseRequestsFiled.
- Skips holds with an existing pending or approved request (idempotent
  against re-running close on a half-closed matter).
larryro added a commit that referenced this pull request May 9, 2026
…cies

Bundle of round-2-confirmed cross-tenant fixes plus the dead-code
delete of the semantic LLM response cache.

POLICY_TYPES drift (W6 #5)
- lib/shared/schemas/governance.ts now includes
  'data_classification_notice' to match the Convex enum, killing the
  `as const` cast at use-data-classification-notice.ts:50.

documents/compare_documents.ts (W6 #8)
- Convex `_storage` is a global namespace; org membership alone was
  not enough to gate `ctx.storage.getUrl`. Adds a JOIN through
  fileMetadata via the new internal query verifyStorageIdsBelongToOrg
  to confirm both `baseStorageId` and `comparisonStorageId` are owned
  by the caller's org. Refuses with a clear error otherwise. Pattern
  copied from agent_tools/documents/helpers/retrieve_document.ts.

file_metadata/actions.ts::checkFileRagStatuses (W6 #9)
- Was an unauthenticated public action that could flip any org's
  fileMetadata.ragStatus to `failed` via expireStaleRagQueue (DoS,
  pre-existing on `main`). Now requires `getAuthUser` and filters
  storageIds to ones owned by an org the caller is a member of via
  the new file_metadata.internal_queries.filterStorageIdsByCallerOrg.

governance/queries.ts (W6 #11)
- getPolicy + listPolicies now apply a member-readable allow-list
  (data_classification_notice, feature_flags, pii_config,
  chat_filter, personalization, upload_policy, default_models). All
  other types — login_policy.trustedProxies, password_policy,
  two_factor_policy, model_access.rules, budgets, retention_policy,
  moderation_provider.endpoint, system_prompt — are admin-only.
  listPolicies silently filters those out for non-admins.

semantic LLM response cache — DELETE (W6 #12 + #13)
- Round-2 v05 confirmed the lookup is structurally cross-tenant
  (filters only on agent_name, model, expires_at, similarity; ignores
  user_id / organization_id even though they're stored). The platform
  helpers `lookupSemanticCache` / `storeSemanticCacheAsync` had ZERO
  callers in the monorepo, the FastAPI router was mounted but
  unreachable from platform — a latent foot-gun primed for the next
  dev to wire up unaware. Deletes:

  - services/platform/convex/lib/response_cache/semantic_cache.ts
  - services/platform/convex/lib/response_cache/internal_actions.ts
  - services/rag/app/routers/llm_cache.py
  - services/rag/app/services/llm_response_cache.py

  Plus the corresponding imports in routers/__init__.py, main.py,
  rag_service.py. Also removes the two empty-catch violations in
  semantic_cache.ts (no longer applicable).

The exact-key Convex `lib/response_cache/{internal_mutations,
internal_queries}.ts` cache stays — it is the actually-wired one and
is correctly org-scoped.
larryro added a commit that referenced this pull request May 9, 2026
…stodian cascade

Round-1 #10/#12/#13 + round-2 V3 confirmed three custodian-cascade gaps
(P0-6 retention guard, P0-7 deleteDocumentById, P0-8 cleanupMessageMetadata
+ cleanupChatFilterEvents). The user's observation made the deeper truth
clear: in production, only `org` and `userMembership` hold target types
are operator-facing — `thread` / `document` / `execution` were modeled in
the original fine-grained hold design but never wired into the place-hold
UI (`PICKER_TARGET_TYPES = ['userMembership', 'org']`), and `usePlaceLegalHold`
is called from a single dialog. The corresponding `holds.threadIds` /
`documentIds` / `executionIds` Sets in `ActiveHolds` therefore stayed
empty in every deployment, while ~80 LOC of dead per-row checks
proliferated through retention, erasure, restore, folders, and the
cascade helper. This commit collapses the model to its actual shape.

Schema-level narrowing (legal_hold.ts):
  - HOLD_TARGET_TYPES = ['userMembership', 'org'] (write-side)
  - ActiveHolds = { orgHeld, userMembershipIds }
  - loadActiveHolds switch reduces to 2 cases; legacy thread/document/
    execution rows in the table (if any) are intentionally ignored —
    they're surfaced in admin UI as `(legacy)` via the read-side
    validator in legal_hold_queries.ts.
  - resolveAndAssertTarget keeps only `org` + `userMembership` cases.
  - isHeld() simplified to org + custodian.

Guard simplification (legal_hold_guard.ts):
  - assertNotHeld drops the per-row branches; `targetType`/`targetId`
    are now error-message context only. The user-custodian cascade
    via `authorUserId` is the sole remaining gate beyond org-wide.

assertSafeRetentionDelete (internal_mutations_retention.ts):
  - Drops `targetType?` / `targetId?` args; adds `authorUserId?`.
  - Internal handler is now: cross-org check → TOCTOU check →
    org-wide hold → user-custodian cascade. Single, clear contract.

P0-6 cascade wiring — every retention mutation passes the row's
author user id (~15 call sites in internal_mutations_retention.ts):
  - documents → doc.createdBy
  - threads → thread.userId
  - workflowExecution → execution.userId
  - workflowTriggerLog → exec.userId via wfExecutionId join
  - messageFeedback → row.userId
  - memoryAudit → row.subjectUserId (with thread-fallback)
  - chatFilterEvent → parent thread's userId via index lookup
  - usageLedger → row.userId
  - promptTemplate → row.createdBy
  - messageMetadata → parent thread's userId (closes P0-8 directly)
  - tempFile, customer, vendor, externalConversation: org-only
    (no author concept on the row)

P0-8 cascade fix — cleanupMessageMetadata + cleanupChatFilterEvents
no longer rely on the now-empty `holds.threadIds` Set; both look up
the parent thread's `userId` for the user-membership cascade
(messageMetadata cleanup mutation was already P0-8 fixed in commit 1).

Other dead-check removals:
  - retention_cleanup.ts: ~10 dead `holds.threadIds.has` /
    `documentIds.has` / `executionIds.has` checks removed; per-row
    custodian cascade either replaces them or the mutation re-checks.
  - governance/restore.ts: drops dead per-row branches; user-custodian
    cascade in restoreSoftDeletedRow (commit 1) covers the rest.
  - threads/cascade_helpers.ts: per-thread hold check replaced with
    a thread-metadata lookup → `userMembershipIds.has(thread.userId)`.
  - threads/delete_chat_thread.ts + restore_chat_thread.ts: drop
    `holds.threadIds.has(threadId)`; ownerHeld + orgHeld cover all
    refusal cases.
  - folders/mutations.ts assertNoHeldDescendantDocs: per-document
    hold check replaced with `userMembershipIds.has(doc.createdBy)`.
  - governance/erasure.ts requestErasure: drops heldThreadIds /
    heldDocumentIds collection — refusal is now `orgHeld ||
    userCustodianHeld`. eraseSubjectDocuments drops the per-row
    held-document check (subject is by definition the only author
    in the iteration; user-custodian gate applied at request time).
  - legal_hold_internal.ts: loadActiveHoldsForOrg returns the
    narrowed ActiveHolds shape across the action/query boundary.

Tests rewritten — `governance/__tests__/retention_legal_hold_gaps.test.ts`:
  - Drops thread/document/execution per-row hold cases (behaviour
    deleted on purpose).
  - Adds 6 cases covering the simplified contract: authorUserId
    cascade, TOCTOU + cross-org pre-checks, missing authorUserId
    grandfather, org-wide hold blanket refusal.
  - Source-grep regressions: every retention mutation passes the
    right authorUserId; cleanup actions pre-filter via the right
    `holds.userMembershipIds.has(...)`; per-row Sets stay deleted
    from the cleanup / erasure / guard sources.

Note: this is a one-way migration. Any deployment with active
legacy thread/document/execution holds loses enforcement of those
holds going forward; the read-side validator in legal_hold_queries
keeps showing them as `(legacy)`. Acceptable per project memory
(demo stage, no prod use). Pre-prod check: convert to user/org
holds first if any deployment has them.

Verified: typecheck clean; 851 tests pass across affected dirs;
full vitest run shows 5641/5642 pass (1 pre-existing flake in
canvas-pane.test.tsx, unrelated).
larryro added a commit that referenced this pull request May 9, 2026
…te gates

P0-9 — Org-wide hold UI invisibility (round-1 #12, round-2 V4):
  Before: `heldByTargetValidator` declared `via: 'org'` but no producer
  ever emitted it. With only an org-wide hold active, `getLegalHoldByTarget`
  returned null and `listActiveHoldTargetIds` returned an empty list →
  no chat-sidebar lock icons, no admin-grid badges, no banner. Backend
  refused destructive actions but the UX showed nothing until the user
  attempted a delete and hit a confusing error.
  Fix:
  - `getLegalHoldByTarget`: third pass after direct + custodian-cascade
    looks up the org-wide hold row and surfaces it with `via: 'org'`.
  - `listActiveHoldTargetIds`: returns a new `orgHeld: boolean` flag
    alongside the per-target id set; the chat-sidebar ORs the flag with
    the per-thread set so every row gets the lock indicator when the
    whole org is on a hold.
  - `chat-history-sidebar.tsx`: replaces direct `heldThreadIds.has(id)`
    checks with an `isThreadHeld(id)` helper that ORs `orgWideHeld`.

P0-10 — `prepareOrganizationDeletion` bypasses holds (round-1 #13, V4):
  `cascadeOnOrgDeleted` hard-deletes userMemories + userPreferences (and
  per Commit 1, every thread-bound chat-upload file) for every member of
  the org. The mutation refused only on owner-role check; legal-hold
  status was never consulted. Adds `assertNotHeld(ctx, orgId, 'org',
  orgId, undefined, authUser._id)` before the cascade — refuses on org-wide
  hold OR on a userMembership cascade against the owner deleting.
  `GuardedTargetType` extended with `'org'` for the error-payload context.

P0-11 — `removeMember` bypasses custodian holds (round-1 #13, V4):
  `cascadeOnMemberRemoved` cascades into userMemories + userPreferences
  (+ chat-upload files) scoped to the member being removed. Mutation
  previously checked only role. Adds `assertNotHeld(ctx, orgId,
  'userMembership', member.userId, undefined, member.userId)` before
  the cascade — refuses when the org is on a hold OR when the member
  themselves is on a custodian hold. Self-author cascade uses the same
  user id for `targetId` and `authorUserId` so the gate fires regardless
  of which leg matches.

Verified: typecheck clean; 908 tests pass across affected dirs.
larryro added a commit that referenced this pull request May 17, 2026
Closes #9, #10, #11, #12 — cascade correctness + GC durability.

- `personalization_cascade.ts:cascadeOnOrgDeleted` swaps delete order:
  `db.delete` runs FIRST, then `storage.delete` inside the try/catch.
  Matches the documented contract in `tts/cascade_helpers.ts:55-62` —
  Convex `_storage` writes are out-of-band and not rolled back on tx
  abort, so the reverse order leaves a surviving row pointing at a
  dead storageId (404 on `/api/tts-audio`).
- `threads/cascade_helpers.ts` step 7c TTS cleanup gets the same swap,
  for the same reason.
- `cascadeOnTtsForMemberRemoved` per-mutation page cap lowered from
  50 (~10K writes) to 30 (~6K writes) to stay under Convex's ~8K
  per-mutation write budget. `cascadeOnOrgDeleted` gets the same cap
  reduction. The hourly cron picks up whatever doesn't fit in a single
  pass — still well inside the 30-day Art 12(3) GDPR window.
- `gcOrgTtsChunks` now persists its org-cursor in a new singleton
  `ttsGcCursor` table between cron runs. A deployment with more orgs
  than `MAX_ORGS_PER_RUN` now advances through the full org list over
  successive hours instead of restarting from the lex-first org every
  time and starving lex-tail orgs forever. On reaching the end of the
  org list the cursor wraps to null and the next run starts over.
- `gcOrgTtsChunks` skip-empty: an org with no rows older than the
  retention cutoff no longer counts against `MAX_ORGS_PER_RUN`. Without
  this, a busy tail of stale orgs sandwiched behind quiet lex-leading
  orgs would never get reaped.
larryro added a commit that referenced this pull request May 17, 2026
Closes #9, #10, #11, #12 — cascade correctness + GC durability.

- `personalization_cascade.ts:cascadeOnOrgDeleted` swaps delete order:
  `db.delete` runs FIRST, then `storage.delete` inside the try/catch.
  Matches the documented contract in `tts/cascade_helpers.ts:55-62` —
  Convex `_storage` writes are out-of-band and not rolled back on tx
  abort, so the reverse order leaves a surviving row pointing at a
  dead storageId (404 on `/api/tts-audio`).
- `threads/cascade_helpers.ts` step 7c TTS cleanup gets the same swap,
  for the same reason.
- `cascadeOnTtsForMemberRemoved` per-mutation page cap lowered from
  50 (~10K writes) to 30 (~6K writes) to stay under Convex's ~8K
  per-mutation write budget. `cascadeOnOrgDeleted` gets the same cap
  reduction. The hourly cron picks up whatever doesn't fit in a single
  pass — still well inside the 30-day Art 12(3) GDPR window.
- `gcOrgTtsChunks` now persists its org-cursor in a new singleton
  `ttsGcCursor` table between cron runs. A deployment with more orgs
  than `MAX_ORGS_PER_RUN` now advances through the full org list over
  successive hours instead of restarting from the lex-first org every
  time and starving lex-tail orgs forever. On reaching the end of the
  org list the cursor wraps to null and the next run starts over.
- `gcOrgTtsChunks` skip-empty: an org with no rows older than the
  retention cutoff no longer counts against `MAX_ORGS_PER_RUN`. Without
  this, a busy tail of stale orgs sandwiched behind quiet lex-leading
  orgs would never get reaped.
larryro added a commit that referenced this pull request May 17, 2026
Two-round, 39-agent review (30 round-1 + 9 round-2) on top of round-5
identified 16 HIGH + 18 MEDIUM defects across the TTS pipeline. This
single commit closes every HIGH/MEDIUM that wasn't deferred:

Server (correctness / cost)
- reserveChunk schedules markChunkFailed watchdog at PENDING_STALE_MS+5s
  so an action that crashes between storage.store and mark-ready can no
  longer strand a row in `pending` until the 7-day cron
- OCC detection now layers ConvexError.data.code + constructor.name + the
  regex fallback so reserialized errors still resolve to CONTENTION
- 429 Retry-After is parsed via TtsProviderHttpError + parseRetryAfterMs
  and threaded back through the action return; classifier exposes
  retryAfterMs on ClassifiedFailure
- 4xx classifier split into PROVIDER_AUTH (401/403),
  PROVIDER_BAD_REQUEST (400/422), PROVIDER_PAYLOAD_TOO_LARGE (413), plus
  PROVIDER_4XX generic fallback
- WATCHDOG_TIMEOUT code added to literals and announcer/indicator maps
- Synth action now logs charCount/bytes/durationMs on success and on
  every provider-error path; sets Accept: audio/* on the upstream POST
- All TTS warn/error logs route through sanitizeError; http.ts:270
  brought into line with the rest of the file
- scheduleOpportunisticCleanup helper deduplicates the two
  mark-ready/mark-failed cleanup blocks

Governance + cascade
- Added policyType: 'voice_output' with config { enabled: boolean } to
  schema + upsertPolicy dispatch; org admins can now disable voice for
  the entire tenant
- getVoiceModeEffective tier order is now org_policy > thread_override
  > user_pref > false; org-policy veto added to source union
- Legacy thread fallback (no organizationId) now returns false instead
  of a non-deterministic prefix-only first match across orgs

HTTP + SSRF
- /api/tts-audio: Cross-Origin-Resource-Policy: same-origin on 200;
  Cache-Control: no-store + Vary: Cookie + WWW-Authenticate on 401
- safe_fetch: IPv6 link-local check now covers all of fe80::/10 (plus
  fec0::/10 site-local) and strips %zoneid; redirect-follow restricted
  to {301,302,303,307,308} so 304/305/306 stop misclassifying

Client lifecycle
- runNext gated on enabledRef and a dedicated useEffect clears
  queueRef/retry timers when opts.enabled flips false, so toggle-off
  mid-stream no longer drains the queue (and bills) for paragraphs the
  user already silenced
- detectChunkLocale now uses a CJK-ratio threshold (40%) instead of a
  single-character probe, so "中文 with English help" stays in fallback
- Player's onEnded detaches listeners symmetrically with onError, and
  stop() advances playEpoch so a coordinator.claim().then() callback
  racing a teardown can no longer re-flip activeRef
- VoiceOutputProvider accepts optional threadId and resets per-thread
  stores (preReservationErrors, activePlayback, announcer) on switch;
  resetAll() added to all three store factories

Markdown stripper
- ~~~ fences, 4-space indented code blocks, autolinks <https://…>, bare
  URLs, tables, $inline$ / $$block$$ math, ~~strike~~, task-list
  checkboxes, HTML tags, common HTML entities, footnote refs, intra-
  word _emphasis_ (boundary-anchored per CommonMark)
- Tests extended from 13 to 24 cases covering every regression class

Spotlight + a11y
- assistant-message-content masks fenced regions before paragraph
  split, so a fence containing a blank line no longer renders as an
  unterminated <pre> + literal backticks
- Inactive paragraphs receive aria-hidden during active playback to
  prevent SR double-read; spotlight now uses text-muted-foreground
  (WCAG-compliant) plus a left-border accent as a non-color cue

Dropdown / thread toggle a11y
- Added a 'checkbox' DropdownMenu variant rendering CheckboxItem with
  onSelect preventDefault and min-h-11 row height (44px target)
- Per-thread voice toggle migrated from a Switch-inside-custom-row to
  the new variant; arrow-key navigable, announces aria-checked,
  menuitemcheckbox role

i18n + docs
- en.json:4806 'Voice mode' → 'Voice output' (canonical per GLOSSARY)
- de.json:2067 'Sprachwiedergabe' → 'Sprachausgabe' (matches 13 other
  occurrences); 2066 trailing-space ellipsis fix
- All three voice-output.md rewritten to describe the Ellipsis-menu
  Switch (not the deleted tri-state speaker icon)
- All three providers.md TTS-fields tables gained defaultInstructions
  and instructionsByLocale rows, and clarified per-1M-character billing

Deferred (separate PRs):
- HIGH #12 header-icon discoverability restoration (per user)
- safe_fetch per-read idle timeout, undici Dispatcher pinning
- Comprehensive test backfill for synthesize/mutations/queries
- Markdown-strip → remark migration (streaming-state requirement)

bun lint: 0 errors. tsc --noEmit: clean. markdown-strip + error_codes
suites: 24/24 + 41/41 passing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant