feat: add file attachment support for chat agent with document parsing by larryro · Pull Request #12 · tale-project/tale

larryro · 2025-12-12T06:54:22Z

Add file attachment handling to chat interface with paste support for images
Implement multi-modal message support: images sent inline, documents parsed via tools
Create file parser service in crawler for PDF, DOCX, PPTX text extraction
Add parse operation to pdf_tool, docx_tool, and pptx_tool
Add shared parseFile helper to download and extract text from document URLs
Register files with agent component for proper tracking
Update chat mutation and action to pass attachments through the flow

Summary by CodeRabbit

New Features
- Added support for parsing PDF, DOCX, and PPTX documents with automatic text and metadata extraction.
- Users can now attach files and images to chat conversations for AI processing.
- Added image paste functionality—users can paste images directly from clipboard into the chat.
- AI chat now supports multi-modal interactions, processing both text and attached files/images together.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

- Add file attachment handling to chat interface with paste support for images - Implement multi-modal message support: images sent inline, documents parsed via tools - Create file parser service in crawler for PDF, DOCX, PPTX text extraction - Add parse operation to pdf_tool, docx_tool, and pptx_tool - Add shared parseFile helper to download and extract text from document URLs - Register files with agent component for proper tracking - Update chat mutation and action to pass attachments through the flow

coderabbitai · 2025-12-12T07:04:18Z

📝 Walkthrough

Walkthrough

This PR introduces file parsing and multi-modal attachment support across the application stack. The crawler service adds a new FileParserService class that parses PDF, DOCX, and PPTX files and exposes them via a new /api/v1/parse/file REST endpoint. The platform side extends the chat UI with clipboard paste support for images and an attachment button, integrates file parsing into existing PDF/DOCX/PPTX Convex tools, and enhances the chat agent to handle attachments through the message pipeline, including optimistic UI updates and multi-modal prompt construction for AI processing.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

services/platform/convex/model/chat_agent/generate_agent_response.ts: Multi-modal attachment handling with file registration, SHA256 computation, and complex promptContent construction for AI integration requires careful logic review.
services/platform/app/(app)/dashboard/[id]/chat/components/chat-interface.tsx: Optimistic message clearing logic with conditional content matching (exact or prefix-based) needs verification for edge cases.
services/platform/convex/agent_tools/convex_tools/files/helpers/parse_file.ts: File download, FormData construction, and error handling for crawler integration requires validation of abort timeout and response parsing.
services/platform/app/(app)/dashboard/[id]/chat/components/chat-input.tsx: Image paste handler and clipboard processing logic should be reviewed for browser compatibility and file handling correctness.
services/platform/convex/agent_tools/convex_tools/files/pdf_tool.ts and pptx_tool.ts: Union type refactoring and conditional operation handling (parse vs. generate/analyze) need verification for completeness.

Possibly related PRs

talecorp/poc2#29: Directly implements the backend file parsing service (FileParserService) and /api/v1/parse/file endpoint that the parse_file.ts helper and Convex tool changes invoke.
tale-project/poc2#497: Introduces the same crawler FileParserService and ParseFileResponse API contract used by the parse_file.ts helper and new PDF/DOCX/PPTX tool parse operations.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 14

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (7)

services/platform/convex/agent_tools/convex_tools/files/pdf_tool.ts (2)

12-30: Avoid the as GeneratePdfResult escape hatch; keep result typing honest
return { operation: 'generate', ...result } as GeneratePdfResult; bypasses compile-time verification that internal.documents.generateDocumentInternal still returns the full GeneratePdfResult shape. Prefer explicit mapping (or satisfies) so missing fields fail fast at build time.

63-118: Use z.discriminatedUnion('operation', [...]) to enforce operation-specific required fields

The schema currently marks all fields as optional, but validates them at runtime based on the operation value. This defers validation errors until the handler executes. Use Zod's discriminatedUnion() to split the schema by operation and enforce required fields upfront:

-    args: z.object({
-      operation: z
-        .enum(['generate', 'parse'])
-        .optional()
-        .describe("Operation: 'generate' (default) or 'parse'"),
-      // For generate operation
-      fileName: z.string().optional().describe(...),
-      sourceType: z.enum(['markdown', 'html', 'url']).optional().describe(...),
-      content: z.string().optional().describe(...),
-      ...
-      // For parse operation
-      url: z.string().optional().describe(...),
-      filename: z.string().optional().describe(...),
-    }),
+    args: z.discriminatedUnion('operation', [
+      z.object({
+        operation: z.literal('generate').optional(),
+        fileName: z.string().min(1),
+        sourceType: z.enum(['markdown', 'html', 'url']),
+        content: z.string().min(1),
+        pdfOptions: z
+          .object({
+            format: z.string().optional(),
+            landscape: z.boolean().optional(),
+            marginTop: z.string().optional(),
+            marginBottom: z.string().optional(),
+            marginLeft: z.string().optional(),
+            marginRight: z.string().optional(),
+            printBackground: z.boolean().optional(),
+          })
+          .optional(),
+        urlOptions: z
+          .object({
+            waitUntil: z.enum(['load', 'domcontentloaded', 'networkidle', 'commit']).optional(),
+          })
+          .optional(),
+        extraCss: z.string().optional(),
+        wrapInTemplate: z.boolean().optional(),
+      }),
+      z.object({
+        operation: z.literal('parse'),
+        url: z.string().url(),
+        filename: z.string().min(1),
+      }),
+    ]),

This catches missing fields at validation time with precise error messages instead of throwing at runtime (lines 124–143).

services/platform/app/(app)/dashboard/[id]/chat/components/chat-input.tsx (1)

252-272: Replace <img> + hardcoded blues with next/image + semantic colors.
Repo guidelines: use Next.js Image (not native img) and avoid hardcoded colors like from-blue-100, text-blue-600.

+import Image from 'next/image';

-                        <img
-                          src={attachment.previewUrl}
-                          alt={attachment.fileName}
-                          className="size-full object-cover"
-                        />
+                        <Image
+                          src={attachment.previewUrl}
+                          alt={attachment.fileName}
+                          fill
+                          className="object-cover"
+                          unoptimized
+                        />

-                        <div className="size-full bg-gradient-to-br from-blue-100 to-blue-200 flex items-center justify-center">
-                          <span className="text-xs text-blue-600">IMG</span>
+                        <div className="size-full bg-muted flex items-center justify-center">
+                          <span className="text-xs text-muted-foreground">IMG</span>
                         </div>

services/platform/convex/chat_agent.ts (1)

24-48: Deduplicate the attachment validator to prevent schema drift.

+const FILE_ATTACHMENT_VALIDATOR = v.object({
+  fileId: v.id('_storage'),
+  fileName: v.string(),
+  fileType: v.string(),
+  fileSize: v.number(),
+});

 export const chatWithAgent = mutation({
   args: {
@@
-    attachments: v.optional(
-      v.array(
-        v.object({
-          fileId: v.id('_storage'),
-          fileName: v.string(),
-          fileType: v.string(),
-          fileSize: v.number(),
-        }),
-      ),
-    ),
+    attachments: v.optional(v.array(FILE_ATTACHMENT_VALIDATOR)),
   },
@@
 export const generateAgentResponse = internalAction({
   args: {
@@
-    attachments: v.optional(
-      v.array(
-        v.object({
-          fileId: v.id('_storage'),
-          fileName: v.string(),
-          fileType: v.string(),
-          fileSize: v.number(),
-        }),
-      ),
-    ),
+    attachments: v.optional(v.array(FILE_ATTACHMENT_VALIDATOR)),
     messageText: v.optional(v.string()),
   },

Also applies to: 108-125

services/platform/convex/model/chat_agent/generate_agent_response.ts (1)

384-396: Don’t log responseBody by default (possible sensitive data leak).
Even if “sanitized”, responseBody/err.data can include provider payloads that may contain user content. Gate it behind debugLog or drop it / truncate heavily.

services/platform/app/(app)/dashboard/[id]/chat/components/chat-interface.tsx (1)

200-218: startsWith optimistic clearing can false-positive and hide the user message.
Guard the prefix-match so it only triggers for the “attachments appended” pattern (e.g., next chars are \n\n).

-        return (
-          m.content === optimisticMessage.content ||
-          m.content.startsWith(optimisticMessage.content)
-        );
+        if (m.content === optimisticMessage.content) return true;
+        if (!m.content.startsWith(optimisticMessage.content)) return false;
+        // Only treat as “same message” if server appended content (attachments section).
+        const suffix = m.content.slice(optimisticMessage.content.length);
+        return suffix.startsWith('\n\n');

services/platform/convex/model/chat_agent/chat_with_agent.ts (1)

50-75: Dedup logic drops attachments when message text repeats (attachments must bypass dedup).
Right now attachments are ignored if messageAlreadyExists is true, even if the attachments differ.

-  const {
+  const {
     latestMessage,
     lastUserMessage,
     messageAlreadyExists,
     trimmedMessage,
   } = computeDeduplicationState(existingMessages, message);
 
   const hasAttachments = attachments && attachments.length > 0;
+  const treatAsDuplicate = messageAlreadyExists && !hasAttachments;

@@
-  if (!messageAlreadyExists) {
+  if (!treatAsDuplicate) {
     const { messageId } = await saveMessage(ctx, components.agent, {
       threadId,
       message: { role: 'user', content: messageContent },
     });
     promptMessageId = messageId;
   } else {
     promptMessageId = lastUserMessage!._id;
   }

@@
-  const actionAttachments = !messageAlreadyExists && hasAttachments
+  const actionAttachments = !treatAsDuplicate && hasAttachments
     ? attachments.map((a) => ({
         fileId: a.fileId,
         fileName: a.fileName,
         fileType: a.fileType,
         fileSize: a.fileSize,
       }))
     : undefined;

Also applies to: 114-136

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between 17e07ff and 80b318c.

⛔ Files ignored due to path filters (1)

services/platform/convex/_generated/api.d.ts is excluded by !**/_generated/**

📒 Files selected for processing (13)

services/crawler/app/file_parser_service.py (1 hunks)
services/crawler/app/main.py (2 hunks)
services/crawler/app/models.py (1 hunks)
services/crawler/requirements.txt (1 hunks)
services/platform/app/(app)/dashboard/[id]/chat/components/chat-input.tsx (4 hunks)
services/platform/app/(app)/dashboard/[id]/chat/components/chat-interface.tsx (2 hunks)
services/platform/convex/agent_tools/convex_tools/files/docx_tool.ts (6 hunks)
services/platform/convex/agent_tools/convex_tools/files/helpers/parse_file.ts (1 hunks)
services/platform/convex/agent_tools/convex_tools/files/pdf_tool.ts (4 hunks)
services/platform/convex/agent_tools/convex_tools/files/pptx_tool.ts (6 hunks)
services/platform/convex/chat_agent.ts (2 hunks)
services/platform/convex/model/chat_agent/chat_with_agent.ts (4 hunks)
services/platform/convex/model/chat_agent/generate_agent_response.ts (7 hunks)

🧰 Additional context used

📓 Path-based instructions (6)

**/*

📄 CodeRabbit inference engine (.cursor/rules/workspace_rules.mdc)

Use English only for ALL user-facing content including UI components, labels, buttons, dialogs, forms, toast messages, error messages, success messages, comments, documentation, README files, variable names, function names, and type names

Files:

services/crawler/requirements.txt
services/platform/convex/chat_agent.ts
services/platform/convex/model/chat_agent/generate_agent_response.ts
services/platform/convex/model/chat_agent/chat_with_agent.ts
services/platform/convex/agent_tools/convex_tools/files/pdf_tool.ts
services/crawler/app/models.py
services/platform/convex/agent_tools/convex_tools/files/pptx_tool.ts
services/crawler/app/file_parser_service.py
services/platform/convex/agent_tools/convex_tools/files/helpers/parse_file.ts
services/platform/convex/agent_tools/convex_tools/files/docx_tool.ts
services/platform/app/(app)/dashboard/[id]/chat/components/chat-interface.tsx
services/platform/app/(app)/dashboard/[id]/chat/components/chat-input.tsx
services/crawler/app/main.py

**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.cursor/rules/workspace_rules.mdc)

**/*.{ts,tsx,js,jsx}: Use Vercel AI SDK with OpenAI - import from 'ai' and '@ai-sdk/openai', never use raw OpenAI SDK or OpenRouter
Never hallucinate API keys - always use environment variables and existing .env configuration
Use camelCase for function names (e.g., getUserData)
Use SCREAMING_SNAKE_CASE for constants (e.g., API_BASE_URL, MAX_RETRIES)
Use feature flags with enums (TypeScript) or const objects (JavaScript) with UPPERCASE_WITH_UNDERSCORE naming
Implement error handling with try-catch pattern: check for result.error and display descriptive toast messages using result.error as title

Files:

services/platform/convex/chat_agent.ts
services/platform/convex/model/chat_agent/generate_agent_response.ts
services/platform/convex/model/chat_agent/chat_with_agent.ts
services/platform/convex/agent_tools/convex_tools/files/pdf_tool.ts
services/platform/convex/agent_tools/convex_tools/files/pptx_tool.ts
services/platform/convex/agent_tools/convex_tools/files/helpers/parse_file.ts
services/platform/convex/agent_tools/convex_tools/files/docx_tool.ts
services/platform/app/(app)/dashboard/[id]/chat/components/chat-interface.tsx
services/platform/app/(app)/dashboard/[id]/chat/components/chat-input.tsx

**/*.{ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/workspace_rules.mdc)

**/*.{ts,tsx}: Use kebab-case for file names (e.g., user-profile.tsx)
Use PascalCase for component names (e.g., UserProfile)
Use descriptive messages as toast title (never generic 'Error'), with optional description for additional context only
Follow component structure: 'use client' directive, imports, interface Props, hooks, effects, event handlers, then render
Prioritize data fetching methods in order: Server Actions (preferred), Route Handlers (when needed), Client-side (minimal use)
Use React.memo for expensive components to optimize performance
Use Next.js Image component for all images instead of native img tags
Use dynamic imports for code splitting

Files:

services/platform/convex/chat_agent.ts
services/platform/convex/model/chat_agent/generate_agent_response.ts
services/platform/convex/model/chat_agent/chat_with_agent.ts
services/platform/convex/agent_tools/convex_tools/files/pdf_tool.ts
services/platform/convex/agent_tools/convex_tools/files/pptx_tool.ts
services/platform/convex/agent_tools/convex_tools/files/helpers/parse_file.ts
services/platform/convex/agent_tools/convex_tools/files/docx_tool.ts
services/platform/app/(app)/dashboard/[id]/chat/components/chat-interface.tsx
services/platform/app/(app)/dashboard/[id]/chat/components/chat-input.tsx

services/*/convex/*.ts

📄 CodeRabbit inference engine (.cursor/rules/workspace_rules.mdc)

Thin wrapper API modules (like services/platform/convex/documents.ts) may export multiple Convex functions as wrappers that delegate to model helpers, but must not contain business logic and must only perform argument/return validation and delegation

Files:

services/platform/convex/chat_agent.ts

**/*.{tsx,jsx}

📄 CodeRabbit inference engine (.cursor/rules/figma_rules.mdc)

**/*.{tsx,jsx}: Avoid specifying font-['Inter:Regular',_sans-serif] as it should be the default
Only specify font-family when using non-default fonts like font-['Inter:Medium',_sans-serif]
Ensure font-family matches font-weight (Inter:Regular with font-normal, Inter:Medium with font-medium)
Use leading-normal instead of leading-[normal] in Tailwind classes
Use standard font size classes instead of arbitrary values: text-[12px] → text-xs, text-[14px] → text-sm, text-[16px] → text-base, text-[18px] → text-lg, text-[20px] → text-xl, text-[24px] → text-2xl
Use semantic spacing classes: p-[4px] → p-1, p-[8px] → p-2, m-[4px] → m-1, m-[8px] → m-2
Convert pixel values to rem using the 16px base for width and height measurements: w-[278px] → w-[17.375rem], h-[48px] → h-[3rem], min-w-[120px] → min-w-[7.5rem], max-w-[400px] → max-w-[25rem]
NEVER use hardcoded colors like text-gray-500, bg-gray-100, border-gray-200; ALWAYS use design system semantic colors: text-foreground for primary text, text-muted-foreground for secondary text and icons, bg-background for main backgrounds, bg-muted for subtle backgrounds and hover states, border-border for borders
ALWAYS use the Table component instead of custom flex layouts; use Table, TableHeader, TableBody, TableRow, TableHead, TableCell components with proper column widths using rem units and semantic colors

Files:

services/platform/app/(app)/dashboard/[id]/chat/components/chat-interface.tsx
services/platform/app/(app)/dashboard/[id]/chat/components/chat-input.tsx

**/app/**/*.tsx

📄 CodeRabbit inference engine (.cursor/rules/workspace_rules.mdc)

In Next.js App Router, use page.tsx as server components by default; use 'use client' only for interactions and state

Files:

services/platform/app/(app)/dashboard/[id]/chat/components/chat-interface.tsx
services/platform/app/(app)/dashboard/[id]/chat/components/chat-input.tsx

🧠 Learnings (18)

📚 Learning: 2025-07-20T08:40:24.693Z

Learnt from: CR
Repo: talecorp/poc PR: 0
File: .cursor/rules/ai.mdc:0-0
Timestamp: 2025-07-20T08:40:24.693Z
Learning: Applies to **/actions/*.ts : For text generation, use the `generateObject` or similar functions from the AI SDK

Applied to files:

services/platform/convex/model/chat_agent/generate_agent_response.ts

📚 Learning: 2025-07-20T08:40:24.693Z

Learnt from: CR
Repo: talecorp/poc PR: 0
File: .cursor/rules/ai.mdc:0-0
Timestamp: 2025-07-20T08:40:24.693Z
Learning: Applies to **/app/api/**/*.ts : For text generation, use the `generateObject` or similar functions from the AI SDK

Applied to files:

services/platform/convex/model/chat_agent/generate_agent_response.ts

📚 Learning: 2025-07-20T08:40:24.693Z

Learnt from: CR
Repo: talecorp/poc PR: 0
File: .cursor/rules/ai.mdc:0-0
Timestamp: 2025-07-20T08:40:24.693Z
Learning: Applies to **/actions/*.ts : Import the Vercel AI SDK directly (e.g., `import { generateObject } from 'ai'; import { openrouter } from 'openrouter/ai-sdk-provider';`)

Applied to files:

services/platform/convex/model/chat_agent/generate_agent_response.ts

📚 Learning: 2025-08-21T14:59:56.034Z

Learnt from: CR
Repo: talecorp/lanserhof PR: 0
File: .cursor/rules/ai.mdc:0-0
Timestamp: 2025-08-21T14:59:56.034Z
Learning: Applies to {**/actions/*.ts,**/app/api/**/*.ts,**/lib/**/*.ts} : Always use the Vercel AI SDK to access large language models

Applied to files:

services/platform/convex/model/chat_agent/generate_agent_response.ts

📚 Learning: 2025-08-21T14:59:56.034Z

Learnt from: CR
Repo: talecorp/lanserhof PR: 0
File: .cursor/rules/ai.mdc:0-0
Timestamp: 2025-08-21T14:59:56.034Z
Learning: Applies to {**/actions/*.ts,**/app/api/**/*.ts,**/lib/**/*.ts} : Import the Vercel AI SDK directly (e.g., `import { generateObject } from 'ai'; import { openrouter } from 'openrouter/ai-sdk-provider';`)

Applied to files:

services/platform/convex/model/chat_agent/generate_agent_response.ts

📚 Learning: 2025-08-21T14:59:56.034Z

Learnt from: CR
Repo: talecorp/lanserhof PR: 0
File: .cursor/rules/ai.mdc:0-0
Timestamp: 2025-08-21T14:59:56.034Z
Learning: Applies to {**/actions/*.ts,**/app/api/**/*.ts,**/lib/**/*.ts} : Do not use the direct OpenAI client without the Vercel AI SDK

Applied to files:

services/platform/convex/model/chat_agent/generate_agent_response.ts

📚 Learning: 2025-08-21T14:59:56.034Z

Learnt from: CR
Repo: talecorp/lanserhof PR: 0
File: .cursor/rules/ai.mdc:0-0
Timestamp: 2025-08-21T14:59:56.034Z
Learning: Applies to {**/actions/*.ts,**/app/api/**/*.ts,**/lib/**/*.ts} : Avoid custom AI wrappers that do not add meaningful functionality beyond the Vercel AI SDK

Applied to files:

services/platform/convex/model/chat_agent/generate_agent_response.ts

📚 Learning: 2025-07-20T08:40:24.693Z

Learnt from: CR
Repo: talecorp/poc PR: 0
File: .cursor/rules/ai.mdc:0-0
Timestamp: 2025-07-20T08:40:24.693Z
Learning: Applies to **/actions/*.ts : Always use the Vercel AI SDK to access large language models

Applied to files:

services/platform/convex/model/chat_agent/generate_agent_response.ts

📚 Learning: 2025-07-20T08:40:24.693Z

Learnt from: CR
Repo: talecorp/poc PR: 0
File: .cursor/rules/ai.mdc:0-0
Timestamp: 2025-07-20T08:40:24.693Z
Learning: Applies to **/lib/**/*.ts : For text generation, use the `generateObject` or similar functions from the AI SDK

Applied to files:

services/platform/convex/model/chat_agent/generate_agent_response.ts

📚 Learning: 2025-07-20T08:40:24.693Z

Learnt from: CR
Repo: talecorp/poc PR: 0
File: .cursor/rules/ai.mdc:0-0
Timestamp: 2025-07-20T08:40:24.693Z
Learning: Applies to **/actions/*.ts : Never use direct OpenAI client without the Vercel AI SDK

Applied to files:

services/platform/convex/model/chat_agent/generate_agent_response.ts

📚 Learning: 2025-11-30T03:53:00.316Z

Learnt from: CR
Repo: tale-project/poc2 PR: 0
File: .cursor/rules/convex_rules.mdc:0-0
Timestamp: 2025-11-30T03:53:00.316Z
Learning: Applies to convex/**/*.ts : Use one file per Convex function under the `convex/` directory, with file name and exported function name identical (snake_case for file names, camelCase for function names)

Applied to files:

services/platform/convex/agent_tools/convex_tools/files/helpers/parse_file.ts

📚 Learning: 2025-12-02T08:13:24.290Z

Learnt from: CR
Repo: tale-project/tale PR: 0
File: .cursor/rules/convex_rules.mdc:0-0
Timestamp: 2025-12-02T08:13:24.290Z
Learning: Applies to convex/**/*.{ts,tsx} : When touching existing modules that export multiple functions, split them into separate files so each file exports exactly one function matching the file name, unless the module is an explicitly designated thin wrapper file.

Applied to files:

services/platform/convex/agent_tools/convex_tools/files/helpers/parse_file.ts

📚 Learning: 2025-07-03T08:43:49.346Z

Learnt from: CR
Repo: talecorp/poc PR: 0
File: .cursor/rules/next-best-practice.mdc:0-0
Timestamp: 2025-07-03T08:43:49.346Z
Learning: Applies to **/*.{tsx,jsx} : Use useCallback for stable function references in components

Applied to files:

services/platform/app/(app)/dashboard/[id]/chat/components/chat-input.tsx

📚 Learning: 2025-07-19T15:29:09.401Z

Learnt from: CR
Repo: talecorp/poc PR: 0
File: .cursorrules:0-0
Timestamp: 2025-07-19T15:29:09.401Z
Learning: Applies to app/components/**/*.{js,jsx,ts,tsx} : Use useCallback for stable function references in React components.

Applied to files:

services/platform/app/(app)/dashboard/[id]/chat/components/chat-input.tsx

📚 Learning: 2025-07-19T15:30:00.886Z

Learnt from: CR
Repo: talecorp/poc PR: 0
File: .cursor/rules/core-rules.mdc:0-0
Timestamp: 2025-07-19T15:30:00.886Z
Learning: Applies to **/*.{tsx,jsx} : Follow the required React component structure: 'use client' (when needed), imports, Props interface, function definition, hooks, effects, event handlers, and render in order

Applied to files:

services/platform/app/(app)/dashboard/[id]/chat/components/chat-input.tsx

📚 Learning: 2025-07-19T15:30:00.886Z

Learnt from: CR
Repo: talecorp/poc PR: 0
File: .cursor/rules/core-rules.mdc:0-0
Timestamp: 2025-07-19T15:30:00.886Z
Learning: Applies to **/*.{tsx,jsx} : Use memoization (e.g., useMemo) for expensive operations and cache repeated calculations (e.g., useRef with Map) in React components

Applied to files:

services/platform/app/(app)/dashboard/[id]/chat/components/chat-input.tsx

📚 Learning: 2025-07-03T08:43:49.346Z

Learnt from: CR
Repo: talecorp/poc PR: 0
File: .cursor/rules/next-best-practice.mdc:0-0
Timestamp: 2025-07-03T08:43:49.346Z
Learning: Applies to **/*.{tsx,jsx} : Consistent component structure: hooks, effects, event handlers, render

Applied to files:

services/platform/app/(app)/dashboard/[id]/chat/components/chat-input.tsx

📚 Learning: 2025-08-21T14:59:36.874Z

Learnt from: CR
Repo: talecorp/lanserhof PR: 0
File: .cursorrules:0-0
Timestamp: 2025-08-21T14:59:36.874Z
Learning: Applies to **/*.{tsx,jsx} : Use React.memo for expensive components

Applied to files:

services/platform/app/(app)/dashboard/[id]/chat/components/chat-input.tsx

🧬 Code graph analysis (7)

services/platform/convex/model/chat_agent/generate_agent_response.ts (1)

services/platform/convex/model/chat_agent/chat_with_agent.ts (1)

FileAttachment (26-31)

services/platform/convex/model/chat_agent/chat_with_agent.ts (1)

services/platform/convex/lib/debug_log.ts (1)

createDebugLog (27-39)

services/platform/convex/agent_tools/convex_tools/files/pptx_tool.ts (1)

services/platform/convex/agent_tools/convex_tools/files/helpers/parse_file.ts (2)

ParseFileResult (11-25)

parseFile (34-102)

services/platform/convex/agent_tools/convex_tools/files/helpers/parse_file.ts (2)

services/platform/convex/lib/debug_log.ts (1)

createDebugLog (27-39)

services/platform/convex/agent_tools/convex_tools/crawler/helpers/get_crawler_service_url.ts (1)

getCrawlerServiceUrl (7-15)

services/platform/convex/agent_tools/convex_tools/files/docx_tool.ts (2)

services/platform/convex/agent_tools/convex_tools/files/helpers/parse_file.ts (2)

ParseFileResult (11-25)

parseFile (34-102)

services/platform/convex/model/documents/generate_docx.ts (1)

GenerateDocxResult (43-50)

services/platform/app/(app)/dashboard/[id]/chat/components/chat-interface.tsx (2)

services/platform/convex/chat_agent.ts (1)

chatWithAgent (24-48)

services/platform/convex/model/chat_agent/chat_with_agent.ts (1)

chatWithAgent (46-181)

services/crawler/app/main.py (4)

services/crawler/app/models.py (1)

ParseFileResponse (283-294)

services/crawler/app/converter_service.py (1)

get_converter_service (356-361)

services/crawler/app/template_service.py (1)

get_template_service (117-122)

services/crawler/app/file_parser_service.py (2)

FileParserService (17-147)

parse_file (131-147)

🪛 Ruff (0.14.8)

services/crawler/app/file_parser_service.py

12-12: typing.Dict is deprecated, use dict instead

(UP035)

50-50: Do not catch blind exception: Exception

(BLE001)

51-51: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

51-51: Logging statement uses f-string

(G004)

86-86: Do not catch blind exception: Exception

(BLE001)

87-87: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

87-87: Logging statement uses f-string

(G004)

127-127: Do not catch blind exception: Exception

(BLE001)

128-128: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

128-128: Logging statement uses f-string

(G004)

services/crawler/app/main.py

908-908: Do not perform function call File in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)

926-929: Abstract raise to an inner function

(TRY301)

941-941: Do not catch blind exception: Exception

(BLE001)

946-946: Use explicit conversion flag

Replace with conversion flag

(RUF010)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Build Docker Image

🔇 Additional comments (8)

services/crawler/app/models.py (1)

280-294: LGTM!

The ParseFileResponse model is well-structured, follows the existing patterns in the file, and aligns with the ParseFileResult interface in the TypeScript layer. The field descriptions are clear and appropriate.

services/crawler/app/main.py (1)

48-57: LGTM! Consistent with existing service patterns.

The global service singleton pattern matches the existing get_converter_service() and get_template_service() implementations. The FileParserService doesn't require async initialization like the crawler or converter services, so no lifespan changes are needed.

services/platform/convex/agent_tools/convex_tools/files/pptx_tool.ts (1)

281-292: LGTM! Parse operation implementation is consistent.

The parse operation follows the established patterns of other tools and correctly validates required inputs. The handler delegates to the shared parseFile helper, maintaining DRY principles.

services/platform/convex/agent_tools/convex_tools/files/helpers/parse_file.ts (1)

11-25: LGTM! Interface aligns with backend model.

The ParseFileResult interface correctly mirrors the ParseFileResponse Pydantic model from the crawler service, ensuring type consistency across the stack.

services/platform/convex/agent_tools/convex_tools/files/docx_tool.ts (2)

223-234: LGTM! Consistent with other file tools.

The parse operation implementation follows the same pattern as pptx_tool.ts and pdf_tool.ts, maintaining consistency across all file parsing tools. The input validation and error handling are appropriate.

41-43: Type definitions are well-structured.

The ParseDocxResult type correctly extends ParseFileResult with the operation discriminator, and the DocxResult union properly includes all three operation result types.

services/platform/convex/agent_tools/convex_tools/files/pdf_tool.ts (1)

146-182: Generate path looks solid; just remove the unsafe cast and consider logging a stable error code
Flow, debug logging, and action invocation look consistent. Main concern is the as GeneratePdfResult cast (see earlier).

services/platform/app/(app)/dashboard/[id]/chat/components/chat-interface.tsx (1)

294-307: Attachment mapping looks correct (strips UI-only fields).

coderabbitai · 2025-12-12T07:04:21Z

+from typing import Dict, Any
+
+logger = logging.getLogger(__name__)
+


⚠️ Potential issue | 🟡 Minor

Switch Dict[...] to dict[...] (and standardize response typing)
Ruff is right: prefer built-in generics for py3.9+.

-from typing import Dict, Any +from typing import Any ... - def parse_pdf(self, file_bytes: bytes, filename: str = "document.pdf") -> Dict[str, Any]: + def parse_pdf(self, file_bytes: bytes, filename: str = "document.pdf") -> dict[str, Any]: ... - def parse_docx(self, file_bytes: bytes, filename: str = "document.docx") -> Dict[str, Any]: + def parse_docx(self, file_bytes: bytes, filename: str = "document.docx") -> dict[str, Any]: ... - def parse_pptx(self, file_bytes: bytes, filename: str = "presentation.pptx") -> Dict[str, Any]: + def parse_pptx(self, file_bytes: bytes, filename: str = "presentation.pptx") -> dict[str, Any]: ... - def parse_file(self, file_bytes: bytes, filename: str, content_type: str = "") -> Dict[str, Any]: + def parse_file(self, file_bytes: bytes, filename: str, content_type: str = "") -> dict[str, Any]:

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools

🪛 Ruff (0.14.8)

12-12: typing.Dict is deprecated, use dict instead

(UP035)

🤖 Prompt for AI Agents

In services/crawler/app/file_parser_service.py around lines 12 to 15, replace the use of typing.Dict[...] with the built-in generic dict[...] (Python 3.9+) and standardize the function/response type annotations to use dict[str, Any] (or the appropriate key type) throughout the file; remove the unused Dict import, keep Any imported from typing (or add it if missing), and update all signatures and return annotations so they consistently use dict[...] instead of Dict[...].

coderabbitai · 2025-12-12T07:04:21Z

+    def parse_pdf(self, file_bytes: bytes, filename: str = "document.pdf") -> Dict[str, Any]:
+        """Extract text content from a PDF file."""
+        import fitz  # PyMuPDF
+
+        try:
+            doc = fitz.open(stream=file_bytes, filetype="pdf")
+            pages = []
+            full_text = []
+
+            for page_num, page in enumerate(doc, start=1):
+                text = page.get_text("text")
+                pages.append({"page_number": page_num, "text": text.strip()})
+                full_text.append(text)
+
+            metadata = doc.metadata or {}
+            doc.close()
+
+            return {
+                "success": True,
+                "filename": filename,
+                "file_type": "application/pdf",
+                "page_count": len(pages),
+                "pages": pages,
+                "full_text": "\n\n".join(full_text).strip(),
+                "metadata": {
+                    "title": metadata.get("title", ""),
+                    "author": metadata.get("author", ""),
+                    "subject": metadata.get("subject", ""),
+                },
+            }
+        except Exception as e:
+            logger.error(f"Error parsing PDF: {e}")
+            return {"success": False, "filename": filename, "file_type": "application/pdf", "error": str(e)}
+


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -type f -path "*crawler*" -name "*.py" | grep -i parser | head -10

Repository: tale-project/tale

Length of output: 178

🏁 Script executed:

find . -type f -name "file_parser_service.py"

Repository: tale-project/tale

Length of output: 106

🏁 Script executed:

git ls-files | grep -i "file_parser"

Repository: tale-project/tale

Length of output: 104

🏁 Script executed:

cat -n services/crawler/app/file_parser_service.py | head -100

Repository: tale-project/tale

Length of output: 4484

🏁 Script executed:

cat -n services/crawler/app/file_parser_service.py | sed -n '54-110p'

Repository: tale-project/tale

Length of output: 112

🏁 Script executed:

tail -60 services/crawler/app/file_parser_service.py

Repository: tale-project/tale

Length of output: 2992

🏁 Script executed:

find . -name "pyproject.toml" -o -name "setup.py" -o -name "setup.cfg" | head -3 | xargs grep -l "python" 2>/dev/null | head -1

Repository: tale-project/tale

Length of output: 147

Ensure PDF resources close on all paths and use logger.exception() instead of f-string logging

The doc.close() call at line 35 is not guaranteed to run if an exception occurs during page iteration or metadata access (lines 29–34), causing a resource leak. Additionally, using logger.error(f"...") loses the exception stack trace; logger.exception() should be used instead to preserve debugging information.

try: - doc = fitz.open(stream=file_bytes, filetype="pdf") - pages = [] - full_text = [] + doc = fitz.open(stream=file_bytes, filetype="pdf") + try: + pages: list[dict[str, Any]] = [] + full_text: list[str] = [] - for page_num, page in enumerate(doc, start=1): - text = page.get_text("text") - pages.append({"page_number": page_num, "text": text.strip()}) - full_text.append(text) + for page_num, page in enumerate(doc, start=1): + text = page.get_text("text") + pages.append({"page_number": page_num, "text": text.strip()}) + full_text.append(text) - metadata = doc.metadata or {} - doc.close() + metadata = doc.metadata or {} + finally: + doc.close() return { "success": True, "filename": filename, "file_type": "application/pdf", "page_count": len(pages), "pages": pages, "full_text": "\n\n".join(full_text).strip(), "metadata": { "title": metadata.get("title", ""), "author": metadata.get("author", ""), "subject": metadata.get("subject", ""), }, } - except Exception as e: - logger.error(f"Error parsing PDF: {e}") - return {"success": False, "filename": filename, "file_type": "application/pdf", "error": str(e)} + except Exception as e: + logger.exception("Error parsing PDF") + return { + "success": False, + "filename": filename, + "file_type": "application/pdf", + "error": str(e), + }

🧰 Tools

🪛 Ruff (0.14.8)

50-50: Do not catch blind exception: Exception

(BLE001)

51-51: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

51-51: Logging statement uses f-string

(G004)

🤖 Prompt for AI Agents

In services/crawler/app/file_parser_service.py around lines 20 to 53, the PDF document is opened without guaranteeing it gets closed on all code paths and the exception is logged with logger.error which drops the traceback; update the code to open the PyMuPDF document inside a try/finally or use a context manager so doc.close() always runs (even if page iteration or metadata access raises), and replace logger.error(...) with logger.exception(...) in the except block so the full stack trace is preserved.

coderabbitai · 2025-12-12T07:04:22Z

+    def parse_docx(self, file_bytes: bytes, filename: str = "document.docx") -> Dict[str, Any]:
+        """Extract text content from a DOCX file."""
+        from docx import Document
+
+        try:
+            doc = Document(BytesIO(file_bytes))
+            paragraphs = []
+            for para in doc.paragraphs:
+                text = para.text.strip()
+                if text:
+                    paragraphs.append({"text": text, "style": para.style.name if para.style else None})
+
+            tables = []
+            for table in doc.tables:
+                table_data = [[cell.text.strip() for cell in row.cells] for row in table.rows]
+                if table_data:
+                    tables.append(table_data)
+
+            full_text = "\n".join(p["text"] for p in paragraphs)
+            core_props = doc.core_properties
+
+            return {
+                "success": True,
+                "filename": filename,
+                "file_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
+                "paragraph_count": len(paragraphs),
+                "table_count": len(tables),
+                "paragraphs": paragraphs,
+                "tables": tables,
+                "full_text": full_text,
+                "metadata": {"title": core_props.title or "", "author": core_props.author or ""},
+            }
+        except Exception as e:
+            logger.error(f"Error parsing DOCX: {e}")
+            return {"success": False, "filename": filename, "error": str(e)}
+


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "file_parser_service.py" -type f

Repository: tale-project/tale

Length of output: 106

🏁 Script executed:

git ls-files | grep -i "file_parser" | head -20

Repository: tale-project/tale

Length of output: 104

🏁 Script executed:

find . -path "*/crawler/app/*" -name "*.py" | head -20

Repository: tale-project/tale

Length of output: 445

🏁 Script executed:

git ls-files "**/*file_parser*" || rg -l "file_parser_service" -t py || find . -name "*file_parser*" -type f

Repository: tale-project/tale

Length of output: 104

🏁 Script executed:

find . -type f -name "*.py" | xargs grep -l "def parse_docx" 2>/dev/null | head -5

Repository: tale-project/tale

Length of output: 106

🏁 Script executed:

rg "def parse_docx" -A 20 -t py

Repository: tale-project/tale

Length of output: 1849

🏁 Script executed:

rg "def parse_pdf" -A 20 -t py

Repository: tale-project/tale

Length of output: 1754

🏁 Script executed:

rg "parse_docx\|parse_pdf" -t py | head -30

Repository: tale-project/tale

Length of output: 43

🏁 Script executed:

find . -type d -name "crawler" 2>/dev/null

Repository: tale-project/tale

Length of output: 191

🏁 Script executed:

ls -la

Repository: tale-project/tale

Length of output: 1145

🏁 Script executed:

find . -type f -name "*.py" | wc -l && find . -type f -name "*.py" | head -20

Repository: tale-project/tale

Length of output: 781

🏁 Script executed:

rg "application/vnd.openxmlformats-officedocument.wordprocessingml" -t py

Repository: tale-project/tale

Length of output: 208

🏁 Script executed:

git log --oneline -5 2>/dev/null || echo "Not a git repo"

Repository: tale-project/tale

Length of output: 139

🏁 Script executed:

rg "Extract text content from a DOCX file" -t py

Repository: tale-project/tale

Length of output: 157

🏁 Script executed:

rg "logger.exception\|logger.error" -t py -A 2 | head -40

Repository: tale-project/tale

Length of output: 43

🏁 Script executed:

rg "success.*False" -t py | head -20

Repository: tale-project/tale

Length of output: 1029

🌐 Web query:

Python logger.exception vs logger.error exception trace stacktrace

💡 Result:

Short answer

logger.exception(msg) == logger.error(msg, exc_info=True). It logs at ERROR level and automatically adds the current exception traceback from sys.exc_info(). [1][2]

Key differences / gotchas

logger.error(...) by itself does NOT include a traceback — pass exc_info=True (or exc_info=sys.exc_info() or an exc tuple) to include it. [1][2]

Use logger.exception only inside an except block (or pass an explicit exc_info) — otherwise sys.exc_info() is empty and you get no traceback (or None info). [3]

Minimal examples

inside except:
try: ...
except Exception:
logger.exception("failed") # same as logger.error("failed", exc_info=True) [1][2]

logging a saved exception later:
logger.error("failed", exc_info=my_exc_info) # where my_exc_info is sys.exc_info() or (type, value, tb). [3]

Sources
[1] docs / blog: explanation that logger.exception is error + exc_info=True.
[2] Loggly / tutorials: exc_info=True includes full stack trace; logger.exception defaults to ERROR.
[3] Stack Overflow: logger.exception relies on sys.exc_info() (must be in except or supply exc_info).

🌐 Web query:

Python API response consistency error handling payload fields

💡 Result:

Here are concise, practical guidelines for consistent API responses, error handling, and payload fields in Python APIs — plus a minimal example.

Key rules

Use proper HTTP status codes (4xx for client, 5xx for server). Avoid always returning 200 for errors. [1][2]

Adopt a standard machine-readable error format (IETF Problem Details — RFC 9457 / RFC 7807). Include type, title, status, detail, instance; extend with application-specific fields. [1][3]

Provide a consistent success envelope (data + meta) so clients can parse uniformly.

Include a stable error code (machine-readable), human message, and optional field-level errors for validation failures.

Always return a correlation/trace id in responses and logs to link client errors to server logs.

Don’t leak internal stack traces or sensitive data; log full details server-side and return safe summary messages to clients.

Version your API and include response version in metadata to enable non‑breaking evolution.

For transient upstream failures use 5xx and include Retry-After when appropriate; for rate limits use 429 plus Retry-After. [2][4]

Minimal JSON patterns

Success:
{
"data": { ... },
"meta": { "request_id":"...", "version":"v1", "timestamp":"2025-12-12T12:34:56Z" }
}

Error (RFC-style + extras):
{
"type": "https://example.com/problems/validation-error",
"title": "Validation error",
"status": 400,
"detail": "One or more fields failed validation",
"instance": "/orders/1234",
"code": "INVALID_INPUT", // app-specific code
"errors": { // field-level details
"email": ["missing", "invalid format"],
"quantity": ["must be >= 1"]
},
"request_id": "abc123" // correlation id
}

Python implementation notes

Centralize error handling (middleware/exception handler) that:

Maps exceptions to HTTP statuses and application error codes.

Formats all responses using the agreed envelope (success or RFC-style error).

Attaches/generates request_id and logs it with full stack traces.

Framework examples:

FastAPI: use exception handlers and Pydantic validation to populate field errors, then return Problem Details JSON.

Flask/Django: use error handlers/middleware and serializers to enforce the response shape.

Validate inputs early (schema libs: Pydantic, Marshmallow) and translate validation errors into the errors array field. [2][5]

Mapping examples (brief)

Validation error -> 400 + code INVALID_INPUT + errors (field list).

Auth failure -> 401 + code UNAUTHORIZED.

Permission -> 403 + code FORBIDDEN.

Not found -> 404 + code NOT_FOUND.

Conflict (duplicate) -> 409 + code CONFLICT.

Upstream timeout -> 504 + code UPSTREAM_TIMEOUT + Retry-After when possible.

References

RFC 9457 — Problem Details for HTTP APIs (IETF). [1]

Best practices / error handling guides (Treblle, Zuplo) — practical tips on status codes, not exposing internals, Retry-After, etc. [2][4]

RFC 7807 — original Problem Details specification (still referenced widely). [3]

If you want, I can:

produce a ready-to-drop-in FastAPI or Flask middleware/exception-handler implementing this pattern, or

generate a JSON Schema / OpenAPI example for your API. Which do you prefer?

🏁 Script executed:

cat -n services/crawler/app/file_parser_service.py

Repository: tale-project/tale

Length of output: 7240

🏁 Script executed:

rg "def parse_docx" -B 2 -A 40 services/crawler/app/file_parser_service.py | head -50

Repository: tale-project/tale

Length of output: 1922

🏁 Script executed:

sed -n '54,90p' services/crawler/app/file_parser_service.py

Repository: tale-project/tale

Length of output: 1707

🏁 Script executed:

rg "def parse_pdf" -B 2 -A 45 services/crawler/app/file_parser_service.py

Repository: tale-project/tale

Length of output: 1973

🏁 Script executed:

grep -n "except Exception" services/crawler/app/file_parser_service.py

Repository: tale-project/tale

Length of output: 163

🏁 Script executed:

sed -n '96,145p' services/crawler/app/file_parser_service.py

Repository: tale-project/tale

Length of output: 2534

🏁 Script executed:

rg "except Exception as e:" services/crawler/app/file_parser_service.py -A 3

Repository: tale-project/tale

Length of output: 583

🏁 Script executed:

rg "parse_docx\|parse_pdf" services/crawler/app/main.py -B 3 -A 3

Repository: tale-project/tale

Length of output: 43

🏁 Script executed:

head -20 services/crawler/app/file_parser_service.py

Repository: tale-project/tale

Length of output: 569

Standardize DOCX failure response to include file_type + capture exception trace in logs

parse_docx success includes file_type, but failure omits it—unlike parse_pdf, which includes it consistently. This breaks callers expecting a stable response structure. Additionally, logger.error() without traceback loses debugging context; logger.exception() automatically captures the full stack.

- except Exception as e: - logger.error(f"Error parsing DOCX: {e}") - return {"success": False, "filename": filename, "error": str(e)} + except Exception as e: + logger.exception("Error parsing DOCX") + return { + "success": False, + "filename": filename, + "file_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document", + "error": str(e), + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def parse_docx(self, file_bytes: bytes, filename: str = "document.docx") -> Dict[str, Any]:

"""Extract text content from a DOCX file."""

from docx import Document

try:

doc = Document(BytesIO(file_bytes))

paragraphs = []

for para in doc.paragraphs:

text = para.text.strip()

if text:

paragraphs.append({"text": text, "style": para.style.name if para.style else None})

tables = []

for table in doc.tables:

table_data = [[cell.text.strip() for cell in row.cells] for row in table.rows]

if table_data:

tables.append(table_data)

full_text = "\n".join(p["text"] for p in paragraphs)

core_props = doc.core_properties

return {

"success": True,

"filename": filename,

"file_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",

"paragraph_count": len(paragraphs),

"table_count": len(tables),

"paragraphs": paragraphs,

"tables": tables,

"full_text": full_text,

"metadata": {"title": core_props.title or "", "author": core_props.author or ""},

}

except Exception as e:

logger.error(f"Error parsing DOCX: {e}")

return {"success": False, "filename": filename, "error": str(e)}

def parse_docx(self, file_bytes: bytes, filename: str = "document.docx") -> Dict[str, Any]:

"""Extract text content from a DOCX file."""

from docx import Document

try:

doc = Document(BytesIO(file_bytes))

paragraphs = []

for para in doc.paragraphs:

text = para.text.strip()

if text:

paragraphs.append({"text": text, "style": para.style.name if para.style else None})

tables = []

for table in doc.tables:

table_data = [[cell.text.strip() for cell in row.cells] for row in table.rows]

if table_data:

tables.append(table_data)

full_text = "\n".join(p["text"] for p in paragraphs)

core_props = doc.core_properties

return {

"success": True,

"filename": filename,

"file_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",

"paragraph_count": len(paragraphs),

"table_count": len(tables),

"paragraphs": paragraphs,

"tables": tables,

"full_text": full_text,

"metadata": {"title": core_props.title or "", "author": core_props.author or ""},

}

except Exception as e:

logger.exception("Error parsing DOCX")

return {

"success": False,

"filename": filename,

"file_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",

"error": str(e),

}

🧰 Tools

🪛 Ruff (0.14.8)

86-86: Do not catch blind exception: Exception

(BLE001)

87-87: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

87-87: Logging statement uses f-string

(G004)

🤖 Prompt for AI Agents

In services/crawler/app/file_parser_service.py around lines 54 to 89, the DOCX failure path omits the file_type field and logs only the error message; update the except block to (1) use logger.exception(...) to capture the full traceback and (2) return a failure dict that includes the same "file_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document" key as the success response (alongside success: False, filename and error), so callers receive a stable response shape.

coderabbitai · 2025-12-12T07:04:22Z

+    def parse_pptx(self, file_bytes: bytes, filename: str = "presentation.pptx") -> Dict[str, Any]:
+        """Extract text content from a PPTX file."""
+        from pptx import Presentation
+
+        try:
+            prs = Presentation(BytesIO(file_bytes))
+            slides = []
+            full_text_parts = []
+
+            for slide_num, slide in enumerate(prs.slides, start=1):
+                slide_text = []
+                for shape in slide.shapes:
+                    if shape.has_text_frame:
+                        for paragraph in shape.text_frame.paragraphs:
+                            text = paragraph.text.strip()
+                            if text:
+                                slide_text.append(text)
+                    if shape.has_table:
+                        for row in shape.table.rows:
+                            for cell in row.cells:
+                                text = cell.text.strip()
+                                if text:
+                                    slide_text.append(text)
+
+                slides.append({"slide_number": slide_num, "text_content": slide_text, "full_text": "\n".join(slide_text)})
+                full_text_parts.extend(slide_text)
+
+            core_props = prs.core_properties
+            return {
+                "success": True,
+                "filename": filename,
+                "file_type": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
+                "slide_count": len(slides),
+                "slides": slides,
+                "full_text": "\n\n".join(full_text_parts),
+                "metadata": {"title": core_props.title or "", "author": core_props.author or ""},
+            }
+        except Exception as e:
+            logger.error(f"Error parsing PPTX: {e}")
+            return {"success": False, "filename": filename, "error": str(e)}
+


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd /tmp && find . -name "file_parser_service.py" -type f 2>/dev/null

Repository: tale-project/tale

Length of output: 43

🏁 Script executed:

cd /tmp && git ls-files | grep -i "file_parser"

Repository: tale-project/tale

Length of output: 129

🏁 Script executed:

cd /tmp && git ls-files | head -20

Repository: tale-project/tale

Length of output: 129

🏁 Script executed:

find . -name "file_parser_service.py" -type f

Repository: tale-project/tale

Length of output: 106

🏁 Script executed:

find . -type f -name "*.py" | grep -i parser | head -20

Repository: tale-project/tale

Length of output: 106

🏁 Script executed:

wc -l ./services/crawler/app/file_parser_service.py

Repository: tale-project/tale

Length of output: 110

🏁 Script executed:

cat -n ./services/crawler/app/file_parser_service.py

Repository: tale-project/tale

Length of output: 7240

Standardize PPTX failure payload shape (file_type missing) + use logger.exception() for traceback

PPTX failure response omits file_type while the success response includes it (line 121). DOCX has the same inconsistency (line 88 failure vs line 78 success), while PDF correctly includes it in both success and failure responses (line 52). Additionally, use logger.exception() to preserve the full exception traceback in logs.

except Exception as e: - logger.error(f"Error parsing PPTX: {e}") - return {"success": False, "filename": filename, "error": str(e)} + logger.exception("Error parsing PPTX") + return { + "success": False, + "filename": filename, + "file_type": "application/vnd.openxmlformats-officedocument.presentationml.presentation", + "error": str(e), + }

🧰 Tools

🪛 Ruff (0.14.8)

127-127: Do not catch blind exception: Exception

(BLE001)

128-128: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

128-128: Logging statement uses f-string

(G004)

🤖 Prompt for AI Agents

In services/crawler/app/file_parser_service.py around lines 90 to 130, the PPTX error path logs with logger.error and returns a payload missing the file_type key while the success path includes it; update the except block to call logger.exception("Error parsing PPTX") to record the full traceback and return a failure dict that mirrors the success shape by including "file_type": "application/vnd.openxmlformats-officedocument.presentationml.presentation" along with success False, filename and error; also make the analogous change for the DOCX handler so both formats have consistent failure payloads and tracebacks.

coderabbitai · 2025-12-12T07:04:22Z

+    def parse_file(self, file_bytes: bytes, filename: str, content_type: str = "") -> Dict[str, Any]:
+        """Parse a file based on its content type or filename extension."""
+        filename_lower = filename.lower()
+        content_type_lower = content_type.lower() if content_type else ""
+
+        if filename_lower.endswith(".pdf") or "pdf" in content_type_lower:
+            return self.parse_pdf(file_bytes, filename)
+        elif filename_lower.endswith(".docx") or "wordprocessingml" in content_type_lower:
+            return self.parse_docx(file_bytes, filename)
+        elif filename_lower.endswith(".pptx") or "presentationml" in content_type_lower:
+            return self.parse_pptx(file_bytes, filename)
+        else:
+            return {
+                "success": False,
+                "filename": filename,
+                "error": f"Unsupported file type: {filename} ({content_type}). Supported: PDF, DOCX, PPTX.",
+            }


🧹 Nitpick | 🔵 Trivial

Add guardrails for untrusted uploads (size limits / zip-bomb risk)
DOCX/PPTX are ZIP-based; parsing arbitrary bytes without size/complexity limits can be an easy DoS vector. Consider enforcing a max byte size (and ideally request timeouts) at the API boundary or here before dispatch.

🤖 Prompt for AI Agents

In services/crawler/app/file_parser_service.py around lines 131 to 147, the parser dispatches DOCX/PPTX (zip-based) and other uploads without any size or zip-complexity guards; add an early defensive check before dispatch: reject files above a configured MAX_FILE_BYTES with a clear error return, and for zip-based formats (docx/pptx) additionally inspect the bytes as a zip (without extracting) to enforce limits on number of entries (MAX_ZIP_ENTRIES) and total uncompressed size (MAX_UNCOMPRESSED_BYTES); if any limit is exceeded return the same error structure with a descriptive message and do not call parse_docx/parse_pptx. Ensure the limits are configurable constants and applied before any parsing to mitigate DoS/zip-bomb risk.

coderabbitai · 2025-12-12T07:04:22Z

+    // Download the file from the URL
+    const fileResponse = await fetch(url);
+    if (!fileResponse.ok) {
+      throw new Error(`Failed to download file: ${fileResponse.status} ${fileResponse.statusText}`);
+    }
+
+    const fileBlob = await fileResponse.blob();


⚠️ Potential issue | 🟡 Minor

Add timeout for file download to prevent hanging requests.

The initial fetch(url) call has no timeout, which could cause the function to hang indefinitely if the source server is slow or unresponsive. The crawler service call has a 120-second timeout, but the download doesn't.

+ // Download the file from the URL with timeout + const downloadController = new AbortController(); + const downloadTimeoutId = setTimeout(() => downloadController.abort(), 60000); + - const fileResponse = await fetch(url); + const fileResponse = await fetch(url, { signal: downloadController.signal }); + clearTimeout(downloadTimeoutId); + if (!fileResponse.ok) { throw new Error(`Failed to download file: ${fileResponse.status} ${fileResponse.statusText}`); }

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In services/platform/convex/agent_tools/convex_tools/files/helpers/parse_file.ts around lines 48 to 54, the fetch(url) call has no timeout and can hang; add an AbortController, pass its signal to fetch, start a timeout (match the crawler's 120000 ms or a chosen value), and on timeout call controller.abort() to cancel the request; ensure you clear the timeout timer after fetch completes and catch the abort error to throw a clear timeout-specific Error so downloads don’t hang indefinitely.

coderabbitai · 2025-12-12T07:04:22Z

+    const fileBlob = await fileResponse.blob();
+
+    // Create FormData and upload to crawler service
+    const formData = new FormData();
+    formData.append('file', fileBlob, filename);


🧹 Nitpick | 🔵 Trivial

Consider adding file size validation to prevent resource exhaustion.

Large files could exhaust memory or cause timeouts. Consider adding a size check before uploading to the crawler service.

const fileBlob = await fileResponse.blob(); + // Validate file size (e.g., 50MB limit) + const MAX_FILE_SIZE = 50 * 1024 * 1024; + if (fileBlob.size > MAX_FILE_SIZE) { + throw new Error(`File too large: ${fileBlob.size} bytes exceeds ${MAX_FILE_SIZE} byte limit`); + } + // Create FormData and upload to crawler service const formData = new FormData(); formData.append('file', fileBlob, filename);

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const fileBlob = await fileResponse.blob();

// Create FormData and upload to crawler service

const formData = new FormData();

formData.append('file', fileBlob, filename);

const fileBlob = await fileResponse.blob();

// Validate file size (e.g., 50MB limit)

const MAX_FILE_SIZE = 50 * 1024 * 1024;

if (fileBlob.size > MAX_FILE_SIZE) {

throw new Error(`File too large: ${fileBlob.size} bytes exceeds ${MAX_FILE_SIZE} byte limit`);

}

// Create FormData and upload to crawler service

const formData = new FormData();

formData.append('file', fileBlob, filename);

coderabbitai · 2025-12-12T07:04:22Z

    handler: async (ctx: ToolCtx, args): Promise<PdfResult> => {
-      debugLog('tool:pdf start', {
+      const operation = args.operation ?? 'generate';
+
+      // Handle parse operation
+      if (operation === 'parse') {
+        if (!args.url) {
+          throw new Error("Missing required 'url' for parse operation");
+        }
+        if (!args.filename) {
+          throw new Error("Missing required 'filename' for parse operation");
+        }
+
+        const result = await parseFile(args.url, args.filename, 'pdf');
+        return { operation: 'parse', ...result };
+      }
+


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, locate and examine the file mentioned in the review find . -type f -name "pdf_tool.ts" | head -20

Repository: tale-project/tale

Length of output: 130

🏁 Script executed:

# Get the file size to determine how to read it wc -l services/platform/convex/agent_tools/convex_tools/files/pdf_tool.ts 2>/dev/null || echo "File not found in expected location"

Repository: tale-project/tale

Length of output: 132

🏁 Script executed:

# Search for the pdf_tool.ts file more broadly fd "pdf_tool.ts" --type f

Repository: tale-project/tale

Length of output: 128

🏁 Script executed:

# Search for parseFile function definition rg "parseFile" --type ts --type tsx -A 5 -B 2

Repository: tale-project/tale

Length of output: 88

🏁 Script executed:

# Check for URL validation patterns in the codebase rg "allowlist|allowList|SSRF|validateUrl|URL.*valid" --type ts --type tsx

Repository: tale-project/tale

Length of output: 88

🏁 Script executed:

# Read the pdf_tool.ts file cat -n services/platform/convex/agent_tools/convex_tools/files/pdf_tool.ts

Repository: tale-project/tale

Length of output: 7551

🏁 Script executed:

# Search for parseFile with simpler pattern rg "parseFile" -A 5 -B 2

Repository: tale-project/tale

Length of output: 12733

🏁 Script executed:

# Search for URL validation patterns more broadly rg "validate|allowlist|allowList" -A 2 -B 2 | head -100

Repository: tale-project/tale

Length of output: 8635

🏁 Script executed:

# Read the parseFile helper cat -n services/platform/convex/agent_tools/convex_tools/files/helpers/parse_file.ts

Repository: tale-project/tale

Length of output: 3594

🏁 Script executed:

# Check if there's any URL validation or allowlist patterns used elsewhere rg "storage.convex|allowlist|CONVEX.*DOMAIN|URL.*validation" --type ts -B 2 -A 2

Repository: tale-project/tale

Length of output: 2566

🏁 Script executed:

# Check the documentation comment for the parse operation to understand URL expectations grep -A 10 "parse - Extract" services/platform/convex/agent_tools/convex_tools/files/pdf_tool.ts

Repository: tale-project/tale

Length of output: 623

Add try/catch and operation-scoped logging to parse operation; enforce URL domain allowlist before fetch

The parse operation invokes parseFile() without wrapping it in try/catch or operation-scoped logging, leaving failures untagged at the tool level. More critically, parseFile accepts any caller-provided URL and fetches it directly (line 49 in parse_file.ts) without validation—a classic SSRF vector. While the tool description mentions Convex storage, the parameter accepts "accessible HTTP URL", which is too permissive. Add an allowlist (e.g., Convex storage domain) before the fetch, and wrap the parseFile call with try/catch and debugLog calls matching the pattern used in the generate operation.

🤖 Prompt for AI Agents

In services/platform/convex/agent_tools/convex_tools/files/pdf_tool.ts around lines 119 to 134, the parse branch needs an allowlist check and operation-scoped logging with error handling: before calling parseFile, validate args.url against a small allowlist of permitted hosts (e.g., your Convex storage domain(s)) and throw a clear Error if the URL host is not allowed; wrap the parseFile call in try/catch and surround it with debugLog calls similar to the generate operation (e.g., debugLog('pdf:parse:start', { url, filename }), debugLog('pdf:parse:success', result) on success, and debugLog('pdf:parse:error', { error })) and rethrow or throw a new Error after logging so failures are surfaced consistently.

coderabbitai · 2025-12-12T07:04:22Z

+  // For parse operation
+  url: z
+    .string()
+    .optional()
+    .describe("For 'parse': URL of the PPTX file to parse"),
+  filename: z
+    .string()
+    .optional()
+    .describe("For 'parse': Original filename (e.g., 'presentation.pptx')"),


🧹 Nitpick | 🔵 Trivial

Consider adding file extension validation hint.

The url and filename parameters for the parse operation are correctly defined. However, consider adding a .describe() hint that clarifies the expected file type to help the AI agent make better decisions.

url: z .string() .optional() - .describe("For 'parse': URL of the PPTX file to parse"), + .describe("For 'parse': URL of the PPTX file to parse. Must be a .pptx file."), filename: z .string() .optional() - .describe("For 'parse': Original filename (e.g., 'presentation.pptx')"), + .describe("For 'parse': Original filename. Must end with .pptx extension."),

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// For parse operation

url: z

.string()

.optional()

.describe("For 'parse': URL of the PPTX file to parse"),

filename: z

.string()

.optional()

.describe("For 'parse': Original filename (e.g., 'presentation.pptx')"),

// For parse operation

url: z

.string()

.optional()

.describe("For 'parse': URL of the PPTX file to parse. Must be a .pptx file."),

filename: z

.string()

.optional()

.describe("For 'parse': Original filename. Must end with .pptx extension."),

🤖 Prompt for AI Agents

In services/platform/convex/agent_tools/convex_tools/files/pptx_tool.ts around lines 82 to 90, the .describe() texts for the url and filename should explicitly state the expected .pptx file type and hint about extension validation; update the descriptions to mention "must be a .pptx file (e.g., presentation.pptx)" and, if feasible, add a simple validation (e.g., require the filename or URL to end with .pptx) so the agent and runtime can detect incorrect file types early.

coderabbitai · 2025-12-12T07:04:22Z

CVE-2025-26791

+  // Build message content with markdown references for all attachments
+  let messageContent = trimmedMessage;
+  if (hasAttachments) {
+    const imageMarkdowns: string[] = [];
+    const fileMarkdowns: string[] = [];
+
+    for (const attachment of attachments) {
+      const url = await ctx.storage.getUrl(attachment.fileId);
+      if (!url) continue;
+
+      if (attachment.fileType.startsWith('image/')) {
+        // Images: Use markdown image syntax for inline display
+        imageMarkdowns.push(`![${attachment.fileName}](${url})`);
+      } else {
+        // Other files: Use markdown link with file info
+        const sizeKB = Math.round(attachment.fileSize / 1024);
+        const sizeDisplay = sizeKB >= 1024
+          ? `${(sizeKB / 1024).toFixed(1)} MB`
+          : `${sizeKB} KB`;
+        fileMarkdowns.push(`📎 [${attachment.fileName}](${url}) (${attachment.fileType}, ${sizeDisplay})`);
+      }
+    }
+
+    // Build the attachment section
+    const attachmentParts: string[] = [];
+    if (imageMarkdowns.length > 0) {
+      attachmentParts.push(imageMarkdowns.join('\n'));
+    }
+    if (fileMarkdowns.length > 0) {
+      attachmentParts.push(fileMarkdowns.join('\n'));
+    }
+
+    if (attachmentParts.length > 0) {
+      messageContent = `${trimmedMessage}\n\n${attachmentParts.join('\n\n')}`;
+    }
+  }
+


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Search for markdown rendering patterns rg -n --hidden -S "react-markdown|remark-gfm|rehype-raw|dangerouslySetInnerHTML|sanitize" .

Repository: tale-project/tale

Length of output: 9429

🏁 Script executed:

# Find where messageContent from chat_with_agent is used/rendered rg -n "messageContent|chat_with_agent" --max-count 20

Repository: tale-project/tale

Length of output: 2148

🏁 Script executed:

# Look for the function definition to see full context of message handling grep -n "export.*chat_with_agent\|async.*chat_with_agent" services/platform/convex/model/chat_agent/chat_with_agent.ts

Repository: tale-project/tale

Length of output: 43

🏁 Script executed:

# Search for markdown libraries and renderer setup in the frontend fd -e "package.json" -e "tsconfig.json" | head -5 && rg -l "markdown|unified|remark|rehype" . | head -20

Repository: tale-project/tale

Length of output: 1057

🏁 Script executed:

# Read the sanitize-chat.ts file to understand sanitization cat -n services/platform/lib/utils/sanitize-chat.ts

Repository: tale-project/tale

Length of output: 1885

🏁 Script executed:

# Read message-bubble.tsx to see how content is rendered with react-markdown cat -n services/platform/app/$app$/dashboard/\[id\]/chat/components/message-bubble.tsx | head -100

Repository: tale-project/tale

Length of output: 2829

🏁 Script executed:

# Check the rendering part where sanitizedContent is used sed -n '380,465p' "services/platform/app/(app)/dashboard/[id]/chat/components/message-bubble.tsx"

Repository: tale-project/tale

Length of output: 2660

🏁 Script executed:

# Check react-markdown version and look for any security plugins or configurations cat services/platform/package.json | grep -A 2 -B 2 "react-markdown"

Repository: tale-project/tale

Length of output: 204

🏁 Script executed:

# Search for any sanitization happening before sanitizeChatMessage is called rg -n "sanitizeChatMessage|chat_with_agent" services/platform/app/$app$/dashboard/\[id\]/chat/components/chat-interface.tsx -A 5

Repository: tale-project/tale

Length of output: 419

🏁 Script executed:

# Check how attachment data flows - is fileName ever escaped before reaching message-bubble? sed -n '77,120p' services/platform/convex/model/chat_agent/chat_with_agent.ts

Repository: tale-project/tale

Length of output: 1619

🏁 Script executed:

# Check if there's URL validation for markdown links or any additional security measures rg -n "javascript:|protocol|href.*validation" services/platform/lib/ services/platform/app/ --max-count 20

Repository: tale-project/tale

Length of output: 1231

🏁 Script executed:

# Check the full context of how messages are stored and retrieved to understand the data flow rg -n "saveMessage|message.content" services/platform/convex/model/chat_agent/chat_with_agent.ts -B 3 -A 3

Repository: tale-project/tale

Length of output: 989

🌐 Web query:

react-markdown 10.1.0 security markdown link injection javascript protocol vulnerability

💡 Result:

Short answer: No known JavaScript-protocol link-injection (javascript:) vulnerability in react-markdown v10.1.0 as of Dec 12, 2025. Sources (react-markdown docs and Snyk) show react-markdown is “secure by default” and list no direct vulnerabilities for v10.1.0; most reported markdown link/URL XSS issues affect other libraries (e.g., markdown-to-jsx, markdown-it). [1][2]

Recommendation: Keep react-markdown updated, do not enable raw HTML unless you sanitize it, and use rehype-sanitize (or your own URL-sanitizer/urlTransform) to block javascript: and other dangerous schemes when rendering untrusted markdown. See the react-markdown security guidance. [1][2]

Sources:
[1] react-markdown (npm) — Security and version info.
[2] Snyk — react-markdown security/scan results.

🏁 Script executed:

# Check what rehypeRaw is used for in the codebase and if it's necessary rg -n "rehypeRaw\|rehype-raw" services/platform/app/$app$/dashboard/\[id\]/chat/components/message-bubble.tsx -B 2 -A 5

Repository: tale-project/tale

Length of output: 43

🌐 Web query:

rehype-raw security risks markdown injection 2024 2025

💡 Result:

Short summary — risk, recent context (2024–2025), and mitigations.

Risk: rehype-raw allows embedded/raw HTML in Markdown to be parsed into the HTML AST; used on untrusted input it can enable XSS/“markdown injection” because arbitrary HTML (including <script>, event handlers, data: URLs, etc.) may be passed through. Use of allowDangerousHtml + rehype-raw is explicitly called out as unsafe in the project docs. [1][4][5]

Recent/context (2024–2025): there are no high-profile CVEs on rehype-raw itself, but related tooling and common Markdown parsers have had XSS advisories in 2024–2025 (examples: reports and CVEs involving markdown rendering in apps using rehype/remark pipelines and a CVE published for markdown-it in 2025), showing the practical risk when raw HTML is allowed and not sanitized. This reinforces that the attack surface is real even if rehype-raw is not itself a packaged “vulnerability.” [5][6][2]

Recommended mitigations

Never run rehype-raw over untrusted user content without sanitization. If you must render untrusted Markdown with embedded HTML, sanitize AFTER rehype-raw (e.g., rehype-sanitize / hast-util-sanitize) — rehype docs and rehype-sanitize explicitly recommend this ordering. [2][1]

Prefer disallowing raw HTML: avoid allowDangerousHtml in remark-rehype or strip/escape HTML before processing. [1][4]

Use a strict sanitize schema (e.g., GitHub-like schema) and disallow risky attributes (on*, javascript: URLs, data: where appropriate). rehype-sanitize has options/schemas for that. [2]

If you need limited HTML, consider whitelist/tag-filtering or transform untrusted HTML to a safe subset server-side before rendering. Some wrappers/options (tagfilter) can help but must be tested. [3][4]

Keep unified/remark/rehype/rehype-raw/rehype-sanitize and Markdown libraries up to date; monitor advisories (npm, Snyk, NVD) for related parser/XSS fixes. [2][4][6]

If you want, I can:

show a minimal rehype pipeline example that safely processes untrusted Markdown (rehype-raw followed by rehype-sanitize), or

scan specific package versions you use and list any advisories. Which would you prefer?

Sources
[1] rehype-raw README / security note.
[2] rehype-sanitize (recommended plugin + security guidance).
[3] rehype-raw usage notes (tagfilter / ordering advice).
[4] Snyk package security page for rehype-raw (maintenance & findings).
[5] NVD advisory example showing XSS in an app rendering Markdown via rehype-raw (CVE example).
[6] NVD / advisories re: markdown-it XSS (2025) — example of related parser vulnerabilities.

🌐 Web query:

react-markdown filename markdown link syntax injection attack example

💡 Result:

Short example — attack and mitigations.

Attack (malicious filename inserted into Markdown):

Vulnerable template:
Download file

Attacker sets filename exactly to:
](javascript:alert(1))<!--.pdf

Resulting Markdown becomes:
Download file
— which creates a link with a javascript: URL → XSS when clicked. [1][2]

Mitigations (pick one or more):

URL-encode filenames when inserting into links: use encodeURIComponent(filename) so special characters can’t break the link syntax. [2]

Validate/whitelist link protocols (allow only http(s), mailto, tel) or use react-markdown’s uriTransformer / transformLinkUri to filter/normalize URIs. [1][3]

Disable raw HTML in markdown and sanitize output (DOMPurify / rehype-sanitize) before rendering. [2][3]

References

Example attack discussion / link-transform protections (react-markdown / markdown libs). [1]

Markdown security best practices and sanitization (DOMPurify, URL validation, allowed protocols). [2]

Secure-react-markdown guidance (use uriTransformer / disable HTML / sanitize). [3]

Sources
[1] Medium: Avoiding XSS via Markdown in React (react-markdown link transformer). (medium.com)
[2] Markdown Security Best Practices (link/data/javascript protocols; sanitize/encode). (markdownlang.com)
[3] WSO2 / secure guidance for react-markdown (skipHtml, sanitize plugins). (security.docs.wso2.com)

Citations:

1: https://medium.com/javascript-security/avoiding-xss-via-markdown-in-react-91665479900?utm_source=openai

2: https://www.markdownlang.com/advanced/security.html?utm_source=openai

3: https://security.docs.wso2.com/en/latest/security-guidelines/secure-engineering-guidelines/secure-coding-guidlines/general-recommendations-for-react-secure-coding/?utm_source=openai

🏁 Script executed:

# Verify the exact data flow: check how react-markdown with rehypeRaw is configured cat -n services/platform/app/$app$/dashboard/\[id\]/chat/components/message-bubble.tsx | grep -A 20 "Markdown"

Repository: tale-project/tale

Length of output: 4690

🏁 Script executed:

# Check if there are any link/URL validators or transformers in the react-markdown setup rg -n "uriTransformer|transformLinkUri|href.*validate|allowDangerousHtml" services/platform/app/

Repository: tale-project/tale

Length of output: 43

🌐 Web query:

DOMPurify sanitize javascript: URL protocol links security 2025

💡 Result:

Short answer: DOMPurify blocks "javascript:" (and other unsafe schemes) in href/src by default, but you must keep DOMPurify up-to-date and follow defense‑in‑depth because past config-specific bugs (e.g., SAFE_FOR_TEMPLATES) or old versions have introduced XSS risks. See sources below. [1][2][3][4]

Recommendations (concise)

Upgrade to the latest DOMPurify release (patch any CVE‑fixed versions; e.g., CVE-2025-26791 affects < 3.2.4 when SAFE_FOR_TEMPLATES=true). [3]

Rely on DOMPurify’s default URL filtering (it blocks javascript:, data:, vbscript: by default) but also validate/normalize URLs server‑side where possible. [1][2]

Don’t assume custom hooks/configs are safe — review any ALLOWED_* or SAFE_FOR_TEMPLATES settings and test with current browser parsing behavior. [2][3]

For external links add rel="noopener noreferrer" (and prefer target behavior controlled by your code) and canonicalize hrefs before inserting. [1][4]

If you only need URL sanitization (not full HTML), consider dedicated URL validation/normalization in addition to DOMPurify (DOMPurify’s URL logic is internal; see related GitHub discussion). [4]

Sources

DOMPurify homepage / features (URI/protocol filtering). [1]

DOMPurify technical FAQ / “how does DOMPurify ensure…”, notes on javascript: URIs and allowed protocols. [2]

CVE-2025-26791 advisory (DOMPurify <3.2.4 SAFE_FOR_TEMPLATES mXSS issue). [3]

DOMPurify GitHub issue re: URL sanitization API (discussing internal URL logic / developer needs). [4]

🏁 Script executed:

# Check if there's any sanitization AFTER markdown rendering, before display sed -n '380,465p' "services/platform/app/(app)/dashboard/[id]/chat/components/message-bubble.tsx" | grep -E "sanitize|DOMPurify|dangerously"

Repository: tale-project/tale

Length of output: 257

🏁 Script executed:

# Trace the exact flow - what does sanitizeChatMessage output? # Check if it's actually sanitizing markdown or if it's just escaping HTML entities python3 << 'EOF' # Simulate what DOMPurify does to markdown with javascript: link markdown_input = "![badfile](javascript:alert(1))" print(f"Input markdown: {markdown_input}") # DOMPurify when given markdown as text (not HTML) would: # 1. Look for HTML tags - finds none # 2. Look for HTML attributes - finds none # 3. Returns the input unchanged (because it's just text, not HTML) print(f"DOMPurify output (treating as text): {markdown_input}") # Then react-markdown parses it: print(f"react-markdown parses: {markdown_input}") print(f"Renders as HTML link: <a href=\"javascript:alert(1)\">badfile</a>") # No further sanitization, so the javascript: link is live print("\nConclusion: The javascript: link reaches the browser unsanitized!") EOF

Repository: tale-project/tale

Length of output: 363

🏁 Script executed:

# Verify by checking if message-bubble ever calls sanitizeChatMessage AFTER markdown rendering grep -n "sanitize\|Markdown" "services/platform/app/(app)/dashboard/[id]/chat/components/message-bubble.tsx" | head -30

Repository: tale-project/tale

Length of output: 651

Escape fileName and fileType in markdown before embedding (markdown syntax injection risk).

If attachment.fileName contains markdown control characters like ](javascript:alert(1))<!--, it can break the markdown link syntax and inject a javascript: URL. For example, a fileName of ](javascript:alert(1))<!--.pdf transforms the markdown ![](javascript:alert(1))<!--.pdf](url) into a link with a javascript: protocol. Since sanitizeChatMessage is applied to the markdown source (not the rendered HTML), the javascript: link reaches the browser unsanitized.

Add the escapeMarkdownText function and apply it to both fileName and fileType:

+function escapeMarkdownText(value: string): string { + // Escape markdown syntax characters to prevent link/text injection. + return value.replaceAll('\\', '\\\\').replaceAll('[', '\\[').replaceAll(']', '\\]').replaceAll('(', '\$').replaceAll(')', '\$'); +} @@ - imageMarkdowns.push(`![${attachment.fileName}](${url})`); + imageMarkdowns.push(`![${escapeMarkdownText(attachment.fileName)}](${url})`); @@ - fileMarkdowns.push(`📎 [${attachment.fileName}](${url}) (${attachment.fileType}, ${sizeDisplay})`); + fileMarkdowns.push(`📎 [${escapeMarkdownText(attachment.fileName)}](${url}) (${escapeMarkdownText(attachment.fileType)}, ${sizeDisplay})`);

Committable suggestion skipped: line range outside the PR's diff.

#12)

Three round-2-confirmed gaps in the dual-control flow. placeLegalHold (W1 #11) - targetType: 'org' is the nuclear "halt all retention" hold; a single admin placing it freezes cleanup org-wide, defeating dual-control. Refuse by default; deployments running a single admin opt back in via the existing TALE_LEGAL_HOLD_SINGLE_ADMIN_OK env. A future maker- checker placement flow (legalHoldPlaceRequests + approveLegalHoldPlacement) supersedes this; today's commit restores the contract. approveLegalHoldRelease (W1 #12) - 5-min minimum delay between request and approval (constant RELEASE_APPROVAL_MIN_DELAY_MS). Defeats the chained-call attack where one admin requests + a second admin instantly approves via the same automation flow. Skipped when self-approval is in effect (the escape hatch already opted out of dual-control entirely). - Re-check that the original requester is still an org admin at approval time. A demoted/removed requester whose request lingered loses the ability to retroactively gate a destructive change. closeLegalMatter (W1 #10 + W4 #6) - Implements the docstring promise (round-2 v08 bonus): closing a matter now fans out PENDING release requests for every active hold linked via matterRef, leveraging the new by_organizationId_matterRef index. Approval still requires a second admin via approveLegalHoldRelease — matter-close does NOT auto-release, dual-control survives. - Returns releaseRequestsFiled count so the UI can confirm the cascade. Audit row records linkedHolds + releaseRequestsFiled. - Skips holds with an existing pending or approved request (idempotent against re-running close on a half-closed matter).

…cies Bundle of round-2-confirmed cross-tenant fixes plus the dead-code delete of the semantic LLM response cache. POLICY_TYPES drift (W6 #5) - lib/shared/schemas/governance.ts now includes 'data_classification_notice' to match the Convex enum, killing the `as const` cast at use-data-classification-notice.ts:50. documents/compare_documents.ts (W6 #8) - Convex `_storage` is a global namespace; org membership alone was not enough to gate `ctx.storage.getUrl`. Adds a JOIN through fileMetadata via the new internal query verifyStorageIdsBelongToOrg to confirm both `baseStorageId` and `comparisonStorageId` are owned by the caller's org. Refuses with a clear error otherwise. Pattern copied from agent_tools/documents/helpers/retrieve_document.ts. file_metadata/actions.ts::checkFileRagStatuses (W6 #9) - Was an unauthenticated public action that could flip any org's fileMetadata.ragStatus to `failed` via expireStaleRagQueue (DoS, pre-existing on `main`). Now requires `getAuthUser` and filters storageIds to ones owned by an org the caller is a member of via the new file_metadata.internal_queries.filterStorageIdsByCallerOrg. governance/queries.ts (W6 #11) - getPolicy + listPolicies now apply a member-readable allow-list (data_classification_notice, feature_flags, pii_config, chat_filter, personalization, upload_policy, default_models). All other types — login_policy.trustedProxies, password_policy, two_factor_policy, model_access.rules, budgets, retention_policy, moderation_provider.endpoint, system_prompt — are admin-only. listPolicies silently filters those out for non-admins. semantic LLM response cache — DELETE (W6 #12 + #13) - Round-2 v05 confirmed the lookup is structurally cross-tenant (filters only on agent_name, model, expires_at, similarity; ignores user_id / organization_id even though they're stored). The platform helpers `lookupSemanticCache` / `storeSemanticCacheAsync` had ZERO callers in the monorepo, the FastAPI router was mounted but unreachable from platform — a latent foot-gun primed for the next dev to wire up unaware. Deletes: - services/platform/convex/lib/response_cache/semantic_cache.ts - services/platform/convex/lib/response_cache/internal_actions.ts - services/rag/app/routers/llm_cache.py - services/rag/app/services/llm_response_cache.py Plus the corresponding imports in routers/__init__.py, main.py, rag_service.py. Also removes the two empty-catch violations in semantic_cache.ts (no longer applicable). The exact-key Convex `lib/response_cache/{internal_mutations, internal_queries}.ts` cache stays — it is the actually-wired one and is correctly org-scoped.

Three round-2-confirmed gaps in the dual-control flow. placeLegalHold (W1 #11) - targetType: 'org' is the nuclear "halt all retention" hold; a single admin placing it freezes cleanup org-wide, defeating dual-control. Refuse by default; deployments running a single admin opt back in via the existing TALE_LEGAL_HOLD_SINGLE_ADMIN_OK env. A future maker- checker placement flow (legalHoldPlaceRequests + approveLegalHoldPlacement) supersedes this; today's commit restores the contract. approveLegalHoldRelease (W1 #12) - 5-min minimum delay between request and approval (constant RELEASE_APPROVAL_MIN_DELAY_MS). Defeats the chained-call attack where one admin requests + a second admin instantly approves via the same automation flow. Skipped when self-approval is in effect (the escape hatch already opted out of dual-control entirely). - Re-check that the original requester is still an org admin at approval time. A demoted/removed requester whose request lingered loses the ability to retroactively gate a destructive change. closeLegalMatter (W1 #10 + W4 #6) - Implements the docstring promise (round-2 v08 bonus): closing a matter now fans out PENDING release requests for every active hold linked via matterRef, leveraging the new by_organizationId_matterRef index. Approval still requires a second admin via approveLegalHoldRelease — matter-close does NOT auto-release, dual-control survives. - Returns releaseRequestsFiled count so the UI can confirm the cascade. Audit row records linkedHolds + releaseRequestsFiled. - Skips holds with an existing pending or approved request (idempotent against re-running close on a half-closed matter).

…cies Bundle of round-2-confirmed cross-tenant fixes plus the dead-code delete of the semantic LLM response cache. POLICY_TYPES drift (W6 #5) - lib/shared/schemas/governance.ts now includes 'data_classification_notice' to match the Convex enum, killing the `as const` cast at use-data-classification-notice.ts:50. documents/compare_documents.ts (W6 #8) - Convex `_storage` is a global namespace; org membership alone was not enough to gate `ctx.storage.getUrl`. Adds a JOIN through fileMetadata via the new internal query verifyStorageIdsBelongToOrg to confirm both `baseStorageId` and `comparisonStorageId` are owned by the caller's org. Refuses with a clear error otherwise. Pattern copied from agent_tools/documents/helpers/retrieve_document.ts. file_metadata/actions.ts::checkFileRagStatuses (W6 #9) - Was an unauthenticated public action that could flip any org's fileMetadata.ragStatus to `failed` via expireStaleRagQueue (DoS, pre-existing on `main`). Now requires `getAuthUser` and filters storageIds to ones owned by an org the caller is a member of via the new file_metadata.internal_queries.filterStorageIdsByCallerOrg. governance/queries.ts (W6 #11) - getPolicy + listPolicies now apply a member-readable allow-list (data_classification_notice, feature_flags, pii_config, chat_filter, personalization, upload_policy, default_models). All other types — login_policy.trustedProxies, password_policy, two_factor_policy, model_access.rules, budgets, retention_policy, moderation_provider.endpoint, system_prompt — are admin-only. listPolicies silently filters those out for non-admins. semantic LLM response cache — DELETE (W6 #12 + #13) - Round-2 v05 confirmed the lookup is structurally cross-tenant (filters only on agent_name, model, expires_at, similarity; ignores user_id / organization_id even though they're stored). The platform helpers `lookupSemanticCache` / `storeSemanticCacheAsync` had ZERO callers in the monorepo, the FastAPI router was mounted but unreachable from platform — a latent foot-gun primed for the next dev to wire up unaware. Deletes: - services/platform/convex/lib/response_cache/semantic_cache.ts - services/platform/convex/lib/response_cache/internal_actions.ts - services/rag/app/routers/llm_cache.py - services/rag/app/services/llm_response_cache.py Plus the corresponding imports in routers/__init__.py, main.py, rag_service.py. Also removes the two empty-catch violations in semantic_cache.ts (no longer applicable). The exact-key Convex `lib/response_cache/{internal_mutations, internal_queries}.ts` cache stays — it is the actually-wired one and is correctly org-scoped.

Three round-2-confirmed gaps in the dual-control flow. placeLegalHold (W1 #11) - targetType: 'org' is the nuclear "halt all retention" hold; a single admin placing it freezes cleanup org-wide, defeating dual-control. Refuse by default; deployments running a single admin opt back in via the existing TALE_LEGAL_HOLD_SINGLE_ADMIN_OK env. A future maker- checker placement flow (legalHoldPlaceRequests + approveLegalHoldPlacement) supersedes this; today's commit restores the contract. approveLegalHoldRelease (W1 #12) - 5-min minimum delay between request and approval (constant RELEASE_APPROVAL_MIN_DELAY_MS). Defeats the chained-call attack where one admin requests + a second admin instantly approves via the same automation flow. Skipped when self-approval is in effect (the escape hatch already opted out of dual-control entirely). - Re-check that the original requester is still an org admin at approval time. A demoted/removed requester whose request lingered loses the ability to retroactively gate a destructive change. closeLegalMatter (W1 #10 + W4 #6) - Implements the docstring promise (round-2 v08 bonus): closing a matter now fans out PENDING release requests for every active hold linked via matterRef, leveraging the new by_organizationId_matterRef index. Approval still requires a second admin via approveLegalHoldRelease — matter-close does NOT auto-release, dual-control survives. - Returns releaseRequestsFiled count so the UI can confirm the cascade. Audit row records linkedHolds + releaseRequestsFiled. - Skips holds with an existing pending or approved request (idempotent against re-running close on a half-closed matter).

…cies Bundle of round-2-confirmed cross-tenant fixes plus the dead-code delete of the semantic LLM response cache. POLICY_TYPES drift (W6 #5) - lib/shared/schemas/governance.ts now includes 'data_classification_notice' to match the Convex enum, killing the `as const` cast at use-data-classification-notice.ts:50. documents/compare_documents.ts (W6 #8) - Convex `_storage` is a global namespace; org membership alone was not enough to gate `ctx.storage.getUrl`. Adds a JOIN through fileMetadata via the new internal query verifyStorageIdsBelongToOrg to confirm both `baseStorageId` and `comparisonStorageId` are owned by the caller's org. Refuses with a clear error otherwise. Pattern copied from agent_tools/documents/helpers/retrieve_document.ts. file_metadata/actions.ts::checkFileRagStatuses (W6 #9) - Was an unauthenticated public action that could flip any org's fileMetadata.ragStatus to `failed` via expireStaleRagQueue (DoS, pre-existing on `main`). Now requires `getAuthUser` and filters storageIds to ones owned by an org the caller is a member of via the new file_metadata.internal_queries.filterStorageIdsByCallerOrg. governance/queries.ts (W6 #11) - getPolicy + listPolicies now apply a member-readable allow-list (data_classification_notice, feature_flags, pii_config, chat_filter, personalization, upload_policy, default_models). All other types — login_policy.trustedProxies, password_policy, two_factor_policy, model_access.rules, budgets, retention_policy, moderation_provider.endpoint, system_prompt — are admin-only. listPolicies silently filters those out for non-admins. semantic LLM response cache — DELETE (W6 #12 + #13) - Round-2 v05 confirmed the lookup is structurally cross-tenant (filters only on agent_name, model, expires_at, similarity; ignores user_id / organization_id even though they're stored). The platform helpers `lookupSemanticCache` / `storeSemanticCacheAsync` had ZERO callers in the monorepo, the FastAPI router was mounted but unreachable from platform — a latent foot-gun primed for the next dev to wire up unaware. Deletes: - services/platform/convex/lib/response_cache/semantic_cache.ts - services/platform/convex/lib/response_cache/internal_actions.ts - services/rag/app/routers/llm_cache.py - services/rag/app/services/llm_response_cache.py Plus the corresponding imports in routers/__init__.py, main.py, rag_service.py. Also removes the two empty-catch violations in semantic_cache.ts (no longer applicable). The exact-key Convex `lib/response_cache/{internal_mutations, internal_queries}.ts` cache stays — it is the actually-wired one and is correctly org-scoped.

…stodian cascade Round-1 #10/#12/#13 + round-2 V3 confirmed three custodian-cascade gaps (P0-6 retention guard, P0-7 deleteDocumentById, P0-8 cleanupMessageMetadata + cleanupChatFilterEvents). The user's observation made the deeper truth clear: in production, only `org` and `userMembership` hold target types are operator-facing — `thread` / `document` / `execution` were modeled in the original fine-grained hold design but never wired into the place-hold UI (`PICKER_TARGET_TYPES = ['userMembership', 'org']`), and `usePlaceLegalHold` is called from a single dialog. The corresponding `holds.threadIds` / `documentIds` / `executionIds` Sets in `ActiveHolds` therefore stayed empty in every deployment, while ~80 LOC of dead per-row checks proliferated through retention, erasure, restore, folders, and the cascade helper. This commit collapses the model to its actual shape. Schema-level narrowing (legal_hold.ts): - HOLD_TARGET_TYPES = ['userMembership', 'org'] (write-side) - ActiveHolds = { orgHeld, userMembershipIds } - loadActiveHolds switch reduces to 2 cases; legacy thread/document/ execution rows in the table (if any) are intentionally ignored — they're surfaced in admin UI as `(legacy)` via the read-side validator in legal_hold_queries.ts. - resolveAndAssertTarget keeps only `org` + `userMembership` cases. - isHeld() simplified to org + custodian. Guard simplification (legal_hold_guard.ts): - assertNotHeld drops the per-row branches; `targetType`/`targetId` are now error-message context only. The user-custodian cascade via `authorUserId` is the sole remaining gate beyond org-wide. assertSafeRetentionDelete (internal_mutations_retention.ts): - Drops `targetType?` / `targetId?` args; adds `authorUserId?`. - Internal handler is now: cross-org check → TOCTOU check → org-wide hold → user-custodian cascade. Single, clear contract. P0-6 cascade wiring — every retention mutation passes the row's author user id (~15 call sites in internal_mutations_retention.ts): - documents → doc.createdBy - threads → thread.userId - workflowExecution → execution.userId - workflowTriggerLog → exec.userId via wfExecutionId join - messageFeedback → row.userId - memoryAudit → row.subjectUserId (with thread-fallback) - chatFilterEvent → parent thread's userId via index lookup - usageLedger → row.userId - promptTemplate → row.createdBy - messageMetadata → parent thread's userId (closes P0-8 directly) - tempFile, customer, vendor, externalConversation: org-only (no author concept on the row) P0-8 cascade fix — cleanupMessageMetadata + cleanupChatFilterEvents no longer rely on the now-empty `holds.threadIds` Set; both look up the parent thread's `userId` for the user-membership cascade (messageMetadata cleanup mutation was already P0-8 fixed in commit 1). Other dead-check removals: - retention_cleanup.ts: ~10 dead `holds.threadIds.has` / `documentIds.has` / `executionIds.has` checks removed; per-row custodian cascade either replaces them or the mutation re-checks. - governance/restore.ts: drops dead per-row branches; user-custodian cascade in restoreSoftDeletedRow (commit 1) covers the rest. - threads/cascade_helpers.ts: per-thread hold check replaced with a thread-metadata lookup → `userMembershipIds.has(thread.userId)`. - threads/delete_chat_thread.ts + restore_chat_thread.ts: drop `holds.threadIds.has(threadId)`; ownerHeld + orgHeld cover all refusal cases. - folders/mutations.ts assertNoHeldDescendantDocs: per-document hold check replaced with `userMembershipIds.has(doc.createdBy)`. - governance/erasure.ts requestErasure: drops heldThreadIds / heldDocumentIds collection — refusal is now `orgHeld || userCustodianHeld`. eraseSubjectDocuments drops the per-row held-document check (subject is by definition the only author in the iteration; user-custodian gate applied at request time). - legal_hold_internal.ts: loadActiveHoldsForOrg returns the narrowed ActiveHolds shape across the action/query boundary. Tests rewritten — `governance/__tests__/retention_legal_hold_gaps.test.ts`: - Drops thread/document/execution per-row hold cases (behaviour deleted on purpose). - Adds 6 cases covering the simplified contract: authorUserId cascade, TOCTOU + cross-org pre-checks, missing authorUserId grandfather, org-wide hold blanket refusal. - Source-grep regressions: every retention mutation passes the right authorUserId; cleanup actions pre-filter via the right `holds.userMembershipIds.has(...)`; per-row Sets stay deleted from the cleanup / erasure / guard sources. Note: this is a one-way migration. Any deployment with active legacy thread/document/execution holds loses enforcement of those holds going forward; the read-side validator in legal_hold_queries keeps showing them as `(legacy)`. Acceptable per project memory (demo stage, no prod use). Pre-prod check: convert to user/org holds first if any deployment has them. Verified: typecheck clean; 851 tests pass across affected dirs; full vitest run shows 5641/5642 pass (1 pre-existing flake in canvas-pane.test.tsx, unrelated).

…te gates P0-9 — Org-wide hold UI invisibility (round-1 #12, round-2 V4): Before: `heldByTargetValidator` declared `via: 'org'` but no producer ever emitted it. With only an org-wide hold active, `getLegalHoldByTarget` returned null and `listActiveHoldTargetIds` returned an empty list → no chat-sidebar lock icons, no admin-grid badges, no banner. Backend refused destructive actions but the UX showed nothing until the user attempted a delete and hit a confusing error. Fix: - `getLegalHoldByTarget`: third pass after direct + custodian-cascade looks up the org-wide hold row and surfaces it with `via: 'org'`. - `listActiveHoldTargetIds`: returns a new `orgHeld: boolean` flag alongside the per-target id set; the chat-sidebar ORs the flag with the per-thread set so every row gets the lock indicator when the whole org is on a hold. - `chat-history-sidebar.tsx`: replaces direct `heldThreadIds.has(id)` checks with an `isThreadHeld(id)` helper that ORs `orgWideHeld`. P0-10 — `prepareOrganizationDeletion` bypasses holds (round-1 #13, V4): `cascadeOnOrgDeleted` hard-deletes userMemories + userPreferences (and per Commit 1, every thread-bound chat-upload file) for every member of the org. The mutation refused only on owner-role check; legal-hold status was never consulted. Adds `assertNotHeld(ctx, orgId, 'org', orgId, undefined, authUser._id)` before the cascade — refuses on org-wide hold OR on a userMembership cascade against the owner deleting. `GuardedTargetType` extended with `'org'` for the error-payload context. P0-11 — `removeMember` bypasses custodian holds (round-1 #13, V4): `cascadeOnMemberRemoved` cascades into userMemories + userPreferences (+ chat-upload files) scoped to the member being removed. Mutation previously checked only role. Adds `assertNotHeld(ctx, orgId, 'userMembership', member.userId, undefined, member.userId)` before the cascade — refuses when the org is on a hold OR when the member themselves is on a custodian hold. Self-author cascade uses the same user id for `targetId` and `authorUserId` so the gate fires regardless of which leg matches. Verified: typecheck clean; 908 tests pass across affected dirs.

Three round-2-confirmed gaps in the dual-control flow. placeLegalHold (W1 #11) - targetType: 'org' is the nuclear "halt all retention" hold; a single admin placing it freezes cleanup org-wide, defeating dual-control. Refuse by default; deployments running a single admin opt back in via the existing TALE_LEGAL_HOLD_SINGLE_ADMIN_OK env. A future maker- checker placement flow (legalHoldPlaceRequests + approveLegalHoldPlacement) supersedes this; today's commit restores the contract. approveLegalHoldRelease (W1 #12) - 5-min minimum delay between request and approval (constant RELEASE_APPROVAL_MIN_DELAY_MS). Defeats the chained-call attack where one admin requests + a second admin instantly approves via the same automation flow. Skipped when self-approval is in effect (the escape hatch already opted out of dual-control entirely). - Re-check that the original requester is still an org admin at approval time. A demoted/removed requester whose request lingered loses the ability to retroactively gate a destructive change. closeLegalMatter (W1 #10 + W4 #6) - Implements the docstring promise (round-2 v08 bonus): closing a matter now fans out PENDING release requests for every active hold linked via matterRef, leveraging the new by_organizationId_matterRef index. Approval still requires a second admin via approveLegalHoldRelease — matter-close does NOT auto-release, dual-control survives. - Returns releaseRequestsFiled count so the UI can confirm the cascade. Audit row records linkedHolds + releaseRequestsFiled. - Skips holds with an existing pending or approved request (idempotent against re-running close on a half-closed matter).

…cies Bundle of round-2-confirmed cross-tenant fixes plus the dead-code delete of the semantic LLM response cache. POLICY_TYPES drift (W6 #5) - lib/shared/schemas/governance.ts now includes 'data_classification_notice' to match the Convex enum, killing the `as const` cast at use-data-classification-notice.ts:50. documents/compare_documents.ts (W6 #8) - Convex `_storage` is a global namespace; org membership alone was not enough to gate `ctx.storage.getUrl`. Adds a JOIN through fileMetadata via the new internal query verifyStorageIdsBelongToOrg to confirm both `baseStorageId` and `comparisonStorageId` are owned by the caller's org. Refuses with a clear error otherwise. Pattern copied from agent_tools/documents/helpers/retrieve_document.ts. file_metadata/actions.ts::checkFileRagStatuses (W6 #9) - Was an unauthenticated public action that could flip any org's fileMetadata.ragStatus to `failed` via expireStaleRagQueue (DoS, pre-existing on `main`). Now requires `getAuthUser` and filters storageIds to ones owned by an org the caller is a member of via the new file_metadata.internal_queries.filterStorageIdsByCallerOrg. governance/queries.ts (W6 #11) - getPolicy + listPolicies now apply a member-readable allow-list (data_classification_notice, feature_flags, pii_config, chat_filter, personalization, upload_policy, default_models). All other types — login_policy.trustedProxies, password_policy, two_factor_policy, model_access.rules, budgets, retention_policy, moderation_provider.endpoint, system_prompt — are admin-only. listPolicies silently filters those out for non-admins. semantic LLM response cache — DELETE (W6 #12 + #13) - Round-2 v05 confirmed the lookup is structurally cross-tenant (filters only on agent_name, model, expires_at, similarity; ignores user_id / organization_id even though they're stored). The platform helpers `lookupSemanticCache` / `storeSemanticCacheAsync` had ZERO callers in the monorepo, the FastAPI router was mounted but unreachable from platform — a latent foot-gun primed for the next dev to wire up unaware. Deletes: - services/platform/convex/lib/response_cache/semantic_cache.ts - services/platform/convex/lib/response_cache/internal_actions.ts - services/rag/app/routers/llm_cache.py - services/rag/app/services/llm_response_cache.py Plus the corresponding imports in routers/__init__.py, main.py, rag_service.py. Also removes the two empty-catch violations in semantic_cache.ts (no longer applicable). The exact-key Convex `lib/response_cache/{internal_mutations, internal_queries}.ts` cache stays — it is the actually-wired one and is correctly org-scoped.

…stodian cascade Round-1 #10/#12/#13 + round-2 V3 confirmed three custodian-cascade gaps (P0-6 retention guard, P0-7 deleteDocumentById, P0-8 cleanupMessageMetadata + cleanupChatFilterEvents). The user's observation made the deeper truth clear: in production, only `org` and `userMembership` hold target types are operator-facing — `thread` / `document` / `execution` were modeled in the original fine-grained hold design but never wired into the place-hold UI (`PICKER_TARGET_TYPES = ['userMembership', 'org']`), and `usePlaceLegalHold` is called from a single dialog. The corresponding `holds.threadIds` / `documentIds` / `executionIds` Sets in `ActiveHolds` therefore stayed empty in every deployment, while ~80 LOC of dead per-row checks proliferated through retention, erasure, restore, folders, and the cascade helper. This commit collapses the model to its actual shape. Schema-level narrowing (legal_hold.ts): - HOLD_TARGET_TYPES = ['userMembership', 'org'] (write-side) - ActiveHolds = { orgHeld, userMembershipIds } - loadActiveHolds switch reduces to 2 cases; legacy thread/document/ execution rows in the table (if any) are intentionally ignored — they're surfaced in admin UI as `(legacy)` via the read-side validator in legal_hold_queries.ts. - resolveAndAssertTarget keeps only `org` + `userMembership` cases. - isHeld() simplified to org + custodian. Guard simplification (legal_hold_guard.ts): - assertNotHeld drops the per-row branches; `targetType`/`targetId` are now error-message context only. The user-custodian cascade via `authorUserId` is the sole remaining gate beyond org-wide. assertSafeRetentionDelete (internal_mutations_retention.ts): - Drops `targetType?` / `targetId?` args; adds `authorUserId?`. - Internal handler is now: cross-org check → TOCTOU check → org-wide hold → user-custodian cascade. Single, clear contract. P0-6 cascade wiring — every retention mutation passes the row's author user id (~15 call sites in internal_mutations_retention.ts): - documents → doc.createdBy - threads → thread.userId - workflowExecution → execution.userId - workflowTriggerLog → exec.userId via wfExecutionId join - messageFeedback → row.userId - memoryAudit → row.subjectUserId (with thread-fallback) - chatFilterEvent → parent thread's userId via index lookup - usageLedger → row.userId - promptTemplate → row.createdBy - messageMetadata → parent thread's userId (closes P0-8 directly) - tempFile, customer, vendor, externalConversation: org-only (no author concept on the row) P0-8 cascade fix — cleanupMessageMetadata + cleanupChatFilterEvents no longer rely on the now-empty `holds.threadIds` Set; both look up the parent thread's `userId` for the user-membership cascade (messageMetadata cleanup mutation was already P0-8 fixed in commit 1). Other dead-check removals: - retention_cleanup.ts: ~10 dead `holds.threadIds.has` / `documentIds.has` / `executionIds.has` checks removed; per-row custodian cascade either replaces them or the mutation re-checks. - governance/restore.ts: drops dead per-row branches; user-custodian cascade in restoreSoftDeletedRow (commit 1) covers the rest. - threads/cascade_helpers.ts: per-thread hold check replaced with a thread-metadata lookup → `userMembershipIds.has(thread.userId)`. - threads/delete_chat_thread.ts + restore_chat_thread.ts: drop `holds.threadIds.has(threadId)`; ownerHeld + orgHeld cover all refusal cases. - folders/mutations.ts assertNoHeldDescendantDocs: per-document hold check replaced with `userMembershipIds.has(doc.createdBy)`. - governance/erasure.ts requestErasure: drops heldThreadIds / heldDocumentIds collection — refusal is now `orgHeld || userCustodianHeld`. eraseSubjectDocuments drops the per-row held-document check (subject is by definition the only author in the iteration; user-custodian gate applied at request time). - legal_hold_internal.ts: loadActiveHoldsForOrg returns the narrowed ActiveHolds shape across the action/query boundary. Tests rewritten — `governance/__tests__/retention_legal_hold_gaps.test.ts`: - Drops thread/document/execution per-row hold cases (behaviour deleted on purpose). - Adds 6 cases covering the simplified contract: authorUserId cascade, TOCTOU + cross-org pre-checks, missing authorUserId grandfather, org-wide hold blanket refusal. - Source-grep regressions: every retention mutation passes the right authorUserId; cleanup actions pre-filter via the right `holds.userMembershipIds.has(...)`; per-row Sets stay deleted from the cleanup / erasure / guard sources. Note: this is a one-way migration. Any deployment with active legacy thread/document/execution holds loses enforcement of those holds going forward; the read-side validator in legal_hold_queries keeps showing them as `(legacy)`. Acceptable per project memory (demo stage, no prod use). Pre-prod check: convert to user/org holds first if any deployment has them. Verified: typecheck clean; 851 tests pass across affected dirs; full vitest run shows 5641/5642 pass (1 pre-existing flake in canvas-pane.test.tsx, unrelated).

…te gates P0-9 — Org-wide hold UI invisibility (round-1 #12, round-2 V4): Before: `heldByTargetValidator` declared `via: 'org'` but no producer ever emitted it. With only an org-wide hold active, `getLegalHoldByTarget` returned null and `listActiveHoldTargetIds` returned an empty list → no chat-sidebar lock icons, no admin-grid badges, no banner. Backend refused destructive actions but the UX showed nothing until the user attempted a delete and hit a confusing error. Fix: - `getLegalHoldByTarget`: third pass after direct + custodian-cascade looks up the org-wide hold row and surfaces it with `via: 'org'`. - `listActiveHoldTargetIds`: returns a new `orgHeld: boolean` flag alongside the per-target id set; the chat-sidebar ORs the flag with the per-thread set so every row gets the lock indicator when the whole org is on a hold. - `chat-history-sidebar.tsx`: replaces direct `heldThreadIds.has(id)` checks with an `isThreadHeld(id)` helper that ORs `orgWideHeld`. P0-10 — `prepareOrganizationDeletion` bypasses holds (round-1 #13, V4): `cascadeOnOrgDeleted` hard-deletes userMemories + userPreferences (and per Commit 1, every thread-bound chat-upload file) for every member of the org. The mutation refused only on owner-role check; legal-hold status was never consulted. Adds `assertNotHeld(ctx, orgId, 'org', orgId, undefined, authUser._id)` before the cascade — refuses on org-wide hold OR on a userMembership cascade against the owner deleting. `GuardedTargetType` extended with `'org'` for the error-payload context. P0-11 — `removeMember` bypasses custodian holds (round-1 #13, V4): `cascadeOnMemberRemoved` cascades into userMemories + userPreferences (+ chat-upload files) scoped to the member being removed. Mutation previously checked only role. Adds `assertNotHeld(ctx, orgId, 'userMembership', member.userId, undefined, member.userId)` before the cascade — refuses when the org is on a hold OR when the member themselves is on a custodian hold. Self-author cascade uses the same user id for `targetId` and `authorUserId` so the gate fires regardless of which leg matches. Verified: typecheck clean; 908 tests pass across affected dirs.

Closes #9, #10, #11, #12 — cascade correctness + GC durability. - `personalization_cascade.ts:cascadeOnOrgDeleted` swaps delete order: `db.delete` runs FIRST, then `storage.delete` inside the try/catch. Matches the documented contract in `tts/cascade_helpers.ts:55-62` — Convex `_storage` writes are out-of-band and not rolled back on tx abort, so the reverse order leaves a surviving row pointing at a dead storageId (404 on `/api/tts-audio`). - `threads/cascade_helpers.ts` step 7c TTS cleanup gets the same swap, for the same reason. - `cascadeOnTtsForMemberRemoved` per-mutation page cap lowered from 50 (~10K writes) to 30 (~6K writes) to stay under Convex's ~8K per-mutation write budget. `cascadeOnOrgDeleted` gets the same cap reduction. The hourly cron picks up whatever doesn't fit in a single pass — still well inside the 30-day Art 12(3) GDPR window. - `gcOrgTtsChunks` now persists its org-cursor in a new singleton `ttsGcCursor` table between cron runs. A deployment with more orgs than `MAX_ORGS_PER_RUN` now advances through the full org list over successive hours instead of restarting from the lex-first org every time and starving lex-tail orgs forever. On reaching the end of the org list the cursor wraps to null and the next run starts over. - `gcOrgTtsChunks` skip-empty: an org with no rows older than the retention cutoff no longer counts against `MAX_ORGS_PER_RUN`. Without this, a busy tail of stale orgs sandwiched behind quiet lex-leading orgs would never get reaped.

Two-round, 39-agent review (30 round-1 + 9 round-2) on top of round-5 identified 16 HIGH + 18 MEDIUM defects across the TTS pipeline. This single commit closes every HIGH/MEDIUM that wasn't deferred: Server (correctness / cost) - reserveChunk schedules markChunkFailed watchdog at PENDING_STALE_MS+5s so an action that crashes between storage.store and mark-ready can no longer strand a row in `pending` until the 7-day cron - OCC detection now layers ConvexError.data.code + constructor.name + the regex fallback so reserialized errors still resolve to CONTENTION - 429 Retry-After is parsed via TtsProviderHttpError + parseRetryAfterMs and threaded back through the action return; classifier exposes retryAfterMs on ClassifiedFailure - 4xx classifier split into PROVIDER_AUTH (401/403), PROVIDER_BAD_REQUEST (400/422), PROVIDER_PAYLOAD_TOO_LARGE (413), plus PROVIDER_4XX generic fallback - WATCHDOG_TIMEOUT code added to literals and announcer/indicator maps - Synth action now logs charCount/bytes/durationMs on success and on every provider-error path; sets Accept: audio/* on the upstream POST - All TTS warn/error logs route through sanitizeError; http.ts:270 brought into line with the rest of the file - scheduleOpportunisticCleanup helper deduplicates the two mark-ready/mark-failed cleanup blocks Governance + cascade - Added policyType: 'voice_output' with config { enabled: boolean } to schema + upsertPolicy dispatch; org admins can now disable voice for the entire tenant - getVoiceModeEffective tier order is now org_policy > thread_override > user_pref > false; org-policy veto added to source union - Legacy thread fallback (no organizationId) now returns false instead of a non-deterministic prefix-only first match across orgs HTTP + SSRF - /api/tts-audio: Cross-Origin-Resource-Policy: same-origin on 200; Cache-Control: no-store + Vary: Cookie + WWW-Authenticate on 401 - safe_fetch: IPv6 link-local check now covers all of fe80::/10 (plus fec0::/10 site-local) and strips %zoneid; redirect-follow restricted to {301,302,303,307,308} so 304/305/306 stop misclassifying Client lifecycle - runNext gated on enabledRef and a dedicated useEffect clears queueRef/retry timers when opts.enabled flips false, so toggle-off mid-stream no longer drains the queue (and bills) for paragraphs the user already silenced - detectChunkLocale now uses a CJK-ratio threshold (40%) instead of a single-character probe, so "中文 with English help" stays in fallback - Player's onEnded detaches listeners symmetrically with onError, and stop() advances playEpoch so a coordinator.claim().then() callback racing a teardown can no longer re-flip activeRef - VoiceOutputProvider accepts optional threadId and resets per-thread stores (preReservationErrors, activePlayback, announcer) on switch; resetAll() added to all three store factories Markdown stripper - ~~~ fences, 4-space indented code blocks, autolinks <https://…>, bare URLs, tables, $inline$ / $$block$$ math, ~~strike~~, task-list checkboxes, HTML tags, common HTML entities, footnote refs, intra- word _emphasis_ (boundary-anchored per CommonMark) - Tests extended from 13 to 24 cases covering every regression class Spotlight + a11y - assistant-message-content masks fenced regions before paragraph split, so a fence containing a blank line no longer renders as an unterminated <pre> + literal backticks - Inactive paragraphs receive aria-hidden during active playback to prevent SR double-read; spotlight now uses text-muted-foreground (WCAG-compliant) plus a left-border accent as a non-color cue Dropdown / thread toggle a11y - Added a 'checkbox' DropdownMenu variant rendering CheckboxItem with onSelect preventDefault and min-h-11 row height (44px target) - Per-thread voice toggle migrated from a Switch-inside-custom-row to the new variant; arrow-key navigable, announces aria-checked, menuitemcheckbox role i18n + docs - en.json:4806 'Voice mode' → 'Voice output' (canonical per GLOSSARY) - de.json:2067 'Sprachwiedergabe' → 'Sprachausgabe' (matches 13 other occurrences); 2066 trailing-space ellipsis fix - All three voice-output.md rewritten to describe the Ellipsis-menu Switch (not the deleted tri-state speaker icon) - All three providers.md TTS-fields tables gained defaultInstructions and instructionsByLocale rows, and clarified per-1M-character billing Deferred (separate PRs): - HIGH #12 header-icon discoverability restoration (per user) - safe_fetch per-read idle timeout, undici Dispatcher pinning - Comprehensive test backfill for synthesize/mutations/queries - Markdown-strip → remark migration (streaming-state requirement) bun lint: 0 errors. tsc --noEmit: clean. markdown-strip + error_codes suites: 24/24 + 41/41 passing.

larryro merged commit 92b6cc6 into main Dec 12, 2025
1 of 2 checks passed

larryro deleted the chat-agent-files-support branch December 12, 2025 07:04

coderabbitai Bot requested changes Dec 12, 2025

View reviewed changes

This was referenced Dec 16, 2025

improve coding agent #19

Merged

feat: introduce dedicated vision model for image analysis #30

Merged

This was referenced Feb 3, 2026

refactor(platform): update file upload components #344

Merged

chore: unified file domain model #380

Merged

yannickmonney pushed a commit that referenced this pull request Apr 8, 2026

feat: add file attachment support for chat agent with document parsing (

81048d9

#12)

coderabbitai Bot mentioned this pull request Apr 11, 2026

feat(platform): simplify file tools — auto RAG indexing, remove parse/template ops #1407

Merged

7 tasks

		from typing import Dict, Any

		logger = logging.getLogger(__name__)

Conversation

larryro commented Dec 12, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Dec 12, 2025

Walkthrough

Estimated code review effort

Possibly related PRs

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

larryro commented Dec 12, 2025 •

edited by coderabbitai Bot

Loading