feat(documents): enhance contract comparison with clause frequency analysis#750
Conversation
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
Add fileId to document list results so agents can use the correct identifier for file extraction tools (pdf, docx, excel, etc.) vs document_retrieve. Documents without a fileId are now filtered out.
…eterministic diff Refactor contract comparison workflow to use direct document comparison instead of RAG indexing, add clause frequency extraction step for tracking most-negotiated clauses, and improve report structure with executive overview, frequency table, and full localization support. Also adds markdown heading normalization in the crawler and document comparison endpoint in the RAG service.
1ff4579 to
d735691
Compare
📝 WalkthroughWalkthroughThis PR implements a file-based document comparison workflow across multiple services. It adds a new Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
services/platform/convex/documents/list_documents_for_agent.ts (1)
176-185:⚠️ Potential issue | 🔴 CriticalFix TypeScript type error for
fileId.The pipeline fails because TypeScript doesn't narrow the
fileIdtype through the filtering at line 118. Although documents withoutfileIdare filtered out, thematchesarray still has typeDoc<'documents'>[]wherefileIdisId<"_storage"> | undefined.🐛 Proposed fix using non-null assertion
const documents: AgentDocumentItem[] = page.map((doc) => ({ id: doc._id, - fileId: doc.fileId, + fileId: doc.fileId!, title: getDocumentTitle(doc), extension: doc.extension ?? null, folderPath: doc.folderId ? (folderPathMap.get(doc.folderId) ?? null) : null, teamId: doc.teamId ?? null, createdAt: doc._creationTime, sizeBytes: extractSize(doc.metadata), }));Alternatively, cast to
stringexplicitly since Convex IDs serialize as strings:- fileId: doc.fileId, + fileId: doc.fileId as string,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@services/platform/convex/documents/list_documents_for_agent.ts` around lines 176 - 185, The TypeScript error comes from doc.fileId being typed as Id<"_storage"> | undefined when mapping to AgentDocumentItem; update the mapping in the documents creation to guarantee a non-null string for fileId (e.g. use a non-null assertion or cast): replace fileId: doc.fileId with fileId: doc.fileId! (or fileId: String(doc.fileId) / fileId: doc.fileId as string) inside the page.map(...) so AgentDocumentItem.fileId is a string and the type error is resolved; keep the rest of the mapping (id, title via getDocumentTitle, extension, folderPath via folderPathMap, teamId, createdAt, sizeBytes via extractSize) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@services/platform/convex/agent_tools/documents/helpers/fetch_document_comparison.ts`:
- Around line 211-239: Create the AbortController and timeout before starting
the signed-URL downloads and pass its signal into the two fetch calls (the
fetch(baseFileUrl) and fetch(comparisonFileUrl)) so those downloads are bounded
by FETCH_TIMEOUT_MS; also pass the same signal into the final POST fetch to
/api/v1/documents/compare-files and ensure you clearTimeout(timeoutId) after
success (and handle abort errors) so the timeout is properly cleaned up.
In
`@services/platform/convex/workflow_engine/action_defs/document/document_action.ts`:
- Around line 46-55: The resolveFileName function currently returns a hardcoded
'Unknown' when metadata is missing and ignores any supplied baseFileName; change
resolveFileName(ctx, fileId) to accept an optional fallbackName (e.g.,
resolveFileName(ctx, fileId, fallbackName?: string)) and return
metadata?.fileName ?? fallbackName ?? '' instead of 'Unknown' so a provided
baseFileName is honoured and we avoid injecting "Unknown" that breaks extension
validation; make the same change to the other similar implementation mentioned
(the block at the other location corresponding to lines 205-211).
In `@services/rag/app/routers/documents.py`:
- Around line 511-516: The upload handling currently maps ValueError to a 422
but lets text decode errors fall through to a 500; catch UnicodeDecodeError
raised by extract_text() and raise
HTTPException(status_code=status.HTTP_422_UNPROCESSABLE_ENTITY, detail=str(e))
from e, either by adding an explicit except UnicodeDecodeError as e: block
before the generic except Exception or by combining UnicodeDecodeError with the
existing ValueError branch; reference extract_text(), HTTPException and
status.HTTP_422_UNPROCESSABLE_ENTITY when locating where to add the new
exception handling.
In `@services/rag/app/services/rag_service.py`:
- Around line 475-508: In compare_files, ensure the service is initialized
before using the vision client: call the service's initialize() method (or the
module-level initializer used elsewhere) at the start of compare_files so
_vision_client is set when extract_text is invoked; update compare_files to
invoke initialize() (or await initialize() if async) before calling extract_text
for base and comparison files to avoid uninitialized OCR behavior in cold
workers.
---
Outside diff comments:
In `@services/platform/convex/documents/list_documents_for_agent.ts`:
- Around line 176-185: The TypeScript error comes from doc.fileId being typed as
Id<"_storage"> | undefined when mapping to AgentDocumentItem; update the mapping
in the documents creation to guarantee a non-null string for fileId (e.g. use a
non-null assertion or cast): replace fileId: doc.fileId with fileId: doc.fileId!
(or fileId: String(doc.fileId) / fileId: doc.fileId as string) inside the
page.map(...) so AgentDocumentItem.fileId is a string and the type error is
resolved; keep the rest of the mapping (id, title via getDocumentTitle,
extension, folderPath via folderPathMap, teamId, createdAt, sizeBytes via
extractSize) unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: a3e8c08d-033f-46b8-8b21-6c4fd6f7e52b
📒 Files selected for processing (22)
examples/workflows/contract-comparison/config.jsonservices/crawler/app/services/base_converter.pyservices/crawler/tests/test_markdown_normalize.pyservices/platform/app/features/chat/components/workflow-update-approval-card.tsxservices/platform/convex/agent_tools/documents/document_list_tool.tsservices/platform/convex/agent_tools/documents/document_retrieve_tool.tsservices/platform/convex/agent_tools/documents/helpers/fetch_document_comparison.tsservices/platform/convex/agent_tools/integrations/create_bound_integration_tool.tsservices/platform/convex/agent_tools/integrations/integration_tool.tsservices/platform/convex/agent_tools/workflows/create_bound_workflow_tool.tsservices/platform/convex/agent_tools/workflows/create_workflow_tool.tsservices/platform/convex/agent_tools/workflows/run_workflow_tool.tsservices/platform/convex/agent_tools/workflows/save_workflow_definition_tool.tsservices/platform/convex/agent_tools/workflows/update_workflow_step_tool.tsservices/platform/convex/agents/integration/agent.tsservices/platform/convex/documents/__tests__/list_documents_for_agent.test.tsservices/platform/convex/documents/list_documents_for_agent.tsservices/platform/convex/workflow_engine/action_defs/document/document_action.tsservices/platform/convex/workflow_engine/instructions/core_instructions.tsservices/rag/app/routers/documents.pyservices/rag/app/services/rag_service.pyservices/rag/tests/test_compare_files.py
| const [baseResponse, compResponse] = await Promise.all([ | ||
| fetch(baseFileUrl), | ||
| fetch(comparisonFileUrl), | ||
| ]); | ||
| if (!baseResponse.ok) { | ||
| throw new Error(`Failed to download base file: ${baseResponse.status}`); | ||
| } | ||
| if (!compResponse.ok) { | ||
| throw new Error( | ||
| `Failed to download comparison file: ${compResponse.status}`, | ||
| ); | ||
| } | ||
|
|
||
| const [baseBlob, compBlob] = await Promise.all([ | ||
| baseResponse.blob(), | ||
| compResponse.blob(), | ||
| ]); | ||
|
|
||
| const url = `${ragServiceUrl}/api/v1/documents/compare-files`; | ||
|
|
||
| const formData = new FormData(); | ||
| formData.append('base_file', baseBlob, baseFileName); | ||
| formData.append('comparison_file', compBlob, comparisonFileName); | ||
| if (maxChanges != null) { | ||
| formData.append('max_changes', String(maxChanges)); | ||
| } | ||
|
|
||
| const controller = new AbortController(); | ||
| const timeoutId = setTimeout(() => controller.abort(), FETCH_TIMEOUT_MS); |
There was a problem hiding this comment.
Bound the storage downloads with the same timeout budget.
Only the final POST is protected by FETCH_TIMEOUT_MS; the signed-URL downloads and blob() reads happen before the AbortController exists. A slow storage response can pin this synchronous path until the outer action times out.
⏱️ Suggested fix
+async function downloadBlob(
+ fileUrl: string,
+ label: 'base' | 'comparison',
+ signal: AbortSignal,
+): Promise<Blob> {
+ const response = await fetch(fileUrl, { signal });
+ if (!response.ok) {
+ throw new Error(`Failed to download ${label} file: ${response.status}`);
+ }
+ return await response.blob();
+}
+
export async function fetchDocumentComparisonByUrls(
ragServiceUrl: string,
baseFileUrl: string,
baseFileName: string,
comparisonFileUrl: string,
comparisonFileName: string,
maxChanges?: number,
): Promise<DocumentComparisonResult> {
- const [baseResponse, compResponse] = await Promise.all([
- fetch(baseFileUrl),
- fetch(comparisonFileUrl),
- ]);
- if (!baseResponse.ok) {
- throw new Error(`Failed to download base file: ${baseResponse.status}`);
- }
- if (!compResponse.ok) {
- throw new Error(
- `Failed to download comparison file: ${compResponse.status}`,
- );
- }
-
- const [baseBlob, compBlob] = await Promise.all([
- baseResponse.blob(),
- compResponse.blob(),
- ]);
-
- const url = `${ragServiceUrl}/api/v1/documents/compare-files`;
-
const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), FETCH_TIMEOUT_MS);
try {
+ const [baseBlob, compBlob] = await Promise.all([
+ downloadBlob(baseFileUrl, 'base', controller.signal),
+ downloadBlob(comparisonFileUrl, 'comparison', controller.signal),
+ ]);
+
+ const url = `${ragServiceUrl}/api/v1/documents/compare-files`;
+
const response = await fetch(url, {
method: 'POST',
body: formData,
signal: controller.signal,
});🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@services/platform/convex/agent_tools/documents/helpers/fetch_document_comparison.ts`
around lines 211 - 239, Create the AbortController and timeout before starting
the signed-URL downloads and pass its signal into the two fetch calls (the
fetch(baseFileUrl) and fetch(comparisonFileUrl)) so those downloads are bounded
by FETCH_TIMEOUT_MS; also pass the same signal into the final POST fetch to
/api/v1/documents/compare-files and ensure you clearTimeout(timeoutId) after
success (and handle abort errors) so the timeout is properly cleaned up.
| async function resolveFileName( | ||
| ctx: ActionCtx, | ||
| fileId: string, | ||
| ): Promise<string> { | ||
| const metadata = await ctx.runQuery( | ||
| internal.file_metadata.internal_queries.getByStorageId, | ||
| { storageId: toId<'_storage'>(fileId) }, | ||
| ); | ||
| return metadata?.fileName ?? 'Unknown'; | ||
| } |
There was a problem hiding this comment.
Resolve comparison filenames independently and stop defaulting to "Unknown".
Right now we ignore a supplied baseFileName unless comparisonFileName is also present, and a missing metadata row returns "Unknown". The compare-files endpoint validates file extensions, so that combination turns otherwise valid compares into a hard 400.
🔧 Suggested fix
async function resolveFileName(
ctx: ActionCtx,
fileId: string,
): Promise<string> {
const metadata = await ctx.runQuery(
internal.file_metadata.internal_queries.getByStorageId,
{ storageId: toId<'_storage'>(fileId) },
);
- return metadata?.fileName ?? 'Unknown';
+ if (!metadata?.fileName) {
+ throw new Error(
+ `File name not available for ${fileId}. Pass baseFileName/comparisonFileName explicitly.`,
+ );
+ }
+ return metadata.fileName;
}
@@
- const [baseFileName, compFileName] =
- params.baseFileName && params.comparisonFileName
- ? [params.baseFileName, params.comparisonFileName]
- : await Promise.all([
- resolveFileName(ctx, params.baseFileId),
- resolveFileName(ctx, params.comparisonFileId),
- ]);
+ const [baseFileName, compFileName] = await Promise.all([
+ params.baseFileName
+ ? Promise.resolve(params.baseFileName)
+ : resolveFileName(ctx, params.baseFileId),
+ params.comparisonFileName
+ ? Promise.resolve(params.comparisonFileName)
+ : resolveFileName(ctx, params.comparisonFileId),
+ ]);Also applies to: 205-211
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@services/platform/convex/workflow_engine/action_defs/document/document_action.ts`
around lines 46 - 55, The resolveFileName function currently returns a hardcoded
'Unknown' when metadata is missing and ignores any supplied baseFileName; change
resolveFileName(ctx, fileId) to accept an optional fallbackName (e.g.,
resolveFileName(ctx, fileId, fallbackName?: string)) and return
metadata?.fileName ?? fallbackName ?? '' instead of 'Unknown' so a provided
baseFileName is honoured and we avoid injecting "Unknown" that breaks extension
validation; make the same change to the other similar implementation mentioned
(the block at the other location corresponding to lines 205-211).
| except ValueError as e: | ||
| raise HTTPException( | ||
| status_code=status.HTTP_422_UNPROCESSABLE_ENTITY, | ||
| detail=str(e), | ||
| ) from e | ||
| except Exception as e: |
There was a problem hiding this comment.
Treat text decode failures as 422s.
extract_text() can raise UnicodeDecodeError for supported text formats, so malformed .txt/.csv uploads currently fall through to the generic 500 path. Map them with the existing ValueError branch instead.
💡 Suggested handling
- except ValueError as e:
+ except (ValueError, UnicodeDecodeError) as e:
raise HTTPException(
status_code=status.HTTP_422_UNPROCESSABLE_ENTITY,
detail=str(e),
) from e📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| except ValueError as e: | |
| raise HTTPException( | |
| status_code=status.HTTP_422_UNPROCESSABLE_ENTITY, | |
| detail=str(e), | |
| ) from e | |
| except Exception as e: | |
| except (ValueError, UnicodeDecodeError) as e: | |
| raise HTTPException( | |
| status_code=status.HTTP_422_UNPROCESSABLE_ENTITY, | |
| detail=str(e), | |
| ) from e | |
| except Exception as e: |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/rag/app/routers/documents.py` around lines 511 - 516, The upload
handling currently maps ValueError to a 422 but lets text decode errors fall
through to a 500; catch UnicodeDecodeError raised by extract_text() and raise
HTTPException(status_code=status.HTTP_422_UNPROCESSABLE_ENTITY, detail=str(e))
from e, either by adding an explicit except UnicodeDecodeError as e: block
before the generic except Exception or by combining UnicodeDecodeError with the
existing ValueError branch; reference extract_text(), HTTPException and
status.HTTP_422_UNPROCESSABLE_ENTITY when locating where to add the new
exception handling.
| async def compare_files( | ||
| self, | ||
| base_bytes: bytes, | ||
| base_filename: str, | ||
| comparison_bytes: bytes, | ||
| comparison_filename: str, | ||
| *, | ||
| max_changes: int = 500, | ||
| ) -> dict[str, Any]: | ||
| """Compare two uploaded files using deterministic paragraph-level diffing. | ||
|
|
||
| Extracts text directly from file bytes — no database storage or embedding. | ||
| """ | ||
| from tale_knowledge.extraction import extract_text | ||
|
|
||
| from .diff_service import compute_diff | ||
|
|
||
| base_text, _ = await extract_text( | ||
| base_bytes, | ||
| base_filename, | ||
| vision_client=self._vision_client, | ||
| ) | ||
| if not base_text or not base_text.strip(): | ||
| raise ValueError(f"No text could be extracted from base file: {base_filename}") | ||
|
|
||
| comp_text, _ = await extract_text( | ||
| comparison_bytes, | ||
| comparison_filename, | ||
| vision_client=self._vision_client, | ||
| ) | ||
| if not comp_text or not comp_text.strip(): | ||
| raise ValueError(f"No text could be extracted from comparison file: {comparison_filename}") | ||
|
|
||
| diff_result = compute_diff(base_text, comp_text, max_changes=max_changes) |
There was a problem hiding this comment.
Call initialize() before extracting the uploaded files.
compare_files() is the only public entrypoint here that skips the standard initialization guard. On a cold worker, _vision_client stays unset even when configured, so OCR-backed comparisons can silently degrade or fail until some other endpoint initializes the singleton.
🛠️ Minimal fix
async def compare_files(
self,
base_bytes: bytes,
base_filename: str,
comparison_bytes: bytes,
comparison_filename: str,
*,
max_changes: int = 500,
) -> dict[str, Any]:
"""Compare two uploaded files using deterministic paragraph-level diffing.
Extracts text directly from file bytes — no database storage or embedding.
"""
+ if not self.initialized:
+ await self.initialize()
+
from tale_knowledge.extraction import extract_text
from .diff_service import compute_diff📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| async def compare_files( | |
| self, | |
| base_bytes: bytes, | |
| base_filename: str, | |
| comparison_bytes: bytes, | |
| comparison_filename: str, | |
| *, | |
| max_changes: int = 500, | |
| ) -> dict[str, Any]: | |
| """Compare two uploaded files using deterministic paragraph-level diffing. | |
| Extracts text directly from file bytes — no database storage or embedding. | |
| """ | |
| from tale_knowledge.extraction import extract_text | |
| from .diff_service import compute_diff | |
| base_text, _ = await extract_text( | |
| base_bytes, | |
| base_filename, | |
| vision_client=self._vision_client, | |
| ) | |
| if not base_text or not base_text.strip(): | |
| raise ValueError(f"No text could be extracted from base file: {base_filename}") | |
| comp_text, _ = await extract_text( | |
| comparison_bytes, | |
| comparison_filename, | |
| vision_client=self._vision_client, | |
| ) | |
| if not comp_text or not comp_text.strip(): | |
| raise ValueError(f"No text could be extracted from comparison file: {comparison_filename}") | |
| diff_result = compute_diff(base_text, comp_text, max_changes=max_changes) | |
| async def compare_files( | |
| self, | |
| base_bytes: bytes, | |
| base_filename: str, | |
| comparison_bytes: bytes, | |
| comparison_filename: str, | |
| *, | |
| max_changes: int = 500, | |
| ) -> dict[str, Any]: | |
| """Compare two uploaded files using deterministic paragraph-level diffing. | |
| Extracts text directly from file bytes — no database storage or embedding. | |
| """ | |
| if not self.initialized: | |
| await self.initialize() | |
| from tale_knowledge.extraction import extract_text | |
| from .diff_service import compute_diff | |
| base_text, _ = await extract_text( | |
| base_bytes, | |
| base_filename, | |
| vision_client=self._vision_client, | |
| ) | |
| if not base_text or not base_text.strip(): | |
| raise ValueError(f"No text could be extracted from base file: {base_filename}") | |
| comp_text, _ = await extract_text( | |
| comparison_bytes, | |
| comparison_filename, | |
| vision_client=self._vision_client, | |
| ) | |
| if not comp_text or not comp_text.strip(): | |
| raise ValueError(f"No text could be extracted from comparison file: {comparison_filename}") | |
| diff_result = compute_diff(base_text, comp_text, max_changes=max_changes) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/rag/app/services/rag_service.py` around lines 475 - 508, In
compare_files, ensure the service is initialized before using the vision client:
call the service's initialize() method (or the module-level initializer used
elsewhere) at the start of compare_files so _vision_client is set when
extract_text is invoked; update compare_files to invoke initialize() (or await
initialize() if async) before calling extract_text for base and comparison files
to avoid uninitialized OCR behavior in cold workers.
Summary
fileIdin agent document list tool for downstream workflow consumptionTest plan
languageparameter (e.g.,de,fr)test_markdown_normalize.pyandtest_compare_files.pyfileIdis returned in document list agent tool responses🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Bug Fixes
Documentation