feat(platform,crawler): add website page embeddings with vector search#501
Conversation
Introduce a website_page_embeddings module that chunks crawled page content, generates OpenAI embeddings, and stores them in Convex vector indexes. The web agent tool now performs hybrid search (vector + text) with reciprocal rank fusion (RRF) to find relevant page sections when answering questions. Key changes: - Add website_page_embeddings domain: chunking, content hashing, embedding generation, internal actions/mutations/queries, and RRF scoring - Refactor crawler to use fit_markdown (density-filtered) over raw_markdown, add PruningContentFilter, exclude nav/footer/header noise, and fix seeder retry logic with fresh httpx client between sources - Remove standalone web_assistant_tool sub-agent in favor of inline search_pages helper used directly by the web tool - Add paginated website pages query and UI dialog for browsing pages - Add embedding_config utility for centralized embedding model/dimension config - Add RLS rules for website_page_embeddings table - Add tests for chunking, content hashing, RRF, embedding config, pagination, and crawler markdown selection
📝 WalkthroughWalkthroughThis PR introduces semantic search over crawled website page content as a replacement for direct web operations. It removes the web_assistant sub-agent tool, converts the web tool from operation-based (fetch_url, browser_operate) to query-based semantic search, and implements a complete embedding pipeline including content chunking, multi-dimensional vector storage, and Reciprocal Rank Fusion for result ranking. Additionally, it adds UI components for browsing website pages, updates the crawler to prefer fit_markdown content representation, and implements page count tracking with automatic embedding generation during page ingestion. Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 13
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
services/platform/convex/agents/web/agent.ts (1)
17-58:⚠️ Potential issue | 🟠 MajorConsider alternate instructions or handling for tool-less fallback mode.
The mandatory search-first rule in
WEB_AGENT_INSTRUCTIONSconflicts with the generic retry/recovery mechanism inlib/agent_response/generate_response.ts, which disables tools for all agents (lines 363, 502, 586, 682) as a fallback strategy. When the web agent is created in tool-less mode during retry/recovery, the instructions become impossible to follow.Rather than throwing an error (which would break fallback logic), either:
- Provide alternate instructions for tool-less retry scenarios, or
- Handle web agent tool-less cases specially in the retry logic to avoid the instruction conflict
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@services/platform/convex/agents/web/agent.ts` around lines 17 - 58, The WEB_AGENT_INSTRUCTIONS mandate always calling the web tool but createWebAgent supports a tool-less mode, which conflicts with the retry/recovery logic in lib/agent_response/generate_response.ts that disables tools; update the implementation so the web agent can operate when withTools is false by either (A) adding a second instruction set used when options.withTools === false that removes/relaxes the "MANDATORY SEARCH-FIRST RULE" (e.g., a WEB_AGENT_INSTRUCTIONS_TOOLLESS or internal flag) and ensures createWebAgent returns that instruction set, or (B) change the retry/recovery logic to avoid disabling tools for agents created by createWebAgent (detect via a creator flag) so WEB_AGENT_INSTRUCTIONS remains enforceable; reference WEB_AGENT_INSTRUCTIONS and createWebAgent when locating the code to alter, and adjust the retry paths in lib/agent_response/generate_response.ts so the behavior is consistent and does not cause impossible-to-follow instructions.services/platform/convex/agents/builtin_agents.ts (1)
63-83:⚠️ Potential issue | 🟡 MinorUpdate web agent description to match indexed-site search.
The tool now searches crawled/indexed website pages, but the description still suggests live web search. This can mislead users about freshness and coverage.
✏️ Suggested copy adjustment
- description: 'Searches the web and retrieves the latest information', + description: + 'Searches your indexed website pages and returns relevant content',🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@services/platform/convex/agents/builtin_agents.ts` around lines 63 - 83, The web-assistant entry currently describes live web search but the tool queries crawled/indexed site pages; update the description string on the agent object with type 'web' and name 'web-assistant' to indicate it searches indexed/crawled website pages (e.g., "Searches crawled/indexed website pages for relevant information" or similar), so users understand the source and freshness limits; adjust any related displayName or description references for consistency if present.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@services/crawler/tests/test_crawler_service.py`:
- Around line 104-107: The async helper async def gen() contains an unused "#
noqa: RUF028" directive on the unreachable "yield" line; simply remove the "#
noqa: RUF028" comment from the yield line in the async def gen() function so
Ruff warnings are resolved without changing behavior.
In `@services/platform/app/features/websites/components/website-pages-dialog.tsx`:
- Around line 96-106: The page dialog currently renders full page.content via
ReactMarkdown (in the website-pages-dialog component) which can hurt performance
for very long markdown; update the component to render the markdown inside a
container using markdownWrapperStyles but with a CSS max-height and overflow
hidden for collapsed state and add a simple "Show more"/"Show less" toggle that
flips a local expanded state to remove the max-height and overflow; detect
initial collapsed state either by measuring rendered scrollHeight or by a
character/line cutoff and ensure you still pass mdComponents into ReactMarkdown
so formatting remains intact.
- Around line 63-68: The onClose handler's type doesn't match ViewDialog's
onOpenChange signature: ViewDialog calls onOpenChange?.(false) (see
view-dialog.tsx) so update the usage in website-pages-dialog.tsx by either
changing the onClose prop type to accept a boolean (onClose: (open: boolean) =>
void) and pass it directly to ViewDialog.onOpenChange, or wrap it to ignore the
argument by passing onOpenChange={() => onClose()} so the boolean parameter is
not required; update the function/type accordingly and ensure
ViewDialog.onOpenChange vs onClose names match.
In `@services/platform/convex/agent_tools/web/helpers/search_pages.ts`:
- Around line 38-46: The call to
internal.website_page_embeddings.internal_actions.search includes an unnecessary
TypeScript cast "websiteId: undefined as Id<'websites'> | undefined"; remove the
cast and either omit the websiteId property entirely or set it to plain
undefined so the action validator's optional v.id('websites') type is
respected—update the object passed to ctx.runAction (the call site using
ctx.runAction and internal.website_page_embeddings.internal_actions.search)
accordingly.
In `@services/platform/convex/website_page_embeddings/chunk_content.test.ts`:
- Around line 28-31: The test currently allows chunk sizes up to 250 which is
too lenient given the test passes chunkSize=200; update the assertion that
inspects each chunk (the loop over result checking chunk.content.length) to
assert <= 200 so chunks respect the configured chunkSize, or if
overlap/preceding title logic in the chunking implementation (the code that
computes effectiveChunkSize) legitimately allows larger chunks, update the test
to reflect that behavior and add a brief comment explaining why a 250 bound is
expected; reference the variables/result array and the chunk.content.length
assertion when making the change.
- Around line 65-71: The test "filters out chunks shorter than minimum length"
currently asserts chunk.content.length >= 1 which doesn't verify
MIN_CHUNK_LENGTH behavior; update the assertion in the test for chunkContent to
assert chunk.content.length >= MIN_CHUNK_LENGTH (50) or explicitly assert that
known short paragraphs (e.g., "Short" and "Another really short piece") are not
present in result — locate the test that calls chunkContent and replace the
trivial length check with an assertion comparing to the MIN_CHUNK_LENGTH
constant or checking that filtered short strings are absent.
In `@services/platform/convex/website_page_embeddings/chunk_content.ts`:
- Around line 122-128: The getOverlapText function currently uses
slice.indexOf(' ') which finds the first space in the overlap slice and can
leave a partial word at the start; change the logic in getOverlapText to locate
the last word boundary inside the overlap (use slice.lastIndexOf(' ')) and start
after that boundary (or fall back to returning the full slice if no boundary is
found) so the overlap always begins at a full-word boundary; update the function
name reference getOverlapText and its comment to document this intent.
In `@services/platform/convex/website_page_embeddings/content_hash.ts`:
- Around line 8-13: computeContentHash currently implements a 32-bit DJB2 hash
which risks collisions at scale; replace it with a stronger 64-bit (or
cryptographic) hash to avoid silent content-change misses. Update the
computeContentHash function to use a 64-bit FNV-1a or Node's crypto (e.g.,
sha256 and truncate to 64 bits or emit full hex) and return a longer hex string
(e.g., 16 hex chars for 64-bit or full sha256 hex); ensure any
storage/comparison logic that reads this hash (calls to computeContentHash) is
adapted to the new string length/format so change detection continues to work
correctly.
In `@services/platform/convex/website_page_embeddings/internal_actions.ts`:
- Around line 193-206: The current inline 5s blocking delay inside the Convex
action (see debugLog, setTimeout, then calling embedMany and assigning
queryEmbedding) consumes the action time budget; replace the blocking await new
Promise(setTimeout) by scheduling the retry via ctx.scheduler.runAfter to
execute embedMany asynchronously (pass the same params: userId, threadId,
values, textEmbeddingModel) and then persist or return the embedding when the
scheduled job runs, or if you must keep an inline retry reduce the delay to a
much smaller value (e.g., ~1s) and add a retry count/timeout guard around
embedMany to avoid further action time exhaustion.
- Around line 374-380: The EmbeddingRecord interface declares websiteId as
string but the code actually uses Id<'websites'>; update the EmbeddingRecord
definition in
services/platform/convex/website_page_embeddings/internal_actions.ts to type
websiteId as Id<'websites'> (or a union that includes it), and import the Id
type from the Convex types (e.g., import { Id } from 'convex') if not already
present; also scan any callers of EmbeddingRecord and adjust
casts/serializations where code assumed plain string to preserve correct typing.
In `@services/platform/convex/website_page_embeddings/internal_mutations.ts`:
- Around line 34-107: The switch on dimension currently falls through for
unsupported values and simply returns count = 0; update the switch to include a
default case that throws a clear error (e.g., throw new Error) listing allowed
dimensions (256, 512, 1024, 1536, 2048, 2560, 4096) so invalid inputs fail fast;
locate the switch that branches on the variable dimension in the deletion
mutation (the block that queries websitePageEmbeddings{N} and increments count)
and add the default case there to surface configuration errors.
In `@services/platform/convex/website_page_embeddings/internal_queries.ts`:
- Around line 80-138: fullTextSearch currently forwards the incoming limit
directly to DB queries when websiteId is omitted, allowing unbounded scans;
clamp the requested limit to a safe max (e.g., const MAX_LIMIT = 256) at the
start of the fullTextSearch handler (before computing searchFilter and the
switch) and use the clamped value in .take(...); apply the same clamping logic
in the runVectorSearch function so both paths enforce a predictable maximum and
the websiteId-specific branch still respects its existing 256 cap.
In `@services/platform/convex/websites/bulk_upsert_pages.ts`:
- Around line 133-143: The current loop in bulk_upsert_pages.ts enqueues one job
per page using ctx.scheduler.runAfter(0,
internal.website_page_embeddings.internal_actions.generateForPage) for each
pageId in pageIdsToEmbed which can flood the scheduler for large sites; change
this to batch or throttle: group pageIdsToEmbed into chunks (e.g., 50–200 ids)
and either (A) call a new batched action (create
internal.website_page_embeddings.internal_actions.generateForPages that accepts
an array of pageIds) or (B) schedule generateForPage with incremental delays
(e.g., runAfter(i * delayMs)) to spread load, and update callers to use the
chunking logic so the scheduler queue is not overwhelmed. Ensure references to
pageIdsToEmbed, ctx.scheduler.runAfter, and
internal.website_page_embeddings.internal_actions.generateForPage are updated
accordingly.
---
Outside diff comments:
In `@services/platform/convex/agents/builtin_agents.ts`:
- Around line 63-83: The web-assistant entry currently describes live web search
but the tool queries crawled/indexed site pages; update the description string
on the agent object with type 'web' and name 'web-assistant' to indicate it
searches indexed/crawled website pages (e.g., "Searches crawled/indexed website
pages for relevant information" or similar), so users understand the source and
freshness limits; adjust any related displayName or description references for
consistency if present.
In `@services/platform/convex/agents/web/agent.ts`:
- Around line 17-58: The WEB_AGENT_INSTRUCTIONS mandate always calling the web
tool but createWebAgent supports a tool-less mode, which conflicts with the
retry/recovery logic in lib/agent_response/generate_response.ts that disables
tools; update the implementation so the web agent can operate when withTools is
false by either (A) adding a second instruction set used when options.withTools
=== false that removes/relaxes the "MANDATORY SEARCH-FIRST RULE" (e.g., a
WEB_AGENT_INSTRUCTIONS_TOOLLESS or internal flag) and ensures createWebAgent
returns that instruction set, or (B) change the retry/recovery logic to avoid
disabling tools for agents created by createWebAgent (detect via a creator flag)
so WEB_AGENT_INSTRUCTIONS remains enforceable; reference WEB_AGENT_INSTRUCTIONS
and createWebAgent when locating the code to alter, and adjust the retry paths
in lib/agent_response/generate_response.ts so the behavior is consistent and
does not cause impossible-to-follow instructions.
Greptile SummaryIntroduced comprehensive vector search infrastructure for crawled website pages, combining semantic embeddings with full-text search using reciprocal rank fusion for better relevance. Key Changes:
Confidence Score: 4/5
|
| Filename | Overview |
|---|---|
| services/platform/convex/website_page_embeddings/internal_actions.ts | Implements embedding generation and vector search with retry logic and RRF merging; comprehensive error handling and proper content hashing |
| services/platform/convex/website_page_embeddings/schema.ts | Multi-dimension vector table definitions with appropriate indexes for org/page filtering and search |
| services/crawler/app/services/crawler_service.py | Improved seeder retry logic with client re-initialization; switched to fit_markdown for better content quality; added PruningContentFilter |
| services/platform/convex/agent_tools/web/web_tool.ts | Simplified web tool now focuses solely on semantic search; clear API with mandatory URL citation requirement |
| services/platform/convex/agents/web/agent.ts | Refactored web agent now specializes in crawled content search; mandatory search-first rule and source citation requirements |
| services/platform/convex/websites/bulk_upsert_pages.ts | Added content change detection and automatic embedding generation scheduling; updates page count on inserts |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[User adds website] --> B[Crawler discovers URLs]
B --> C[Crawler fetches pages<br/>fit_markdown + PruningContentFilter]
C --> D[bulk_upsert_pages mutation]
D --> E{Content changed?}
E -->|Yes| F[Schedule generateForPage action]
E -->|No| G[Skip embedding generation]
F --> H[Chunk content<br/>1500 chars, 200 overlap]
H --> I[Compute content hash]
I --> J[Generate embeddings via OpenAI]
J --> K[Delete old embeddings for page]
K --> L[Insert new embeddings into<br/>dimension-specific table]
M[Agent receives query] --> N[Web tool: search pages]
N --> O[Generate query embedding]
O --> P[Parallel: Vector search +<br/>Full-text search]
P --> Q[Merge with RRF k=60]
Q --> R[Deduplicate by URL]
R --> S[Format with source URLs]
S --> T[Return to agent]
Last reviewed commit: ab7170e
Summary
website_page_embeddingsmodule that chunks crawled page content, generates OpenAI embeddings, and stores them in Convex vector indexes for semantic searchKey changes
fit_markdown(density-filtered) overraw_markdown, addPruningContentFilter, exclude nav/footer/header noise, fix seeder retry logicweb_assistant_toolsub-agent in favor of inlinesearch_pageshelper with hybrid searchembedding_configutility, RLS rules for new tableTest plan
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Improvements