Skip to content

feat(platform,crawler): add website page embeddings with vector search#501

Merged
larryro merged 12 commits into
mainfrom
feature/website-page-embeddings
Feb 20, 2026
Merged

feat(platform,crawler): add website page embeddings with vector search#501
larryro merged 12 commits into
mainfrom
feature/website-page-embeddings

Conversation

@larryro
Copy link
Copy Markdown
Collaborator

@larryro larryro commented Feb 20, 2026

Summary

  • Introduce a website_page_embeddings module that chunks crawled page content, generates OpenAI embeddings, and stores them in Convex vector indexes for semantic search
  • Refactor the web agent tool to perform hybrid search (vector + text) with reciprocal rank fusion (RRF) to find relevant page sections when answering questions
  • Require source URL citations in web search tool results for better traceability
  • Add paginated website pages query and UI dialog for browsing crawled pages

Key changes

  • website_page_embeddings domain: chunking, content hashing, embedding generation, internal actions/mutations/queries, RRF scoring, and schema
  • Crawler improvements: use fit_markdown (density-filtered) over raw_markdown, add PruningContentFilter, exclude nav/footer/header noise, fix seeder retry logic
  • Web tool refactor: remove standalone web_assistant_tool sub-agent in favor of inline search_pages helper with hybrid search
  • UI: add website pages dialog and cell components for browsing pages
  • Infrastructure: centralized embedding_config utility, RLS rules for new table

Test plan

  • Unit tests for chunking, content hashing, RRF scoring, embedding config
  • Paginated website pages query tests
  • Crawler markdown selection tests
  • Manual: verify web agent tool returns relevant results with source URL citations
  • Manual: verify website pages dialog displays crawled pages correctly

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added website pages viewer with pagination and modal dialog.
    • Introduced semantic search over indexed website content.
    • Added page count tracking for websites.
  • Improvements

    • Enhanced website crawler reliability and error handling.
    • Improved content extraction and markdown generation during crawling.
    • Replaced operation-based web tool with semantic search interface.

Introduce a website_page_embeddings module that chunks crawled page content,
generates OpenAI embeddings, and stores them in Convex vector indexes. The web
agent tool now performs hybrid search (vector + text) with reciprocal rank
fusion (RRF) to find relevant page sections when answering questions.

Key changes:
- Add website_page_embeddings domain: chunking, content hashing, embedding
  generation, internal actions/mutations/queries, and RRF scoring
- Refactor crawler to use fit_markdown (density-filtered) over raw_markdown,
  add PruningContentFilter, exclude nav/footer/header noise, and fix seeder
  retry logic with fresh httpx client between sources
- Remove standalone web_assistant_tool sub-agent in favor of inline
  search_pages helper used directly by the web tool
- Add paginated website pages query and UI dialog for browsing pages
- Add embedding_config utility for centralized embedding model/dimension config
- Add RLS rules for website_page_embeddings table
- Add tests for chunking, content hashing, RRF, embedding config, pagination,
  and crawler markdown selection
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 20, 2026

📝 Walkthrough

Walkthrough

This PR introduces semantic search over crawled website page content as a replacement for direct web operations. It removes the web_assistant sub-agent tool, converts the web tool from operation-based (fetch_url, browser_operate) to query-based semantic search, and implements a complete embedding pipeline including content chunking, multi-dimensional vector storage, and Reciprocal Rank Fusion for result ranking. Additionally, it adds UI components for browsing website pages, updates the crawler to prefer fit_markdown content representation, and implements page count tracking with automatic embedding generation during page ingestion.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main feature: adding website page embeddings with vector search capability, which aligns with the PR's primary objective.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/website-page-embeddings

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
services/platform/convex/agents/web/agent.ts (1)

17-58: ⚠️ Potential issue | 🟠 Major

Consider alternate instructions or handling for tool-less fallback mode.

The mandatory search-first rule in WEB_AGENT_INSTRUCTIONS conflicts with the generic retry/recovery mechanism in lib/agent_response/generate_response.ts, which disables tools for all agents (lines 363, 502, 586, 682) as a fallback strategy. When the web agent is created in tool-less mode during retry/recovery, the instructions become impossible to follow.

Rather than throwing an error (which would break fallback logic), either:

  • Provide alternate instructions for tool-less retry scenarios, or
  • Handle web agent tool-less cases specially in the retry logic to avoid the instruction conflict
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/platform/convex/agents/web/agent.ts` around lines 17 - 58, The
WEB_AGENT_INSTRUCTIONS mandate always calling the web tool but createWebAgent
supports a tool-less mode, which conflicts with the retry/recovery logic in
lib/agent_response/generate_response.ts that disables tools; update the
implementation so the web agent can operate when withTools is false by either
(A) adding a second instruction set used when options.withTools === false that
removes/relaxes the "MANDATORY SEARCH-FIRST RULE" (e.g., a
WEB_AGENT_INSTRUCTIONS_TOOLLESS or internal flag) and ensures createWebAgent
returns that instruction set, or (B) change the retry/recovery logic to avoid
disabling tools for agents created by createWebAgent (detect via a creator flag)
so WEB_AGENT_INSTRUCTIONS remains enforceable; reference WEB_AGENT_INSTRUCTIONS
and createWebAgent when locating the code to alter, and adjust the retry paths
in lib/agent_response/generate_response.ts so the behavior is consistent and
does not cause impossible-to-follow instructions.
services/platform/convex/agents/builtin_agents.ts (1)

63-83: ⚠️ Potential issue | 🟡 Minor

Update web agent description to match indexed-site search.

The tool now searches crawled/indexed website pages, but the description still suggests live web search. This can mislead users about freshness and coverage.

✏️ Suggested copy adjustment
-      description: 'Searches the web and retrieves the latest information',
+      description:
+        'Searches your indexed website pages and returns relevant content',
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/platform/convex/agents/builtin_agents.ts` around lines 63 - 83, The
web-assistant entry currently describes live web search but the tool queries
crawled/indexed site pages; update the description string on the agent object
with type 'web' and name 'web-assistant' to indicate it searches indexed/crawled
website pages (e.g., "Searches crawled/indexed website pages for relevant
information" or similar), so users understand the source and freshness limits;
adjust any related displayName or description references for consistency if
present.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@services/crawler/tests/test_crawler_service.py`:
- Around line 104-107: The async helper async def gen() contains an unused "#
noqa: RUF028" directive on the unreachable "yield" line; simply remove the "#
noqa: RUF028" comment from the yield line in the async def gen() function so
Ruff warnings are resolved without changing behavior.

In `@services/platform/app/features/websites/components/website-pages-dialog.tsx`:
- Around line 96-106: The page dialog currently renders full page.content via
ReactMarkdown (in the website-pages-dialog component) which can hurt performance
for very long markdown; update the component to render the markdown inside a
container using markdownWrapperStyles but with a CSS max-height and overflow
hidden for collapsed state and add a simple "Show more"/"Show less" toggle that
flips a local expanded state to remove the max-height and overflow; detect
initial collapsed state either by measuring rendered scrollHeight or by a
character/line cutoff and ensure you still pass mdComponents into ReactMarkdown
so formatting remains intact.
- Around line 63-68: The onClose handler's type doesn't match ViewDialog's
onOpenChange signature: ViewDialog calls onOpenChange?.(false) (see
view-dialog.tsx) so update the usage in website-pages-dialog.tsx by either
changing the onClose prop type to accept a boolean (onClose: (open: boolean) =>
void) and pass it directly to ViewDialog.onOpenChange, or wrap it to ignore the
argument by passing onOpenChange={() => onClose()} so the boolean parameter is
not required; update the function/type accordingly and ensure
ViewDialog.onOpenChange vs onClose names match.

In `@services/platform/convex/agent_tools/web/helpers/search_pages.ts`:
- Around line 38-46: The call to
internal.website_page_embeddings.internal_actions.search includes an unnecessary
TypeScript cast "websiteId: undefined as Id<'websites'> | undefined"; remove the
cast and either omit the websiteId property entirely or set it to plain
undefined so the action validator's optional v.id('websites') type is
respected—update the object passed to ctx.runAction (the call site using
ctx.runAction and internal.website_page_embeddings.internal_actions.search)
accordingly.

In `@services/platform/convex/website_page_embeddings/chunk_content.test.ts`:
- Around line 28-31: The test currently allows chunk sizes up to 250 which is
too lenient given the test passes chunkSize=200; update the assertion that
inspects each chunk (the loop over result checking chunk.content.length) to
assert <= 200 so chunks respect the configured chunkSize, or if
overlap/preceding title logic in the chunking implementation (the code that
computes effectiveChunkSize) legitimately allows larger chunks, update the test
to reflect that behavior and add a brief comment explaining why a 250 bound is
expected; reference the variables/result array and the chunk.content.length
assertion when making the change.
- Around line 65-71: The test "filters out chunks shorter than minimum length"
currently asserts chunk.content.length >= 1 which doesn't verify
MIN_CHUNK_LENGTH behavior; update the assertion in the test for chunkContent to
assert chunk.content.length >= MIN_CHUNK_LENGTH (50) or explicitly assert that
known short paragraphs (e.g., "Short" and "Another really short piece") are not
present in result — locate the test that calls chunkContent and replace the
trivial length check with an assertion comparing to the MIN_CHUNK_LENGTH
constant or checking that filtered short strings are absent.

In `@services/platform/convex/website_page_embeddings/chunk_content.ts`:
- Around line 122-128: The getOverlapText function currently uses
slice.indexOf(' ') which finds the first space in the overlap slice and can
leave a partial word at the start; change the logic in getOverlapText to locate
the last word boundary inside the overlap (use slice.lastIndexOf(' ')) and start
after that boundary (or fall back to returning the full slice if no boundary is
found) so the overlap always begins at a full-word boundary; update the function
name reference getOverlapText and its comment to document this intent.

In `@services/platform/convex/website_page_embeddings/content_hash.ts`:
- Around line 8-13: computeContentHash currently implements a 32-bit DJB2 hash
which risks collisions at scale; replace it with a stronger 64-bit (or
cryptographic) hash to avoid silent content-change misses. Update the
computeContentHash function to use a 64-bit FNV-1a or Node's crypto (e.g.,
sha256 and truncate to 64 bits or emit full hex) and return a longer hex string
(e.g., 16 hex chars for 64-bit or full sha256 hex); ensure any
storage/comparison logic that reads this hash (calls to computeContentHash) is
adapted to the new string length/format so change detection continues to work
correctly.

In `@services/platform/convex/website_page_embeddings/internal_actions.ts`:
- Around line 193-206: The current inline 5s blocking delay inside the Convex
action (see debugLog, setTimeout, then calling embedMany and assigning
queryEmbedding) consumes the action time budget; replace the blocking await new
Promise(setTimeout) by scheduling the retry via ctx.scheduler.runAfter to
execute embedMany asynchronously (pass the same params: userId, threadId,
values, textEmbeddingModel) and then persist or return the embedding when the
scheduled job runs, or if you must keep an inline retry reduce the delay to a
much smaller value (e.g., ~1s) and add a retry count/timeout guard around
embedMany to avoid further action time exhaustion.
- Around line 374-380: The EmbeddingRecord interface declares websiteId as
string but the code actually uses Id<'websites'>; update the EmbeddingRecord
definition in
services/platform/convex/website_page_embeddings/internal_actions.ts to type
websiteId as Id<'websites'> (or a union that includes it), and import the Id
type from the Convex types (e.g., import { Id } from 'convex') if not already
present; also scan any callers of EmbeddingRecord and adjust
casts/serializations where code assumed plain string to preserve correct typing.

In `@services/platform/convex/website_page_embeddings/internal_mutations.ts`:
- Around line 34-107: The switch on dimension currently falls through for
unsupported values and simply returns count = 0; update the switch to include a
default case that throws a clear error (e.g., throw new Error) listing allowed
dimensions (256, 512, 1024, 1536, 2048, 2560, 4096) so invalid inputs fail fast;
locate the switch that branches on the variable dimension in the deletion
mutation (the block that queries websitePageEmbeddings{N} and increments count)
and add the default case there to surface configuration errors.

In `@services/platform/convex/website_page_embeddings/internal_queries.ts`:
- Around line 80-138: fullTextSearch currently forwards the incoming limit
directly to DB queries when websiteId is omitted, allowing unbounded scans;
clamp the requested limit to a safe max (e.g., const MAX_LIMIT = 256) at the
start of the fullTextSearch handler (before computing searchFilter and the
switch) and use the clamped value in .take(...); apply the same clamping logic
in the runVectorSearch function so both paths enforce a predictable maximum and
the websiteId-specific branch still respects its existing 256 cap.

In `@services/platform/convex/websites/bulk_upsert_pages.ts`:
- Around line 133-143: The current loop in bulk_upsert_pages.ts enqueues one job
per page using ctx.scheduler.runAfter(0,
internal.website_page_embeddings.internal_actions.generateForPage) for each
pageId in pageIdsToEmbed which can flood the scheduler for large sites; change
this to batch or throttle: group pageIdsToEmbed into chunks (e.g., 50–200 ids)
and either (A) call a new batched action (create
internal.website_page_embeddings.internal_actions.generateForPages that accepts
an array of pageIds) or (B) schedule generateForPage with incremental delays
(e.g., runAfter(i * delayMs)) to spread load, and update callers to use the
chunking logic so the scheduler queue is not overwhelmed. Ensure references to
pageIdsToEmbed, ctx.scheduler.runAfter, and
internal.website_page_embeddings.internal_actions.generateForPage are updated
accordingly.

---

Outside diff comments:
In `@services/platform/convex/agents/builtin_agents.ts`:
- Around line 63-83: The web-assistant entry currently describes live web search
but the tool queries crawled/indexed site pages; update the description string
on the agent object with type 'web' and name 'web-assistant' to indicate it
searches indexed/crawled website pages (e.g., "Searches crawled/indexed website
pages for relevant information" or similar), so users understand the source and
freshness limits; adjust any related displayName or description references for
consistency if present.

In `@services/platform/convex/agents/web/agent.ts`:
- Around line 17-58: The WEB_AGENT_INSTRUCTIONS mandate always calling the web
tool but createWebAgent supports a tool-less mode, which conflicts with the
retry/recovery logic in lib/agent_response/generate_response.ts that disables
tools; update the implementation so the web agent can operate when withTools is
false by either (A) adding a second instruction set used when options.withTools
=== false that removes/relaxes the "MANDATORY SEARCH-FIRST RULE" (e.g., a
WEB_AGENT_INSTRUCTIONS_TOOLLESS or internal flag) and ensures createWebAgent
returns that instruction set, or (B) change the retry/recovery logic to avoid
disabling tools for agents created by createWebAgent (detect via a creator flag)
so WEB_AGENT_INSTRUCTIONS remains enforceable; reference WEB_AGENT_INSTRUCTIONS
and createWebAgent when locating the code to alter, and adjust the retry paths
in lib/agent_response/generate_response.ts so the behavior is consistent and
does not cause impossible-to-follow instructions.

Comment thread services/crawler/tests/test_crawler_service.py
Comment thread services/platform/app/features/websites/components/website-pages-dialog.tsx Outdated
Comment thread services/platform/convex/agent_tools/web/helpers/search_pages.ts
Comment thread services/platform/convex/websites/bulk_upsert_pages.ts
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Feb 20, 2026

Greptile Summary

Introduced comprehensive vector search infrastructure for crawled website pages, combining semantic embeddings with full-text search using reciprocal rank fusion for better relevance.

Key Changes:

  • New website_page_embeddings domain: chunk content (1500 chars, 200 overlap), generate OpenAI embeddings, store in dimension-specific Convex vector tables (256-4096), hybrid search with RRF k=60
  • Crawler improvements: switched from raw_markdown to fit_markdown (density-filtered main content), added PruningContentFilter with threshold 0.4, excluded nav/footer/header tags, improved URL discovery retry logic with seeder re-initialization
  • Web tool refactor: removed standalone web_assistant_tool sub-agent, replaced with inline search_pages helper performing hybrid vector+text search over indexed pages with mandatory source URL citation
  • Embedding generation: content-change detection in bulk_upsert_pages, automatic scheduling when pages update, retry logic for OpenAI API failures
  • UI additions: paginated website pages dialog for browsing crawled content
  • Infrastructure: centralized embedding_config utility, RLS rules for websitePages table

Confidence Score: 4/5

  • Safe to merge with careful monitoring - comprehensive feature with good test coverage but embedding costs should be tracked
  • Well-structured implementation with unit tests for chunking, RRF, and embedding config. Crawler changes have test coverage. Minor concerns: no integration tests for full embedding pipeline, potential OpenAI API cost implications, and websiteId filter only applied post-vector-search rather than in query (due to Convex limitations)
  • Monitor services/platform/convex/website_page_embeddings/internal_actions.ts for embedding generation costs and retry behavior; check services/crawler/app/services/crawler_service.py for seeder re-initialization performance

Important Files Changed

Filename Overview
services/platform/convex/website_page_embeddings/internal_actions.ts Implements embedding generation and vector search with retry logic and RRF merging; comprehensive error handling and proper content hashing
services/platform/convex/website_page_embeddings/schema.ts Multi-dimension vector table definitions with appropriate indexes for org/page filtering and search
services/crawler/app/services/crawler_service.py Improved seeder retry logic with client re-initialization; switched to fit_markdown for better content quality; added PruningContentFilter
services/platform/convex/agent_tools/web/web_tool.ts Simplified web tool now focuses solely on semantic search; clear API with mandatory URL citation requirement
services/platform/convex/agents/web/agent.ts Refactored web agent now specializes in crawled content search; mandatory search-first rule and source citation requirements
services/platform/convex/websites/bulk_upsert_pages.ts Added content change detection and automatic embedding generation scheduling; updates page count on inserts

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[User adds website] --> B[Crawler discovers URLs]
    B --> C[Crawler fetches pages<br/>fit_markdown + PruningContentFilter]
    C --> D[bulk_upsert_pages mutation]
    D --> E{Content changed?}
    E -->|Yes| F[Schedule generateForPage action]
    E -->|No| G[Skip embedding generation]
    F --> H[Chunk content<br/>1500 chars, 200 overlap]
    H --> I[Compute content hash]
    I --> J[Generate embeddings via OpenAI]
    J --> K[Delete old embeddings for page]
    K --> L[Insert new embeddings into<br/>dimension-specific table]
    
    M[Agent receives query] --> N[Web tool: search pages]
    N --> O[Generate query embedding]
    O --> P[Parallel: Vector search +<br/>Full-text search]
    P --> Q[Merge with RRF k=60]
    Q --> R[Deduplicate by URL]
    R --> S[Format with source URLs]
    S --> T[Return to agent]
Loading

Last reviewed commit: ab7170e

@larryro larryro merged commit fcddf90 into main Feb 20, 2026
17 checks passed
@larryro larryro deleted the feature/website-page-embeddings branch February 20, 2026 17:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant