feat(platform,crawler): add website page embeddings with vector search by larryro · Pull Request #501 · tale-project/tale

larryro · 2026-02-20T16:25:57Z

Summary

Introduce a website_page_embeddings module that chunks crawled page content, generates OpenAI embeddings, and stores them in Convex vector indexes for semantic search
Refactor the web agent tool to perform hybrid search (vector + text) with reciprocal rank fusion (RRF) to find relevant page sections when answering questions
Require source URL citations in web search tool results for better traceability
Add paginated website pages query and UI dialog for browsing crawled pages

Key changes

website_page_embeddings domain: chunking, content hashing, embedding generation, internal actions/mutations/queries, RRF scoring, and schema
Crawler improvements: use fit_markdown (density-filtered) over raw_markdown, add PruningContentFilter, exclude nav/footer/header noise, fix seeder retry logic
Web tool refactor: remove standalone web_assistant_tool sub-agent in favor of inline search_pages helper with hybrid search
UI: add website pages dialog and cell components for browsing pages
Infrastructure: centralized embedding_config utility, RLS rules for new table

Test plan

Unit tests for chunking, content hashing, RRF scoring, embedding config
Paginated website pages query tests
Crawler markdown selection tests
Manual: verify web agent tool returns relevant results with source URL citations
Manual: verify website pages dialog displays crawled pages correctly

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added website pages viewer with pagination and modal dialog.
- Introduced semantic search over indexed website content.
- Added page count tracking for websites.
Improvements
- Enhanced website crawler reliability and error handling.
- Improved content extraction and markdown generation during crawling.
- Replaced operation-based web tool with semantic search interface.

Introduce a website_page_embeddings module that chunks crawled page content, generates OpenAI embeddings, and stores them in Convex vector indexes. The web agent tool now performs hybrid search (vector + text) with reciprocal rank fusion (RRF) to find relevant page sections when answering questions. Key changes: - Add website_page_embeddings domain: chunking, content hashing, embedding generation, internal actions/mutations/queries, and RRF scoring - Refactor crawler to use fit_markdown (density-filtered) over raw_markdown, add PruningContentFilter, exclude nav/footer/header noise, and fix seeder retry logic with fresh httpx client between sources - Remove standalone web_assistant_tool sub-agent in favor of inline search_pages helper used directly by the web tool - Add paginated website pages query and UI dialog for browsing pages - Add embedding_config utility for centralized embedding model/dimension config - Add RLS rules for website_page_embeddings table - Add tests for chunking, content hashing, RRF, embedding config, pagination, and crawler markdown selection

coderabbitai · 2026-02-20T16:34:05Z

📝 Walkthrough

Walkthrough

This PR introduces semantic search over crawled website page content as a replacement for direct web operations. It removes the web_assistant sub-agent tool, converts the web tool from operation-based (fetch_url, browser_operate) to query-based semantic search, and implements a complete embedding pipeline including content chunking, multi-dimensional vector storage, and Reciprocal Rank Fusion for result ranking. Additionally, it adds UI components for browsing website pages, updates the crawler to prefer fit_markdown content representation, and implements page count tracking with automatic embedding generation during page ingestion.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

refactor(platform): consolidate crawler tools into unified web tool #369 — Modifies web/crawler tooling integration, affecting URL discovery and content extraction strategies.
feat(operator,platform): refactor operator agent loop and improve platform reliability #499 — Modifies tool-call and sub-agent handling in agent response generation, intersecting with web tool removal.
refactor: context management and sub-agents improvements #185 — Modifies SubAgentType definitions and web-related sub-agent tool handling, directly related to web_assistant removal.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main feature: adding website page embeddings with vector search capability, which aligns with the PR's primary objective.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/website-page-embeddings

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 13

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

services/platform/convex/agents/web/agent.ts (1)
17-58: ⚠️ Potential issue | 🟠 Major

Consider alternate instructions or handling for tool-less fallback mode.

The mandatory search-first rule in WEB_AGENT_INSTRUCTIONS conflicts with the generic retry/recovery mechanism in lib/agent_response/generate_response.ts, which disables tools for all agents (lines 363, 502, 586, 682) as a fallback strategy. When the web agent is created in tool-less mode during retry/recovery, the instructions become impossible to follow.

Rather than throwing an error (which would break fallback logic), either:

Provide alternate instructions for tool-less retry scenarios, or

Handle web agent tool-less cases specially in the retry logic to avoid the instruction conflict
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/platform/convex/agents/web/agent.ts` around lines 17 - 58, The
WEB_AGENT_INSTRUCTIONS mandate always calling the web tool but createWebAgent
supports a tool-less mode, which conflicts with the retry/recovery logic in
lib/agent_response/generate_response.ts that disables tools; update the
implementation so the web agent can operate when withTools is false by either
(A) adding a second instruction set used when options.withTools === false that
removes/relaxes the "MANDATORY SEARCH-FIRST RULE" (e.g., a
WEB_AGENT_INSTRUCTIONS_TOOLLESS or internal flag) and ensures createWebAgent
returns that instruction set, or (B) change the retry/recovery logic to avoid
disabling tools for agents created by createWebAgent (detect via a creator flag)
so WEB_AGENT_INSTRUCTIONS remains enforceable; reference WEB_AGENT_INSTRUCTIONS
and createWebAgent when locating the code to alter, and adjust the retry paths
in lib/agent_response/generate_response.ts so the behavior is consistent and
does not cause impossible-to-follow instructions.
services/platform/convex/agents/builtin_agents.ts (1)
63-83: ⚠️ Potential issue | 🟡 Minor

Update web agent description to match indexed-site search.

The tool now searches crawled/indexed website pages, but the description still suggests live web search. This can mislead users about freshness and coverage.
✏️ Suggested copy adjustment
-      description: 'Searches the web and retrieves the latest information',
+      description:
+        'Searches your indexed website pages and returns relevant content',
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/platform/convex/agents/builtin_agents.ts` around lines 63 - 83, The
web-assistant entry currently describes live web search but the tool queries
crawled/indexed site pages; update the description string on the agent object
with type 'web' and name 'web-assistant' to indicate it searches indexed/crawled
website pages (e.g., "Searches crawled/indexed website pages for relevant
information" or similar), so users understand the source and freshness limits;
adjust any related displayName or description references for consistency if
present.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@services/crawler/tests/test_crawler_service.py`:
- Around line 104-107: The async helper async def gen() contains an unused "#
noqa: RUF028" directive on the unreachable "yield" line; simply remove the "#
noqa: RUF028" comment from the yield line in the async def gen() function so
Ruff warnings are resolved without changing behavior.

In `@services/platform/app/features/websites/components/website-pages-dialog.tsx`:
- Around line 96-106: The page dialog currently renders full page.content via
ReactMarkdown (in the website-pages-dialog component) which can hurt performance
for very long markdown; update the component to render the markdown inside a
container using markdownWrapperStyles but with a CSS max-height and overflow
hidden for collapsed state and add a simple "Show more"/"Show less" toggle that
flips a local expanded state to remove the max-height and overflow; detect
initial collapsed state either by measuring rendered scrollHeight or by a
character/line cutoff and ensure you still pass mdComponents into ReactMarkdown
so formatting remains intact.
- Around line 63-68: The onClose handler's type doesn't match ViewDialog's
onOpenChange signature: ViewDialog calls onOpenChange?.(false) (see
view-dialog.tsx) so update the usage in website-pages-dialog.tsx by either
changing the onClose prop type to accept a boolean (onClose: (open: boolean) =>
void) and pass it directly to ViewDialog.onOpenChange, or wrap it to ignore the
argument by passing onOpenChange={() => onClose()} so the boolean parameter is
not required; update the function/type accordingly and ensure
ViewDialog.onOpenChange vs onClose names match.

In `@services/platform/convex/agent_tools/web/helpers/search_pages.ts`:
- Around line 38-46: The call to
internal.website_page_embeddings.internal_actions.search includes an unnecessary
TypeScript cast "websiteId: undefined as Id<'websites'> | undefined"; remove the
cast and either omit the websiteId property entirely or set it to plain
undefined so the action validator's optional v.id('websites') type is
respected—update the object passed to ctx.runAction (the call site using
ctx.runAction and internal.website_page_embeddings.internal_actions.search)
accordingly.

In `@services/platform/convex/website_page_embeddings/chunk_content.test.ts`:
- Around line 28-31: The test currently allows chunk sizes up to 250 which is
too lenient given the test passes chunkSize=200; update the assertion that
inspects each chunk (the loop over result checking chunk.content.length) to
assert <= 200 so chunks respect the configured chunkSize, or if
overlap/preceding title logic in the chunking implementation (the code that
computes effectiveChunkSize) legitimately allows larger chunks, update the test
to reflect that behavior and add a brief comment explaining why a 250 bound is
expected; reference the variables/result array and the chunk.content.length
assertion when making the change.
- Around line 65-71: The test "filters out chunks shorter than minimum length"
currently asserts chunk.content.length >= 1 which doesn't verify
MIN_CHUNK_LENGTH behavior; update the assertion in the test for chunkContent to
assert chunk.content.length >= MIN_CHUNK_LENGTH (50) or explicitly assert that
known short paragraphs (e.g., "Short" and "Another really short piece") are not
present in result — locate the test that calls chunkContent and replace the
trivial length check with an assertion comparing to the MIN_CHUNK_LENGTH
constant or checking that filtered short strings are absent.

In `@services/platform/convex/website_page_embeddings/chunk_content.ts`:
- Around line 122-128: The getOverlapText function currently uses
slice.indexOf(' ') which finds the first space in the overlap slice and can
leave a partial word at the start; change the logic in getOverlapText to locate
the last word boundary inside the overlap (use slice.lastIndexOf(' ')) and start
after that boundary (or fall back to returning the full slice if no boundary is
found) so the overlap always begins at a full-word boundary; update the function
name reference getOverlapText and its comment to document this intent.

In `@services/platform/convex/website_page_embeddings/content_hash.ts`:
- Around line 8-13: computeContentHash currently implements a 32-bit DJB2 hash
which risks collisions at scale; replace it with a stronger 64-bit (or
cryptographic) hash to avoid silent content-change misses. Update the
computeContentHash function to use a 64-bit FNV-1a or Node's crypto (e.g.,
sha256 and truncate to 64 bits or emit full hex) and return a longer hex string
(e.g., 16 hex chars for 64-bit or full sha256 hex); ensure any
storage/comparison logic that reads this hash (calls to computeContentHash) is
adapted to the new string length/format so change detection continues to work
correctly.

In `@services/platform/convex/website_page_embeddings/internal_actions.ts`:
- Around line 193-206: The current inline 5s blocking delay inside the Convex
action (see debugLog, setTimeout, then calling embedMany and assigning
queryEmbedding) consumes the action time budget; replace the blocking await new
Promise(setTimeout) by scheduling the retry via ctx.scheduler.runAfter to
execute embedMany asynchronously (pass the same params: userId, threadId,
values, textEmbeddingModel) and then persist or return the embedding when the
scheduled job runs, or if you must keep an inline retry reduce the delay to a
much smaller value (e.g., ~1s) and add a retry count/timeout guard around
embedMany to avoid further action time exhaustion.
- Around line 374-380: The EmbeddingRecord interface declares websiteId as
string but the code actually uses Id<'websites'>; update the EmbeddingRecord
definition in
services/platform/convex/website_page_embeddings/internal_actions.ts to type
websiteId as Id<'websites'> (or a union that includes it), and import the Id
type from the Convex types (e.g., import { Id } from 'convex') if not already
present; also scan any callers of EmbeddingRecord and adjust
casts/serializations where code assumed plain string to preserve correct typing.

In `@services/platform/convex/website_page_embeddings/internal_mutations.ts`:
- Around line 34-107: The switch on dimension currently falls through for
unsupported values and simply returns count = 0; update the switch to include a
default case that throws a clear error (e.g., throw new Error) listing allowed
dimensions (256, 512, 1024, 1536, 2048, 2560, 4096) so invalid inputs fail fast;
locate the switch that branches on the variable dimension in the deletion
mutation (the block that queries websitePageEmbeddings{N} and increments count)
and add the default case there to surface configuration errors.

In `@services/platform/convex/website_page_embeddings/internal_queries.ts`:
- Around line 80-138: fullTextSearch currently forwards the incoming limit
directly to DB queries when websiteId is omitted, allowing unbounded scans;
clamp the requested limit to a safe max (e.g., const MAX_LIMIT = 256) at the
start of the fullTextSearch handler (before computing searchFilter and the
switch) and use the clamped value in .take(...); apply the same clamping logic
in the runVectorSearch function so both paths enforce a predictable maximum and
the websiteId-specific branch still respects its existing 256 cap.

In `@services/platform/convex/websites/bulk_upsert_pages.ts`:
- Around line 133-143: The current loop in bulk_upsert_pages.ts enqueues one job
per page using ctx.scheduler.runAfter(0,
internal.website_page_embeddings.internal_actions.generateForPage) for each
pageId in pageIdsToEmbed which can flood the scheduler for large sites; change
this to batch or throttle: group pageIdsToEmbed into chunks (e.g., 50–200 ids)
and either (A) call a new batched action (create
internal.website_page_embeddings.internal_actions.generateForPages that accepts
an array of pageIds) or (B) schedule generateForPage with incremental delays
(e.g., runAfter(i * delayMs)) to spread load, and update callers to use the
chunking logic so the scheduler queue is not overwhelmed. Ensure references to
pageIdsToEmbed, ctx.scheduler.runAfter, and
internal.website_page_embeddings.internal_actions.generateForPage are updated
accordingly.

---

Outside diff comments:
In `@services/platform/convex/agents/builtin_agents.ts`:
- Around line 63-83: The web-assistant entry currently describes live web search
but the tool queries crawled/indexed site pages; update the description string
on the agent object with type 'web' and name 'web-assistant' to indicate it
searches indexed/crawled website pages (e.g., "Searches crawled/indexed website
pages for relevant information" or similar), so users understand the source and
freshness limits; adjust any related displayName or description references for
consistency if present.

In `@services/platform/convex/agents/web/agent.ts`:
- Around line 17-58: The WEB_AGENT_INSTRUCTIONS mandate always calling the web
tool but createWebAgent supports a tool-less mode, which conflicts with the
retry/recovery logic in lib/agent_response/generate_response.ts that disables
tools; update the implementation so the web agent can operate when withTools is
false by either (A) adding a second instruction set used when options.withTools
=== false that removes/relaxes the "MANDATORY SEARCH-FIRST RULE" (e.g., a
WEB_AGENT_INSTRUCTIONS_TOOLLESS or internal flag) and ensures createWebAgent
returns that instruction set, or (B) change the retry/recovery logic to avoid
disabling tools for agents created by createWebAgent (detect via a creator flag)
so WEB_AGENT_INSTRUCTIONS remains enforceable; reference WEB_AGENT_INSTRUCTIONS
and createWebAgent when locating the code to alter, and adjust the retry paths
in lib/agent_response/generate_response.ts so the behavior is consistent and
does not cause impossible-to-follow instructions.

greptile-apps · 2026-02-20T16:35:49Z

Greptile Summary

Introduced comprehensive vector search infrastructure for crawled website pages, combining semantic embeddings with full-text search using reciprocal rank fusion for better relevance.

Key Changes:

New website_page_embeddings domain: chunk content (1500 chars, 200 overlap), generate OpenAI embeddings, store in dimension-specific Convex vector tables (256-4096), hybrid search with RRF k=60
Crawler improvements: switched from raw_markdown to fit_markdown (density-filtered main content), added PruningContentFilter with threshold 0.4, excluded nav/footer/header tags, improved URL discovery retry logic with seeder re-initialization
Web tool refactor: removed standalone web_assistant_tool sub-agent, replaced with inline search_pages helper performing hybrid vector+text search over indexed pages with mandatory source URL citation
Embedding generation: content-change detection in bulk_upsert_pages, automatic scheduling when pages update, retry logic for OpenAI API failures
UI additions: paginated website pages dialog for browsing crawled content
Infrastructure: centralized embedding_config utility, RLS rules for websitePages table

Confidence Score: 4/5

Safe to merge with careful monitoring - comprehensive feature with good test coverage but embedding costs should be tracked
Well-structured implementation with unit tests for chunking, RRF, and embedding config. Crawler changes have test coverage. Minor concerns: no integration tests for full embedding pipeline, potential OpenAI API cost implications, and websiteId filter only applied post-vector-search rather than in query (due to Convex limitations)
Monitor services/platform/convex/website_page_embeddings/internal_actions.ts for embedding generation costs and retry behavior; check services/crawler/app/services/crawler_service.py for seeder re-initialization performance

Important Files Changed

Filename	Overview
services/platform/convex/website_page_embeddings/internal_actions.ts	Implements embedding generation and vector search with retry logic and RRF merging; comprehensive error handling and proper content hashing
services/platform/convex/website_page_embeddings/schema.ts	Multi-dimension vector table definitions with appropriate indexes for org/page filtering and search
services/crawler/app/services/crawler_service.py	Improved seeder retry logic with client re-initialization; switched to fit_markdown for better content quality; added PruningContentFilter
services/platform/convex/agent_tools/web/web_tool.ts	Simplified web tool now focuses solely on semantic search; clear API with mandatory URL citation requirement
services/platform/convex/agents/web/agent.ts	Refactored web agent now specializes in crawled content search; mandatory search-first rule and source citation requirements
services/platform/convex/websites/bulk_upsert_pages.ts	Added content change detection and automatic embedding generation scheduling; updates page count on inserts

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[User adds website] --> B[Crawler discovers URLs]
    B --> C[Crawler fetches pages<br/>fit_markdown + PruningContentFilter]
    C --> D[bulk_upsert_pages mutation]
    D --> E{Content changed?}
    E -->|Yes| F[Schedule generateForPage action]
    E -->|No| G[Skip embedding generation]
    F --> H[Chunk content<br/>1500 chars, 200 overlap]
    H --> I[Compute content hash]
    I --> J[Generate embeddings via OpenAI]
    J --> K[Delete old embeddings for page]
    K --> L[Insert new embeddings into<br/>dimension-specific table]
    
    M[Agent receives query] --> N[Web tool: search pages]
    N --> O[Generate query embedding]
    O --> P[Parallel: Vector search +<br/>Full-text search]
    P --> Q[Merge with RRF k=60]
    Q --> R[Deduplicate by URL]
    R --> S[Format with source URLs]
    S --> T[Return to agent]

_{Last reviewed commit: ab7170e}

…filtering

…to 1s

…rd interface

#501)

larryro added 2 commits February 21, 2026 00:08

feat(platform): require source URL citation in web search tool results

ab7170e

coderabbitai Bot requested changes Feb 20, 2026

View reviewed changes

larryro added 10 commits February 21, 2026 01:21

fix(crawler): remove unused noqa directive in test helper

d23ab86

fix(platform): add collapsible markdown content in website pages dialog

3706353

fix(platform): remove unnecessary type cast when calling search action

905b7c7

fix(platform): clarify chunk size assertion bound in chunk_content tests

cc24996

fix(platform): strengthen MIN_CHUNK_LENGTH test to actually exercise …

19e199a

…filtering

fix(platform): reduce inline retry delay in search embedding from 5s …

589eee2

…to 1s

fix(platform): use Id<'websites'> type for websiteId in EmbeddingReco…

ceb34c4

…rd interface

fix(platform): throw on unsupported dimension in deleteByPageId switch

3b5fa40

fix(platform): clamp search limit to prevent unbounded queries

2443954

fix(platform): remove unused web_assistant sub-agent type

46a7121

larryro merged commit fcddf90 into main Feb 20, 2026
17 checks passed

larryro deleted the feature/website-page-embeddings branch February 20, 2026 17:52

yannickmonney pushed a commit that referenced this pull request Apr 8, 2026

feat(platform,crawler): add website page embeddings with vector search (

fe9b6ec

#501)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(platform,crawler): add website page embeddings with vector search#501

feat(platform,crawler): add website page embeddings with vector search#501
larryro merged 12 commits into
mainfrom
feature/website-page-embeddings

larryro commented Feb 20, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Feb 20, 2026

Walkthrough

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot commented Feb 20, 2026

Important Files Changed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

larryro commented Feb 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key changes

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Feb 20, 2026

Walkthrough

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot commented Feb 20, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

larryro commented Feb 20, 2026 •

edited by coderabbitai Bot

Loading